TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Time series statistics are tricky

In one of my last projects at BigCo, I had to implement a bunch of metrics monitoring and alerting for our services. This was years ago now but every so often I think back on it and how it drove me a bit nutty... Here's a short write up of some of the problems I remember, from just considering one of its aspects: metrics on a single API endpoint request.

People seem to think statistics like "average requests per minute" or "p99 response times" are straightforward metrics that can be pulled with a simple query. But they can be quite complex, and the results can be very misleading depending on how the events and queries have been defined.

So again, starting with something simple, we just want a requests counter for a single endpoint. How you define this counter changes how you interpret its data. One approach is to emit an event each time the endpoint is hit, logging every single request. Alternatively, you could maintain an asynchronous counter that emits its value at fixed intervals (every minute, say), incrementing only when new requests come in.

In the first case, calculating the total all-time requests up to a certain time t involves querying all events preceding t and summing them. In the second case, you have another choice to make first. Do you only ever increment the counter? Then you just need the value of the most recent counter emission before t. Or do you reset the counter to 0 after each emission? Then you need to query all events preceding t and sum them, too, but at least you're limited to a fixed number of scans that doesn't depend on the request rate (which may be very high for some applications). If you want to compute the count of requests in a single day, for the first representation you have to count up all events emitted over that day. For the second representation you just have to query the events closest to end of day and start of day and subtract, if you don't reset the counter after each emission.

Now consider a more complex metric like average requests per minute. It's tempting to think of this as a straightforward calculation: count the requests over some period and divide by the number of minutes. But what does "average" actually mean here? Are we talking about the average over the last five minutes, the last hour, or all time? And how does this definition change depending on how the events are structured?

If you're emitting events for every request, your queries might need to scan massive amounts of data just to compute an hourly average. But it's still tricky even if the request rate is low. Imagine you only had two requests in some particular hour you're interested in, and they were spread out over 30 minutes. If you calculate an average requests per minute over that hour, you might report something like $$\frac{2\text{ reqs}×1\text{ hr}}{1\text{ hr}×60\text{ mins}}=\frac{1}{30}$$​ requests per minute. But if the hour of interest is still ongoing, do you divide by 60, or by the number of minutes that have actually elapsed? Or to ask another way, if you actually want the stat of requests per minute for not just any hour, but the "last hour", do you mean the last 60 minutes from "now", or the n minutes that have elapsed since the last new hour mark? And if you want the average hourly requests per minute over the course of a day, do you compute the requests per minute for each hour and average those, or compute the requests per minute over all the minutes (1440) in the whole day and divide by 24? If you're not careful, you'll get different numbers. e.g. Suppose you had 100 requests over the whole day, 100/1440 is 0.069. Now suppose 50 of those came at 4pm, and the other 50 at 5pm. So you have twenty-four computations of rpm for each hour, all but two are 0, and the other two are 50/60. Are you correctly computing (50/60 + 50/60 + 0*22)/24 to get the same 0.069 number? Or are you going to accidentally not count those 0s somehow and wind up with (50/60 + 50/60)/2 because you only have events for those two hours?

If you're maintaining a request count metric that gets emitted periodically instead, you limit the max number of queries you'll have to do. The requests per minute is known from a request count based on the resolution of how often you're emitting these events. If you're emitting once every minute, then the requests per minute for a particular minute is just the total requests of one event minus the total requests from the preceding event. Taking an average for a particular hour, if you don't reset the counts you can just take the count from the end of the hour, subtract the count at the start of the hour, and divide by 2. Now to take the average hourly requests per minute over the course of the day, you can just query the total count at the end of the day and divide by total minutes in the day.

You do lose access to finer grained details having those periodic events, though. Like if you emit once every 10 minutes instead, you can't get an accurate request per minute data over the last 5 minutes. Interpolation and prediction might give a good enough estimate.

Complexity increases again in large distributed systems where multiple machines are emitting events. Each machine might have its own event stream, and aggregating these streams into a global average is not trivial. Are you taking an average of averages? Summing up the totals and then averaging? What about time synchronization between machines? When it comes to setting up alerts about potential issues, are you able to properly alert on problems with individual machines vs. problems with all machines?

Another common metric is the p99 time, or the 99th percentile of request handling times. It's useful for understanding tail-end worst-case performance scenarios. But p99 over what period? All time? The last day? And how do you compute this across multiple machines? Is it the p99 for just one machine or across all machines?

What happens if you're interested in whether any request in the last five minutes exceeded the p99 of the last 24 hours? If you're emitting on every request, you might need to query a lot over the two time ranges. For rarely requested endpoints, however, the p99 over the last 24 hours might be meaningless if there were few or no requests during that period.

If you're emitting periodically, how do you even report request performance time? You were maintaining a simple counter, which every request could just increment on, but now you have variable timings to report. Your event's about to get bigger with how much data it has to report. If your emitting period is long, that makes it even harder to accurately represent the data on performance. And all those problems of the time series stats I raised before are now coming for you in your application state as well: if your emission period is once every minute, and you got 10 requests in the last minute, of course you report a 10 count, or add 10 to whatever value was there before (perhaps you report both numbers, the running total and the total of just the last period), but what do you do for the performance timings of each 10? Do you average just the 10 together? (Do you have a big enough buffer to correctly compute the average? I believe I had a 100 item max ring buffer for such a purpose for each endpoint.) Do you want to maintain an all-time average or not? Does it make sense to report any sort of p99 here, or calculate that at the time series level? (I do think it's useful to at least additionally report min and max. And sometimes hey, you only get 1 request for that 1 minute, so the avg = min = max = actual response time.)

Maintaining metrics in the application raises more complex questions around persistence, too. Machines restart. What happens to the longer-time-scale metrics? Should they be persisted to another database so they can be read in from application start and continued to be built on? If there are multiple machines, how is this coordinated? This potential complexity makes it inviting to reset various metrics after each reporting period, or only report "since machine started" continuous metrics, but then you lose out on some query simplicity. On the other hand if you do nothing about this problem, you'll see weird event series data like 50 total request at time t and 2 total requests at time t+1, instead of 52.

Oh yeah, and I haven't even gotten to complexities on the query side, where once you've bitten the bullet of needing to examine lots of individual events to compute some data you need, you then decide to complicate matters further by not actually examining every event you could, but a down-sampled subset of the events.

The ultimate point of this post is just again that statistics in time series databases are tricky, and at least at the time I worked on this stuff, there didn't seem to be really great solutions. Maybe that's changed now, but I doubt it. What was most frustrating was a lot of other teams were implementing dashboards on similar "metrics" without any of this thought put into them, so they were really reporting on nonsense.


Posted on 2024-09-02 by Jach

Tags: programming, rant

Permalink: https://www.thejach.com/view/id/431

Trackback URL: https://www.thejach.com/view/2024/9/time_series_statistics_are_tricky

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.