HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Maintain statistics over a sliding window (robust & efficient)

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
robustslidingefficientmaintainstatisticswindowover

Problem

I am looking for an algorithms to maintain several statistics over a sliding window. The setup is as follows: There is a datastream consisting of (real value,timestamp) tuples. The values for the last x seconds are stored. In each iteration at least one new tuple arrives and between zero and more values are retired.
Now I want to maintain mean and variance without recomputing everything from scratch. There are methods if values are never retired, e.g. Welford's method. But what if values are retired?

The second problem I am facing is, how to compute covariance between two sliding windows in a similar fashion?

Are you aware of a solution or references?

Solution

To compute the mean of all the values in your window, separately compute (a) the sum of all the values in the window, and (b) a count of the number of values in the window.

To keep track of the sum of all the values in the sliding window, each time you receive a new value, add it to the sum. Each time a value leaves the window, subtract it from the sum. Now you'll always have a sum of the values in the window.

To keep track of the number of values in the window, each time you receive a new value, increment the counter. Each time a value leaves the window, decrement the counter.

Finally, the mean is the sum divided by the count.

To keep track of the variance, you need the count, the sum of the values, and the sum of the squares of the values. You can keep track of the sum of the squares of the values in the window in the same way (when a new value arrives, add its square to the running sum; when a value leaves, subtract its square). These three values are enough to let you compute the variance of the values in the sliding window.

Note that when using the sum of squares method to compute the variance, you will need to compute the three intermediate values to enough digits of precision. If you don't have enough digits of precision for those intermediate values, then you get a very inaccurate estimate. This is the robustness problem that is mentioned in the blog post you link to and that is addressed by Welford's method. However, there is another pragmatic way to address it that will work for most practical settings: just compute those three intermediate values to more digits of precision.

For the covariance between $x_1,\dots,x_n$ and $y_1,\dots,y_n$, you need to know the sum of the values $x_i$, $x_i^2$, $y_i^2$, $x_i y_i$, and the count $n$. So, it's basically the same principle as above, but now there are five intermediate values you need to maintain, instead of just three.

Context

StackExchange Computer Science Q#41025, answer score: 2

Revisions (0)

No revisions yet.