HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Calculate 40 day moving average w.r.t to a field

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fieldaveragemovingcalculateday

Problem

I have a table that stores the information about user calls in a call center. The table has a call_id, date when the call was made, actual date and time of the call, call type and a score associated with the call.

My requirement is to calculate a 40 day moving average of the score with respect to the call day. The 40 day should start from the previous day from the call date. If there are no call in the past 40 days it should include rows for the call date for which the moving average is being calculated.

Below is sample data:

select * from test_aes;


Output:

```
call_id | call_dt_key | call_type_id | call_dt_tm | aes_raw
1 | 2016-01-01 | CT1 | 2016-01-01 00:00:10-08 | 10
2 | 2016-01-01 | CT1 | 2016-01-01 00:00:20-08 | 20
3 | 2016-01-01 | CT1 | 2016-01-01 00:00:30-08 | 10
4 | 2016-01-01 | CT1 | 2016-01-01 00:00:40-08 | 20
5 | 2016-01-01 | CT1 | 2016-01-01 00:00:50-08 | 10
6 | 2016-01-01 | CT1 | 2016-01-01 00:01:00-08 | 20
7 | 2016-01-01 | CT1 | 2016-01-01 00:02:00-08 | 10
8 | 2016-01-01 | CT1 | 2016-01-01 00:03:00-08 | 20
9 | 2016-01-01 | CT1 | 2016-01-01 00:04:00-08 | 10
10 | 2016-01-01 | CT1 | 2016-01-01 00:05:00-08 | 20
11 | 2016-01-05 | CT1 | 2016-01-05 00:00:10-08 | 10
12 | 2016-01-05 | CT1 | 2016-01-05 00:00:20-08 | 10
13 | 2016-01-05 | CT1 | 2016-01-05 00:00:30-08 | 20
14 | 2016-01-05 | CT1 | 2016-01-05 00:00:40-08 | 20
15 | 2016-01-05 | CT1 | 2016-01-05 00:00:50-08 | 20
16 | 2016-01-10 | CT1 | 2016-01-10 00:00:10-08 | 10
17 | 2016-01-10 | CT1 | 2016-01-10 00:00:20-08 | 20
18 | 2016-01-15 | CT1 | 2016-01-15 00:00:10-08 | 10
19 | 2016-01-15 | CT1 | 2016-01-15

Solution

It is not really clear from the question what is the role of the call_type_id column. I will ignore it until you clarify.

Without window functions

Here is a simple variant that doesn't use window functions at all.

Make sure that there is an index on (call_dt_key, aes_raw).

CTE_Dates returns a list of all dates in the table and calculates average for each day. This average_current_day will be needed for the first day. The server will scan the whole index any way, so calculating such average is cheap.

Then, for each distinct day I use a self-join to calculate the average for 40 previous days. This will return NULL for the first day, which is replaced with average_current_day in the main query.

You don't have to use CTE here, it just makes the query easier to read.

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,call_dt_key - INTERVAL '41 day' AS dt_from
        ,call_dt_key - INTERVAL '1 day' AS dt_to
        ,AVG(test_aes.aes_raw) AS average_current_day
    FROM test_aes
    GROUP BY call_dt_key
)
SELECT
    CTE_Dates.call_dt_key
    ,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
    CTE_Dates
    LEFT JOIN LATERAL
    (
        SELECT AVG(test_aes.aes_raw) AS average_40
        FROM test_aes
        WHERE
                test_aes.call_dt_key >= CTE_Dates.dt_from
            AND test_aes.call_dt_key <= CTE_Dates.dt_to
    ) AS prev40 ON true
ORDER BY call_dt_key;


Result

|                call_dt_key |         average_40 |
|----------------------------|--------------------|
|  January, 01 2016 00:00:00 |                 15 |
|  January, 05 2016 00:00:00 |                 15 |
|  January, 10 2016 00:00:00 | 15.333333333333334 |
|  January, 15 2016 00:00:00 | 15.294117647058824 |
|  January, 16 2016 00:00:00 |               15.5 |
|  January, 20 2016 00:00:00 | 15.652173913043478 |
|  January, 21 2016 00:00:00 |               15.6 |
|  January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
|    March, 04 2016 00:00:00 |                 15 |
|    March, 18 2016 00:00:00 |                 15 |


Here is SQL Fiddle.

With the recommended index this solution should not be too bad.

There is a similar question, but for SQL Server (Date range rolling sum using window functions). Postgres seems to support RANGE with a window of specified size, while SQL Server doesn't at this moment. So, solution for Postgres is likely to be a bit simpler.

The key part would be:

AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)


To calculate the moving average using these window functions you'd likely have to fill the gaps in dates first, so that the table has at least one row for each day (with NULL values for aes_raw in these dummy rows).

...

As Erwin Brandstetter correctly pointed out in his answer, at the moment (as of Postgres 9.5) the RANGE clause in Postgres still has limitations similar to SQL Server. Docs say:


The value PRECEDING and value FOLLOWING cases are currently only
allowed in ROWS mode.

So, this method with the RANGE above would not work for you even if you used Postgres 9.5.

Using window functions

You can use approaches outlined in the question for SQL Server above. For example, group your data into daily sums, add rows for missing days, calculate the moving SUM and COUNT using OVER with ROWS and then calculate moving average.

Something along these lines:

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,SUM(test_aes.aes_raw) AS sum_daily
        ,COUNT(*) AS cnt_daily
        ,AVG(test_aes.aes_raw) AS avg_daily
        ,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
    FROM test_aes
    GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
    SELECT
        CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
        ,avg_daily
        ,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
        /SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
    FROM
        CTE_Dates
        INNER JOIN LATERAL
            generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day') 
            AS all_dates(dt) ON true
)
SELECT
    final_dt
    ,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;


Result is the same as in the first variant. See SQL Fiddle.

Again, this could be written with inlined sub-queries without CTEs.

It is worth checking on real data the performance of different variants.

Code Snippets

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,call_dt_key - INTERVAL '41 day' AS dt_from
        ,call_dt_key - INTERVAL '1 day' AS dt_to
        ,AVG(test_aes.aes_raw) AS average_current_day
    FROM test_aes
    GROUP BY call_dt_key
)
SELECT
    CTE_Dates.call_dt_key
    ,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
    CTE_Dates
    LEFT JOIN LATERAL
    (
        SELECT AVG(test_aes.aes_raw) AS average_40
        FROM test_aes
        WHERE
                test_aes.call_dt_key >= CTE_Dates.dt_from
            AND test_aes.call_dt_key <= CTE_Dates.dt_to
    ) AS prev40 ON true
ORDER BY call_dt_key;
|                call_dt_key |         average_40 |
|----------------------------|--------------------|
|  January, 01 2016 00:00:00 |                 15 |
|  January, 05 2016 00:00:00 |                 15 |
|  January, 10 2016 00:00:00 | 15.333333333333334 |
|  January, 15 2016 00:00:00 | 15.294117647058824 |
|  January, 16 2016 00:00:00 |               15.5 |
|  January, 20 2016 00:00:00 | 15.652173913043478 |
|  January, 21 2016 00:00:00 |               15.6 |
|  January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
|    March, 04 2016 00:00:00 |                 15 |
|    March, 18 2016 00:00:00 |                 15 |
AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)
WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,SUM(test_aes.aes_raw) AS sum_daily
        ,COUNT(*) AS cnt_daily
        ,AVG(test_aes.aes_raw) AS avg_daily
        ,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
    FROM test_aes
    GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
    SELECT
        CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
        ,avg_daily
        ,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
        /SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
    FROM
        CTE_Dates
        INNER JOIN LATERAL
            generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day') 
            AS all_dates(dt) ON true
)
SELECT
    final_dt
    ,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;

Context

StackExchange Database Administrators Q#140358, answer score: 7

Revisions (0)

No revisions yet.