patternsqlMinor
Calculate 40 day moving average w.r.t to a field
Viewed 0 times
fieldaveragemovingcalculateday
Problem
I have a table that stores the information about user calls in a call center. The table has a call_id, date when the call was made, actual date and time of the call, call type and a score associated with the call.
My requirement is to calculate a 40 day moving average of the score with respect to the call day. The 40 day should start from the previous day from the call date. If there are no call in the past 40 days it should include rows for the call date for which the moving average is being calculated.
Below is sample data:
Output:
```
call_id | call_dt_key | call_type_id | call_dt_tm | aes_raw
1 | 2016-01-01 | CT1 | 2016-01-01 00:00:10-08 | 10
2 | 2016-01-01 | CT1 | 2016-01-01 00:00:20-08 | 20
3 | 2016-01-01 | CT1 | 2016-01-01 00:00:30-08 | 10
4 | 2016-01-01 | CT1 | 2016-01-01 00:00:40-08 | 20
5 | 2016-01-01 | CT1 | 2016-01-01 00:00:50-08 | 10
6 | 2016-01-01 | CT1 | 2016-01-01 00:01:00-08 | 20
7 | 2016-01-01 | CT1 | 2016-01-01 00:02:00-08 | 10
8 | 2016-01-01 | CT1 | 2016-01-01 00:03:00-08 | 20
9 | 2016-01-01 | CT1 | 2016-01-01 00:04:00-08 | 10
10 | 2016-01-01 | CT1 | 2016-01-01 00:05:00-08 | 20
11 | 2016-01-05 | CT1 | 2016-01-05 00:00:10-08 | 10
12 | 2016-01-05 | CT1 | 2016-01-05 00:00:20-08 | 10
13 | 2016-01-05 | CT1 | 2016-01-05 00:00:30-08 | 20
14 | 2016-01-05 | CT1 | 2016-01-05 00:00:40-08 | 20
15 | 2016-01-05 | CT1 | 2016-01-05 00:00:50-08 | 20
16 | 2016-01-10 | CT1 | 2016-01-10 00:00:10-08 | 10
17 | 2016-01-10 | CT1 | 2016-01-10 00:00:20-08 | 20
18 | 2016-01-15 | CT1 | 2016-01-15 00:00:10-08 | 10
19 | 2016-01-15 | CT1 | 2016-01-15
My requirement is to calculate a 40 day moving average of the score with respect to the call day. The 40 day should start from the previous day from the call date. If there are no call in the past 40 days it should include rows for the call date for which the moving average is being calculated.
Below is sample data:
select * from test_aes;Output:
```
call_id | call_dt_key | call_type_id | call_dt_tm | aes_raw
1 | 2016-01-01 | CT1 | 2016-01-01 00:00:10-08 | 10
2 | 2016-01-01 | CT1 | 2016-01-01 00:00:20-08 | 20
3 | 2016-01-01 | CT1 | 2016-01-01 00:00:30-08 | 10
4 | 2016-01-01 | CT1 | 2016-01-01 00:00:40-08 | 20
5 | 2016-01-01 | CT1 | 2016-01-01 00:00:50-08 | 10
6 | 2016-01-01 | CT1 | 2016-01-01 00:01:00-08 | 20
7 | 2016-01-01 | CT1 | 2016-01-01 00:02:00-08 | 10
8 | 2016-01-01 | CT1 | 2016-01-01 00:03:00-08 | 20
9 | 2016-01-01 | CT1 | 2016-01-01 00:04:00-08 | 10
10 | 2016-01-01 | CT1 | 2016-01-01 00:05:00-08 | 20
11 | 2016-01-05 | CT1 | 2016-01-05 00:00:10-08 | 10
12 | 2016-01-05 | CT1 | 2016-01-05 00:00:20-08 | 10
13 | 2016-01-05 | CT1 | 2016-01-05 00:00:30-08 | 20
14 | 2016-01-05 | CT1 | 2016-01-05 00:00:40-08 | 20
15 | 2016-01-05 | CT1 | 2016-01-05 00:00:50-08 | 20
16 | 2016-01-10 | CT1 | 2016-01-10 00:00:10-08 | 10
17 | 2016-01-10 | CT1 | 2016-01-10 00:00:20-08 | 20
18 | 2016-01-15 | CT1 | 2016-01-15 00:00:10-08 | 10
19 | 2016-01-15 | CT1 | 2016-01-15
Solution
It is not really clear from the question what is the role of the
Without window functions
Here is a simple variant that doesn't use window functions at all.
Make sure that there is an index on
Then, for each distinct day I use a self-join to calculate the average for 40 previous days. This will return
You don't have to use CTE here, it just makes the query easier to read.
Result
Here is SQL Fiddle.
With the recommended index this solution should not be too bad.
There is a similar question, but for SQL Server (Date range rolling sum using window functions). Postgres seems to support
The key part would be:
To calculate the moving average using these window functions you'd likely have to fill the gaps in dates first, so that the table has at least one row for each day (with
...
As Erwin Brandstetter correctly pointed out in his answer, at the moment (as of Postgres 9.5) the
The value PRECEDING and value FOLLOWING cases are currently only
allowed in ROWS mode.
So, this method with the
Using window functions
You can use approaches outlined in the question for SQL Server above. For example, group your data into daily sums, add rows for missing days, calculate the moving
Something along these lines:
Result is the same as in the first variant. See SQL Fiddle.
Again, this could be written with inlined sub-queries without CTEs.
It is worth checking on real data the performance of different variants.
call_type_id column. I will ignore it until you clarify.Without window functions
Here is a simple variant that doesn't use window functions at all.
Make sure that there is an index on
(call_dt_key, aes_raw).CTE_Dates returns a list of all dates in the table and calculates average for each day. This average_current_day will be needed for the first day. The server will scan the whole index any way, so calculating such average is cheap.Then, for each distinct day I use a self-join to calculate the average for 40 previous days. This will return
NULL for the first day, which is replaced with average_current_day in the main query.You don't have to use CTE here, it just makes the query easier to read.
WITH
CTE_Dates
AS
(
SELECT
call_dt_key
,call_dt_key - INTERVAL '41 day' AS dt_from
,call_dt_key - INTERVAL '1 day' AS dt_to
,AVG(test_aes.aes_raw) AS average_current_day
FROM test_aes
GROUP BY call_dt_key
)
SELECT
CTE_Dates.call_dt_key
,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
CTE_Dates
LEFT JOIN LATERAL
(
SELECT AVG(test_aes.aes_raw) AS average_40
FROM test_aes
WHERE
test_aes.call_dt_key >= CTE_Dates.dt_from
AND test_aes.call_dt_key <= CTE_Dates.dt_to
) AS prev40 ON true
ORDER BY call_dt_key;Result
| call_dt_key | average_40 |
|----------------------------|--------------------|
| January, 01 2016 00:00:00 | 15 |
| January, 05 2016 00:00:00 | 15 |
| January, 10 2016 00:00:00 | 15.333333333333334 |
| January, 15 2016 00:00:00 | 15.294117647058824 |
| January, 16 2016 00:00:00 | 15.5 |
| January, 20 2016 00:00:00 | 15.652173913043478 |
| January, 21 2016 00:00:00 | 15.6 |
| January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
| March, 04 2016 00:00:00 | 15 |
| March, 18 2016 00:00:00 | 15 |Here is SQL Fiddle.
With the recommended index this solution should not be too bad.
There is a similar question, but for SQL Server (Date range rolling sum using window functions). Postgres seems to support
RANGE with a window of specified size, while SQL Server doesn't at this moment. So, solution for Postgres is likely to be a bit simpler.The key part would be:
AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)To calculate the moving average using these window functions you'd likely have to fill the gaps in dates first, so that the table has at least one row for each day (with
NULL values for aes_raw in these dummy rows)....
As Erwin Brandstetter correctly pointed out in his answer, at the moment (as of Postgres 9.5) the
RANGE clause in Postgres still has limitations similar to SQL Server. Docs say:The value PRECEDING and value FOLLOWING cases are currently only
allowed in ROWS mode.
So, this method with the
RANGE above would not work for you even if you used Postgres 9.5.Using window functions
You can use approaches outlined in the question for SQL Server above. For example, group your data into daily sums, add rows for missing days, calculate the moving
SUM and COUNT using OVER with ROWS and then calculate moving average.Something along these lines:
WITH
CTE_Dates
AS
(
SELECT
call_dt_key
,SUM(test_aes.aes_raw) AS sum_daily
,COUNT(*) AS cnt_daily
,AVG(test_aes.aes_raw) AS avg_daily
,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
FROM test_aes
GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
SELECT
CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
,avg_daily
,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END)
OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
/SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END)
OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
FROM
CTE_Dates
INNER JOIN LATERAL
generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day')
AS all_dates(dt) ON true
)
SELECT
final_dt
,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;Result is the same as in the first variant. See SQL Fiddle.
Again, this could be written with inlined sub-queries without CTEs.
It is worth checking on real data the performance of different variants.
Code Snippets
WITH
CTE_Dates
AS
(
SELECT
call_dt_key
,call_dt_key - INTERVAL '41 day' AS dt_from
,call_dt_key - INTERVAL '1 day' AS dt_to
,AVG(test_aes.aes_raw) AS average_current_day
FROM test_aes
GROUP BY call_dt_key
)
SELECT
CTE_Dates.call_dt_key
,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
CTE_Dates
LEFT JOIN LATERAL
(
SELECT AVG(test_aes.aes_raw) AS average_40
FROM test_aes
WHERE
test_aes.call_dt_key >= CTE_Dates.dt_from
AND test_aes.call_dt_key <= CTE_Dates.dt_to
) AS prev40 ON true
ORDER BY call_dt_key;| call_dt_key | average_40 |
|----------------------------|--------------------|
| January, 01 2016 00:00:00 | 15 |
| January, 05 2016 00:00:00 | 15 |
| January, 10 2016 00:00:00 | 15.333333333333334 |
| January, 15 2016 00:00:00 | 15.294117647058824 |
| January, 16 2016 00:00:00 | 15.5 |
| January, 20 2016 00:00:00 | 15.652173913043478 |
| January, 21 2016 00:00:00 | 15.6 |
| January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
| March, 04 2016 00:00:00 | 15 |
| March, 18 2016 00:00:00 | 15 |AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)WITH
CTE_Dates
AS
(
SELECT
call_dt_key
,SUM(test_aes.aes_raw) AS sum_daily
,COUNT(*) AS cnt_daily
,AVG(test_aes.aes_raw) AS avg_daily
,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
FROM test_aes
GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
SELECT
CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
,avg_daily
,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END)
OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
/SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END)
OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
FROM
CTE_Dates
INNER JOIN LATERAL
generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day')
AS all_dates(dt) ON true
)
SELECT
final_dt
,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;Context
StackExchange Database Administrators Q#140358, answer score: 7
Revisions (0)
No revisions yet.