HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Scalable query for running counts of events within x previous days

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
previouseventsscalablequerywithinrunningforcountsdays

Problem

I already posted this question on stackoverflow but I thought that I might get a better answer here.

I have a table storing millions of events occurring to users:

Table "public.events"
   Column   |           Type           |                         Modifiers                         
------------+--------------------------+-----------------------------------------------------------
 event_id   | integer                  | not null default nextval('events_event_id_seq'::regclass)
 user_id    | bigint                   | 
 event_type | integer                  | 
 ts         | timestamp with time zone |


There are 5 different values for event_type, millions of users, and varying number of events per user per event_type usually ranging from 1 to 50.

A sample of the data:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+


I would like to get, for each event, the number of events of the same user and the same event_type that occurred within 30 days before the event.

It should look like the following:

```
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 |

Solution

Assuming this sanitized table definition

CREATE TABLE events (
  event_id   serial PRIMARY KEY
, user_id    int
, event_type int
, ts         timestamp  -- don't use reserved word as identifier
);


Your comparison seems unfair, the first query has ORDER BY event_id, but the second hasn't. The EXPLAIN output does not fit the first query (no sort step). Be sure to run all tests with the same ORDER BY clause to get valid results. Best run a couple of times and compare the best of 5 to eliminate caching effects.
Index

The key to performance is this multicolumn index:

CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);


The sequence of columns matters! Why?

  • Multicolumn index and performance



Queries

Each of your queries can be improved:
Query 1

Remove group by event_type, user_id without substitution:

SELECT event_id, user_id, event_type, ts
    , (SELECT count(*) 
       FROM   events 
       WHERE  user_id    = e.user_id 
       AND    event_type = e.event_type
       AND    ts >= e.ts - interval '30 days'
       AND    ts <= e.ts
      ) AS  ct
FROM   events e
ORDER  BY event_id;


Equivalent with more modern LATERAL join (Postgres 9.3+):

SELECT *
FROM   events e
    ,  LATERAL (
   SELECT count(*) AS ct
   FROM   events 
   WHERE  user_id    = e.user_id 
   AND    event_type = e.event_type
   AND    ts >= e.ts - interval '30 days'
   AND    ts <= e.ts
   ) ct
ORDER  BY event_id;


Which might also be the fastest query in combination with above index.

Related answer with more explanation:

  • Optimize GROUP BY query to retrieve latest record per user



Query 2

  • last_value(ts) OVER w as lv is just an expensive copy of ts.



  • ROWS UNBOUNDED PRECEDING is the default and hence just noise.



SELECT event_id, user_id, event_type, ts, count(*) AS ct
FROM  (
   SELECT event_id, user_id, event_type, ts
        , unnest(array_agg(ts) OVER (PARTITION BY user_id, event_type
                                     ORDER BY ts)) AS agg
   FROM   events   
   ) e
WHERE  agg >= ts - interval '30 days'
GROUP  BY event_id, user_id, event_type, ts
ORDER  BY event_id;


But this is needlessly complex. The same logic can be had much cheaper with a join instead of the subquery with window function:

SELECT e.*, count(*) AS ct
FROM   events e
JOIN   events x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id
ORDER  BY e.event_id;


Which is my other favorite for top performance. Again with the above index.
Other query

Here is another idea, but I doubt it can compete. Give it a go, though:

WITH cte AS (
   SELECT event_id, user_id, event_type, ts
        , row_number(*) OVER (PARTITION BY user_id, event_type
                              ORDER BY ts) AS rn
   FROM   events
   )
SELECT e.event_id, e.user_id, e.event_type, e.ts, e.rn - min(x.rn) + 1 AS ct
FROM   cte e
JOIN   cte x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id, e.user_id, e.event_type, e.ts, e.rn
ORDER  BY e.event_id;


SQL Fiddle demonstrating all in Postgres 9.3.

Code Snippets

CREATE TABLE events (
  event_id   serial PRIMARY KEY
, user_id    int
, event_type int
, ts         timestamp  -- don't use reserved word as identifier
);
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
SELECT event_id, user_id, event_type, ts
    , (SELECT count(*) 
       FROM   events 
       WHERE  user_id    = e.user_id 
       AND    event_type = e.event_type
       AND    ts >= e.ts - interval '30 days'
       AND    ts <= e.ts
      ) AS  ct
FROM   events e
ORDER  BY event_id;
SELECT *
FROM   events e
    ,  LATERAL (
   SELECT count(*) AS ct
   FROM   events 
   WHERE  user_id    = e.user_id 
   AND    event_type = e.event_type
   AND    ts >= e.ts - interval '30 days'
   AND    ts <= e.ts
   ) ct
ORDER  BY event_id;
SELECT event_id, user_id, event_type, ts, count(*) AS ct
FROM  (
   SELECT event_id, user_id, event_type, ts
        , unnest(array_agg(ts) OVER (PARTITION BY user_id, event_type
                                     ORDER BY ts)) AS agg
   FROM   events   
   ) e
WHERE  agg >= ts - interval '30 days'
GROUP  BY event_id, user_id, event_type, ts
ORDER  BY event_id;

Context

StackExchange Database Administrators Q#97076, answer score: 5

Revisions (0)

No revisions yet.