patternsqlMinor
Scalable query for running counts of events within x previous days
Viewed 0 times
previouseventsscalablequerywithinrunningforcountsdays
Problem
I already posted this question on stackoverflow but I thought that I might get a better answer here.
I have a table storing millions of events occurring to users:
There are 5 different values for event_type, millions of users, and varying number of events per user per event_type usually ranging from 1 to 50.
A sample of the data:
I would like to get, for each event, the number of events of the same user and the same
It should look like the following:
```
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 |
I have a table storing millions of events occurring to users:
Table "public.events"
Column | Type | Modifiers
------------+--------------------------+-----------------------------------------------------------
event_id | integer | not null default nextval('events_event_id_seq'::regclass)
user_id | bigint |
event_type | integer |
ts | timestamp with time zone |There are 5 different values for event_type, millions of users, and varying number of events per user per event_type usually ranging from 1 to 50.
A sample of the data:
+-----------+----------+-------------+----------------------------+
| event_id | user_id | event_type | timestamp |
+-----------+----------+-------------+----------------------------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+I would like to get, for each event, the number of events of the same user and the same
event_type that occurred within 30 days before the event.It should look like the following:
```
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 |
Solution
Assuming this sanitized table definition
Your comparison seems unfair, the first query has
Index
The key to performance is this multicolumn index:
The sequence of columns matters! Why?
Queries
Each of your queries can be improved:
Query 1
Remove
Equivalent with more modern
Which might also be the fastest query in combination with above index.
Related answer with more explanation:
Query 2
But this is needlessly complex. The same logic can be had much cheaper with a join instead of the subquery with window function:
Which is my other favorite for top performance. Again with the above index.
Other query
Here is another idea, but I doubt it can compete. Give it a go, though:
SQL Fiddle demonstrating all in Postgres 9.3.
CREATE TABLE events (
event_id serial PRIMARY KEY
, user_id int
, event_type int
, ts timestamp -- don't use reserved word as identifier
);Your comparison seems unfair, the first query has
ORDER BY event_id, but the second hasn't. The EXPLAIN output does not fit the first query (no sort step). Be sure to run all tests with the same ORDER BY clause to get valid results. Best run a couple of times and compare the best of 5 to eliminate caching effects.Index
The key to performance is this multicolumn index:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);The sequence of columns matters! Why?
- Multicolumn index and performance
Queries
Each of your queries can be improved:
Query 1
Remove
group by event_type, user_id without substitution:SELECT event_id, user_id, event_type, ts
, (SELECT count(*)
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) AS ct
FROM events e
ORDER BY event_id;Equivalent with more modern
LATERAL join (Postgres 9.3+):SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;Which might also be the fastest query in combination with above index.
Related answer with more explanation:
- Optimize GROUP BY query to retrieve latest record per user
Query 2
last_value(ts) OVER w as lvis just an expensive copy ofts.
ROWS UNBOUNDED PRECEDINGis the default and hence just noise.
SELECT event_id, user_id, event_type, ts, count(*) AS ct
FROM (
SELECT event_id, user_id, event_type, ts
, unnest(array_agg(ts) OVER (PARTITION BY user_id, event_type
ORDER BY ts)) AS agg
FROM events
) e
WHERE agg >= ts - interval '30 days'
GROUP BY event_id, user_id, event_type, ts
ORDER BY event_id;But this is needlessly complex. The same logic can be had much cheaper with a join instead of the subquery with window function:
SELECT e.*, count(*) AS ct
FROM events e
JOIN events x USING (user_id, event_type)
WHERE x.ts >= e.ts - interval '30 days'
AND x.ts <= e.ts
GROUP BY e.event_id
ORDER BY e.event_id;Which is my other favorite for top performance. Again with the above index.
Other query
Here is another idea, but I doubt it can compete. Give it a go, though:
WITH cte AS (
SELECT event_id, user_id, event_type, ts
, row_number(*) OVER (PARTITION BY user_id, event_type
ORDER BY ts) AS rn
FROM events
)
SELECT e.event_id, e.user_id, e.event_type, e.ts, e.rn - min(x.rn) + 1 AS ct
FROM cte e
JOIN cte x USING (user_id, event_type)
WHERE x.ts >= e.ts - interval '30 days'
AND x.ts <= e.ts
GROUP BY e.event_id, e.user_id, e.event_type, e.ts, e.rn
ORDER BY e.event_id;SQL Fiddle demonstrating all in Postgres 9.3.
Code Snippets
CREATE TABLE events (
event_id serial PRIMARY KEY
, user_id int
, event_type int
, ts timestamp -- don't use reserved word as identifier
);CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);SELECT event_id, user_id, event_type, ts
, (SELECT count(*)
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) AS ct
FROM events e
ORDER BY event_id;SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;SELECT event_id, user_id, event_type, ts, count(*) AS ct
FROM (
SELECT event_id, user_id, event_type, ts
, unnest(array_agg(ts) OVER (PARTITION BY user_id, event_type
ORDER BY ts)) AS agg
FROM events
) e
WHERE agg >= ts - interval '30 days'
GROUP BY event_id, user_id, event_type, ts
ORDER BY event_id;Context
StackExchange Database Administrators Q#97076, answer score: 5
Revisions (0)
No revisions yet.