snippetsqlModerate
Improving sort performance in GROUP BY clause
Viewed 0 times
groupperformancesortclauseimproving
Problem
I have two tables in Postgres 9.4.1
Both tables hold 2M rows. This is the query I'm running
I am aware of the operator precedence in the where clause. It is only supposed to filter events with the type 'event2' by date.
This is the
```
GroupAggregate (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
Group Key: event_refs.reference_value
Filter: (count(*) > 1)
Rows Removed by Filter: 41315
-> Sort (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
Sort Key: event_refs.reference_value
Sort Method: external merge Disk: 3904kB
-> Hash Join (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
Hash Cond: (event_refs.ev
events and event_refs with the following schemas:events tableCREATE TABLE events (
id serial NOT NULL PRIMARY KEY,
event_type text NOT NULL,
event_path jsonb,
event_data jsonb,
created_at timestamp with time zone NOT NULL
);
-- Index on type and created time
CREATE INDEX events_event_type_created_at_idx
ON events (event_type, created_at);event_refs tableCREATE TABLE event_refs (
event_id integer NOT NULL,
reference_key text NOT NULL,
reference_value text NOT NULL,
CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
REFERENCES events (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);Both tables hold 2M rows. This is the query I'm running
SELECT
EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
events
INNER JOIN
event_refs
ON
event_refs.event_id = events.id AND
event_refs.reference_key = 'project'
WHERE
events.event_type = 'event1' OR
events.event_type = 'event2' AND
events.created_at >= '2015-07-01 00:00:00+08:00' AND
events.created_at 1I am aware of the operator precedence in the where clause. It is only supposed to filter events with the type 'event2' by date.
This is the
EXPLAIN ANALYZE output```
GroupAggregate (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
Group Key: event_refs.reference_value
Filter: (count(*) > 1)
Rows Removed by Filter: 41315
-> Sort (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
Sort Key: event_refs.reference_value
Sort Method: external merge Disk: 3904kB
-> Hash Join (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
Hash Cond: (event_refs.ev
Solution
work_memThat's what makes your sort expensive:
Sort Method: external merge Disk: 3904kB
The sort spills to disk, which kills performance. You need more RAM. In particular, you need to increase the setting for
work_mem. The manual:work_mem (integer)Specifies the amount of memory to be used by internal sort operations
and hash tables before writing to temporary disk files.
In this particular case, raising the setting by a meager 4MB should do the trick. Generally, since you are going to need a lot more in your full deployment for 60M rows and since it can backfire if the general setting for
work_mem is too high (read the manual where I linked to!), consider setting it high enough locally for your query, like:BEGIN;
SET LOCAL work_mem = '500MB'; -- adapt to your needs
SELECT ...;
COMMIT;Be aware that even
SET LOCAL sticks until the end of the transaction. You may want to reset if you put more into the same transaction:RESET work_mem;Or encapsulate the query in a function with a function-local setting. Related answer with example for function:
- Postgres not using index when index scan is much better option
Indexes
I would also try these indexes:
CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);Adding
id as last column only makes sense if you get index-only scans out of it. See:- The manual
- The Postgres Wiki
And a partial index on
event_refs:CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE reference_key = 'project';The predicate
WHERE reference_key = 'project' does not help much in your test case (except maybe for query planning), but it should help a lot in your full deployment where there will be different types added later.And this should also allow index-only scans.
Possible alternative query
Since you are selecting large parts of
events anyway, this alternative query might be faster (heavily depends on data distribution):SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM events e
JOIN (
SELECT event_id, reference_value, count(*) AS ct
FROM event_refs
WHERE reference_key = 'project'
GROUP BY event_id, reference_value
) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
e.event_type = 'event2') -- see below !
AND e.created_at >= '2015-07-01 00:00:00+08:00'
AND e.created_at 1;I suspect a bug in the query, and that you want parentheses in the
WHERE clause like I added. According to operator precedence, AND binds before OR.Only makes sense if there are many rows per
(event_id, reference_value) in event_refs. Again, the above index would help.Code Snippets
BEGIN;
SET LOCAL work_mem = '500MB'; -- adapt to your needs
SELECT ...;
COMMIT;RESET work_mem;CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE reference_key = 'project';SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM events e
JOIN (
SELECT event_id, reference_value, count(*) AS ct
FROM event_refs
WHERE reference_key = 'project'
GROUP BY event_id, reference_value
) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
e.event_type = 'event2') -- see below !
AND e.created_at >= '2015-07-01 00:00:00+08:00'
AND e.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY r.reference_value
HAVING sum(r.ct) > 1;Context
StackExchange Database Administrators Q#125094, answer score: 14
Revisions (0)
No revisions yet.