HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlModerate

Improving sort performance in GROUP BY clause

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
groupperformancesortclauseimproving

Problem

I have two tables in Postgres 9.4.1 events and event_refs with the following schemas:

events table

CREATE TABLE events (
  id serial NOT NULL PRIMARY KEY,
  event_type text NOT NULL,
  event_path jsonb,
  event_data jsonb,
  created_at timestamp with time zone NOT NULL
);

-- Index on type and created time

CREATE INDEX events_event_type_created_at_idx
  ON events (event_type, created_at);


event_refs table

CREATE TABLE event_refs (
  event_id integer NOT NULL,
  reference_key text NOT NULL,
  reference_value text NOT NULL,
  CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
  CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
      REFERENCES events (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
);


Both tables hold 2M rows. This is the query I'm running

SELECT
  EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
  events
INNER JOIN
  event_refs
ON
  event_refs.event_id = events.id AND
  event_refs.reference_key = 'project'
WHERE
    events.event_type = 'event1' OR
    events.event_type = 'event2' AND
    events.created_at >= '2015-07-01 00:00:00+08:00' AND
    events.created_at  1


I am aware of the operator precedence in the where clause. It is only supposed to filter events with the type 'event2' by date.

This is the EXPLAIN ANALYZE output

```
GroupAggregate (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
Group Key: event_refs.reference_value
Filter: (count(*) > 1)
Rows Removed by Filter: 41315
-> Sort (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
Sort Key: event_refs.reference_value
Sort Method: external merge Disk: 3904kB
-> Hash Join (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
Hash Cond: (event_refs.ev

Solution

work_mem

That's what makes your sort expensive:


Sort Method: external merge Disk: 3904kB

The sort spills to disk, which kills performance. You need more RAM. In particular, you need to increase the setting for work_mem. The manual:


work_mem (integer)


Specifies the amount of memory to be used by internal sort operations
and hash tables before writing to temporary disk files.

In this particular case, raising the setting by a meager 4MB should do the trick. Generally, since you are going to need a lot more in your full deployment for 60M rows and since it can backfire if the general setting for work_mem is too high (read the manual where I linked to!), consider setting it high enough locally for your query, like:

BEGIN;
SET LOCAL work_mem = '500MB';  -- adapt to your needs
SELECT ...;
COMMIT;


Be aware that even SET LOCAL sticks until the end of the transaction. You may want to reset if you put more into the same transaction:

RESET work_mem;


Or encapsulate the query in a function with a function-local setting. Related answer with example for function:

  • Postgres not using index when index scan is much better option



Indexes

I would also try these indexes:

CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);


Adding id as last column only makes sense if you get index-only scans out of it. See:

  • The manual



  • The Postgres Wiki



And a partial index on event_refs:

CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE  reference_key = 'project';


The predicate WHERE reference_key = 'project' does not help much in your test case (except maybe for query planning), but it should help a lot in your full deployment where there will be different types added later.

And this should also allow index-only scans.

Possible alternative query

Since you are selecting large parts of events anyway, this alternative query might be faster (heavily depends on data distribution):

SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM   events e
JOIN  (
   SELECT event_id, reference_value, count(*) AS ct
   FROM   event_refs
   WHERE  reference_key = 'project'                   
   GROUP  BY event_id, reference_value
   ) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
       e.event_type = 'event2')        -- see below !
AND    e.created_at >= '2015-07-01 00:00:00+08:00'
AND    e.created_at  1;


I suspect a bug in the query, and that you want parentheses in the WHERE clause like I added. According to operator precedence, AND binds before OR.

Only makes sense if there are many rows per (event_id, reference_value) in event_refs. Again, the above index would help.

Code Snippets

BEGIN;
SET LOCAL work_mem = '500MB';  -- adapt to your needs
SELECT ...;
COMMIT;
RESET work_mem;
CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);
CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE  reference_key = 'project';
SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM   events e
JOIN  (
   SELECT event_id, reference_value, count(*) AS ct
   FROM   event_refs
   WHERE  reference_key = 'project'                   
   GROUP  BY event_id, reference_value
   ) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
       e.event_type = 'event2')        -- see below !
AND    e.created_at >= '2015-07-01 00:00:00+08:00'
AND    e.created_at <  '2015-12-01 00:00:00+08:00'
GROUP  BY r.reference_value
HAVING sum(r.ct) > 1;

Context

StackExchange Database Administrators Q#125094, answer score: 14

Revisions (0)

No revisions yet.