HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Slow window function query with big table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
withqueryfunctionslowbigwindowtable

Problem

I'm doing some performance testing on a new DB design on PostgreSQL 9.4rc1 and I'm seeing some pretty slow queries using window functions. Here is my table setup:

CREATE TABLE player_stat (
  player_id    VARCHAR(200) NOT NULL,
  stat_id      BIGINT NOT NULL,
  value        BIGINT NOT NULL DEFAULT 0,
  last_updated TIMESTAMP WITH TIME ZONE NOT NULL,
  last_active  TIMESTAMP WITH TIME ZONE DEFAULT NULL,

  CONSTRAINT player_stat_pk PRIMARY KEY (player_id, stat_id),
  CONSTRAINT player_stat_fk1 FOREIGN KEY(stat_id) REFERENCES stat (id)
);
CREATE INDEX player_stat_stat_value_player_desc
  ON player_stat (stat_id, value DESC, player_id ASC);


I've inserted 30 million rows into this table split among 3 stats:

INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 1, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 2, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 3, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;


Then I try to rank the players for a given stat (EDIT):

SELECT * FROM 
( SELECT player_id
       , rank() OVER (ORDER BY value DESC, player_id ASC)  as rank 
  FROM player_stat 
  WHERE stat_id = 1
) as t 
WHERE rank <= 20 
ORDER BY rank ASC;


This query takes about 5.5 seconds to return. Running Explain on it returns the following:

```
"Sort (cost=1167612.28..1176082.26 rows=3387993 width=15) (actual time=9726.132..9726.135 rows=20 loops=1)"
" Sort Key: t.rank"
" Sort Method: quicksort Memory: 25kB"
" -> Subquery Scan on t (cost=0.56..684349.57 rows=3387993 width=15) (actual time=0.080..9726.116 rows=20 loops=1)"
" Filter: (t.rank WindowAgg (cost=0.56..55729

Solution

Is there any way I can speed this up?

Yes. Don't use a varchar column for an integer number. Use integer or bigint if you burn that many IDs - much smaller in table and index and faster to process. Since you are ranking 10 million rows in your test, this is going to make a substantial difference.

player_id VARCHAR(200) NOT NULL,

player_id int NOT NULL,

Or a uuid if you must (I doubt that):

  • Postgres uuid: Use as primary key, or in addition to SERIAL -for disconnected app-



Your query ranks 10 million rows. This is going to take some time, even when read from the index directly and no sort step.

Side note: If you bulk-insert rows first and add index and PK constraint (and FK constraint) after, that's going to be much faster, plus you get perfect indexes without bloat without running REINDEX or VACUUM FULL.
Do make sure ANALYZE has been run on the table before testing performance, though.

What you didn't ask

.. but, going out on a limb here, what are probably looking for.

The EXPLAIN output reveals that you filter the top 20 rows: (t.rank

-
That's all due to the sequence of events in a
SELECT query: window functions are applied before LIMIT. Only if the sort order agrees, we don't have to consider the rest of the applicable 10000000 rows.

-
We can use
LIMIT 20 because the top 20 ranks are guaranteed to span no more than 20 rows. The PK on (player_id, stat_id) guarantees unique player_id per stat_id and since that is included in the ORDER BY, each rank is only assigned once - which also means we can use the slightly cheaper row_number()` instead.

Code Snippets

SELECT * FROM (
   SELECT player_id
        , rank() OVER (ORDER BY value DESC, player_id ASC) AS rank
   FROM   player_stat
   WHERE  stat_id = 1
   ) t
WHERE t.rank <= 20;
SELECT row_number() OVER (ORDER BY value DESC, player_id ASC) AS rank
     , player_id
FROM   player_stat
WHERE  stat_id = 1
ORDER  BY value DESC, player_id
LIMIT  20;

Context

StackExchange Database Administrators Q#106513, answer score: 6

Revisions (0)

No revisions yet.