HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL window functions very slow on big table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlfunctionsslowbigverywindowtable

Problem

I have a PostgreSQL table constructed as

device_id | point     | dt_edit
----------+-----------+----------
100       | geometry  | timestamp
101       | geometry  | timestamp
100       | geometry  | timestamp
102       | geometry  | timestamp
102       | geometry  | timestamp
101       | geometry  | timestamp


I need to select the last 2 records ordered by dt_edit from unique device_id. This query works very slow, on 1 billion records - 500 seconds:

SELECT dt.device_id,
    dt.point,
    dt.dt_edit
   FROM ( SELECT gps_data.device_id,
            gps_data.point,
            gps_data.dt_edit,
            rank() OVER (PARTITION BY gps_data.device_id
                         ORDER BY gps_data.dt_edit DESC) AS rank
           FROM gps_data) dt
  WHERE dt.rank <= 2


Any ideas?

Solution

If we can assume a device table holding all devices of interest. Example:

CREATE TABLE device (device_id int, device text);
INSERT INTO device (device_id, device) VALUES
  (100, 'a')
, (101, 'b')
, (102, 'c');


The query can be very simple:

SELECT d.device_id, g.point, g.dt_edit
FROM   device d
,      LATERAL (
   SELECT point, dt_edit
   FROM   gps_data
   WHERE  device_id = d.device_id
   ORDER  BY dt_edit DESC NULLS LAST -- more items?
   LIMIT  2
   ) g;


Assuming a much smaller cardinality in table device than in table gps_data, i.e. many rows per device in the latter.

If dt_edit is not UNIQUE, add more items to ORDER BY to disambiguate the sort order.

All you need for this to be fast is a multicolumn index;

CREATE INDEX gps_data_foo_idx
ON gps_data (device_id, dt_edit DESC NULLS LAST); -- more items?


The NULLS LAST modifier is useful if the column can be NULL, but does not hurt in any case. Either way, the query must match the index.

  • PostgreSQL sort by datetime asc, null first?



If you don't have a device table, I would suggest to create one (and keep it current).

But there are ways to make this fast even without device table ...

Details:

  • Optimize GROUP BY query to retrieve latest record per user



  • How do I efficiently get "the most recent corresponding row"?



  • SELECT DISTINCT on multiple columns

Code Snippets

CREATE TABLE device (device_id int, device text);
INSERT INTO device (device_id, device) VALUES
  (100, 'a')
, (101, 'b')
, (102, 'c');
SELECT d.device_id, g.point, g.dt_edit
FROM   device d
,      LATERAL (
   SELECT point, dt_edit
   FROM   gps_data
   WHERE  device_id = d.device_id
   ORDER  BY dt_edit DESC NULLS LAST -- more items?
   LIMIT  2
   ) g;
CREATE INDEX gps_data_foo_idx
ON gps_data (device_id, dt_edit DESC NULLS LAST); -- more items?

Context

StackExchange Database Administrators Q#106987, answer score: 4

Revisions (0)

No revisions yet.