HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Efficient query to get greatest value per group from big table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
pergreatestgroupqueryefficientvaluebiggetfromtable

Problem

Given the table:

Column    |            Type             
 id           | integer                     
 latitude     | numeric(9,6)                
 longitude    | numeric(9,6)                
 speed        | integer                     
 equipment_id | integer                     
 created_at   | timestamp without time zone
Indexes:
    "geoposition_records_pkey" PRIMARY KEY, btree (id)


The table has 20 million records that is not, relatively speaking, a large number. But it makes sequential scans slow.

How can I get the last record (max(created_at)) of each equipment_id?

I've tried both the following queries, with several variants that I've read through many answers of this topic:

select max(created_at),equipment_id from geoposition_records group by equipment_id;

select distinct on (equipment_id) equipment_id,created_at 
  from geoposition_records order by equipment_id, created_at desc;


I have also tried creating btree indexes for equipment_id,created_at but Postgres finds that using a seqscan is faster. Forcing enable_seqscan = off is of no use either since reading the index is as slow as the seq scan, probably worse.

The query must run periodically returning always the last.

Using Postgres 9.3.

Explain/analyze (with 1.7 million records):

```
set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
" -> Seq Scan on geoposition_records (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"

set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
" -> Index Scan using geoposition_records_equipment_id_created

Solution

Index

A plain multicolumn B-tree index should work after all:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);


Why DESC NULLS LAST?

  • Unused index in range of dates query



It's safe to assume you have an equipment table? Then performance won't be a problem:
Correlated subquery

Based on this equipment table, run a lowly correlated subquery to great effect:

SELECT equipment_id
     , (SELECT created_at
        FROM   geoposition_records
        WHERE  equipment_id = eq.equipment_id
        ORDER  BY created_at DESC NULLS LAST
        LIMIT  1) AS latest
FROM   equipment eq;


For a small number of rows in the equipment table (57 judging from your EXPLAIN ANALYZE output), that's very fast.
LATERAL join in Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;


Detailed explanation:

  • Optimize GROUP BY query to retrieve latest record per user



Performance similar to the correlated subquery.
Function

If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one equipment_id at a time uses the index.

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;


Makes for a nice call, too:

SELECT * FROM f_latest_equip();


Performance comparison:

db<>fiddle here

Old sqlfiddle

Code Snippets

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);
SELECT equipment_id
     , (SELECT created_at
        FROM   geoposition_records
        WHERE  equipment_id = eq.equipment_id
        ORDER  BY created_at DESC NULLS LAST
        LIMIT  1) AS latest
FROM   equipment eq;
SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;
CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;
SELECT * FROM f_latest_equip();

Context

StackExchange Database Administrators Q#51895, answer score: 11

Revisions (0)

No revisions yet.