patternsqlModerate
Efficient query to get greatest value per group from big table
Viewed 0 times
pergreatestgroupqueryefficientvaluebiggetfromtable
Problem
Given the table:
The table has 20 million records that is not, relatively speaking, a large number. But it makes sequential scans slow.
How can I get the last record (
I've tried both the following queries, with several variants that I've read through many answers of this topic:
I have also tried creating btree indexes for
The query must run periodically returning always the last.
Using Postgres 9.3.
Explain/analyze (with 1.7 million records):
```
set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
" -> Seq Scan on geoposition_records (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"
set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
" -> Index Scan using geoposition_records_equipment_id_created
Column | Type
id | integer
latitude | numeric(9,6)
longitude | numeric(9,6)
speed | integer
equipment_id | integer
created_at | timestamp without time zone
Indexes:
"geoposition_records_pkey" PRIMARY KEY, btree (id)The table has 20 million records that is not, relatively speaking, a large number. But it makes sequential scans slow.
How can I get the last record (
max(created_at)) of each equipment_id? I've tried both the following queries, with several variants that I've read through many answers of this topic:
select max(created_at),equipment_id from geoposition_records group by equipment_id;
select distinct on (equipment_id) equipment_id,created_at
from geoposition_records order by equipment_id, created_at desc;I have also tried creating btree indexes for
equipment_id,created_at but Postgres finds that using a seqscan is faster. Forcing enable_seqscan = off is of no use either since reading the index is as slow as the seq scan, probably worse.The query must run periodically returning always the last.
Using Postgres 9.3.
Explain/analyze (with 1.7 million records):
```
set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
" -> Seq Scan on geoposition_records (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"
set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
" -> Index Scan using geoposition_records_equipment_id_created
Solution
Index
A plain multicolumn B-tree index should work after all:
Why
It's safe to assume you have an
Correlated subquery
Based on this
For a small number of rows in the
Detailed explanation:
Performance similar to the correlated subquery.
Function
If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one
Makes for a nice call, too:
Performance comparison:
db<>fiddle here
Old sqlfiddle
A plain multicolumn B-tree index should work after all:
CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);Why
DESC NULLS LAST?- Unused index in range of dates query
It's safe to assume you have an
equipment table? Then performance won't be a problem:Correlated subquery
Based on this
equipment table, run a lowly correlated subquery to great effect:SELECT equipment_id
, (SELECT created_at
FROM geoposition_records
WHERE equipment_id = eq.equipment_id
ORDER BY created_at DESC NULLS LAST
LIMIT 1) AS latest
FROM equipment eq;For a small number of rows in the
equipment table (57 judging from your EXPLAIN ANALYZE output), that's very fast.LATERAL join in Postgres 9.3+SELECT eq.equipment_id, r.latest
FROM equipment eq
LEFT JOIN LATERAL (
SELECT created_at
FROM geoposition_records
WHERE equipment_id = eq.equipment_id
ORDER BY created_at DESC NULLS LAST
LIMIT 1
) r(latest) ON true;Detailed explanation:
- Optimize GROUP BY query to retrieve latest record per user
Performance similar to the correlated subquery.
Function
If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one
equipment_id at a time uses the index.CREATE OR REPLACE FUNCTION f_latest_equip()
RETURNS TABLE (equipment_id int, latest timestamp)
LANGUAGE plpgsql STABLE AS
$func$
BEGIN
FOR equipment_id IN
SELECT e.equipment_id FROM equipment e ORDER BY 1
LOOP
SELECT g.created_at
FROM geoposition_records g
WHERE g.equipment_id = f_latest_equip.equipment_id
-- prepend function name to disambiguate
ORDER BY g.created_at DESC NULLS LAST
LIMIT 1
INTO latest;
RETURN NEXT;
END LOOP;
END
$func$;Makes for a nice call, too:
SELECT * FROM f_latest_equip();Performance comparison:
db<>fiddle here
Old sqlfiddle
Code Snippets
CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);SELECT equipment_id
, (SELECT created_at
FROM geoposition_records
WHERE equipment_id = eq.equipment_id
ORDER BY created_at DESC NULLS LAST
LIMIT 1) AS latest
FROM equipment eq;SELECT eq.equipment_id, r.latest
FROM equipment eq
LEFT JOIN LATERAL (
SELECT created_at
FROM geoposition_records
WHERE equipment_id = eq.equipment_id
ORDER BY created_at DESC NULLS LAST
LIMIT 1
) r(latest) ON true;CREATE OR REPLACE FUNCTION f_latest_equip()
RETURNS TABLE (equipment_id int, latest timestamp)
LANGUAGE plpgsql STABLE AS
$func$
BEGIN
FOR equipment_id IN
SELECT e.equipment_id FROM equipment e ORDER BY 1
LOOP
SELECT g.created_at
FROM geoposition_records g
WHERE g.equipment_id = f_latest_equip.equipment_id
-- prepend function name to disambiguate
ORDER BY g.created_at DESC NULLS LAST
LIMIT 1
INTO latest;
RETURN NEXT;
END LOOP;
END
$func$;SELECT * FROM f_latest_equip();Context
StackExchange Database Administrators Q#51895, answer score: 11
Revisions (0)
No revisions yet.