patternsqlModerate
Postgres Index a query with MAX and groupBy
Viewed 0 times
postgreswithquerygroupbymaxandindex
Problem
Is there any way to index the following query?
I've tried creating sorted (non-composite) indexes on
Misc info:
Schema:
Explain analyze:
SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;I've tried creating sorted (non-composite) indexes on
frame and time, and an index on run_id, but the query planner doesn't use them.Misc info:
- Unfortunately (and for reasons I won't get into) I cannot change the query
- The
frames_stattable has 42 million rows
- The table is unchanging (no further inserts/deletes will ever take place)
- The query was always slow, it's just gotten slower because this dataset is larger than in the past.
- There are no indexes on the table
- We are using Postgres 9.4
- The db's "work_mem" size is 128MB (if that's relevant).
- Hardware: 130GB Ram, 10 core Xeon
Schema:
CREATE TABLE run.frame_stat (
id bigint NOT NULL,
run_id bigint NOT NULL,
frame bigint NOT NULL,
heap_size bigint NOT NULL,
"time" timestamp without time zone NOT NULL,
CONSTRAINT frame_stat_pkey PRIMARY KEY (id)
)Explain analyze:
HashAggregate (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1)
Group Key: run_id
-> Seq Scan on zulu (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)Solution
Too bad
If you cannot change the query at all, that's too bad. You won't get a good solution.
If you had not table-qualified the table (
Here's a recipe for such a technique:
@Joishi's idea with a
Better query / indexes
If you could change the query, you should try to emulate a loose index scan:
This is even more efficient when based on a separate table with one row per relevant
Implemented with correlated subqueries:
Create two multicolumn indexes with matching sort order for lightening performance:
With only 280 distinct
MATERIALIZED VIEW
Or, based on these key pieces of information:
The "frames_stat" table has 42 million rows
rows=280 -- number of returned rows = disctinct run_id
The table is unchanging (no inserts/deletes)
Use a
You still need to change the query to base it on the MV instead of the table.
Aside: never use reserved words like
If you cannot change the query at all, that's too bad. You won't get a good solution.
If you had not table-qualified the table (
run.frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path (optionally just in sessions where this is desirable) - for hugely superior performance.Here's a recipe for such a technique:
- How can I fake inet_client_addr() for unit tests in PostgreSQL?
@Joishi's idea with a
RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.Better query / indexes
If you could change the query, you should try to emulate a loose index scan:
- Optimize GROUP BY query to retrieve latest record per user
This is even more efficient when based on a separate table with one row per relevant
run_id - let's call it run_tbl. Create it if you don't have it, yet!Implemented with correlated subqueries:
SELECT run_id
, (SELECT frame
FROM run.frames_stat
WHERE run_id = r.run_id
ORDER BY frame DESC NULLS LAST
LIMIT 1) AS max_frame
, (SELECT "time"
FROM run.frames_stat
WHERE run_id = r.run_id
ORDER BY "time" DESC NULLS LAST
LIMIT 1) AS max_time
FROM run_tbl r;Create two multicolumn indexes with matching sort order for lightening performance:
CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);NULLS LAST is only necessary if there can be null values. But it won't hurt either way.- Unused index in range of dates query
With only 280 distinct
run_id, this will be very fast.MATERIALIZED VIEW
Or, based on these key pieces of information:
The "frames_stat" table has 42 million rows
rows=280 -- number of returned rows = disctinct run_id
The table is unchanging (no inserts/deletes)
Use a
MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.You still need to change the query to base it on the MV instead of the table.
Aside: never use reserved words like
time (in standard SQL) as identifier.Code Snippets
SELECT run_id
, (SELECT frame
FROM run.frames_stat
WHERE run_id = r.run_id
ORDER BY frame DESC NULLS LAST
LIMIT 1) AS max_frame
, (SELECT "time"
FROM run.frames_stat
WHERE run_id = r.run_id
ORDER BY "time" DESC NULLS LAST
LIMIT 1) AS max_time
FROM run_tbl r;CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);Context
StackExchange Database Administrators Q#93442, answer score: 14
Revisions (0)
No revisions yet.