patternsqlModerate
Optimizing large database query (25+ million rows, using max() and GROUP BY)
Viewed 0 times
millionrowsgroupquerydatabaselargeoptimizingusingmaxand
Problem
I'm using Postgres 9.3.5 and I have a large table in the database, currently it has over 25 million rows and it tends to get even larger rapidly. I'm trying to select specific rows (all
Without any index, this query takes about 35 seconds to execute. With an index defined (
I wonder if it will ever be possible to execute my query in even less time (like just few seconds) and if so, what steps should I take in order to optimize it even further?
My table structure dump looks like this:
The
and my psql settings on server look like this:
```
name | context | min_val | max_val | boot_val
-------------------------------------+------------+-----------+--------------+-----------------------------------------
allow_system_table_mods | postmaster | | | off
application_name | user
unit_ids with only latest unit_timestamp for each of them) with a simple query like:SELECT unit_id, max(unit_timestamp) AS latest_timestamp FROM all_units GROUP BY unit_id;Without any index, this query takes about 35 seconds to execute. With an index defined (
CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);) the query time is shortened to around (only) 19 seconds.I wonder if it will ever be possible to execute my query in even less time (like just few seconds) and if so, what steps should I take in order to optimize it even further?
My table structure dump looks like this:
CREATE TABLE "all_units" (
"unit_id" int4 NOT NULL,
"unit_timestamp" timestamp(6) NOT NULL,
"lon" float4,
"lat" float4,
"speed" float4,
"status" varchar(255) COLLATE "default"
)
ALTER TABLE "all_units" ADD PRIMARY KEY ("unit_id", "unit_timestamp");The
EXPLAIN (ANALYZE, BUFFERS) follows:QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=663998.38..664069.73 rows=7135 width=12) (actual time=84715.050..84732.021 rows=11094 loops=1)
Buffers: shared hit=192 read=286819
-> Seq Scan on ais_sorted (cost=0.00..538335.92 rows=25132492 width=12) (actual time=0.608..41264.196 rows=25132492 loops=1)
Buffers: shared hit=192 read=286819
Total runtime: 84746.501 msand my psql settings on server look like this:
```
name | context | min_val | max_val | boot_val
-------------------------------------+------------+-----------+--------------+-----------------------------------------
allow_system_table_mods | postmaster | | | off
application_name | user
Solution
Query
Your query is forced to scan the whole table (or the whole index). Every row could be another distinct unit. The only way to substantially shorten the process would be a separate table with all available units - which would help as long as there are substantially fewer units than entries in
Since you have ~ 11k units (added in comment) for 25M entries, this should definitely help.
Depending on frequencies of values, there are a couple of query techniques to get your result considerably faster:
Details in this related answer on SO:
Only needing the implicit index of the primary key on
Excludes units without entry in
Or a lowly correlated subquery (probably even faster):
Includes units without entry in
Efficiency depends on the number of entries per unit. The more entries, the more potential for one of these queries.
In a quick local test with similar tables (500 "units", 1M rows in big table), the query with correlated subqueries was ~ 500x faster than your original. Index-only scans on the PK index of the big table vs. sequential scan in your original query.
Since your table
There is also
Index
Your
is not in fact a partial index and also redundant. Postgres can scan indexes backwards at practically the same speed, the PK serves well. Drop this additional index.
Table layout
A couple of points for your table definition.
-
-
I switched positions of the first two columns to save 4 bytes of padding, which amounts to ~ 100 MB for 25M rows (exact result depends on
-
If
Server configuration
You need to configure your server. Most of your settings seem to be conservative defaults. 1 MB on
Your query is forced to scan the whole table (or the whole index). Every row could be another distinct unit. The only way to substantially shorten the process would be a separate table with all available units - which would help as long as there are substantially fewer units than entries in
all_units.Since you have ~ 11k units (added in comment) for 25M entries, this should definitely help.
Depending on frequencies of values, there are a couple of query techniques to get your result considerably faster:
- recursive CTE
JOIN LATERAL
- correlated subquery
Details in this related answer on SO:
- Optimize GROUP BY query to retrieve latest record per user
Only needing the implicit index of the primary key on
(unit_id, unit_timestamp), this query should do the trick, using an implicit JOIN LATERAL:SELECT u.unit_id, a.max_ts
FROM unit u
, (SELECT unit_timestamp AS max_ts
FROM all_units
WHERE unit_id = u.unit_id
ORDER BY unit_timestamp DESC
LIMIT 1
) a;Excludes units without entry in
all_units, like your original query.Or a lowly correlated subquery (probably even faster):
SELECT u.unit_id
, (SELECT unit_timestamp
FROM all_units
WHERE unit_id = u.unit_id
ORDER BY unit_timestamp DESC
LIMIT 1) AS max_ts
FROM unit u;Includes units without entry in
all_units.Efficiency depends on the number of entries per unit. The more entries, the more potential for one of these queries.
In a quick local test with similar tables (500 "units", 1M rows in big table), the query with correlated subqueries was ~ 500x faster than your original. Index-only scans on the PK index of the big table vs. sequential scan in your original query.
Since your table
tends to get even larger rapidly, a materialized view is probably not an option.There is also
DISTINCT ON as another possible query technique, but it's hardly going to be faster than your original query, so not the answer you are looking for. Details here:- How do I efficiently get "the most recent corresponding row"?
Index
Your
partial_idx:CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);is not in fact a partial index and also redundant. Postgres can scan indexes backwards at practically the same speed, the PK serves well. Drop this additional index.
Table layout
A couple of points for your table definition.
CREATE TABLE all_units (
unit_timestamp timestamp,
unit_id int4,
lon float4,
lat float4,
speed float4,
status varchar(255), -- might be improved.
PRIMARY KEY (unit_id, unit_timestamp)
);-
timestamp(6) doesn't make much sense, it's effectively the same as just timestamp, which already saves a maximum of 6 fractional digits.-
I switched positions of the first two columns to save 4 bytes of padding, which amounts to ~ 100 MB for 25M rows (exact result depends on
status). Smaller tables are typically faster for everything.-
If
status isn't free text, but some kind of standardized note, you could replace it with something a lot cheaper. More about varchar(255) in Postgres.Server configuration
You need to configure your server. Most of your settings seem to be conservative defaults. 1 MB on
shared_buffers or work_mem seems way to low for an installation with millions of rows. And random_pare_cost = 4 is to high for any modern system with plenty of RAM. Start with the manual and the Postgres Wiki:- http://www.postgresql.org/docs/current/interactive/runtime-config.html
- http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Code Snippets
SELECT u.unit_id, a.max_ts
FROM unit u
, (SELECT unit_timestamp AS max_ts
FROM all_units
WHERE unit_id = u.unit_id
ORDER BY unit_timestamp DESC
LIMIT 1
) a;SELECT u.unit_id
, (SELECT unit_timestamp
FROM all_units
WHERE unit_id = u.unit_id
ORDER BY unit_timestamp DESC
LIMIT 1) AS max_ts
FROM unit u;CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);CREATE TABLE all_units (
unit_timestamp timestamp,
unit_id int4,
lon float4,
lat float4,
speed float4,
status varchar(255), -- might be improved.
PRIMARY KEY (unit_id, unit_timestamp)
);Context
StackExchange Database Administrators Q#75963, answer score: 14
Revisions (0)
No revisions yet.