gotchasqlMinor
Why does Postgres run SORT on many rows, before GROUP BY?
Viewed 0 times
rowswhybeforepostgresgroupdoesmanysortrun
Problem
I'm using Postgres 9.4, with a recently analyzed database. These are my tables:
This is my query:
The results of the analyse shows a very slow sort on 5 million rows, BEFORE running the GroupAggregate.
The result of the GroupAggregate is just 5,000 rows. So wouldn't it make more sense to aggregate first, then sort?
Explain here: http://explain.depesz.com/s/IS1
Any other suggestions for speeding up the query would also be very welcome.
Materialized view "public.vw_chemical_summary_by_ccg"
Column | Type | Modifiers
-----------------+----------------------+-----------
processing_date | date |
pct_id | character varying(3) |
chemical_id | character varying(9) |
items | bigint |
cost | double precision |
Indexes:
"vw_idx_chem_by_ccg_chem_id" btree (chemical_id)
"vw_idx_chem_by_ccg_chem_id_vc" btree (chemical_id varchar_pattern_ops)
"vw_idx_chem_by_ccg_joint_id" btree (pct_id, chemical_id)
Table "public.frontend_pct"
Column | Type | Modifiers
-------------------+-------------------------+-----------
code | character varying(3) | not null
name | character varying(200) |
org_type | character varying(9) | not null
Indexes:
"frontend_pct_pkey" PRIMARY KEY, btree (code)
"frontend_pct_code_1df55e2c36c298b2_like" btree (code varchar_pattern_ops)This is my query:
EXPAIN (ANALYSE, BUFFERS)
SELECT pr.pct_id AS row_id, pc.name AS row_name,
pr.processing_date AS date, SUM(pr.cost) AS actual_cost,
SUM(pr.items) AS items
FROM vw_chemical_summary_by_ccg pr
JOIN frontend_pct pc
ON pr.pct_id=pc.code AND pc.org_type='CCG'
GROUP BY row_id, row_name, date
ORDER BY date, row_id;The results of the analyse shows a very slow sort on 5 million rows, BEFORE running the GroupAggregate.
The result of the GroupAggregate is just 5,000 rows. So wouldn't it make more sense to aggregate first, then sort?
Explain here: http://explain.depesz.com/s/IS1
Any other suggestions for speeding up the query would also be very welcome.
Solution
I ran into the same situation, a large number of rows of raw data being sorted on disk in order to be fed into a GroupAggregate which drastically reduced the number of rows.
I tried removing the
It also scaled similarly with more data. I'm only using about a million rows, so you may still need a higher
I added
I tested this on 8.4 and 9.3 and got the same results with both.
Update:
If one of the aggregates being calculated is a
I tried removing the
ORDER BY clause; that allowed the optimizer to choose HashAggregate, which was significantly faster. To give a picture of the speed increase, a particular subset of my data took about- 13.9 seconds using the original Sort and GroupAggregate
- 9.4 seconds for a simple
EXPLAIN ANALYZE SELECTwithout aGROUP BYclause
- 9.6 seconds using the unsorted HashAggregate
It also scaled similarly with more data. I'm only using about a million rows, so you may still need a higher
work_mem for your larger dataset (I am using 10MB). Although http://www.depesz.com/2013/05/09/explaining-the-unexplainable-part-3/#hash-aggregate suggests that HashAggregate will go to disk if it doesn't fit in memory, I found that reducing work_mem just prompted the planner to revert to the Sort/GroupAggregate plan.I added
SELECT * FROM () AS nested ORDER BY 1,2,3 to the outside and this sorts the aggregated output to give the same results as the original query, while still using the fast HashAggregate. I think there is perhaps an optimization opportunity here that the planner is not taking.I tested this on 8.4 and 9.3 and got the same results with both.
Update:
If one of the aggregates being calculated is a
COUNT(DISTINCT ), then apparently HashAggregate can not be used.Context
StackExchange Database Administrators Q#109073, answer score: 4
Revisions (0)
No revisions yet.