patternsqlModerate
Aggregated column causes full table scan even though the right index is present
Viewed 0 times
indexfullthoughscanthecolumnpresentrightaggregatedeven
Problem
I have a query where I want to fetch the first few rows from the table datasets ordered by the date_added column. The column that is sorted by is indexed, so the basic version of this table is very fast:
But I have an issue once I make the query a bit more complicated. I want to join another table representing a many-to-many relationship and aggregate the results in an array column. To do that I need to add a GROUP BY id clause:
Just by adding the GROUP BY clause the query now does a full scan of the datasets table instead of using the index on the date_added column like previously.
A simplified version of the actual query I want to do is the following:
Why does the GROUP BY claus
SELECT datasets.id FROM datasets ORDER BY date_added LIMIT 25"Limit (cost=0.28..6.48 rows=25 width=12) (actual time=0.040..0.092 rows=25 loops=1)"
" -> Index Scan using datasets_date_added_idx2 on datasets (cost=0.28..1244.19 rows=5016 width=12) (actual time=0.037..0.086 rows=25 loops=1)"
"Planning time: 0.484 ms"
"Execution time: 0.139 ms"But I have an issue once I make the query a bit more complicated. I want to join another table representing a many-to-many relationship and aggregate the results in an array column. To do that I need to add a GROUP BY id clause:
SELECT datasets.id FROM datasets GROUP BY datasets.id ORDER BY date_added LIMIT 25"Limit (cost=551.41..551.47 rows=25 width=12) (actual time=9.926..9.931 rows=25 loops=1)"
" -> Sort (cost=551.41..563.95 rows=5016 width=12) (actual time=9.924..9.926 rows=25 loops=1)"
" Sort Key: date_added"
" Sort Method: top-N heapsort Memory: 26kB"
" -> HashAggregate (cost=359.70..409.86 rows=5016 width=12) (actual time=7.016..8.604 rows=5016 loops=1)"
" Group Key: datasets_id"
" -> Seq Scan on datasets (cost=0.00..347.16 rows=5016 width=12) (actual time=0.009..1.574 rows=5016 loops=1)"
"Planning time: 0.502 ms"
"Execution time: 10.235 ms"Just by adding the GROUP BY clause the query now does a full scan of the datasets table instead of using the index on the date_added column like previously.
A simplified version of the actual query I want to do is the following:
SELECT
datasets.id,
array_remove(array_agg(other_table.some_column), NULL) AS other_table
FROM datasets
LEFT JOIN other_table
ON other_table.id = datasets.id
GROUP BY datasets.id
ORDER BY date_added
LIMIT 25Why does the GROUP BY claus
Solution
The problem is that your 2nd query:
does not mean what you expect. It does give you the first 25 rows ordered by
It seems however that the optimizer does not always remove the redundant
You might get a better plan if you change the query to have matching
But in any case, my advice would be "don't use redundant / complicated syntax when there is a simpler one".
Now for the 3rd query, with the join, while the
We could also avoid
Both queries are written so that the plan produced will do the (fast) limit subquery first and then the join, avoiding a full table scan of either table.
If you need aggregate from more columns, a third method combines both of the above, using a correlated (
SELECT datasets.id
FROM datasets
GROUP BY datasets.id
ORDER BY date_added
LIMIT 25 ;does not mean what you expect. It does give you the first 25 rows ordered by
date_added only because the id is the primary key of the table, so the GROUP BY can be removed without changing the result.It seems however that the optimizer does not always remove the redundant
GROUP BY and thus it produces a different plan. I'm not sure why - the various features of the optimizer that do these simplifications are far from covering all cases. You might get a better plan if you change the query to have matching
GROUP BY and ORDER BY clauses:SELECT d.id
FROM datasets AS d
GROUP BY d.date_added, d.id
ORDER BY d.date_added, d.id
LIMIT 25 ;But in any case, my advice would be "don't use redundant / complicated syntax when there is a simpler one".
Now for the 3rd query, with the join, while the
GROUP BY method is working, you can rewrite it by using standard SQL window functions (ROW_NUMBER()) or Postgres DISTINCT ON or by joining to a derived table (which uses your very first query!, with minor details changed):SELECT
d.id,
array_remove(array_agg(o.some_column), NULL) AS other_table
FROM
( SELECT d.id, d.date_added
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25
) AS d
LEFT JOIN other_table AS o
ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;We could also avoid
GROUP BY completely (well, it's hidden in the inline subquery):SELECT
d.id,
( SELECT array_remove(array_agg(o.some_column), NULL)
FROM other_table AS o
WHERE o.id = d.id
) AS other_table
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25 ;Both queries are written so that the plan produced will do the (fast) limit subquery first and then the join, avoiding a full table scan of either table.
If you need aggregate from more columns, a third method combines both of the above, using a correlated (
LATERAL) subquery in the FROM clause:SELECT
d.id,
o.other_table
-- more aggregates
FROM
( SELECT d.id, d.date_added
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25
) AS d
LEFT JOIN LATERAL
( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
-- more aggregates
FROM other_table AS o
WHERE o.id = d.id
) AS o
ON TRUE
ORDER BY d.date_added
LIMIT 25 ;Code Snippets
SELECT datasets.id
FROM datasets
GROUP BY datasets.id
ORDER BY date_added
LIMIT 25 ;SELECT d.id
FROM datasets AS d
GROUP BY d.date_added, d.id
ORDER BY d.date_added, d.id
LIMIT 25 ;SELECT
d.id,
array_remove(array_agg(o.some_column), NULL) AS other_table
FROM
( SELECT d.id, d.date_added
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25
) AS d
LEFT JOIN other_table AS o
ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;SELECT
d.id,
( SELECT array_remove(array_agg(o.some_column), NULL)
FROM other_table AS o
WHERE o.id = d.id
) AS other_table
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25 ;SELECT
d.id,
o.other_table
-- more aggregates
FROM
( SELECT d.id, d.date_added
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25
) AS d
LEFT JOIN LATERAL
( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
-- more aggregates
FROM other_table AS o
WHERE o.id = d.id
) AS o
ON TRUE
ORDER BY d.date_added
LIMIT 25 ;Context
StackExchange Database Administrators Q#161775, answer score: 10
Revisions (0)
No revisions yet.