HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Aggregated column causes full table scan even though the right index is present

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
indexfullthoughscanthecolumnpresentrightaggregatedeven

Problem

I have a query where I want to fetch the first few rows from the table datasets ordered by the date_added column. The column that is sorted by is indexed, so the basic version of this table is very fast:

SELECT datasets.id FROM datasets ORDER BY date_added LIMIT 25


"Limit  (cost=0.28..6.48 rows=25 width=12) (actual time=0.040..0.092 rows=25 loops=1)"
"  ->  Index Scan using datasets_date_added_idx2 on datasets  (cost=0.28..1244.19 rows=5016 width=12) (actual time=0.037..0.086 rows=25 loops=1)"
"Planning time: 0.484 ms"
"Execution time: 0.139 ms"


But I have an issue once I make the query a bit more complicated. I want to join another table representing a many-to-many relationship and aggregate the results in an array column. To do that I need to add a GROUP BY id clause:

SELECT datasets.id FROM datasets GROUP BY datasets.id ORDER BY date_added LIMIT 25


"Limit  (cost=551.41..551.47 rows=25 width=12) (actual time=9.926..9.931 rows=25 loops=1)"
"  ->  Sort  (cost=551.41..563.95 rows=5016 width=12) (actual time=9.924..9.926 rows=25 loops=1)"
"        Sort Key: date_added"
"        Sort Method: top-N heapsort  Memory: 26kB"
"        ->  HashAggregate  (cost=359.70..409.86 rows=5016 width=12) (actual time=7.016..8.604 rows=5016 loops=1)"
"              Group Key: datasets_id"
"              ->  Seq Scan on datasets  (cost=0.00..347.16 rows=5016 width=12) (actual time=0.009..1.574 rows=5016 loops=1)"
"Planning time: 0.502 ms"
"Execution time: 10.235 ms"


Just by adding the GROUP BY clause the query now does a full scan of the datasets table instead of using the index on the date_added column like previously.

A simplified version of the actual query I want to do is the following:

SELECT 
    datasets.id,
    array_remove(array_agg(other_table.some_column), NULL) AS other_table
FROM datasets 
LEFT JOIN other_table 
    ON other_table.id = datasets.id
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25


Why does the GROUP BY claus

Solution

The problem is that your 2nd query:

SELECT datasets.id 
FROM datasets 
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25 ;


does not mean what you expect. It does give you the first 25 rows ordered by date_added only because the id is the primary key of the table, so the GROUP BY can be removed without changing the result.

It seems however that the optimizer does not always remove the redundant GROUP BY and thus it produces a different plan. I'm not sure why - the various features of the optimizer that do these simplifications are far from covering all cases.

You might get a better plan if you change the query to have matching GROUP BY and ORDER BY clauses:

SELECT d.id 
FROM datasets AS d 
GROUP BY d.date_added, d.id 
ORDER BY d.date_added, d.id 
LIMIT 25 ;


But in any case, my advice would be "don't use redundant / complicated syntax when there is a simpler one".

Now for the 3rd query, with the join, while the GROUP BY method is working, you can rewrite it by using standard SQL window functions (ROW_NUMBER()) or Postgres DISTINCT ON or by joining to a derived table (which uses your very first query!, with minor details changed):

SELECT  
    d.id,
    array_remove(array_agg(o.some_column), NULL) AS other_table
FROM 
  ( SELECT d.id, d.date_added
    FROM datasets AS d 
    ORDER BY d.date_added 
    LIMIT 25 
  ) AS d
LEFT JOIN other_table AS o
    ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;


We could also avoid GROUP BY completely (well, it's hidden in the inline subquery):

SELECT  
    d.id,
    ( SELECT array_remove(array_agg(o.some_column), NULL)
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS other_table
FROM  datasets AS d 
ORDER BY d.date_added 
LIMIT 25 ;


Both queries are written so that the plan produced will do the (fast) limit subquery first and then the join, avoiding a full table scan of either table.

If you need aggregate from more columns, a third method combines both of the above, using a correlated (LATERAL) subquery in the FROM clause:

SELECT  
    d.id,
    o.other_table
    -- more aggregates
FROM 
    ( SELECT d.id, d.date_added
      FROM datasets AS d 
      ORDER BY d.date_added 
      LIMIT 25 
    ) AS d
  LEFT JOIN LATERAL
    ( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
             -- more aggregates
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS o
    ON TRUE
ORDER BY d.date_added
LIMIT 25 ;

Code Snippets

SELECT datasets.id 
FROM datasets 
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25 ;
SELECT d.id 
FROM datasets AS d 
GROUP BY d.date_added, d.id 
ORDER BY d.date_added, d.id 
LIMIT 25 ;
SELECT  
    d.id,
    array_remove(array_agg(o.some_column), NULL) AS other_table
FROM 
  ( SELECT d.id, d.date_added
    FROM datasets AS d 
    ORDER BY d.date_added 
    LIMIT 25 
  ) AS d
LEFT JOIN other_table AS o
    ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;
SELECT  
    d.id,
    ( SELECT array_remove(array_agg(o.some_column), NULL)
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS other_table
FROM  datasets AS d 
ORDER BY d.date_added 
LIMIT 25 ;
SELECT  
    d.id,
    o.other_table
    -- more aggregates
FROM 
    ( SELECT d.id, d.date_added
      FROM datasets AS d 
      ORDER BY d.date_added 
      LIMIT 25 
    ) AS d
  LEFT JOIN LATERAL
    ( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
             -- more aggregates
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS o
    ON TRUE
ORDER BY d.date_added
LIMIT 25 ;

Context

StackExchange Database Administrators Q#161775, answer score: 10

Revisions (0)

No revisions yet.