HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Does Postgres optimize this JOIN with subquery?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thispostgreswithjoinsubqueryoptimizedoes

Problem

In Postgres 12, I have a table purchase_orders and one for its items. I'm running a query that returns PO's for a given shop and a sum of items ordered on each PO:

SELECT po.id, 
       SUM(grouped_items.total_quantity) AS total_quantity
FROM purchase_orders po
LEFT JOIN (
  SELECT purchase_order_id, 
  SUM(quantity) AS total_quantity
  FROM items
  GROUP BY purchase_order_id
) grouped_items ON po.id = grouped_items.purchase_order_id

WHERE po.shop_id = 195
GROUP BY po.id


This query returns the desired result. The JOIN is in a subquery because there will be other JOINS to other tables, so this produces an already grouped table to join to.

I wrote another query with a correlated SELECT subquery instead of a JOIN. The execution time was practically identical running both methods so it was difficult to see which one was faster. I ran EXPLAIN ANALYZE but can't interpret it very well.

Question:
In the example above, will Postgres process the entire items table of the subquery, and only after join with the purchase_orders? Or is it smart enough to filter down the set if items first?

The EXPLAIN report mentions "Seq Scan on Items..." which seemed to contain all rows in items, and then that gets reduced as it moves up the tree. But not sure if that means it actually SUM'ed the entire table in memory.

EXPLAIN:

```
GroupAggregate (cost=6948.16..6973.00 rows=1242 width=40) (actual time=165.099..166.321 rows=1242 loops=1)
Group Key: po.id
Buffers: shared hit=4148
-> Sort (cost=6948.16..6951.27 rows=1242 width=16) (actual time=165.090..165.406 rows=1242 loops=1)
Sort Key: po.id
Sort Method: quicksort Memory: 107kB
Buffers: shared hit=4148
-> Hash Right Join (cost=6668.31..6884.34 rows=1242 width=16) (actual time=99.951..120.627 rows=1242 loops=1)
Hash Cond: (items.purchase_order_id = po.id)
Buffers: shared hit=4148
-> HashAggregate (cost=5906.04..5993.80 r

Solution

I've been looking into this issue a bit myself lately, and my conclusion is that the planner is not smart enough to optimize this particular thing. The correlated subselect will be executed once for each row even if that is a huge number of rows, while the uncorrelated one will be executed to completion even if only a few rows from it are needed.

It does know that one will be faster than the other (assuming the estimated row counts are reasonably correct) but it lacks the capacity to recognize the two formulations are identical and so choose between execution plans based on estimated performance.

Although in your case, the queries would not be identical because they handle missing rows in "items" differently. The correlated subselect would be identical to the left join, not the inner join.

Context

StackExchange Database Administrators Q#270009, answer score: 5

Revisions (0)

No revisions yet.