principlesqlMinor
Exists subselect vs inner join?
Viewed 0 times
innersubselectjoinexists
Problem
I'm moving up to the next level of my mystery query. It looks like there's a subselect inside of an exists, but on the same table. I think this could probably be simplified with an
Using PostgreSQL 9.4.2.
Table definitions (
Cardinalities:
Here's the original query, without modifications suggested for the subquery inside
The part I noticed looking weird is the
I tried changing this with the modification from my other SO post:
```
SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
SELECT quote_id, MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
WHERE quote_id=i.quote_id
A
INNER JOIN higher up.Using PostgreSQL 9.4.2.
Table definitions (
/d+): https://gist.github.com/neezer/879f5d3649ca1903c6f3Cardinalities:
billing_pricequote: 1,462,625 rowsbilling_pricequotestatus: 3,331,657 rowsbilling_lineitem: 43,687,855 rowsHere's the original query, without modifications suggested for the subquery inside
EXISTS:SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
SELECT s1.quote_id
FROM billing_pricequotestatus s1
INNER JOIN (
SELECT DISTINCT ON (quote_id) quote_id, MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
WHERE quote_id=i.quote_id
GROUP BY quote_id, created_at
ORDER BY quote_id, created_at DESC
) AS s2
ON s1.quote_id = s2.quote_id
AND s1.created_at = s2.max_created_at
WHERE s1.name IN ('adjustment','payment','billable')
)
GROUP BY i.quote_id, i.acct_id
;The part I noticed looking weird is the
SELECT on billing_pricequotestatus and then another subselect on the same table inside the INNER JOIN.I tried changing this with the modification from my other SO post:
```
SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
SELECT quote_id, MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
WHERE quote_id=i.quote_id
A
Solution
The purpose of your subquery as I understand it:
Select rows where the latest related entry in
Incorrect 2nd query
It's not immediately clear to me why my modification didn't produce equivalent output
The 1st query picks the latest row from
The 2nd query is backwards: it check for any row with qualifying
Consequently, you get more rows from the 2nd query.
Incorrect time range
This predicate is a mess. Inefficient and possibly incorrect - or at least a ticking bomb:
The column
Even worse, the time zone abbreviation
You would need to use the applicable time zone name instead of the abbreviation to get consistent local time - which is even more expensive.
And there's yet another problem: While the column value is shifted by a hard-coded time offset ('PST'), your upper bound
Don't do any of this. Don't translate
Query
Note the
Or use the equivalent correlated subquery outlined by @ypercube. Not sure which is faster.
Also note that I base the
Index
Currently, you get:
Seq Scan on billing_pricequote pq
Only 70k of your 1,5M rows are selected, which is around 5 %. An index on
Even more efficient with
Statistics
Postgres is over-estimating the selectivity of your time range:
(cost=0.00..88,546.50 rows=7,313 width=4) (actual time=2.353..767.408
rows=70,623 loops=1)
It might help to increase the statistics target for just the column
Table definition
Exemplary for
```
Column | Type | Modifiers
------------+--------------------------+------------------------------------------
id | integer | not null default nextval('...
quote_id | integer | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with tim
Select rows where the latest related entry in
billing_pricequotestatus has a qualifying name.Incorrect 2nd query
It's not immediately clear to me why my modification didn't produce equivalent output
The 1st query picks the latest row from
billing_pricequotestatus and checks whether the name qualifies (name IN ('adjustment','payment','billable')).The 2nd query is backwards: it check for any row with qualifying
name (not just the last one). Also, it doesn't make sense to compute an aggregate in an EXISTS semi-join. You don't want that. And it's not equivalent.Consequently, you get more rows from the 2nd query.
Incorrect time range
This predicate is a mess. Inefficient and possibly incorrect - or at least a ticking bomb:
WHERE pq.date_applied AT TIME ZONE 'PST'
BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptz The column
date_applied is of type timestamptz. The construct AT TIME ZONE 'PST' converts it to type timestamp and shifts by the time offset hard-coded into the time zone abbreviation 'PST' - which is a bad move to begin with. It makes the expression non-sargable. That's more expensive and, more importantly, rules out the use of any plain index on date_applied.Even worse, the time zone abbreviation
'PST' is not aware of DST or any historic shifts in time. If your time zone has (or had in the past) daylight saving time, and your set spans different DST periods, your current expression is most probably incorrect:- "AT TIME ZONE" with zone name PostgreSQL bug?
- Ignoring timezones altogether in Rails and PostgreSQL
You would need to use the applicable time zone name instead of the abbreviation to get consistent local time - which is even more expensive.
And there's yet another problem: While the column value is shifted by a hard-coded time offset ('PST'), your upper bound
'2016-03-03T22:27:41.734102-08:00'::timestamptz is provided as timestamptz and silently coerced to match the data type timestamp. Since no explicit time offset is provided, the cast defaults to the time zone of the current session. So you can get different results depending on the current time zone setting of your session. I can't think of a use case where this would make sense.Don't do any of this. Don't translate
timestamptz column date_applied to local time at all, don't mix data types like you do and don't mix different ways to cast. Instead, use the column as is and provide timestamptz parameters.Query
SELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM billing_pricequote pq
JOIN LATERAL (
SELECT name
FROM billing_pricequotestatus
WHERE quote_id = pq.id
ORDER BY created_at DESC
LIMIT 1
) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN billing_lineitem i ON i.quote_id = pq.id
WHERE pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST') -- !
AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP BY 1,2;Note the
LATERAL join, but not LEFT JOIN, make it an INNER JOIN to implement your predicate right away.Or use the equivalent correlated subquery outlined by @ypercube. Not sure which is faster.
Also note that I base the
LATERAL JOIN on billing_pricequote - before joining to the big table billing_lineitem. This way we can eliminate rows early, which should be cheaper.Index
Currently, you get:
Seq Scan on billing_pricequote pq
Only 70k of your 1,5M rows are selected, which is around 5 %. An index on
date_applied might help a bit, but not much. However, this multicolumn index should help substantially if you can get index-only scans out of it:CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);Even more efficient with
name_id instead of name as suggested below.Statistics
Postgres is over-estimating the selectivity of your time range:
(cost=0.00..88,546.50 rows=7,313 width=4) (actual time=2.353..767.408
rows=70,623 loops=1)
It might help to increase the statistics target for just the column
date_applied. Details here:- Configuring PostgreSQL for read performance
Table definition
Exemplary for
billing_pricequotestatus:name seems to be one of a couple of possible types. It would help performance to normalize some more and just use a 4-byte integer referencing a lookup-table instead of a varchar(20) repeated over an over in 3.3M rows. Also, reordering columns (if possible) like I demonstrate would help some more:```
Column | Type | Modifiers
------------+--------------------------+------------------------------------------
id | integer | not null default nextval('...
quote_id | integer | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with tim
Code Snippets
WHERE pq.date_applied AT TIME ZONE 'PST'
BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptzSELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM billing_pricequote pq
JOIN LATERAL (
SELECT name
FROM billing_pricequotestatus
WHERE quote_id = pq.id
ORDER BY created_at DESC
LIMIT 1
) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN billing_lineitem i ON i.quote_id = pq.id
WHERE pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST') -- !
AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP BY 1,2;CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);Column | Type | Modifiers
------------+--------------------------+------------------------------------------
id | integer | not null default nextval('...
quote_id | integer | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with time zone | not null
name_id | integer | not null REFERENCES name_table(name_id)
notes | text | not nullContext
StackExchange Database Administrators Q#131310, answer score: 7
Revisions (0)
No revisions yet.