principlesqlMinor

Exists subselect vs inner join?

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

innersubselectjoinexists

Problem

I'm moving up to the next level of my mystery query. It looks like there's a subselect inside of an exists, but on the same table. I think this could probably be simplified with an INNER JOIN higher up.

Using PostgreSQL 9.4.2.

Table definitions (/d+): https://gist.github.com/neezer/879f5d3649ca1903c6f3

Cardinalities:

billing_pricequote: 1,462,625 rows

billing_pricequotestatus: 3,331,657 rows

billing_lineitem: 43,687,855 rows

Here's the original query, without modifications suggested for the subquery inside EXISTS:

SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
                                AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
  SELECT s1.quote_id
  FROM billing_pricequotestatus s1
  INNER JOIN (
    SELECT DISTINCT ON (quote_id) quote_id, MAX(created_at) AS max_created_at
    FROM billing_pricequotestatus
    WHERE quote_id=i.quote_id
    GROUP BY quote_id, created_at
    ORDER BY quote_id, created_at DESC
  ) AS s2
  ON s1.quote_id = s2.quote_id
  AND s1.created_at = s2.max_created_at
  WHERE s1.name IN ('adjustment','payment','billable')
)
GROUP BY i.quote_id, i.acct_id
;

The part I noticed looking weird is the SELECT on billing_pricequotestatus and then another subselect on the same table inside the INNER JOIN.

I tried changing this with the modification from my other SO post:

```
SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
SELECT quote_id, MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
WHERE quote_id=i.quote_id
A

Solution

The purpose of your subquery as I understand it:

Select rows where the latest related entry in billing_pricequotestatus has a qualifying name.
Incorrect 2nd query

It's not immediately clear to me why my modification didn't produce equivalent output

The 1st query picks the latest row from billing_pricequotestatus and checks whether the name qualifies (name IN ('adjustment','payment','billable')).

The 2nd query is backwards: it check for any row with qualifying name (not just the last one). Also, it doesn't make sense to compute an aggregate in an EXISTS semi-join. You don't want that. And it's not equivalent.

Consequently, you get more rows from the 2nd query.
Incorrect time range

This predicate is a mess. Inefficient and possibly incorrect - or at least a ticking bomb:

WHERE pq.date_applied AT TIME ZONE 'PST'
       BETWEEN '2016-02-02T00:00:00'::timestamp
           AND '2016-03-03T22:27:41.734102-08:00'::timestamptz

The column date_applied is of type timestamptz. The construct AT TIME ZONE 'PST' converts it to type timestamp and shifts by the time offset hard-coded into the time zone abbreviation 'PST' - which is a bad move to begin with. It makes the expression non-sargable. That's more expensive and, more importantly, rules out the use of any plain index on date_applied.

Even worse, the time zone abbreviation 'PST' is not aware of DST or any historic shifts in time. If your time zone has (or had in the past) daylight saving time, and your set spans different DST periods, your current expression is most probably incorrect:

"AT TIME ZONE" with zone name PostgreSQL bug?

Ignoring timezones altogether in Rails and PostgreSQL

You would need to use the applicable time zone name instead of the abbreviation to get consistent local time - which is even more expensive.

And there's yet another problem: While the column value is shifted by a hard-coded time offset ('PST'), your upper bound '2016-03-03T22:27:41.734102-08:00'::timestamptz is provided as timestamptz and silently coerced to match the data type timestamp. Since no explicit time offset is provided, the cast defaults to the time zone of the current session. So you can get different results depending on the current time zone setting of your session. I can't think of a use case where this would make sense.

Don't do any of this. Don't translate timestamptz column date_applied to local time at all, don't mix data types like you do and don't mix different ways to cast. Instead, use the column as is and provide timestamptz parameters.
Query

SELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM   billing_pricequote pq
JOIN   LATERAL (
   SELECT name
   FROM   billing_pricequotestatus
   WHERE  quote_id = pq.id
   ORDER  BY created_at DESC
   LIMIT  1
   ) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN   billing_lineitem i ON i.quote_id = pq.id
WHERE  pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST')  -- !
                           AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP  BY 1,2;

Note the LATERAL join, but not LEFT JOIN, make it an INNER JOIN to implement your predicate right away.

Or use the equivalent correlated subquery outlined by @ypercube. Not sure which is faster.

Also note that I base the LATERAL JOIN on billing_pricequote - before joining to the big table billing_lineitem. This way we can eliminate rows early, which should be cheaper.
Index

Currently, you get:

Seq Scan on billing_pricequote pq

Only 70k of your 1,5M rows are selected, which is around 5 %. An index on date_applied might help a bit, but not much. However, this multicolumn index should help substantially if you can get index-only scans out of it:

CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);

Even more efficient with name_id instead of name as suggested below.
Statistics

Postgres is over-estimating the selectivity of your time range:

(cost=0.00..88,546.50 rows=7,313 width=4) (actual time=2.353..767.408
rows=70,623 loops=1)

It might help to increase the statistics target for just the column date_applied. Details here:

Configuring PostgreSQL for read performance

Table definition

Exemplary for billing_pricequotestatus:

name seems to be one of a couple of possible types. It would help performance to normalize some more and just use a 4-byte integer referencing a lookup-table instead of a varchar(20) repeated over an over in 3.3M rows. Also, reordering columns (if possible) like I demonstrate would help some more:

```
Column | Type | Modifiers
------------+--------------------------+------------------------------------------
id | integer | not null default nextval('...
quote_id | integer | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with tim

Code Snippets

WHERE pq.date_applied AT TIME ZONE 'PST'
       BETWEEN '2016-02-02T00:00:00'::timestamp
           AND '2016-03-03T22:27:41.734102-08:00'::timestamptz

SELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM   billing_pricequote pq
JOIN   LATERAL (
   SELECT name
   FROM   billing_pricequotestatus
   WHERE  quote_id = pq.id
   ORDER  BY created_at DESC
   LIMIT  1
   ) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN   billing_lineitem i ON i.quote_id = pq.id
WHERE  pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST')  -- !
                           AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP  BY 1,2;

CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);

Column   |           Type           |              Modifiers
------------+--------------------------+------------------------------------------
 id         | integer                  | not null default nextval('...
 quote_id   | integer                  | not null
 created_at | timestamp with time zone | not null
 updated_at | timestamp with time zone | not null
 name_id    | integer                  | not null REFERENCES name_table(name_id)
 notes      | text                     | not null

Context

StackExchange Database Administrators Q#131310, answer score: 7

Revisions (0)

No revisions yet.