HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Is there a faster way to get UNION ALL behavior in Postgres?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresallunionbehaviorwayfastergetthere

Problem

I have a table (t0), in my Postgres DB, with data that looks something like this:

t1_id  t2_id
1      1      
2      1
2      1
4      null
4      null
5      null


And I have a query to return my desired results of:

t1_id  t2_id
1      1
4      null
5      null


My query looks something like this:

(
  SELECT DISTINCT ON (t2_id) t1_id, t2_id
  FROM t0
  WHERE t2_id IS NOT NULL
)
UNION ALL
(
  SELECT DISTINCT ON (t1_id) t1_id, t2_id
  FROM t0
  WHERE t2_id IS NULL
)


Is there a faster way to do an operation like this? It's not too bad, but I'm doing it in several places (with joins) and all these repeated queries seems to slow stuff down a bit. Seems like there must be a better way.

Here's the query in fiddle form: http://sqlfiddle.com/#!15/d41d8/3603

Solution

For the simple case, I can only think of minor improvements to the query:

(
  SELECT DISTINCT ON (t2_id)
         t1_id, t2_id
  FROM   t0
  WHERE  t2_id IS NOT NULL
  ORDER  BY t2_id, t1_id  -- to get consistent results
)
UNION ALL
(
  SELECT DISTINCT ON (t1_id)
         t1_id, NULL      -- cheaper
  FROM   t0
  WHERE  t2_id IS NULL
  -- if you retrieve more columns, add ORDER BY, too
)


-
As ypercube mentioned You need to add ORDER BY with an unambiguous list of expressions to get deterministic results.

-
You can use the constant NULL instead of t2_id in the second leg of the query. Also relevant for below index support.

-
The key to performance is indexing. Try two partial indexes matching the two parts of the query:

CREATE INDEX t0_part1_idx ON t0 (t2_id, t1_id) WHERE t2_id IS NOT NULL;
CREATE INDEX t0_part2_idx ON t0 (t1_id)        WHERE t2_id IS NULL;


You may or may not want to include additional columns to allow index-only scans.

Depending on table size and data distribution, there may be faster alternatives:

  • Select first row in each GROUP BY group? (last chapter)



  • Optimize GROUP BY query to retrieve latest record per user



Single SELECT

If you want to condense it into a single SELECT:

SELECT DISTINCT ON (coalesce(t2_id, t1_id), t2_id)
       t1_id, t2_id
FROM   t0
ORDER  BY coalesce(t2_id, t1_id), t2_id, t1_id;


Equivalent, except for sort order. If you want this to be fast, try a functional index:

CREATE INDEX t0_func_idx ON t0 (coalesce(t2_id, t1_id), t2_id, t1_id);

Code Snippets

(
  SELECT DISTINCT ON (t2_id)
         t1_id, t2_id
  FROM   t0
  WHERE  t2_id IS NOT NULL
  ORDER  BY t2_id, t1_id  -- to get consistent results
)
UNION ALL
(
  SELECT DISTINCT ON (t1_id)
         t1_id, NULL      -- cheaper
  FROM   t0
  WHERE  t2_id IS NULL
  -- if you retrieve more columns, add ORDER BY, too
)
CREATE INDEX t0_part1_idx ON t0 (t2_id, t1_id) WHERE t2_id IS NOT NULL;
CREATE INDEX t0_part2_idx ON t0 (t1_id)        WHERE t2_id IS NULL;
SELECT DISTINCT ON (coalesce(t2_id, t1_id), t2_id)
       t1_id, t2_id
FROM   t0
ORDER  BY coalesce(t2_id, t1_id), t2_id, t1_id;
CREATE INDEX t0_func_idx ON t0 (coalesce(t2_id, t1_id), t2_id, t1_id);

Context

StackExchange Database Administrators Q#79980, answer score: 5

Revisions (0)

No revisions yet.