HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Why is an OR statement slower than UNION?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whystatementunionslowerthan

Problem

Database version: PostgreSQL 12.6

I have a table with 600,000 records.

The table has the columns:

  • name (varchar)



  • location_type (int) enum values: (1,2,3)



  • ancestry (varchar)



Indexes:

  • ancestry (btree)



The ancestry column is a way to build a tree where every row has an ancestry containing all parent ids separated by '/'.

Consider the following example:

id
name
ancestry

1
root
null

5
node
'1'

12
node
'1/5'

22
leaf
'1/5/12'

The following query takes 686 ms to execute:

SELECT * FROM geolocations
WHERE EXISTS (
   SELECT 1 FROM geolocations g2
   WHERE g2.ancestry =
      CONCAT(geolocations.ancestry, '/', geolocations.id)
)


This query runs in 808 ms:

SELECT * FROM geolocations
WHERE location_type = 2


When combining both queried with an OR, it takes around 4475 ms to finish if it ever finishes.

SELECT * FROM geolocations
WHERE EXISTS (
   SELECT 1 FROM geolocations g2
   WHERE g2.ancestry =
      CONCAT(geolocations.ancestry, '/', geolocations.id)
) OR location_type = 2


Explain:

```
[
{
"Plan": {
"Node Type": "Seq Scan",
"Parallel Aware": false,
"Relation Name": "geolocations",
"Alias": "geolocations",
"Startup Cost": 0,
"Total Cost": 2760473.54,
"Plan Rows": 582910,
"Plan Width": 68,
"Filter": "((SubPlan 1) OR (location_type = 2))",
"Plans": [
{
"Node Type": "Index Only Scan",
"Parent Relationship": "SubPlan",
"Subplan Name": "SubPlan 1",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "index_geolocations_on_ancestry",
"Relation Name": "geolocations",
"Alias": "g2",
"Startup Cost": 0.43,
"Total Cost": 124.91,
"Plan Rows": 30,
"Plan Width": 0,
"Index Cond": "(ancestry = concat(geolocations.ancestry, '/', geolocations.id))"
}
]
},
"JIT": {
"Worker Number": -1,
"Functi

Solution

Jeff already hinted at this, but I feel the need to point out the elephant in the room:

The two queries are not equivalent!

UNION removes all duplicates across the SELECT list.

While the other query with OR keeps them.

You have SELECT * FROM geolocations, and no other tables in the FROM list. So if there are no duplicate rows in the table (which is guaranteed by any UNIQUE index on NOT NULL columns including PRIMARY KEY and UNIQUE constraints), there cannot be duplicates in the result and the two queries are equivalent after all. But any JOIN or any SELECT list with a (not-unique) subset of columns can introduce duplicate rows in the result!

Using UNION ALL instead is even further off. It produces duplicates that OR-ed predicates will not. If the same row qualifies for multiple OR-ed predicates, it qualifies once. Rewriting with UNION ALL will select that same row multiple times.

There is no way to filter "bad" duplicates and keep the "good" ones with UNION / UNION ALL. So Postgres cannot generally replace cases of "ugly OR" with UNION plans. Even where it could, it's not certain that UNION will, in fact, be faster.

But Postgres can typically combine multiple OR-ed predicates on the same table in a bitmap index scan. See:

  • What is a “bitmap index”?



An "ugly OR" is where predicates on different underlying tables are OR-ed together. (Or even the same table, but separate instances like in your case.) This can make queries more expensive, even when no indexes are involved. But it gets particularly expensive when an efficient index scan is foiled by this. (Indexes on different tables cannot be combined in a bitmap index scan!) It typically matters most for selective queries returning a small percentage of rows from underlying big tables. When more than a few percent of all rows have to be read anyway, index scans lose their power. Ugly ORs don't hurt as much in those queries.

Related blog post by Laurenz Albe:

  • https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/



Solutions

First of all, your use of CONCAT() is incorrect. It concatenates offspring with a leading / for root nodes with ancestry IS NULL. Like '/1' instead of '1'. Use CONCAT_WS():

SELECT *
FROM   geolocations g
WHERE  EXISTS (
   SELECT FROM geolocations g2
   WHERE  g2.ancestry = CONCAT_WS('/', g.ancestry, g.id)  -- !
   )
    OR location_type = 2;


See:

  • How to concatenate columns in a Postgres SELECT?



Still ugly. If you run queries like this a lot you might do more. If the table is read-only, add a boolean flag named has_children. Else consider a MATERIALIZED VIEW with that extra column or keep the table column current with triggers. Then your query can just be:

SELECT *
FROM   geolocations
WHERE  has_children
    OR location_type = 2;


has_children is typically not selective, so the query produces a lot of result rows and is never going to be very cheap (though a lot cheaper). Indexing the column won't help. We'd need complete information to maybe find a different angle. That's beyond the scope of this question.

Either way, if you need that redundant ancestry with every row, consider a proper array with a GIN index instead of the ugly string. Or maybe the additional module ltree, which is old, but for that purpose exactly.

Code Snippets

SELECT *
FROM   geolocations g
WHERE  EXISTS (
   SELECT FROM geolocations g2
   WHERE  g2.ancestry = CONCAT_WS('/', g.ancestry, g.id)  -- !
   )
    OR location_type = 2;
SELECT *
FROM   geolocations
WHERE  has_children
    OR location_type = 2;

Context

StackExchange Database Administrators Q#293836, answer score: 26

Revisions (0)

No revisions yet.