HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Postgres 11: Query plan uses seq scan after upgrade

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
afterscanpostgresseqqueryusesplanupgrade

Problem

The Situation

We have a database hosted on RDS with a few hundred tables, a few of which are quite large.

We recently upgraded the database from 9.5.22 to 11.8 and performance is significantly degraded.

After upgrading, we ran VACUUM ANALYZE on the instance (as opposed to ./analyze_new_cluster.sh as we're unable to run a shell on RDS instances).

This has not helped the situation. I spun up another standalone 11.8 instance of the database and ran a VACUUM FULL ANALYZE, and that database exhibits the same query planner behavior so including FULL in the VACUUM command did not help (as is suggested in some SO answers).

We have found one query that shows the most drastic change in performance before and after the upgrade:
SELECT f.uuid, p.name
FROM flights f
LEFT OUTER JOIN passengers p
    ON f.uuid = p.flight_id
WHERE f.uuid IN ()
ORDER BY f.date_created ASC;


Previously P95 latency was under 4 ms. Now, P95 is 15 seconds.

the trouble arises when the number of UUIDs in the WHERE clause includes 5 or more UUIDs.

The tables involved have the following (simplified) structure:
` Table "public.flights"
Column | Type | Modifiers | Storage | Stats target
--------------+--------------------------+-----------+---------+--------------
uuid | uuid | not null | plain |
date_created | timestamp with time zone | not null | plain |

Indexes:
"flights_pkey" PRIMARY KEY, btree (uuid)

Table "public.passengers"
Column | Type | Modifiers | Storage | Stats target
-----------+------------------------+-------------------------------+---------+-------------
id | bigint | not null default nextval(...) | plain |
flight_id | uuid | not null | plain |
name | character varying(128) | not null

Solution

Your statistics haven't changed much between versions. They are quite a bit off in both. But what changed is that the bad stats make parallel plans look attractive, now that parallel plans exist.

This is actually because the UUIDs I provide to the query did not exist in the flights table. I merely wanted to pass in a larger number to trigger the different behavior on how to scan the flights table.

Querying a column with high cardinality for values that happen to not exist is inherently hard to estimate well. Why are you doing this? It sounds here like you are doing it to intentionally create a problem, but it also sounds like you happened upon this problem because it was occuring naturally. How can both be true? Maybe this problem you have artificially created does not have the same root cause (or same solution) as the natural problems you are running into.

I experimented by setting max_parallel_workers_per_gather = 0, which also had the desired effect of coercing the postgres 11 to return a query plan that avoided the sequential scan, but I do not think it is wise to disable that functionality for the database.

The reason for setting this to off seems clear (assuming your one example is a representative instance of the real problems), whereas your reason for not wanting to turn it off seems pretty nebulous. Did you do the upgrade specifically to get access to parallel queries?

So before someone suggests to increase that value, why would that help the postgres 11 instance if the postgres 9.5 instance produces a "good" plan without them?

It is easy to make the right decision accidentally, if you don't have many options to choose from. Opening up parallelization offers many more ways to screw up, and having bad stats makes that likely for one of those bad ways to get chosen. Having said that, increasing the stats target wouldn't help anyway, unless you could increase to over 17 million, which you can't (and even then I don't think it would help anyway).

Is there something about postgres 11 (parallel workers?) that makes it think it can perform the sequential scan faster than the index scan? This seems unlikely given that the planner expects to return 21 rows but at a huge cost

It is not just the seq scan that it thinks benefits from the parallel query. By doing the seq scan on flights in parallel, it thinks that the index scan on passengers will also get done in parallel inherently, and that is where it thinks most of the supposed benefit will come from. Although this is not a complete explanation, as by turning off enable_seqscan, I would still expect to use a parallel plan, just with a parallel index scan or parallel bitmap heap scan on flights. I can't explain why it would give up on parallel altogether just due to enable_seqscan=off. And I can't reproduce that behavior in v11 with simulated data.

Context

StackExchange Database Administrators Q#282271, answer score: 2

Revisions (0)

No revisions yet.