patternsqlModerate
Why do my queries take longer together than when I run them separately?
Viewed 0 times
separatelywhytakelongerthantogetherwhenqueriesthemrun
Problem
I have a series of update statements that follow a general pattern: one update aggregates values from another table (or sometimes multiple tables), the next update produces a rank based on the aggregated values. This process is repeated 23 times for a total 46 update statements. Each update couplet takes 30-40 seconds to run on its own, but when I run them all together as a single transaction via PgAdmin it takes over an hour instead of the ~15 mins I would expect based on the individual query times. (I ended up stopping execution and running them separately last time I tried.)
If I run the same set of updates in a file via psql the process completes in the expected 15 min timeframe.
Is there some quirk to the query planner that would change the execution plan based on there being a lot of update statements running in a single transaction? Given the different behavior between psql and PgAdmin, I assume this is related to the way the queries are packaged for execution but I'm not familiar enough to know the difference.
Is there a way to write my code so as to improve performance when running it all as a single transaction through PgAdmin?
I'm on PostgreSQL 9.5 on Ubuntu 16.04.
Here are two example couplets from the code:
```
-- bike_driver_aggressive
UPDATE generated.crash_aggregates
SET bike_driver_aggressive = (
SELECT COUNT(*)
FROM crashes_bike2 c
WHERE c.int_id = crash_aggregates.int_id
AND c.aggressive_driverfault
);
WITH ranks AS (
SELECT int_id,
rank() OVER (ORDER BY bike_driver_aggressive DESC) AS rank
FROM crash_aggregates
)
UPDATE generated.crash_aggregates
SET bike_driver_aggressive_rank = ranks.rank
FROM ranks
WHERE crash_aggregates.int_id = ranks.int_id;
-- bike_allinjury
UPDATE generated.crash_aggregates
SET bike_allinjury = (
SELECT COUNT(*)
FROM crashes_bike1 c
WHERE c.int_id = crash_aggregates.
If I run the same set of updates in a file via psql the process completes in the expected 15 min timeframe.
Is there some quirk to the query planner that would change the execution plan based on there being a lot of update statements running in a single transaction? Given the different behavior between psql and PgAdmin, I assume this is related to the way the queries are packaged for execution but I'm not familiar enough to know the difference.
Is there a way to write my code so as to improve performance when running it all as a single transaction through PgAdmin?
I'm on PostgreSQL 9.5 on Ubuntu 16.04.
Here are two example couplets from the code:
```
-- bike_driver_aggressive
UPDATE generated.crash_aggregates
SET bike_driver_aggressive = (
SELECT COUNT(*)
FROM crashes_bike2 c
WHERE c.int_id = crash_aggregates.int_id
AND c.aggressive_driverfault
);
WITH ranks AS (
SELECT int_id,
rank() OVER (ORDER BY bike_driver_aggressive DESC) AS rank
FROM crash_aggregates
)
UPDATE generated.crash_aggregates
SET bike_driver_aggressive_rank = ranks.rank
FROM ranks
WHERE crash_aggregates.int_id = ranks.int_id;
-- bike_allinjury
UPDATE generated.crash_aggregates
SET bike_allinjury = (
SELECT COUNT(*)
FROM crashes_bike1 c
WHERE c.int_id = crash_aggregates.
Solution
Debunking
If you are doing all your updates in the same transaction, each of them will have to work an increasingly bigger set of (physical) tuples. See the following example:
You might see this even when the different updates are not in the same transaction. This will be the case if autovacuum (or a manual VACUUM) does not keep up with the speed of change.
The reason is how MVCC in PostgreSQL works. When doing an update, it creates a new physical row with the new values, marking the old one invisible to transactions that start after the current one. This means that there will be row versions on the physical level that are 'wasted space' once the transaction commits. These have to be eventually freed by (auto)vacuuming.
Now when you are inside in a transaction, it is not clear if it will be committed or rolled back, therefore the DB has to keep all obsolete row versions at hand, so that it can revert to the original ones. (Well, not all, there seem to be some optimizations around this, but definitely some: in my experiments, the table seen above grew up to 200 MB (instead of 6.6).) This means the physical size of your table will grow and grow - the size of each step depends what exactly you do in the
And this is where you can improve your process a lot. Currently, you have many updates that change the whole table unconditionally, effectively meaning that in each iteration, you have twice the size of the table you had in the previous one. After 46 (or 23) rounds, even a relatively small table can grow to something really big.
Accordingly, the time needed to operate on all the versions that are kept around takes more and more time. In my runs, the first
The (possible) solution
As your updates look very similar, and they just affect different columns of the same table, you could try something like
This means the CPUs will have to think a bit about all the counts, but you paid for them to do their work. At the same time, the table will be rewritten only once.
When all this is done, it seems to be enough to calculate the ranks only once, by similarly composing one
Improve more
Well, the above
So, as it looks like, the counts are calculated one by one, meaning that the two source tables will be visited many times. Depending on their size (and some other factors), this might be terrible. In your case it stays under the pain threshold, otherwise you would already complain about that, too ;) In other cases, it might not be so.
The idea is that it is possible to collect all the aggregates in a single run, and to use this as a source for the updates. For this, we can build a big aggregate-all structure:
Here I suppose that the distribution of the different flags (like
I've replaced
If you are doing all your updates in the same transaction, each of them will have to work an increasingly bigger set of (physical) tuples. See the following example:
CREATE TABLE explode (id integer, something text);
INSERT INTO explode SELECT i, md5(i::text) FROM generate_series(1, 100000) t(i);
\dt+ explode -- done in psql
List of relations
Schema │ Name │ Type │ Owner │ Size │ Description
────────┼─────────┼───────┼────────┼─────────┼─────────────
test │ explode │ table │ avaczi │ 6704 kB │
BEGIN;
UPDATE explode SET something = something || 'a';
\dt+ explode
test │ explode │ table │ avaczi │ 13 MB │
UPDATE explode SET something = something || 'a';
\dt+ explode
test │ explode │ table │ avaczi │ 20 MB │
COMMIT;You might see this even when the different updates are not in the same transaction. This will be the case if autovacuum (or a manual VACUUM) does not keep up with the speed of change.
The reason is how MVCC in PostgreSQL works. When doing an update, it creates a new physical row with the new values, marking the old one invisible to transactions that start after the current one. This means that there will be row versions on the physical level that are 'wasted space' once the transaction commits. These have to be eventually freed by (auto)vacuuming.
Now when you are inside in a transaction, it is not clear if it will be committed or rolled back, therefore the DB has to keep all obsolete row versions at hand, so that it can revert to the original ones. (Well, not all, there seem to be some optimizations around this, but definitely some: in my experiments, the table seen above grew up to 200 MB (instead of 6.6).) This means the physical size of your table will grow and grow - the size of each step depends what exactly you do in the
UPDATE statements.And this is where you can improve your process a lot. Currently, you have many updates that change the whole table unconditionally, effectively meaning that in each iteration, you have twice the size of the table you had in the previous one. After 46 (or 23) rounds, even a relatively small table can grow to something really big.
Accordingly, the time needed to operate on all the versions that are kept around takes more and more time. In my runs, the first
UPDATE finished in 190-200 ms, the 20th iteration reached 400 ms. As you have nice wide tables and the updates take already 30-40 seconds when run alone, you can expect some serious slowdown.The (possible) solution
As your updates look very similar, and they just affect different columns of the same table, you could try something like
UPDATE generated.crash_aggregates
SET bike_driver_aggressive = (
SELECT COUNT(*)
FROM crashes_bike2 c
WHERE c.int_id = crash_aggregates.int_id
AND c.aggressive_driverfault
),
bike_allinjury = (
SELECT COUNT(*)
FROM crashes_bike1 c
WHERE c.int_id = crash_aggregates.int_id
AND c.injurycrash
) + (
SELECT COUNT(*)
FROM crashes_bike2 c
WHERE c.int_id = crash_aggregates.int_id
AND c.injurycrash
),
[...];This means the CPUs will have to think a bit about all the counts, but you paid for them to do their work. At the same time, the table will be rewritten only once.
When all this is done, it seems to be enough to calculate the ranks only once, by similarly composing one
UPDATE query for updating them in only one round.Improve more
Well, the above
UPDATE (as also suggested in a comment) is far from being the most efficient ever. Here is a take on improving it a bit more.So, as it looks like, the counts are calculated one by one, meaning that the two source tables will be visited many times. Depending on their size (and some other factors), this might be terrible. In your case it stays under the pain threshold, otherwise you would already complain about that, too ;) In other cases, it might not be so.
The idea is that it is possible to collect all the aggregates in a single run, and to use this as a source for the updates. For this, we can build a big aggregate-all structure:
SELECT int_id,
sum(c1.injurycrash::integer) + sum(c2.injurycrash::integer),
sum(aggressive_driverfault::integer),
...
FROM crashes_bike1 AS c1, crashes_bike2 AS c2
GROUP BY int_id;Here I suppose that the distribution of the different flags (like
injurycrash) is so that TRUE values are not very rare. If they are, this one big scan of the whole crashes_bike table (well, the two of them) might be worse than many index (only) scans. However, I don't see you have everything indexed there (and it might make no sense - it's hard to judge without actual data).I've replaced
count() for sum() in the aggregates above. The trick is to conCode Snippets
CREATE TABLE explode (id integer, something text);
INSERT INTO explode SELECT i, md5(i::text) FROM generate_series(1, 100000) t(i);
\dt+ explode -- done in psql
List of relations
Schema │ Name │ Type │ Owner │ Size │ Description
────────┼─────────┼───────┼────────┼─────────┼─────────────
test │ explode │ table │ avaczi │ 6704 kB │
BEGIN;
UPDATE explode SET something = something || 'a';
\dt+ explode
test │ explode │ table │ avaczi │ 13 MB │
UPDATE explode SET something = something || 'a';
\dt+ explode
test │ explode │ table │ avaczi │ 20 MB │
COMMIT;UPDATE generated.crash_aggregates
SET bike_driver_aggressive = (
SELECT COUNT(*)
FROM crashes_bike2 c
WHERE c.int_id = crash_aggregates.int_id
AND c.aggressive_driverfault
),
bike_allinjury = (
SELECT COUNT(*)
FROM crashes_bike1 c
WHERE c.int_id = crash_aggregates.int_id
AND c.injurycrash
) + (
SELECT COUNT(*)
FROM crashes_bike2 c
WHERE c.int_id = crash_aggregates.int_id
AND c.injurycrash
),
[...];SELECT int_id,
sum(c1.injurycrash::integer) + sum(c2.injurycrash::integer),
sum(aggressive_driverfault::integer),
...
FROM crashes_bike1 AS c1, crashes_bike2 AS c2
GROUP BY int_id;WITH aggregates1 AS (
SELECT int_id,
sum(injurycrash::integer),
sum(aggressive_driverfault::integer) AS aggressive_driverfault,
...
FROM crashes_bike1
GROUP BY int_id
), aggregates2 AS (
SELECT sum(injurycrash::integer) AS injurycrash,
...
FROM crashes_bike2
GROUP BY int_id
)
UPDATE crash_aggregates AS ca
SET bike_allinjury = a1.injurycrash + a2.injurycrash,
bike_driver_aggressive = a.aggressive_driverfault,
...
FROM aggregates1 AS a1
FULL JOIN aggregates2 AS a2 USING (int_id)
WHERE ca.int_id = a1.int_id OR ca.int_id = a2.int_id;Context
StackExchange Database Administrators Q#156616, answer score: 11
Revisions (0)
No revisions yet.