principlesqlModerate
PostgreSQL update join vs SQL Server update join
Viewed 0 times
postgresqlupdatesqljoinserver
Problem
I recently started converting a personal project from Microsoft SQL Server to PostgreSQL and I was surprised at the abysmal performance I encountered doing an
Suppose they look something like:
In T-SQL I would do an update using a join using something like this:
But running in Postgres, the query is glacially slow.
If I change it to:
it's not a problem.
I get that the syntax is different but I would've expected the query optimizer to work out something in the same ball park. Instead, things go bananas. Besides the syntactical differences, is there a nuanced difference between the two queries that I fail to see?
Explain plans
`Update on foo (cost=87575.47..535974.25 rows=581621 width=1022)
-> Hash Join (cost=87575.47..535974.25 rows=581621 width=1022)
Hash Cond: (foo.id = t2.id)
-> Seq Scan on foo (cost=0.00..151301.17 rows=1140417 width=1011)
-> Hash (cost=75761.21..75761.21 rows=581621 width=36)
-> Seq Scan on foo2
UPDATE JOIN between two tables.Suppose they look something like:
CREATE TABLE foo (
id INTEGER NOT NULL PRIMARY KEY,
bar INTEGER NULL
);
CREATE TABLE foo2 (
id INTEGER NOT NULL PRIMARY KEY,
bar INTEGER NULL
);
In T-SQL I would do an update using a join using something like this:
UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2
ON t1.id = t2.id;
But running in Postgres, the query is glacially slow.
If I change it to:
UPDATE foo
SET bar = t2.bar
FROM foo2 t2
WHERE foo.id = t2.id;
it's not a problem.
I get that the syntax is different but I would've expected the query optimizer to work out something in the same ball park. Instead, things go bananas. Besides the syntactical differences, is there a nuanced difference between the two queries that I fail to see?
Explain plans
Update on foo (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
-> Nested Loop (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
-> Seq Scan on foo (cost=0.00..145721.10 rows=582410 width=1010)
-> Materialize (cost=85852.43..247935.91 rows=580908 width=17)
-> Hash Join (cost=85852.43..241627.37 rows=580908 width=17)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on foo t1 (cost=0.00..145721.10 rows=582410 width=10)
-> Hash (cost=75754.08..75754.08 rows=580908 width=15)
-> Seq Scan on foo2 t2 (cost=0.00..75754.08 rows=580908 width=15)
`Update on foo (cost=87575.47..535974.25 rows=581621 width=1022)
-> Hash Join (cost=87575.47..535974.25 rows=581621 width=1022)
Hash Cond: (foo.id = t2.id)
-> Seq Scan on foo (cost=0.00..151301.17 rows=1140417 width=1011)
-> Hash (cost=75761.21..75761.21 rows=581621 width=36)
-> Seq Scan on foo2
Solution
But running in Postgres, the query is glacially slow.
There is no join condition between
Could be fixed by adding a join condition like:
Well, technically, a
But that's just putting lipstick on a pig. While
The manual advises for the
Do not repeat the target table as a
And:
When a
target table is joined to the tables mentioned in the
and each output row of the join represents an update operation for the
target table. When using
produces at most one output row for each row to be modified. In other
words, a target row shouldn't join to more than one row from the other
table(s). If it does, then only one of the join rows will be used to
update the target row, but which one will be used is not readily predictable.
Such a self-join makes sense (or is even necessary!) if you need a
UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2 ON t1.id = t2.id;There is no join condition between
foo and t1, the implicit CROSS JOIN forces a Cartesian product, i.e. O(N²) (!) update operations instead of just O(N). And the result is non-deterministic nonsense. The effect also becomes apparent in the query plan: rows=338326628280 instead of rows=581621 (Also: both plans were produced off slightly different tables, but that seems irrelevant to the question.)Could be fixed by adding a join condition like:
UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2 ON t1.id = t2.id
WHERE foo.id = t1.id; -- !Well, technically, a
WHERE condition, but all the same.But that's just putting lipstick on a pig. While
id is the PK column of each table, that's just adding noise. Use the command you already found instead:UPDATE foo
SET bar = t2.bar
FROM foo2 t2
WHERE foo.id = t2.id;The manual advises for the
FROM clause of UPDATE:Do not repeat the target table as a
from_item unless you intend a self-join (in which case it must appear with an alias in the from_item).And:
When a
FROM clause is present, what essentially happens is that thetarget table is joined to the tables mentioned in the
from_item list,and each output row of the join represents an update operation for the
target table. When using
FROM you should ensure that the joinproduces at most one output row for each row to be modified. In other
words, a target row shouldn't join to more than one row from the other
table(s). If it does, then only one of the join rows will be used to
update the target row, but which one will be used is not readily predictable.
Such a self-join makes sense (or is even necessary!) if you need a
LEFT [OUTER] JOIN to additional table(s). Sadly, there is no provision in SQL to say "FROM LEFT" in an UPDATE. Example:- Nullify column in update if subquery returns empty
Code Snippets
UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2 ON t1.id = t2.id;UPDATE foo
SET bar = t2.bar
FROM foo t1
JOIN foo2 t2 ON t1.id = t2.id
WHERE foo.id = t1.id; -- !UPDATE foo
SET bar = t2.bar
FROM foo2 t2
WHERE foo.id = t2.id;Context
StackExchange Database Administrators Q#291632, answer score: 18
Revisions (0)
No revisions yet.