patternsqlMinor
Finding distinct rows across two tables: Full Outer Join more efficient than Union?
Viewed 0 times
distinctrowstablesfullthanmoreefficientfindingjoinunion
Problem
When finding distinct rows across two tables where we can't necessarily ensure are pre-sorted, is it a good idea to use a
We were able to bring a specific production query from ~10 minutes to ~3 minutes by re-writing a
See the following scripts if you'd like to run a simplified and anonymized version of our production query:
Setup script
FULL OUTER JOIN
The
```
-- CPU time = 3058 ms, elapsed time = 783 ms.
-- MaxUsedMemory: 29016 KB
-- Table '#t1'. Scan count 5, logical reads 1301, physical reads 0
-- Table '#t2'. Scan count 5, logical reads 12989, physical reads
FULL OUTER JOIN rather than a UNION? Are there any downsides to this approach? If it is consistently faster, why does the query optimizer not choose the same plan for UNION that the FULL OUTER JOIN would use?We were able to bring a specific production query from ~10 minutes to ~3 minutes by re-writing a
UNION as a FULL OUTER JOIN. A UNION seems like the more intuitive way to write the logic, but upon exploring both options, I have observed that the FULL OUTER JOIN is more efficient in terms of both memory and CPU usage.See the following scripts if you'd like to run a simplified and anonymized version of our production query:
Setup script
-- Create a 500K row table
SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY NEWID()) AS id, v1.number % 5 AS val
INTO #t1
FROM master..spt_values v1
CROSS JOIN master..spt_values v2
-- Create a 5MM row table that will match some, but not all, of the 500K row table
SELECT TOP 5000000 ROW_NUMBER() OVER (ORDER BY NEWID()) AS id, v1.number % 5 AS val
INTO #t2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2
-- Optionally, key both tables to see the impact it has on query plans and performance
-- Both queries end up with essentially the same plan and performance in this case
-- So that means that at least there is not a downside to using the FULL OUTER JOIN when the data is sorted
--ALTER TABLE #t1
--ADD UNIQUE CLUSTERED (id)
--ALTER TABLE #t2
--ADD UNIQUE CLUSTERED (id)FULL OUTER JOIN
The
FULL OUTER JOIN chooses the small of the two tables as the build side of a hash join, meaning that the memory usage is proportional to the size of the smaller table (500K rows).```
-- CPU time = 3058 ms, elapsed time = 783 ms.
-- MaxUsedMemory: 29016 KB
-- Table '#t1'. Scan count 5, logical reads 1301, physical reads 0
-- Table '#t2'. Scan count 5, logical reads 12989, physical reads
Solution
The semantics of the two queries are not the same -
Output:
That said, the optimizer does not know many
Note that with a unique constraint only on the larger table, the optimizer chooses a hash union, without expensive duplicate-removal on the probe input, that makes it choose Concat Union All in the question example:
The
If such a uniqueness guarantee does exist, and yet the optimizer does not select a Hash Union, you might try an
UNION removes duplicates, whereas the FULL OUTER JOIN will not:DECLARE @T1 AS table (id bigint NULL, val integer NULL);
DECLARE @T2 AS table (id bigint NULL, val integer NULL);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
SELECT COALESCE(t1.id, t2.id) AS id, COALESCE(t1.val, t2.val) AS val
FROM @t1 t1
FULL OUTER JOIN @t2 t2
ON t2.id = t1.id
AND t2.val = t1.val;
SELECT t1.id, t1.val
FROM @t1 t1
UNION
SELECT t2.id, t2.val
FROM @t2 t2;Output:
╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║ 1 ║ 1 ║
║ 1 ║ 1 ║
║ 1 ║ 1 ║
║ 1 ║ 1 ║
╚════╩═════╝
╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║ 1 ║ 1 ║
╚════╩═════╝That said, the optimizer does not know many
FOJN tricks, so it is always possible that there is a better way to express the query than the natural UNION. Only commonly-useful and always-correct transformations are implemented.Note that with a unique constraint only on the larger table, the optimizer chooses a hash union, without expensive duplicate-removal on the probe input, that makes it choose Concat Union All in the question example:
ALTER TABLE #t2
ADD CONSTRAINT UQ2
UNIQUE CLUSTERED (id);
SELECT COUNT(*), AVG(x.id), AVG(x.val)
FROM (
SELECT t1.id, t1.val
FROM #t1 t1
UNION
SELECT t2.id, t2.val
FROM #t2 t2
) AS x;The
FOJN rewrite may well be a useful one in cases where you know there cannot be duplicates within each input set, but this condition is not enforced with a unique constraint or index (particularly on the large input).If such a uniqueness guarantee does exist, and yet the optimizer does not select a Hash Union, you might try an
OPTION (HASH UNION) hint, to see how it compares.Code Snippets
DECLARE @T1 AS table (id bigint NULL, val integer NULL);
DECLARE @T2 AS table (id bigint NULL, val integer NULL);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
SELECT COALESCE(t1.id, t2.id) AS id, COALESCE(t1.val, t2.val) AS val
FROM @t1 t1
FULL OUTER JOIN @t2 t2
ON t2.id = t1.id
AND t2.val = t1.val;
SELECT t1.id, t1.val
FROM @t1 t1
UNION
SELECT t2.id, t2.val
FROM @t2 t2;╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║ 1 ║ 1 ║
║ 1 ║ 1 ║
║ 1 ║ 1 ║
║ 1 ║ 1 ║
╚════╩═════╝
╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║ 1 ║ 1 ║
╚════╩═════╝ALTER TABLE #t2
ADD CONSTRAINT UQ2
UNIQUE CLUSTERED (id);
SELECT COUNT(*), AVG(x.id), AVG(x.val)
FROM (
SELECT t1.id, t1.val
FROM #t1 t1
UNION
SELECT t2.id, t2.val
FROM #t2 t2
) AS x;Context
StackExchange Database Administrators Q#112300, answer score: 9
Revisions (0)
No revisions yet.