patternsqlMajor
Updating a table efficiently using JOIN
Viewed 0 times
efficientlyupdatingjoinusingtable
Problem
I have a table that has the details of households and another that has the details of all the persons associated with the households. For the household table I have a primary key defined using two columns in it -
Using the sorting dictated by the clustered indexing on primary keys, I have generated a unique ID for each household
Now, my next step is to associate each person with the corresponding households ie; map a
I have a total of 1928783 household records and 5239842 person records. The execution time is currently very high.
Now, my questions:
are the thumb rules for optimizing a join query?
execution time?
I have uploaded the execution plan generated by SQL Server 2008 for the whole script to SQLPerformance.com
[tempId,n]. For the person table I have a primary key defined using 3 of its columns [tempId,n,sporder]Using the sorting dictated by the clustered indexing on primary keys, I have generated a unique ID for each household
[HHID] and each person [PERID] record (the snippet below is for generating PERID]:ALTER TABLE dbo.persons
ADD PERID INT IDENTITY
CONSTRAINT [UQ dbo.persons HHID] UNIQUE;Now, my next step is to associate each person with the corresponding households ie; map a
[PERID] to a [HHID]. The crosswalk between the two tables is based on the two columns[tempId,n]. For this I have the following inner join statement.UPDATE t1
SET t1.HHID = t2.HHID
FROM dbo.persons AS t1
INNER JOIN dbo.households AS t2
ON t1.tempId = t2.tempId AND t1.n = t2.n;I have a total of 1928783 household records and 5239842 person records. The execution time is currently very high.
Now, my questions:
- Is it possible to optimize this query further? More generally, what
are the thumb rules for optimizing a join query?
- Is there another query construct that can achieve the result I want with better
execution time?
I have uploaded the execution plan generated by SQL Server 2008 for the whole script to SQLPerformance.com
Solution
I'm pretty certain the table definitions are close to this:
I don't have statistics for these tables or your data, but the following will at least set the table cardinality correct (the page counts are a guess):
Query Plan Analysis
The query you have now is:
This generates the rather inefficient plan:
The main problems in this plan are the hash join and sort. Both require a memory grant (the hash join needs to build a hash table, and the sort needs room to store the rows while sorting progresses). Plan Explorer shows this query was granted 765 MB:
This is quite a lot of server memory to dedicate to one query! More to the point, this memory grant is fixed before execution starts based on row count and size estimates.
If the memory turns out to be insufficient at execution time, at least some data for the hash and/or sort will be written to physical tempdb disk. This is known as a 'spill' and it can be a very slow operation. You can trace these spills (in SQL Server 2008) using the Profiler events Hash Warnings and Sort Warnings.
The estimate for the hash table's build input is very good:
The estimate for the sort input is less accurate:
You would have to use Profiler to check, but I suspect the sort will spill to tempdb in this case. It is also possible that the hash table spills too, but that is less clear-cut.
Note that the memory reserved for this query is split between the hash table and sort, because they run concurrently. The Memory Fractions plan property shows the relative amount of the memory grant expected to be used by each operation.
Why Sort and Hash?
The sort is introduced by the query optimizer to ensure that rows arrive at the Clustered Index Update operator in clustered key order. This promotes sequential access to the table, which is often much more efficient than random access.
The hash join is a less obvious choice, because it's inputs are similar sizes (to a first approximation, anyway). Hash join is best where one input (the one that builds the hash table) is relatively small.
In this case, the optimizer's costing model determines that hash join is the cheaper of the three options (hash, merge, nested loops).
Improving Performance
The cost model does not always get it right. It tends to over-estimate the cost of parallel merge join, especially as the number of threads increases. We can force a merge join with a query hint:
This produces a plan that does not require as much memory (because merge join does not need a hash table):
The problematic sort is still there, because merge join only preserves the order of its join keys (tempId, n) but the clustered keys are (tempId, n, sporder). You may find the merge join plan performs no better than the hash join plan.
Nested Loops Join
We can also try a nested loops join:
The plan for this query is:
This query plan is considered the worst by the optimizer's costing model, but it does have some very desirable features. First, nested loops join does not require a memory grant. Second, it can preserve the key order from the
Parallel Nested Loops
The big drawback with the nested loops plan is that it runs on a single thread. It is likely this query benefits from parallelism, but the optimizer decides there is no advantage in doing that here. This is not necessarily correct either. Unfortunately, there is no built-in query hint to get a parallel plan, but there is an undocumented way:
```
UPDATE t1
SET t1.HHID = t2.HHID
FROM dbo.persons AS t1
INNER JOIN dbo.households AS t2
ON t1.tempId = t2.tempId AND t1.n = t2.n
OPTION (LOOP JOIN,
CREATE TABLE dbo.households
(
tempId integer NOT NULL,
n integer NOT NULL,
HHID integer IDENTITY NOT NULL,
CONSTRAINT [UQ dbo.households HHID]
UNIQUE NONCLUSTERED (HHID),
CONSTRAINT [PK dbo.households tempId, n]
PRIMARY KEY CLUSTERED (tempId, n)
);
CREATE TABLE dbo.persons
(
tempId integer NOT NULL,
sporder integer NOT NULL,
n integer NOT NULL,
PERID integer IDENTITY NOT NULL,
HHID integer NOT NULL,
CONSTRAINT [UQ dbo.persons HHID]
UNIQUE NONCLUSTERED (PERID),
CONSTRAINT [PK dbo.persons tempId, n, sporder]
PRIMARY KEY CLUSTERED (tempId, n, sporder)
);I don't have statistics for these tables or your data, but the following will at least set the table cardinality correct (the page counts are a guess):
UPDATE STATISTICS dbo.persons
WITH
ROWCOUNT = 5239842,
PAGECOUNT = 100000;
UPDATE STATISTICS dbo.households
WITH
ROWCOUNT = 1928783,
PAGECOUNT = 25000;Query Plan Analysis
The query you have now is:
UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
ON P.tempId = H.tempId
AND P.n = H.n;This generates the rather inefficient plan:
The main problems in this plan are the hash join and sort. Both require a memory grant (the hash join needs to build a hash table, and the sort needs room to store the rows while sorting progresses). Plan Explorer shows this query was granted 765 MB:
This is quite a lot of server memory to dedicate to one query! More to the point, this memory grant is fixed before execution starts based on row count and size estimates.
If the memory turns out to be insufficient at execution time, at least some data for the hash and/or sort will be written to physical tempdb disk. This is known as a 'spill' and it can be a very slow operation. You can trace these spills (in SQL Server 2008) using the Profiler events Hash Warnings and Sort Warnings.
The estimate for the hash table's build input is very good:
The estimate for the sort input is less accurate:
You would have to use Profiler to check, but I suspect the sort will spill to tempdb in this case. It is also possible that the hash table spills too, but that is less clear-cut.
Note that the memory reserved for this query is split between the hash table and sort, because they run concurrently. The Memory Fractions plan property shows the relative amount of the memory grant expected to be used by each operation.
Why Sort and Hash?
The sort is introduced by the query optimizer to ensure that rows arrive at the Clustered Index Update operator in clustered key order. This promotes sequential access to the table, which is often much more efficient than random access.
The hash join is a less obvious choice, because it's inputs are similar sizes (to a first approximation, anyway). Hash join is best where one input (the one that builds the hash table) is relatively small.
In this case, the optimizer's costing model determines that hash join is the cheaper of the three options (hash, merge, nested loops).
Improving Performance
The cost model does not always get it right. It tends to over-estimate the cost of parallel merge join, especially as the number of threads increases. We can force a merge join with a query hint:
UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
ON P.tempId = H.tempId
AND P.n = H.n
OPTION (MERGE JOIN);This produces a plan that does not require as much memory (because merge join does not need a hash table):
The problematic sort is still there, because merge join only preserves the order of its join keys (tempId, n) but the clustered keys are (tempId, n, sporder). You may find the merge join plan performs no better than the hash join plan.
Nested Loops Join
We can also try a nested loops join:
UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
ON P.tempId = H.tempId
AND P.n = H.n
OPTION (LOOP JOIN);The plan for this query is:
This query plan is considered the worst by the optimizer's costing model, but it does have some very desirable features. First, nested loops join does not require a memory grant. Second, it can preserve the key order from the
Persons table so that an explicit sort is not needed. You may find this plan performs relatively well, perhaps even good enough.Parallel Nested Loops
The big drawback with the nested loops plan is that it runs on a single thread. It is likely this query benefits from parallelism, but the optimizer decides there is no advantage in doing that here. This is not necessarily correct either. Unfortunately, there is no built-in query hint to get a parallel plan, but there is an undocumented way:
```
UPDATE t1
SET t1.HHID = t2.HHID
FROM dbo.persons AS t1
INNER JOIN dbo.households AS t2
ON t1.tempId = t2.tempId AND t1.n = t2.n
OPTION (LOOP JOIN,
Code Snippets
CREATE TABLE dbo.households
(
tempId integer NOT NULL,
n integer NOT NULL,
HHID integer IDENTITY NOT NULL,
CONSTRAINT [UQ dbo.households HHID]
UNIQUE NONCLUSTERED (HHID),
CONSTRAINT [PK dbo.households tempId, n]
PRIMARY KEY CLUSTERED (tempId, n)
);
CREATE TABLE dbo.persons
(
tempId integer NOT NULL,
sporder integer NOT NULL,
n integer NOT NULL,
PERID integer IDENTITY NOT NULL,
HHID integer NOT NULL,
CONSTRAINT [UQ dbo.persons HHID]
UNIQUE NONCLUSTERED (PERID),
CONSTRAINT [PK dbo.persons tempId, n, sporder]
PRIMARY KEY CLUSTERED (tempId, n, sporder)
);UPDATE STATISTICS dbo.persons
WITH
ROWCOUNT = 5239842,
PAGECOUNT = 100000;
UPDATE STATISTICS dbo.households
WITH
ROWCOUNT = 1928783,
PAGECOUNT = 25000;UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
ON P.tempId = H.tempId
AND P.n = H.n;UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
ON P.tempId = H.tempId
AND P.n = H.n
OPTION (MERGE JOIN);UPDATE P
SET HHID = H.HHID
FROM dbo.households AS H
JOIN dbo.persons AS P
ON P.tempId = H.tempId
AND P.n = H.n
OPTION (LOOP JOIN);Context
StackExchange Database Administrators Q#47134, answer score: 20
Revisions (0)
No revisions yet.