gotchaCritical
Unexpected scans during delete operation using WHERE IN
Viewed 0 times
deleteduringwhereoperationscansunexpectedusing
Problem
I've got a query like the following:
tblFEStatsBrowsers has got 553 rows.
tblFEStatsPaperHits has got 47.974.301 rows.
tblFEStatsBrowsers:
tblFEStatsPaperHits:
There's a clustered index on tblFEStatsPaperHits that does not include BrowserID. Performing the inner query will thus require a full table scan of tblFEStatsPaperHits - which is totally OK.
Currently, a full scan is executed for each row in tblFEStatsBrowsers, meaning I've got 553 full table scans of tblFEStatsPaperHits.
Rewriting to just a WHERE EXISTS doesn't change the plan:
However, as suggested by Adam Machanic, adding a HASH JOIN option does result in the optimal execution plan (just a single scan of tblFEStatsPaperHits):
Now this isn't as much a question of how to fix this - I can either use the OPTION (HASH JOIN) or create a temp table manually. I'm more wondering why the query optimizer would ever use the plan it currently does.
Since
DELETE FROM tblFEStatsBrowsers WHERE BrowserID NOT IN (
SELECT DISTINCT BrowserID FROM tblFEStatsPaperHits WITH (NOLOCK) WHERE BrowserID IS NOT NULL
)tblFEStatsBrowsers has got 553 rows.
tblFEStatsPaperHits has got 47.974.301 rows.
tblFEStatsBrowsers:
CREATE TABLE [dbo].[tblFEStatsBrowsers](
[BrowserID] [smallint] IDENTITY(1,1) NOT NULL,
[Browser] [varchar](50) NOT NULL,
[Name] [varchar](40) NOT NULL,
[Version] [varchar](10) NOT NULL,
CONSTRAINT [PK_tblFEStatsBrowsers] PRIMARY KEY CLUSTERED ([BrowserID] ASC)
)tblFEStatsPaperHits:
CREATE TABLE [dbo].[tblFEStatsPaperHits](
[PaperID] [int] NOT NULL,
[Created] [smalldatetime] NOT NULL,
[IP] [binary](4) NULL,
[PlatformID] [tinyint] NULL,
[BrowserID] [smallint] NULL,
[ReferrerID] [int] NULL,
[UserLanguage] [char](2) NULL
)There's a clustered index on tblFEStatsPaperHits that does not include BrowserID. Performing the inner query will thus require a full table scan of tblFEStatsPaperHits - which is totally OK.
Currently, a full scan is executed for each row in tblFEStatsBrowsers, meaning I've got 553 full table scans of tblFEStatsPaperHits.
Rewriting to just a WHERE EXISTS doesn't change the plan:
DELETE FROM tblFEStatsBrowsers WHERE NOT EXISTS (
SELECT * FROM tblFEStatsPaperHits WITH (NOLOCK) WHERE BrowserID = tblFEStatsBrowsers.BrowserID
)However, as suggested by Adam Machanic, adding a HASH JOIN option does result in the optimal execution plan (just a single scan of tblFEStatsPaperHits):
DELETE FROM tblFEStatsBrowsers WHERE NOT EXISTS (
SELECT * FROM tblFEStatsPaperHits WITH (NOLOCK) WHERE BrowserID = tblFEStatsBrowsers.BrowserID
) OPTION (HASH JOIN)Now this isn't as much a question of how to fix this - I can either use the OPTION (HASH JOIN) or create a temp table manually. I'm more wondering why the query optimizer would ever use the plan it currently does.
Since
Solution
"I'm more wondering why the query optimizer would ever use the plan it currently does."
To put it another way, the question is why the following plan looks cheapest to the optimizer, compared with the alternatives (of which there are many).
The inner side of the join is essentially running a query of the following form for each correlated value of
Note that the estimated number of rows is 185,220 (not 289,013) since the equality comparison implicitly excludes
Now let's add a
The estimated cost is now 0.00452 units. The addition of the Top physical operator sets a row goal of 1 row at the Top operator. The question then becomes how to derive a 'row goal' for the Clustered Index Scan; that is, how many rows should the scan expect to process before one row matches the
The statistical information available shows 166 distinct
The estimated cost of scanning the expected 166 rows is not very large (even executed 339 times, once for each change of
In practice, it is very unlikely that the first 166 rows scanned from the index (in whatever order they happen to be returned) will contain one each of the possible
It is also interesting to note that the Top operator in the execution plan is not associated with the Anti Semi Join (in particular the 'short-circuiting' Martin mentions). We can start to see where the Top comes from by first disabling an exploration rule called GbAggToConstScanOrTop:
That plan has an estimated cost of 364.912, and shows that the Top replaced a Group By Aggregate (grouping by the correlated column
That plan has an estimated cost of 40729.3 units.
Without the transformation from Group By to Top, the optimizer 'naturally' chooses a hash join plan with
And without the MAXDOP 1 restriction, a parallel plan:
Another way to 'fix' the original query would be to create the missing index on
I wrote more about this in Row Goals, Part 4: The Anti Join Anti Pattern.
To put it another way, the question is why the following plan looks cheapest to the optimizer, compared with the alternatives (of which there are many).
The inner side of the join is essentially running a query of the following form for each correlated value of
BrowserID:DECLARE @BrowserID smallint;
SELECT
tfsph.BrowserID
FROM dbo.tblFEStatsPaperHits AS tfsph
WHERE
tfsph.BrowserID = @BrowserID
OPTION (MAXDOP 1);Note that the estimated number of rows is 185,220 (not 289,013) since the equality comparison implicitly excludes
NULL (unless ANSI_NULLS is OFF). The estimated cost of the above plan is 206.8 units.Now let's add a
TOP (1) clause:DECLARE @BrowserID smallint;
SELECT TOP (1)
tfsph.BrowserID
FROM dbo.tblFEStatsPaperHits AS tfsph
WHERE
tfsph.BrowserID = @BrowserID
OPTION (MAXDOP 1);The estimated cost is now 0.00452 units. The addition of the Top physical operator sets a row goal of 1 row at the Top operator. The question then becomes how to derive a 'row goal' for the Clustered Index Scan; that is, how many rows should the scan expect to process before one row matches the
BrowserID predicate?The statistical information available shows 166 distinct
BrowserID values (1/[All Density] = 1/0.006024096 = 166). Costing assumes that the distinct values are distributed uniformly over the physical rows, so the row goal on the Clustered Index Scan is set to 166.302 (accounting for the change in table cardinality since the sampled statistics were gathered).The estimated cost of scanning the expected 166 rows is not very large (even executed 339 times, once for each change of
BrowserID) - the Clustered Index Scan shows an estimated cost of 1.3219 units, showing the scaling effect of the row goal. The unscaled operator costs for I/O and CPU are shown as 153.931, and 52.8698 respectively:In practice, it is very unlikely that the first 166 rows scanned from the index (in whatever order they happen to be returned) will contain one each of the possible
BrowserID values. Nevertheless, the DELETE plan is costed at 1.40921 units total, and is selected by the optimizer for that reason. Bart Duncan shows another example of this type in a recent post titled Row Goals Gone Rogue.It is also interesting to note that the Top operator in the execution plan is not associated with the Anti Semi Join (in particular the 'short-circuiting' Martin mentions). We can start to see where the Top comes from by first disabling an exploration rule called GbAggToConstScanOrTop:
DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers
WHERE BrowserID NOT IN
(
SELECT DISTINCT BrowserID
FROM tblFEStatsPaperHits WITH (NOLOCK)
WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, LOOP JOIN, RECOMPILE);
GO
DBCC RULEON ('GbAggToConstScanOrTop');That plan has an estimated cost of 364.912, and shows that the Top replaced a Group By Aggregate (grouping by the correlated column
BrowserID). The aggregate is not due to the redundant DISTINCT in the query text: it is an optimization that can be introduced by two exploration rules, LASJNtoLASJNonDist and LASJOnLclDist. Disabling those two as well produces this plan:DBCC RULEOFF ('LASJNtoLASJNonDist');
DBCC RULEOFF ('LASJOnLclDist');
DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers
WHERE BrowserID NOT IN
(
SELECT DISTINCT BrowserID
FROM tblFEStatsPaperHits WITH (NOLOCK)
WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, LOOP JOIN, RECOMPILE);
GO
DBCC RULEON ('LASJNtoLASJNonDist');
DBCC RULEON ('LASJOnLclDist');
DBCC RULEON ('GbAggToConstScanOrTop');That plan has an estimated cost of 40729.3 units.
Without the transformation from Group By to Top, the optimizer 'naturally' chooses a hash join plan with
BrowserID aggregation before the anti semi join:DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers
WHERE BrowserID NOT IN
(
SELECT DISTINCT BrowserID
FROM tblFEStatsPaperHits WITH (NOLOCK)
WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, RECOMPILE);
GO
DBCC RULEON ('GbAggToConstScanOrTop');And without the MAXDOP 1 restriction, a parallel plan:
Another way to 'fix' the original query would be to create the missing index on
BrowserID that the execution plan reports. Nested loops work best with when the inner side is indexed. Estimating cardinality for semi joins is challenging at the best of times. Not having proper indexing (the large table doesn't even have a unique key!) will not help at all.I wrote more about this in Row Goals, Part 4: The Anti Join Anti Pattern.
Code Snippets
DECLARE @BrowserID smallint;
SELECT
tfsph.BrowserID
FROM dbo.tblFEStatsPaperHits AS tfsph
WHERE
tfsph.BrowserID = @BrowserID
OPTION (MAXDOP 1);DECLARE @BrowserID smallint;
SELECT TOP (1)
tfsph.BrowserID
FROM dbo.tblFEStatsPaperHits AS tfsph
WHERE
tfsph.BrowserID = @BrowserID
OPTION (MAXDOP 1);DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers
WHERE BrowserID NOT IN
(
SELECT DISTINCT BrowserID
FROM tblFEStatsPaperHits WITH (NOLOCK)
WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, LOOP JOIN, RECOMPILE);
GO
DBCC RULEON ('GbAggToConstScanOrTop');DBCC RULEOFF ('LASJNtoLASJNonDist');
DBCC RULEOFF ('LASJOnLclDist');
DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers
WHERE BrowserID NOT IN
(
SELECT DISTINCT BrowserID
FROM tblFEStatsPaperHits WITH (NOLOCK)
WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, LOOP JOIN, RECOMPILE);
GO
DBCC RULEON ('LASJNtoLASJNonDist');
DBCC RULEON ('LASJOnLclDist');
DBCC RULEON ('GbAggToConstScanOrTop');DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers
WHERE BrowserID NOT IN
(
SELECT DISTINCT BrowserID
FROM tblFEStatsPaperHits WITH (NOLOCK)
WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, RECOMPILE);
GO
DBCC RULEON ('GbAggToConstScanOrTop');Context
StackExchange Database Administrators Q#18637, answer score: 61
Revisions (0)
No revisions yet.