gotchasqlMajor
Why does a subquery reduce the row estimate to 1?
Viewed 0 times
estimatewhythereducesubquerydoesrow
Problem
Consider the following contrived but simple query:
I would expect the final row estimate for this query to be equal to the number of rows in the
Why does this happen? What can I do about it?
It's very easy to reproduce this issue with the right syntax. Here is one set of table definitions that will do it:
db fiddle link.
SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE)
ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2)
END AS ID2
FROM X_HEAP;I would expect the final row estimate for this query to be equal to the number of rows in the
X_HEAP table. Whatever I'm doing in the subquery shouldn't matter for the row estimate because it cannot filter out any rows. However, on SQL Server 2016 I see the row estimate reduced to 1 because of the subquery:Why does this happen? What can I do about it?
It's very easy to reproduce this issue with the right syntax. Here is one set of table definitions that will do it:
CREATE TABLE dbo.X_HEAP (ID INT NOT NULL)
CREATE TABLE dbo.X_OTHER_TABLE (ID INT NOT NULL);
CREATE TABLE dbo.X_OTHER_TABLE_2 (ID INT NOT NULL);
INSERT INTO dbo.X_HEAP WITH (TABLOCK)
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values;
CREATE STATISTICS X_HEAP__ID ON X_HEAP (ID) WITH FULLSCAN;db fiddle link.
Solution
This definitely seems like unintended behavior. It is true that cardinality estimates do not need to be consistent at each step of a plan but this is a relatively simple query plan and the final cardinality estimate is inconsistent with what the query is doing. Such a low cardinality estimate could result in poor choices for join types and access methods for other tables downstream in a more complicated plan.
Through trial and error we can come up with a few similar queries for which the issue does not appear:
We can also come up with more queries for which the issue appears:
There appears to be a pattern: if there is an expression within the
If I write the query against a table with a clustered index the rules change somewhat. We can use the same data:
This query has a 1000 row final estimate:
But this query has a 1 row final estimate:
To dig into this further we can use the undocumented trace flag 2363 to get information about how the query optimizer performed selectivity calculations. I found it helpful to pair that trace flag with the undocumented trace flag 8606. TF 2363 seems to give selectivity computations for both the simplified tree and the tree after project normalization. Having both trace flags enabled makes it clear which calculations apply to which tree.
Let's try it for the original query posted in the question:
Here is part of the part of output which I think is relevant along with some comments:
Now let's try it for a similar query that doesn't have the issue. I'm going to use this one:
Debug output at the very end:
```
Plan for computation:
CSelCalcColumnInInterval
Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
Pass-through selectivity: 1
Stats collection generated:
CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)
CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)
CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)
Through trial and error we can come up with a few similar queries for which the issue does not appear:
SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT -1)
END AS ID2
FROM dbo.X_HEAP;
SELECT
ID
, CASE
WHEN ID = 500
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
END AS ID2
FROM dbo.X_HEAP;We can also come up with more queries for which the issue appears:
SELECT
ID
, CASE
WHEN ID = 500
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE)
END AS ID2
FROM dbo.X_HEAP;
SELECT
ID
, CASE
WHEN ID = 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT -1)
END AS ID2
FROM dbo.X_HEAP;
SELECT
ID
, CASE
WHEN ID = 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
END AS ID2
FROM dbo.X_HEAP;There appears to be a pattern: if there is an expression within the
CASE that is not expected to be executed and the result expression is a subquery against a table then the row estimate falls to 1 after that expression.If I write the query against a table with a clustered index the rules change somewhat. We can use the same data:
CREATE TABLE dbo.X_CI (ID INT NOT NULL, PRIMARY KEY (ID))
INSERT INTO dbo.X_CI WITH (TABLOCK)
SELECT * FROM dbo.X_HEAP;
UPDATE STATISTICS X_CI WITH FULLSCAN;This query has a 1000 row final estimate:
SELECT
ID
, CASE
WHEN ID = 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
END
FROM dbo.X_CI;But this query has a 1 row final estimate:
SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
END
FROM dbo.X_CI;To dig into this further we can use the undocumented trace flag 2363 to get information about how the query optimizer performed selectivity calculations. I found it helpful to pair that trace flag with the undocumented trace flag 8606. TF 2363 seems to give selectivity computations for both the simplified tree and the tree after project normalization. Having both trace flags enabled makes it clear which calculations apply to which tree.
Let's try it for the original query posted in the question:
SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE)
ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2)
END AS ID2
FROM X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);Here is part of the part of output which I think is relevant along with some comments:
Plan for computation:
CSelCalcColumnInInterval -- this is the type of calculator used
Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID -- this is the column used for the calculation
Pass-through selectivity: 0 -- all rows are expected to have a true value for the case expression
Stats collection generated:
CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- the row estimate after the join will still be 1000
CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)
CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)
...
Plan for computation:
CSelCalcColumnInInterval
Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
Pass-through selectivity: 1 -- no rows are expected to have a true value for the case expression
Stats collection generated:
CStCollOuterJoin(ID=9, CARD=1 x_jtLeftOuter) -- the row estimate after the join will still be 1
CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- here is the row estimate after the previous join
CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)
CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)
CStCollBaseTable(ID=3, CARD=1 TBL: X_OTHER_TABLE_2)Now let's try it for a similar query that doesn't have the issue. I'm going to use this one:
SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT -1)
END AS ID2
FROM dbo.X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);Debug output at the very end:
```
Plan for computation:
CSelCalcColumnInInterval
Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
Pass-through selectivity: 1
Stats collection generated:
CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)
CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)
CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)
Code Snippets
SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT -1)
END AS ID2
FROM dbo.X_HEAP;
SELECT
ID
, CASE
WHEN ID < 500
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
WHEN ID >= 500
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
END AS ID2
FROM dbo.X_HEAP;SELECT
ID
, CASE
WHEN ID < 500
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
WHEN ID >= 500
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE)
END AS ID2
FROM dbo.X_HEAP;
SELECT
ID
, CASE
WHEN ID = 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT -1)
END AS ID2
FROM dbo.X_HEAP;
SELECT
ID
, CASE
WHEN ID = 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
END AS ID2
FROM dbo.X_HEAP;CREATE TABLE dbo.X_CI (ID INT NOT NULL, PRIMARY KEY (ID))
INSERT INTO dbo.X_CI WITH (TABLOCK)
SELECT * FROM dbo.X_HEAP;
UPDATE STATISTICS X_CI WITH FULLSCAN;SELECT
ID
, CASE
WHEN ID = 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
END
FROM dbo.X_CI;SELECT
ID
, CASE
WHEN ID <> 0
THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE)
ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2)
END
FROM dbo.X_CI;Context
StackExchange Database Administrators Q#171654, answer score: 24
Revisions (0)
No revisions yet.