HiveBrain v1.2.0
Get Started
← Back to all entries
gotchasqlMajor

Why does a subquery reduce the row estimate to 1?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
estimatewhythereducesubquerydoesrow

Problem

Consider the following contrived but simple query:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
  END AS ID2
FROM X_HEAP;


I would expect the final row estimate for this query to be equal to the number of rows in the X_HEAP table. Whatever I'm doing in the subquery shouldn't matter for the row estimate because it cannot filter out any rows. However, on SQL Server 2016 I see the row estimate reduced to 1 because of the subquery:

Why does this happen? What can I do about it?

It's very easy to reproduce this issue with the right syntax. Here is one set of table definitions that will do it:

CREATE TABLE dbo.X_HEAP (ID INT NOT NULL)
CREATE TABLE dbo.X_OTHER_TABLE (ID INT NOT NULL);
CREATE TABLE dbo.X_OTHER_TABLE_2 (ID INT NOT NULL);

INSERT INTO dbo.X_HEAP WITH (TABLOCK)
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values;

CREATE STATISTICS X_HEAP__ID ON X_HEAP (ID) WITH FULLSCAN;


db fiddle link.

Solution

This definitely seems like unintended behavior. It is true that cardinality estimates do not need to be consistent at each step of a plan but this is a relatively simple query plan and the final cardinality estimate is inconsistent with what the query is doing. Such a low cardinality estimate could result in poor choices for join types and access methods for other tables downstream in a more complicated plan.

Through trial and error we can come up with a few similar queries for which the issue does not appear:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP;


We can also come up with more queries for which the issue appears:

SELECT 
  ID
, CASE
    WHEN ID = 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP;


There appears to be a pattern: if there is an expression within the CASE that is not expected to be executed and the result expression is a subquery against a table then the row estimate falls to 1 after that expression.

If I write the query against a table with a clustered index the rules change somewhat. We can use the same data:

CREATE TABLE dbo.X_CI (ID INT NOT NULL, PRIMARY KEY (ID))

INSERT INTO dbo.X_CI WITH (TABLOCK)
SELECT * FROM dbo.X_HEAP;

UPDATE STATISTICS X_CI WITH FULLSCAN;


This query has a 1000 row final estimate:

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
  END
FROM dbo.X_CI;


But this query has a 1 row final estimate:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END
FROM dbo.X_CI;


To dig into this further we can use the undocumented trace flag 2363 to get information about how the query optimizer performed selectivity calculations. I found it helpful to pair that trace flag with the undocumented trace flag 8606. TF 2363 seems to give selectivity computations for both the simplified tree and the tree after project normalization. Having both trace flags enabled makes it clear which calculations apply to which tree.

Let's try it for the original query posted in the question:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
  END AS ID2
FROM X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);


Here is part of the part of output which I think is relevant along with some comments:

Plan for computation:

  CSelCalcColumnInInterval -- this is the type of calculator used

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID -- this is the column used for the calculation

Pass-through selectivity: 0 -- all rows are expected to have a true value for the case expression

Stats collection generated: 

  CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- the row estimate after the join will still be 1000

      CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)

      CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)

...

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 1 -- no rows are expected to have a true value for the case expression

Stats collection generated: 

  CStCollOuterJoin(ID=9, CARD=1 x_jtLeftOuter) -- the row estimate after the join will still be 1

      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- here is the row estimate after the previous join

          CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)

          CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)

      CStCollBaseTable(ID=3, CARD=1 TBL: X_OTHER_TABLE_2)


Now let's try it for a similar query that doesn't have the issue. I'm going to use this one:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);


Debug output at the very end:

```
Plan for computation:

CSelCalcColumnInInterval

Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 1

Stats collection generated:

CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)

CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)

CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)

CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)

Code Snippets

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID < 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    WHEN ID >= 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP;
SELECT 
  ID
, CASE
    WHEN ID < 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    WHEN ID >= 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP;
CREATE TABLE dbo.X_CI (ID INT NOT NULL, PRIMARY KEY (ID))

INSERT INTO dbo.X_CI WITH (TABLOCK)
SELECT * FROM dbo.X_HEAP;

UPDATE STATISTICS X_CI WITH FULLSCAN;
SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
  END
FROM dbo.X_CI;
SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END
FROM dbo.X_CI;

Context

StackExchange Database Administrators Q#171654, answer score: 24

Revisions (0)

No revisions yet.