patternsqlMinor
Actual and Estimated rows differ greatly
Viewed 0 times
rowsgreatlyactualdifferestimatedand
Problem
The full Actual plan is here.
Prior to executing the plan (because I'm debugging a poorly functioning plan) I have this block of variable assignments:
My first problem is with the lookup I'm performing on the IRItemAnswer_Info table (Node ID 19). It's spilling to Tempdb which already starts the query off on the wrong foot.
It's referencing the
However, the Estimated Rows for the query is 53,459 and the Actual Rows is 969,812.
I just finished forcing new statistics via
The database is running Compatibility level 140 (SQL Server 2017). We would run 2019, but there are issues we need to correct in the stored procedures before we can do that.
What should be the next thing I look at?
I chose the worst performing output, which is the most common values.
Prior to executing the plan (because I'm debugging a poorly functioning plan) I have this block of variable assignments:
DECLARE @Days INT = 180
DECLARE @DateRangeFrom DateTime = DATEADD(d, -@Days, getDate())
DECLARE @DateRangeTo DateTime = getDate()
DECLARE @FacilityID INT = 1010
DECLARE @Answer0 INT = 1879
DECLARE @Answer1 INT = 1949
DECLARE @Answer1SetID INT = 1607
DECLARE @Answer2 INT = 1907
DECLARE @Answer2SetID INT = 1593My first problem is with the lookup I'm performing on the IRItemAnswer_Info table (Node ID 19). It's spilling to Tempdb which already starts the query off on the wrong foot.
It's referencing the
IRItemAnswerInfo_DGItemID_AnswerSourceID index, which is the correct index, as I'm matching on DGItemID and AnswerSourceID, and getting back IncidentID. The index is created asCREATE NONCLUSTERED INDEX IRItemAnswerInfo_DGItemID_AnswerSourceID
ON dbo.IRItemAnswer_Info (DGItemID, AnswerSourceID)
INCLUDE([IncidentID], [AnswerBoolean])However, the Estimated Rows for the query is 53,459 and the Actual Rows is 969,812.
I just finished forcing new statistics via
UPDATE STATISTICS IRItemAnswer_Info IRItemAnswerInfo_DGItemID_AnswerSourceID WITH FULLSCAN and it made no difference.DBCC SHOW_STATISTICS ('IRItemAnswer_Info', 'DGItemID') for DGItemID=1949 has EQ_ROWS as 1,063,536 andDBCC SHOW_STATISTICS ('IRItemAnswer_Info', 'AnswerSourceID') for AnswerSourceID=1607 has EQ_ROWS as 970,079The database is running Compatibility level 140 (SQL Server 2017). We would run 2019, but there are issues we need to correct in the stored procedures before we can do that.
What should be the next thing I look at?
I chose the worst performing output, which is the most common values.
IRItemAnswer_Info is a table containing user-defined answers to associate to an event, where DGItemID=1949 is one of the most common questions (almost every event has one), and where AnswerSourceID=1607 iSolution
As discussed in the related Q & A How does SQL Server know predicates are correlated? SQL Server assumes predicates are completely independent by default.
It only has detailed statistical information (histograms) on the single leading column, even where multi-column indexes or statistics are used. The question then is how to combine two statistics histograms from two separate predicates.
For example, say you have a query with
What is the selectivity of the two predicates together? 0.2? 0.1? 0.2 x 0.1? Somewhere in between?
Without specific additional information, SQL Server has to make an educated guess. The original default was to assume complete independence. The newer cardinality estimation framework uses exponential backoff (the 'somewhere in between' option).
Your case is slightly different, in that you have two equality tests on columns in a multi-column index, which comes with multi-column statistics. These are not so grand as they might sound. We still only get a histogram on the leading column, but the statistics object does contain average density information for multiple columns.
For example an index on (a,b,c) would provide density information for (a), (a,b), and (a,b,c). This frequency information does capture something about correlation, but it is a single number at each level. This means a frequency-based estimate will always produce the same estimate given the same number of columns.
SQL Server does produce a selectivity estimate from the multi-column frequency information, but it also computes selectivity from the individual column histograms (where available). The histogram estimate assumes independence, and does not use exponential backoff.
The server chooses the histogram-based estimate if it is a higher selectivity than the frequency-based estimate. This appears to be the case in your example.
From information in the question, the individual selectivities are:
Assuming independence, for
There are a number of internal model variations that approach the task in different ways. Only a few are publicly documented and exposed via hints or trace flags.
Ordinarily, it would seem that the following hint could be helpful:
Documentation
Causes SQL Server to generate a plan using minimum selectivity when estimating AND predicates for filters to account for full correlation. This hint name is equivalent to trace flag 4137 when used with cardinality estimation model of SQL Server 2012 (11.x) and earlier versions, and has similar effect when trace flag 9471 is used with cardinality estimation model of SQL Server 2014 (12.x) or higher.
Unfortunately, that hint does not apply when cardinality estimation started with a frequency-based calculation, using multi-column statistics.
You might get a better result in your case using the original CE model:
It only has detailed statistical information (histograms) on the single leading column, even where multi-column indexes or statistics are used. The question then is how to combine two statistics histograms from two separate predicates.
For example, say you have a query with
WHERE c1 = x AND c2 = y. The selectivity of c1 = x is computed to be 0.2 from histogram information. The selectivity of c2 = y is computed to be 0.1 from a separate histogram.What is the selectivity of the two predicates together? 0.2? 0.1? 0.2 x 0.1? Somewhere in between?
Without specific additional information, SQL Server has to make an educated guess. The original default was to assume complete independence. The newer cardinality estimation framework uses exponential backoff (the 'somewhere in between' option).
Your case is slightly different, in that you have two equality tests on columns in a multi-column index, which comes with multi-column statistics. These are not so grand as they might sound. We still only get a histogram on the leading column, but the statistics object does contain average density information for multiple columns.
For example an index on (a,b,c) would provide density information for (a), (a,b), and (a,b,c). This frequency information does capture something about correlation, but it is a single number at each level. This means a frequency-based estimate will always produce the same estimate given the same number of columns.
SQL Server does produce a selectivity estimate from the multi-column frequency information, but it also computes selectivity from the individual column histograms (where available). The histogram estimate assumes independence, and does not use exponential backoff.
The server chooses the histogram-based estimate if it is a higher selectivity than the frequency-based estimate. This appears to be the case in your example.
Plan for computation:
CSelCalcPointPredsFreqBased
Distinct value calculation:
CDVCPlanLeaf
1 Multi-Column Stats, 0 Single-Column Stats, 0 Guesses
Individual selectivity calculations:
CSelCalcColumnInInterval
Column: QCOL: [IIAI].DGItemID
CSelCalcColumnInInterval
Column: QCOL: [IIAI].AnswerSourceID
Loaded histogram for column QCOL: [IIAI].DGItemID from stats with id 2
Loaded histogram for column QCOL: [IIAI].AnswerSourceID from stats with id 3
Cardinality using multi-column statistics 5.45574e-07 and
with independence assumption 0.00231336.
Picking cardinality 0.00231336
Selectivity: 0.00231336
From information in the question, the individual selectivities are:
- DGItemID = 1063536 out of 19299400
- AnswerSourceID = 970079 out of 19299400
Assuming independence, for
AND we multiply those selectivities then multiply by the full table cardinality to produce the row estimate:19299400 (1063536/19299400 970079/19299400) = 53458.3427124.
There are a number of internal model variations that approach the task in different ways. Only a few are publicly documented and exposed via hints or trace flags.
Ordinarily, it would seem that the following hint could be helpful:
SELECT
COUNT_BIG(*)
FROM [VaxxTracker].[dbo].[IRItemAnswer_Info] AS iiai1
WHERE
iiai1.DGItemID = 1949
AND iiai1.AnswerSourceID = 1607
OPTION (USE HINT('ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES'));Documentation
- 'ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES'
Causes SQL Server to generate a plan using minimum selectivity when estimating AND predicates for filters to account for full correlation. This hint name is equivalent to trace flag 4137 when used with cardinality estimation model of SQL Server 2012 (11.x) and earlier versions, and has similar effect when trace flag 9471 is used with cardinality estimation model of SQL Server 2014 (12.x) or higher.
Unfortunately, that hint does not apply when cardinality estimation started with a frequency-based calculation, using multi-column statistics.
You might get a better result in your case using the original CE model:
USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION')Code Snippets
SELECT
COUNT_BIG(*)
FROM [VaxxTracker].[dbo].[IRItemAnswer_Info] AS iiai1
WHERE
iiai1.DGItemID = 1949
AND iiai1.AnswerSourceID = 1607
OPTION (USE HINT('ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES'));USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION')Context
StackExchange Database Administrators Q#305220, answer score: 6
Revisions (0)
No revisions yet.