patternMinor
How does SQL estimate the number of rows in a less than < Predicate
Viewed 0 times
estimaterowsthenumbersqlthanpredicatelessdoeshow
Problem
I have been doing some testing to try to better understand how SQL Server uses a histogram to estimate the number of rows that will match an equality predicate and also a predicate
Given I am using the AdventureWorks2016 OLTP database
If can understand SQL Server's estimation process for = and > predicates:
Then I can see the histogram for the column
The below screenshot is the top end of the histogram where I have run my tests:
The following query will estimate 6 rows as the value in the predicate is a RANGE_HI_KEY so uses the EQ_ROWS for that bucket:
The following will estimate 1.36 rows as it is not a RANGE_HI_KEY so uses the AVG_RANGE_ROWS for the bucket it falls in:
The following "greater than" query will estimate 130 rows which appears to be the sum of the RANGE_ROWS and the EQ_ROWS for all the buckets with a RANGE_HI_KEY > 2863
A similar query below, but the value is not a RANGE_HI_KEY in the histogram. SQL Server again estimates 130 and appears to use the same method as above
This all makes sense up to now so my testing moved onto a "less than" query
for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:
RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.
How does SQL Server use the histogram to estimate the nu
Given I am using the AdventureWorks2016 OLTP database
If can understand SQL Server's estimation process for = and > predicates:
/* update stats with fullscan first */
UPDATE STATISTICS Production.TransactionHistory WITH FULLSCANThen I can see the histogram for the column
TransactionHistory.QuantityDBCC SHOW_STATISTICS (
'Production.TransactionHistory',
'Quantity')The below screenshot is the top end of the histogram where I have run my tests:
The following query will estimate 6 rows as the value in the predicate is a RANGE_HI_KEY so uses the EQ_ROWS for that bucket:
SELECT *
FROM Production.TransactionHistory
WHERE Quantity = 2863The following will estimate 1.36 rows as it is not a RANGE_HI_KEY so uses the AVG_RANGE_ROWS for the bucket it falls in:
SELECT *
FROM Production.TransactionHistory
WHERE Quantity = 2862The following "greater than" query will estimate 130 rows which appears to be the sum of the RANGE_ROWS and the EQ_ROWS for all the buckets with a RANGE_HI_KEY > 2863
SELECT *
FROM Production.TransactionHistory
WHERE Quantity > 2863A similar query below, but the value is not a RANGE_HI_KEY in the histogram. SQL Server again estimates 130 and appears to use the same method as above
SELECT *
FROM Production.TransactionHistory
WHERE Quantity > 2870This all makes sense up to now so my testing moved onto a "less than" query
SELECT *
FROM Production.TransactionHistory
WHERE Quantity < 490for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:
RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.
How does SQL Server use the histogram to estimate the nu
Solution
for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:
RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.
You're 13 short, not 11: 109,579 - 109,566 = 13.
The general idea, as shown in my related answer is to use linear interpolation within the partial step, assuming uniformity.
In your case:
So the question is how many of those 23
This computation gives 13.00595.
The
The whole thing is a modification of applying the fraction of the range you are asking for versus the range covered by the histogram step. Without excluding the unmatched value, it would be simply
RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.
You're 13 short, not 11: 109,579 - 109,566 = 13.
The general idea, as shown in my related answer is to use linear interpolation within the partial step, assuming uniformity.
In your case:
So the question is how many of those 23
RANGE_ROWS do we expect to match the predicate < 490 when they are assumed to be distributed uniformly within the histogram step with RANGE_HI_KEY 500:DECLARE
@ARR float = 23e0 / 6e0, -- AVG_RANGE_ROWS
@DRR float = 6e0, -- DISTINCT_RANGE_ROWS
@PR float = 490 - 470, -- predicate range
@SR float = 499 - 470 -- whole step range (excluding high key)
SELECT (@DRR - 1) * ((@PR - 1) / @SR) / ((@SR - 1) / @SR) * @ARR;This computation gives 13.00595.
The
-1 factors account for using < which is assumed to exclude a DISTINCT_RANGE_ROW row. When <= is used, that row is assumed to match the predicate.The whole thing is a modification of applying the fraction of the range you are asking for versus the range covered by the histogram step. Without excluding the unmatched value, it would be simply
@PR/@SR.Code Snippets
DECLARE
@ARR float = 23e0 / 6e0, -- AVG_RANGE_ROWS
@DRR float = 6e0, -- DISTINCT_RANGE_ROWS
@PR float = 490 - 470, -- predicate range
@SR float = 499 - 470 -- whole step range (excluding high key)
SELECT (@DRR - 1) * ((@PR - 1) / @SR) / ((@SR - 1) / @SR) * @ARR;Context
StackExchange Database Administrators Q#312816, answer score: 6
Revisions (0)
No revisions yet.