HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

How does SQL estimate the number of rows in a less than < Predicate

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
estimaterowsthenumbersqlthanpredicatelessdoeshow

Problem

I have been doing some testing to try to better understand how SQL Server uses a histogram to estimate the number of rows that will match an equality predicate and also a predicate

Given I am using the AdventureWorks2016 OLTP database

If can understand SQL Server's estimation process for = and > predicates:

/* update stats with fullscan first */    
UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN


Then I can see the histogram for the column TransactionHistory.Quantity

DBCC SHOW_STATISTICS (
    'Production.TransactionHistory', 
    'Quantity')


The below screenshot is the top end of the histogram where I have run my tests:

The following query will estimate 6 rows as the value in the predicate is a RANGE_HI_KEY so uses the EQ_ROWS for that bucket:

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity = 2863


The following will estimate 1.36 rows as it is not a RANGE_HI_KEY so uses the AVG_RANGE_ROWS for the bucket it falls in:

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity = 2862


The following "greater than" query will estimate 130 rows which appears to be the sum of the RANGE_ROWS and the EQ_ROWS for all the buckets with a RANGE_HI_KEY > 2863

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity > 2863


A similar query below, but the value is not a RANGE_HI_KEY in the histogram. SQL Server again estimates 130 and appears to use the same method as above

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity > 2870


This all makes sense up to now so my testing moved onto a "less than" query

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity < 490


for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:

RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.

How does SQL Server use the histogram to estimate the nu

Solution

for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:

RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.

You're 13 short, not 11: 109,579 - 109,566 = 13.

The general idea, as shown in my related answer is to use linear interpolation within the partial step, assuming uniformity.

In your case:

So the question is how many of those 23 RANGE_ROWS do we expect to match the predicate < 490 when they are assumed to be distributed uniformly within the histogram step with RANGE_HI_KEY 500:

DECLARE
    @ARR float = 23e0 / 6e0, -- AVG_RANGE_ROWS
    @DRR float = 6e0,        -- DISTINCT_RANGE_ROWS
    @PR float = 490 - 470,   -- predicate range
    @SR float = 499 - 470    -- whole step range (excluding high key)

SELECT (@DRR - 1) * ((@PR - 1) / @SR) / ((@SR - 1) / @SR) * @ARR;


This computation gives 13.00595.

The -1 factors account for using < which is assumed to exclude a DISTINCT_RANGE_ROW row. When <= is used, that row is assumed to match the predicate.

The whole thing is a modification of applying the fraction of the range you are asking for versus the range covered by the histogram step. Without excluding the unmatched value, it would be simply @PR/@SR.

Code Snippets

DECLARE
    @ARR float = 23e0 / 6e0, -- AVG_RANGE_ROWS
    @DRR float = 6e0,        -- DISTINCT_RANGE_ROWS
    @PR float = 490 - 470,   -- predicate range
    @SR float = 499 - 470    -- whole step range (excluding high key)

SELECT (@DRR - 1) * ((@PR - 1) / @SR) / ((@SR - 1) / @SR) * @ARR;

Context

StackExchange Database Administrators Q#312816, answer score: 6

Revisions (0)

No revisions yet.