patternsqlModerate
Сardinality estimation of partially covering range predicates
Viewed 0 times
predicatesсardinalityrangepartiallycoveringestimation
Problem
At the moment I'm trying to figure out how SQL Server evaluates the cardinality of range predicates that partially cover the histogram step.
On the Internet, at cardinality-estimation-for-and-for-intra-step-statistics-value I came across a similar question and Paul White gave a rather interesting answer to it.
According to Paul's answer, the formulas for estimating cardinality for the predicates >= and > (in this case, I am only interested in the Cardinality estimator model of at least 120) are as follows:
For >:
For >=:
I tested the application of these formulas on the [Production].[TransactionHistory] table of the AdventureWorks2014 database based on the range predicate using the TransactionDate column and datetime range between '20140614' and '20140618'.
The statistics for the histogram step of this range are as follows:
According to the formula, I calculated the cardinality for the following query:
The calculation was performed using the following code:
After calculati
On the Internet, at cardinality-estimation-for-and-for-intra-step-statistics-value I came across a similar question and Paul White gave a rather interesting answer to it.
According to Paul's answer, the formulas for estimating cardinality for the predicates >= and > (in this case, I am only interested in the Cardinality estimator model of at least 120) are as follows:
For >:
Cardinality = EQ_ROWS + (AVG_RANGE_ROWS * (F * (DISTINCT_RANGE_ROWS - 1)))For >=:
Cardinality = EQ_ROWS + (AVG_RANGE_ROWS * ((F * (DISTINCT_RANGE_ROWS - 1)) + 1))I tested the application of these formulas on the [Production].[TransactionHistory] table of the AdventureWorks2014 database based on the range predicate using the TransactionDate column and datetime range between '20140614' and '20140618'.
The statistics for the histogram step of this range are as follows:
According to the formula, I calculated the cardinality for the following query:
SELECT COUNT(1)
FROM [AdventureWorks2014].[Production].[TransactionHistory]
WHERE [TransactionDate] BETWEEN '20140615 00:00:00.000' AND '20140616 00:00:00.000'The calculation was performed using the following code:
DECLARE @predStart DATETIME = '20140615 00:00:00.000'
DECLARE @predEnd DATETIME = '20140616 00:00:00.000'
DECLARE @stepStart DATETIME = '20140614 00:00:00.000'
DECLARE @stepEnd DATETIME = '20140618 00:00:00.000'
DECLARE @predRange FLOAT = DATEDIFF(ms, @predStart, @predEnd)
DECLARE @stepRange FLOAT = DATEDIFF(ms, @stepStart, @stepEnd)
DECLARE @F FLOAT = @predRange / @stepRange;
DECLARE @avg_range_rows FLOAT = 100.3333
DECLARE @distinct_range_rows INT = 3
DECLARE @EQ_ROWS INT = 0
SELECT @F AS 'F'
--for new cardinality estimator
SELECT @EQ_ROWS + @avg_range_rows * (@F * (@distinct_range_rows - 1) + 1) AS [new_card]After calculati
Solution
SQL Server uses different calculations in different situations. Your example is different from the linked Q & A because your range is entirely contained within a step; it does not cross a step boundary. It is also an interval with two ends rather than one. Writing
Interval with two boundaries, within a single step
The formula is modified to perform linear interpolation within the step for the number of distinct values expected, and reflect that two range endpoints are now specified (and assumed to exist within the histogram step) rather than one.
Using the histogram steps given in the question:
For the query with
...giving 225.75. Changing
Both results match those given in the question.
BETWEEN is the same as writing two separate predicates with >= and <=.Interval with two boundaries, within a single step
The formula is modified to perform linear interpolation within the step for the number of distinct values expected, and reflect that two range endpoints are now specified (and assumed to exist within the histogram step) rather than one.
Using the histogram steps given in the question:
For the query with
BETWEEN '20140615' AND '20140616', the calculation is:DECLARE
@Q1 float = CONVERT(float, CONVERT(datetime, '2014-06-15')),
@Q2 float = CONVERT(float, CONVERT(datetime, '2014-06-16')),
@K1 float = CONVERT(float, CONVERT(datetime, '2014-06-14')),
@K2 float = CONVERT(float, CONVERT(datetime, '2014-06-18')),
@RANGE_ROWS float = 301,
@DISTINCT_RANGE_ROWS float = 3;
DECLARE
@S1 float = (@Q1 - @K1) / (@K2 - @K1),
@S2 float = (@Q2 - @K1) / (@K2 - @K1);
DECLARE
@F float = @S2 - @S1;
DECLARE
@AVG_RANGE_ROWS float = @RANGE_ROWS / @DISTINCT_RANGE_ROWS;
SELECT
@AVG_RANGE_ROWS * ((@F * (@DISTINCT_RANGE_ROWS - 2)) + 2);...giving 225.75. Changing
@Q2 from '20140616' to '20140617' gives a result of 250.833.Both results match those given in the question.
Code Snippets
DECLARE
@Q1 float = CONVERT(float, CONVERT(datetime, '2014-06-15')),
@Q2 float = CONVERT(float, CONVERT(datetime, '2014-06-16')),
@K1 float = CONVERT(float, CONVERT(datetime, '2014-06-14')),
@K2 float = CONVERT(float, CONVERT(datetime, '2014-06-18')),
@RANGE_ROWS float = 301,
@DISTINCT_RANGE_ROWS float = 3;
DECLARE
@S1 float = (@Q1 - @K1) / (@K2 - @K1),
@S2 float = (@Q2 - @K1) / (@K2 - @K1);
DECLARE
@F float = @S2 - @S1;
DECLARE
@AVG_RANGE_ROWS float = @RANGE_ROWS / @DISTINCT_RANGE_ROWS;
SELECT
@AVG_RANGE_ROWS * ((@F * (@DISTINCT_RANGE_ROWS - 2)) + 2);Context
StackExchange Database Administrators Q#249057, answer score: 14
Revisions (0)
No revisions yet.