patternsqlMinor
SQL server Cardinality Estimation for LIKE query
Viewed 0 times
cardinalitysqlquerylikeforserverestimation
Problem
I have this statistics histogram vector for my non-clustered index made on LastName column of a table named AspNetUsers.
If I run a query as
Similarly, to my understanding if i do query as
How does this Cardinality Estimation is works with a LIKE Query? Does it use different formulae for estimating number of rows when doing a LIKE query?
If I run a query as
SELECT FROM dbo.AspNetUsers WHERE LastName = 'Baker' it returns 6 rows as estimated rows, cause Baker is the RANGE_HI_KEY of the one of the step so the EQ_ROWS value is my Estimated rows count. Similarly, If i run a query as SELECT FROM dbo.AspNetUsers WHERE LastName = 'Bacilia', it returns 1 row as estimated rows, cause Bacilia fells in to 'Baker' step range, so the AVG_RAGE_ROWS value of that step is my estimated rows count.Similarly, to my understanding if i do query as
SELECT * FROM dbo.AspNetUsers WHERE LastName LIKE 'Ba%' it matches 2 steps (Baker and Batagoda), so it should return 27 + 51 (RANGE_ROWS) + 6 + 4 (EQ_ROWS) = 88. But it returns 99 rows as estimation.How does this Cardinality Estimation is works with a LIKE Query? Does it use different formulae for estimating number of rows when doing a LIKE query?
Solution
Does it use different formulae for estimating number of rows when
doing a LIKE query?
Yes.
I don't know much about the gory details but see the mention of "string summary statistics" here
the statistics object contains string summary statistics to improve
the cardinality estimates for query predicates that use the LIKE
operator; for example, WHERE ProductName LIKE '%Bike'. String summary
statistics are stored separately from the histogram and are created on
the first key column of the statistics object when it is of type char,
varchar, nchar, nvarchar, varchar(max), nvarchar(max), text, or
ntext..
I don't have your sample data but just gave it a whirl with the player_overviews_unindexed tennis player dataset.
The relevant part of the histogram for that was
The statistics only show definitively that the range will contain at least 170 rows (12 + 71 + 6 + 46 + 4 + 26 + 5). And at most 294 when the end ranges are taken into account.
When I indexed the
There are lots of players with surnames (Babcock, Bahrami, Baer, Backe, Baghdatis, 7 * Baileys etc) that fall into that range but it has no way of knowing that from the histogram.
Presumably the string summary statistics does capture the distribution better here than is possible just from the histogram.
doing a LIKE query?
Yes.
I don't know much about the gory details but see the mention of "string summary statistics" here
the statistics object contains string summary statistics to improve
the cardinality estimates for query predicates that use the LIKE
operator; for example, WHERE ProductName LIKE '%Bike'. String summary
statistics are stored separately from the histogram and are created on
the first key column of the statistics object when it is of type char,
varchar, nchar, nvarchar, varchar(max), nvarchar(max), text, or
ntext..
I don't have your sample data but just gave it a whirl with the player_overviews_unindexed tennis player dataset.
The relevant part of the histogram for that was
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
----------------------------------------------------------------------------------
Aubone 40 4 37 1.081081
Baker 79 12 60 1.316667
Barker 71 6 55 1.290909
Barton 46 4 25 1.84
Bates 26 5 20 1.3
Becker 45 6 35 1.285714The statistics only show definitively that the range will contain at least 170 rows (12 + 71 + 6 + 46 + 4 + 26 + 5). And at most 294 when the end ranges are taken into account.
- 79 last_name > 'Aubone' and last_name
- 12 last_name = 'Baker'
- 71 last_name > 'Baker' and last_name
- 6 last_name = 'Barker'
- 46 last_name > 'Barker' and last_name
- 4 last_name = 'Barton'
- 26 last_name > 'Barton' and last_name
- 5 last_name = 'Bates'
- 45 last_name > 'Bates' and last_name
When I indexed the
last_name column the last_name LIKE N'Ba%' predicate gets converted to an index seek on last_name >= N'Ba' AND last_name = N'Ba' AND last_name
- The estimate for
last_name LIKE N'Ba%' was 235.935 rows
- The estimate for
LEFT(last_name, 2) = 'Ba' was 171.317 rows.
- The actual number of rows returned was 241 rows.
For LEFT potentially it is just adding the AVG_RANGE_ROWS value of 1.316667 to that minimum of 170.
When doing the simple index range seek it looks like it just looks at the size of the range of the end histogram steps and the range of these that the query would select and does some interpolation based on that.
So there were 79 rows with last_name > 'Aubone' and last_name = 'Ba' - though in reality the numbers were 34 and 45` respectively.There are lots of players with surnames (Babcock, Bahrami, Baer, Backe, Baghdatis, 7 * Baileys etc) that fall into that range but it has no way of knowing that from the histogram.
Presumably the string summary statistics does capture the distribution better here than is possible just from the histogram.
Code Snippets
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
----------------------------------------------------------------------------------
Aubone 40 4 37 1.081081
Baker 79 12 60 1.316667
Barker 71 6 55 1.290909
Barton 46 4 25 1.84
Bates 26 5 20 1.3
Becker 45 6 35 1.285714Context
StackExchange Database Administrators Q#323067, answer score: 4
Revisions (0)
No revisions yet.