HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

SQL server Cardinality Estimation for LIKE query

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
cardinalitysqlquerylikeforserverestimation

Problem

I have this statistics histogram vector for my non-clustered index made on LastName column of a table named AspNetUsers.

If I run a query as SELECT FROM dbo.AspNetUsers WHERE LastName = 'Baker' it returns 6 rows as estimated rows, cause Baker is the RANGE_HI_KEY of the one of the step so the EQ_ROWS value is my Estimated rows count. Similarly, If i run a query as SELECT FROM dbo.AspNetUsers WHERE LastName = 'Bacilia', it returns 1 row as estimated rows, cause Bacilia fells in to 'Baker' step range, so the AVG_RAGE_ROWS value of that step is my estimated rows count.

Similarly, to my understanding if i do query as SELECT * FROM dbo.AspNetUsers WHERE LastName LIKE 'Ba%' it matches 2 steps (Baker and Batagoda), so it should return 27 + 51 (RANGE_ROWS) + 6 + 4 (EQ_ROWS) = 88. But it returns 99 rows as estimation.

How does this Cardinality Estimation is works with a LIKE Query? Does it use different formulae for estimating number of rows when doing a LIKE query?

Solution

Does it use different formulae for estimating number of rows when
doing a LIKE query?

Yes.

I don't know much about the gory details but see the mention of "string summary statistics" here

the statistics object contains string summary statistics to improve
the cardinality estimates for query predicates that use the LIKE
operator; for example, WHERE ProductName LIKE '%Bike'. String summary
statistics are stored separately from the histogram and are created on
the first key column of the statistics object when it is of type char,
varchar, nchar, nvarchar, varchar(max), nvarchar(max), text, or
ntext..

I don't have your sample data but just gave it a whirl with the player_overviews_unindexed tennis player dataset.

The relevant part of the histogram for that was

RANGE_HI_KEY    RANGE_ROWS     EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
----------------------------------------------------------------------------------
       Aubone            40           4                    37         1.081081
        Baker            79          12                    60         1.316667
       Barker            71           6                    55         1.290909
       Barton            46           4                    25             1.84
        Bates            26           5                    20              1.3
       Becker            45           6                    35         1.285714


The statistics only show definitively that the range will contain at least 170 rows (12 + 71 + 6 + 46 + 4 + 26 + 5). And at most 294 when the end ranges are taken into account.

  • 79 last_name > 'Aubone' and last_name



  • 12 last_name = 'Baker'



  • 71 last_name > 'Baker' and last_name



  • 6 last_name = 'Barker'



  • 46 last_name > 'Barker' and last_name



  • 4 last_name = 'Barton'



  • 26 last_name > 'Barton' and last_name



  • 5 last_name = 'Bates'



  • 45 last_name > 'Bates' and last_name



When I indexed the last_name column the last_name LIKE N'Ba%' predicate gets converted to an index seek on last_name >= N'Ba' AND last_name = N'Ba' AND last_name

  • The estimate for last_name LIKE N'Ba%' was 235.935 rows



  • The estimate for LEFT(last_name, 2) = 'Ba' was 171.317 rows.



  • The actual number of rows returned was 241 rows.



For
LEFT potentially it is just adding the AVG_RANGE_ROWS value of 1.316667 to that minimum of 170.

When doing the simple index range seek it looks like it just looks at the size of the range of the end histogram steps and the range of these that the query would select and does some interpolation based on that.

So there were 79 rows with
last_name > 'Aubone' and last_name = 'Ba' - though in reality the numbers were 34 and 45` respectively.

There are lots of players with surnames (Babcock, Bahrami, Baer, Backe, Baghdatis, 7 * Baileys etc) that fall into that range but it has no way of knowing that from the histogram.

Presumably the string summary statistics does capture the distribution better here than is possible just from the histogram.

Code Snippets

RANGE_HI_KEY    RANGE_ROWS     EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
----------------------------------------------------------------------------------
       Aubone            40           4                    37         1.081081
        Baker            79          12                    60         1.316667
       Barker            71           6                    55         1.290909
       Barton            46           4                    25             1.84
        Bates            26           5                    20              1.3
       Becker            45           6                    35         1.285714

Context

StackExchange Database Administrators Q#323067, answer score: 4

Revisions (0)

No revisions yet.