SQL Server Primary Key Column Statistics Histogram Suggests Duplicate Values

Problem

I have a statistic on a Primary Key column in a table. When I update the statistic with the default options:

UPDATE STATISTICS dbo.MyTable PK__MyTable__CB394B3946083350

I get a histogram as follows (abridged)

```
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
3400002201 0 1 0 1
3400009992 18103.04 1 7790 2.323882
3400040033 26083.68 1 26080 1.000144
3400050456 13029.09 1 10422 1.250153
3400087676 26083.68 1 26080 1.000144
3400103858 19556.38 1 16181 1.208602
3400126866 13029.09 1 13029 1
3400162832 39138.27 1 35965 1.088232
3400213115 45665.56 1 45641 1.000547
34

Solution

This is partially explained in KB3202425 - Improvement: Improves the query performance for SQL Server 2016 by changing the use of histograms on UNIQUE columns:

Improvement

In the current cardinality estimation model of Microsoft SQL Server 2016, uniqueness isn't explained when histograms are scaled. This may cause unique columns having a frequency that's greater than 1. The method for dealing with this in RTM is to ignore the histogram completely on UNIQUE columns. This may cause poor estimates when the distribution of the column is non-uniform.

After this update is installed, the histogram will be used properly if it has all step values within 10% of unique.
Notes

This change will be enabled only if trace flag 4199 is enabled, if the database scope of database configuration QUERY_OPTIMIZER_HOTFIXES is set to ON, or if query hint ENABLE_QUERY_OPTIMIZER_HOTFIXES is used.

The ENABLE_QUERY_OPTIMIZER_HOTFIXES query hint is available starting in SQL Server 2016 SP1.

As so often with modern KBs, you have to imagine the meaning they're really trying to convey:

uniqueness isn't explained when histograms are scaled

One could understand that to mean sampled statistic histograms (which are the only kind needing to be 'scaled') don't reflect any uniqueness constraints, which is the heart of your question.

An interesting difference occurs if you set the database compatibility level to 120 or earlier. This obviously affects more than just statistics sampling. For example, on SQL Server 2019 CU 19 using the 2010 Stack Overflow sample database:

USE StackOverflow2010;
GO
ALTER DATABASE StackOverflow2010 SET COMPATIBILITY_LEVEL = 150;
GO
UPDATE STATISTICS dbo.Posts (PK_Posts__Id);
GO
SELECT TOP (5) 
    H.step_number,
    H.range_high_key,
    H.range_rows,
    H.equal_rows,
    H.distinct_range_rows,
    H.average_range_rows
FROM sys.tables AS T
JOIN sys.schemas AS SCH
    ON SCH.[schema_id] = T.[schema_id]
JOIN sys.indexes AS I
    ON I.[object_id] = T.[object_id]
JOIN sys.stats AS S
    ON S.[object_id] = T.[object_id]
    AND S.stats_id = I.index_id
CROSS APPLY sys.dm_db_stats_histogram
(
    S.[object_id],
    S.stats_id
) AS H
WHERE
    SCH.[name] = N'dbo'
    AND T.[name] = N'Posts'
    AND I.[type_desc] = N'CLUSTERED'
ORDER BY
    H.step_number ASC;

step_number
range_high_key
range_rows
equal_rows
distinct_range_rows
average_range_rows

1
467
0
1
0
1

2
471
111.9387
1
3
37.31289

3
594
167.908
1
122
1.376295

4
5071
1735.05
1
1735
1

5
11035
3078.314
1
3078
1

Replacing compatibility level 150 in the above script with 120:

step_number
range_high_key
range_rows
equal_rows
distinct_range_rows
average_range_rows

1
467
0
1
0
1

2
51352
25401.67
1
25402
1

3
115718
42914.28
1
42888
1.000605

4
141859
14267.46
1
14267
1

5
161001
14267.46
1
14267
1

Notice range_rows and distinct_range_rows are much closer to being equal.

On reflection, that's probably because CL 120 didn't allow parallel statistics updates, but I'll leave the example up anyway. You can see similar differences under CL150 if you mess around with the optional MAXDOP clause for UPDATE STATISTICS.

As I say right at the start, this is a partial explanation. Only someone with access to the source code and internal design documentation could tell you exactly how and why things have changed.

Presumably, this is all part of the ongoing work to make selectivity estimation more consistent and easier to maintain. I think it's fair to say it hasn't been an unmitigated success story so far.

Related Q & A:

Badly-formed histogram causes bad estimates on Nested Loop

Statistics histogram AVG_RANGE_ROWS discrepancy

Code Snippets

USE StackOverflow2010;
GO
ALTER DATABASE StackOverflow2010 SET COMPATIBILITY_LEVEL = 150;
GO
UPDATE STATISTICS dbo.Posts (PK_Posts__Id);
GO
SELECT TOP (5) 
    H.step_number,
    H.range_high_key,
    H.range_rows,
    H.equal_rows,
    H.distinct_range_rows,
    H.average_range_rows
FROM sys.tables AS T
JOIN sys.schemas AS SCH
    ON SCH.[schema_id] = T.[schema_id]
JOIN sys.indexes AS I
    ON I.[object_id] = T.[object_id]
JOIN sys.stats AS S
    ON S.[object_id] = T.[object_id]
    AND S.stats_id = I.index_id
CROSS APPLY sys.dm_db_stats_histogram
(
    S.[object_id],
    S.stats_id
) AS H
WHERE
    SCH.[name] = N'dbo'
    AND T.[name] = N'Posts'
    AND I.[type_desc] = N'CLUSTERED'
ORDER BY
    H.step_number ASC;

Context

StackExchange Database Administrators Q#324987, answer score: 6

Revisions (0)

No revisions yet.