HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Bizarre density results in sampled statistics

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sampleddensitystatisticsresultsbizarre

Problem

A NC-index gets a totally different statistical distribution when estimated with sampling vs fullscan; the sampled one having a bizarre density vector. This results in poor execution plans.

I have a table of ~27M rows, with a non null FK-column supported by a nonclustered index. The table is clustered on its primary key. Both columns are varchar.

A fullscan statistics update for our FK-column gives a normal looking density vector:

All density Average Length  Columns
6,181983E-08    45,99747    INSTANCEELEMENTID
3,615442E-08    95,26874    INSTANCEELEMENTID, ID


That is, we're expected to read about 1.7 rows for each distinct INSTANCELEMENTID we are joining with.

A typical bin from the histogram looks like this:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR          133053      10      71366               1,679318


However, if we do a sampled update (using the default sample number which is 230k rows for this table) things take a turn to the bizarre:

4,773657E-06    45,99596    INSTANCEELEMENTID
3,702179E-08    95,30183    INSTANCEELEMENTID, ID


The density on INSTANCEELEMENTID is now two orders of magnitude larger. (The density for both columns however has been estimated to a quite acceptable value).

A typical bin from the histogram now looks like this;

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS     DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR          143870,4    766,2573    1247                115,3596
ZOTZOT          131560,7    1           969                 135,7092


which is a completely different distribution. Note that the INSTANCEELEMENTID with the highest number of associated IDs has 12, the most common number is 1. It's also very strange that some bins get EQ_ROWS = 1, this happens to about 10% of the bins.

There is no "unlucky" draw of strange rows that could contribute to this.

Am I reading the histogram correctly? Doesn't it look like the sampling has somehow scaled EQ_ROWS, DIS

Solution

I've seen this same density problem on some of the nonclustered indexes on the largest databases that I have access to. First I'll start with a few observations that I've made about histograms and density calculations:

  • SQL Server is able to use the primary key on the table to infer something about the density of both columns. This means that the density that includes the PK columns will usually be very accurate.



  • The density calculation for the first column in the statistics is consistent with the histogram. If the histogram does not model the data well then the density may be off.



  • To create the histogram the StatMan function makes inferences about the data that is missing. The behavior can change depending on the data type of the column.



For one way to look at the problem, suppose that you sample 100 rows from a 10000 row table and you get 100 distinct values. One guess at what the rest of the data in the table is that there are 10000 unique values. Another guess is that there are 100 distinct values but each of the values are repeated 100 times. The second guess may seem unreasonable to you, which I will agree with. However, how do you balance the two approaches when the sampled data comes back unevenly distributed? There is some set of algorithms developed for this by Microsoft contained in the StatMan function. The algorithms may not work for all data disruptions and all sample levels.

Let's go through a relatively simple example. I'm going to use VARCHAR columns like in your table to see some of the same behavior. However, I'll just add one skewed value to the table. I'm testing against SQL Server 2016 SP1. Start with 100k rows with 100k unique values for the FK column:

DROP TABLE IF EXISTS X_STATS_SMALL;

CREATE TABLE X_STATS_SMALL (
ID VARCHAR(10) NOT NULL, 
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID)
);
-- insert 100k rows
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.GetNums(100000);

CREATE INDEX IX_X_STATS_SMALL ON X_STATS_SMALL (FK);

-- get sampled stats
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;


Here are some samples from the statistics:

╔═════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠═════════════╬════════════════╬═════════╣
║ 1.00001E-05 ║ 4.888205       ║ FK      ║
║ 1.00001E-05 ║ 9.77641        ║ FK, ID  ║
╚═════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦═════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬═════════╬═════════════════════╬════════════════╣
║ 1005         ║ 0          ║ 1       ║ 0                   ║ 1              ║
║ 10648        ║ 665.0898   ║ 1       ║ 664                 ║ 1.002173       ║
║ 10968        ║ 431.6008   ║ 1       ║ 432                 ║ 1              ║
║ 11182        ║ 290.0924   ║ 1       ║ 290                 ║ 1              ║
║ 1207         ║ 445.7517   ║ 1       ║ 446                 ║ 1              ║
║ ...          ║ ...        ║ ...     ║ ...                 ║ ...            ║
║ 99989        ║ 318.3941   ║ 1       ║ 318                 ║ 1              ║
╚══════════════╩════════════╩═════════╩═════════════════════╩════════════════╝


For evenly distributed data with one unique value per row we get an accurate density, even with a VARCHAR histogram column and a sample size of 14294 rows.

Now let's add a skewed value and update statistics again:

-- add 70k rows with a FK value of '35000'
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N + 100000 , '35000',  REPLICATE('Z', 900)
FROM dbo.GetNums(70000);

UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;


With a sample size of 17010 rows the density of the first column is smaller than it should be:

```
╔══════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 6.811061E-05 ║ 4.935802 ║ FK ║
║ 5.882353E-06 ║ 10.28007 ║ FK, ID ║
╚══════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦══════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬══════════╬═════════════════════╬════════════════╣
║ 10039 ║ 0 ║ 1 ║ 0 ║ 1 ║
║ 10978 ║ 956.9945 ║ 1 ║ 138 ║ 6.954391 ║
║ 11472 ║ 621.0283 ║ 1 ║ 89 ║ 6.941863 ║
║ 1179 ║ 315.6046 ║ 1 ║ 46 ║ 6.907561 ║
║ 11909 ║ 91.62713 ║ 1 ║ 14 ║ 6.74198 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 35000 ║ 376.6893 ║ 69195.05 ║ 54 ║ 6.918834 ║
║ ... ║ ... ║ ... ║ ... ║ ...

Code Snippets

DROP TABLE IF EXISTS X_STATS_SMALL;

CREATE TABLE X_STATS_SMALL (
ID VARCHAR(10) NOT NULL, 
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID)
);
-- insert 100k rows
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.GetNums(100000);

CREATE INDEX IX_X_STATS_SMALL ON X_STATS_SMALL (FK);

-- get sampled stats
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;
╔═════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠═════════════╬════════════════╬═════════╣
║ 1.00001E-05 ║ 4.888205       ║ FK      ║
║ 1.00001E-05 ║ 9.77641        ║ FK, ID  ║
╚═════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦═════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬═════════╬═════════════════════╬════════════════╣
║ 1005         ║ 0          ║ 1       ║ 0                   ║ 1              ║
║ 10648        ║ 665.0898   ║ 1       ║ 664                 ║ 1.002173       ║
║ 10968        ║ 431.6008   ║ 1       ║ 432                 ║ 1              ║
║ 11182        ║ 290.0924   ║ 1       ║ 290                 ║ 1              ║
║ 1207         ║ 445.7517   ║ 1       ║ 446                 ║ 1              ║
║ ...          ║ ...        ║ ...     ║ ...                 ║ ...            ║
║ 99989        ║ 318.3941   ║ 1       ║ 318                 ║ 1              ║
╚══════════════╩════════════╩═════════╩═════════════════════╩════════════════╝
-- add 70k rows with a FK value of '35000'
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N + 100000 , '35000',  REPLICATE('Z', 900)
FROM dbo.GetNums(70000);

UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;
╔══════════════╦════════════════╦═════════╗
║ All density  ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 6.811061E-05 ║ 4.935802       ║ FK      ║
║ 5.882353E-06 ║ 10.28007       ║ FK, ID  ║
╚══════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦══════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS  ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬══════════╬═════════════════════╬════════════════╣
║ 10039        ║ 0          ║ 1        ║ 0                   ║ 1              ║
║ 10978        ║ 956.9945   ║ 1        ║ 138                 ║ 6.954391       ║
║ 11472        ║ 621.0283   ║ 1        ║ 89                  ║ 6.941863       ║
║ 1179         ║ 315.6046   ║ 1        ║ 46                  ║ 6.907561       ║
║ 11909        ║ 91.62713   ║ 1        ║ 14                  ║ 6.74198        ║
║ ...          ║ ...        ║ ...      ║ ...                 ║ ...            ║
║ 35000        ║ 376.6893   ║ 69195.05 ║ 54                  ║ 6.918834       ║
║ ...          ║ ...        ║ ...      ║ ...                 ║ ...            ║
║ 99966        ║ 325.7854   ║ 1        ║ 47                  ║ 6.909731       ║
╚══════════════╩════════════╩══════════╩═════════════════════╩════════════════╝
DROP TABLE IF EXISTS X_STATS_LARGE;

CREATE TABLE X_STATS_LARGE (
ID VARCHAR(10) NOT NULL,
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID));

INSERT INTO X_STATS_LARGE WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.Getnums(1000000);

CREATE INDEX IX_X_STATS_LARGE ON X_STATS_LARGE (FK);

-- get sampled stats
UPDATE STATISTICS X_STATS_LARGE IX_X_STATS_LARGE;

Context

StackExchange Database Administrators Q#167754, answer score: 4

Revisions (0)

No revisions yet.