patternsqlMinor
Bizarre density results in sampled statistics
Viewed 0 times
sampleddensitystatisticsresultsbizarre
Problem
A NC-index gets a totally different statistical distribution when estimated with sampling vs fullscan; the sampled one having a bizarre density vector. This results in poor execution plans.
I have a table of ~27M rows, with a non null FK-column supported by a nonclustered index. The table is clustered on its primary key. Both columns are varchar.
A fullscan statistics update for our FK-column gives a normal looking density vector:
That is, we're expected to read about 1.7 rows for each distinct
A typical bin from the histogram looks like this:
However, if we do a sampled update (using the default sample number which is 230k rows for this table) things take a turn to the bizarre:
The density on
A typical bin from the histogram now looks like this;
which is a completely different distribution. Note that the
There is no "unlucky" draw of strange rows that could contribute to this.
Am I reading the histogram correctly? Doesn't it look like the sampling has somehow scaled EQ_ROWS, DIS
I have a table of ~27M rows, with a non null FK-column supported by a nonclustered index. The table is clustered on its primary key. Both columns are varchar.
A fullscan statistics update for our FK-column gives a normal looking density vector:
All density Average Length Columns
6,181983E-08 45,99747 INSTANCEELEMENTID
3,615442E-08 95,26874 INSTANCEELEMENTID, IDThat is, we're expected to read about 1.7 rows for each distinct
INSTANCELEMENTID we are joining with.A typical bin from the histogram looks like this:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR 133053 10 71366 1,679318However, if we do a sampled update (using the default sample number which is 230k rows for this table) things take a turn to the bizarre:
4,773657E-06 45,99596 INSTANCEELEMENTID
3,702179E-08 95,30183 INSTANCEELEMENTID, IDThe density on
INSTANCEELEMENTID is now two orders of magnitude larger. (The density for both columns however has been estimated to a quite acceptable value).A typical bin from the histogram now looks like this;
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR 143870,4 766,2573 1247 115,3596
ZOTZOT 131560,7 1 969 135,7092which is a completely different distribution. Note that the
INSTANCEELEMENTID with the highest number of associated IDs has 12, the most common number is 1. It's also very strange that some bins get EQ_ROWS = 1, this happens to about 10% of the bins.There is no "unlucky" draw of strange rows that could contribute to this.
Am I reading the histogram correctly? Doesn't it look like the sampling has somehow scaled EQ_ROWS, DIS
Solution
I've seen this same density problem on some of the nonclustered indexes on the largest databases that I have access to. First I'll start with a few observations that I've made about histograms and density calculations:
For one way to look at the problem, suppose that you sample 100 rows from a 10000 row table and you get 100 distinct values. One guess at what the rest of the data in the table is that there are 10000 unique values. Another guess is that there are 100 distinct values but each of the values are repeated 100 times. The second guess may seem unreasonable to you, which I will agree with. However, how do you balance the two approaches when the sampled data comes back unevenly distributed? There is some set of algorithms developed for this by Microsoft contained in the
Let's go through a relatively simple example. I'm going to use
Here are some samples from the statistics:
For evenly distributed data with one unique value per row we get an accurate density, even with a
Now let's add a skewed value and update statistics again:
With a sample size of 17010 rows the density of the first column is smaller than it should be:
```
╔══════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 6.811061E-05 ║ 4.935802 ║ FK ║
║ 5.882353E-06 ║ 10.28007 ║ FK, ID ║
╚══════════════╩════════════════╩═════════╝
╔══════════════╦════════════╦══════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬══════════╬═════════════════════╬════════════════╣
║ 10039 ║ 0 ║ 1 ║ 0 ║ 1 ║
║ 10978 ║ 956.9945 ║ 1 ║ 138 ║ 6.954391 ║
║ 11472 ║ 621.0283 ║ 1 ║ 89 ║ 6.941863 ║
║ 1179 ║ 315.6046 ║ 1 ║ 46 ║ 6.907561 ║
║ 11909 ║ 91.62713 ║ 1 ║ 14 ║ 6.74198 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 35000 ║ 376.6893 ║ 69195.05 ║ 54 ║ 6.918834 ║
║ ... ║ ... ║ ... ║ ... ║ ...
- SQL Server is able to use the primary key on the table to infer something about the density of both columns. This means that the density that includes the PK columns will usually be very accurate.
- The density calculation for the first column in the statistics is consistent with the histogram. If the histogram does not model the data well then the density may be off.
- To create the histogram the
StatManfunction makes inferences about the data that is missing. The behavior can change depending on the data type of the column.
For one way to look at the problem, suppose that you sample 100 rows from a 10000 row table and you get 100 distinct values. One guess at what the rest of the data in the table is that there are 10000 unique values. Another guess is that there are 100 distinct values but each of the values are repeated 100 times. The second guess may seem unreasonable to you, which I will agree with. However, how do you balance the two approaches when the sampled data comes back unevenly distributed? There is some set of algorithms developed for this by Microsoft contained in the
StatMan function. The algorithms may not work for all data disruptions and all sample levels.Let's go through a relatively simple example. I'm going to use
VARCHAR columns like in your table to see some of the same behavior. However, I'll just add one skewed value to the table. I'm testing against SQL Server 2016 SP1. Start with 100k rows with 100k unique values for the FK column:DROP TABLE IF EXISTS X_STATS_SMALL;
CREATE TABLE X_STATS_SMALL (
ID VARCHAR(10) NOT NULL,
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID)
);
-- insert 100k rows
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.GetNums(100000);
CREATE INDEX IX_X_STATS_SMALL ON X_STATS_SMALL (FK);
-- get sampled stats
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;Here are some samples from the statistics:
╔═════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠═════════════╬════════════════╬═════════╣
║ 1.00001E-05 ║ 4.888205 ║ FK ║
║ 1.00001E-05 ║ 9.77641 ║ FK, ID ║
╚═════════════╩════════════════╩═════════╝
╔══════════════╦════════════╦═════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬═════════╬═════════════════════╬════════════════╣
║ 1005 ║ 0 ║ 1 ║ 0 ║ 1 ║
║ 10648 ║ 665.0898 ║ 1 ║ 664 ║ 1.002173 ║
║ 10968 ║ 431.6008 ║ 1 ║ 432 ║ 1 ║
║ 11182 ║ 290.0924 ║ 1 ║ 290 ║ 1 ║
║ 1207 ║ 445.7517 ║ 1 ║ 446 ║ 1 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 99989 ║ 318.3941 ║ 1 ║ 318 ║ 1 ║
╚══════════════╩════════════╩═════════╩═════════════════════╩════════════════╝For evenly distributed data with one unique value per row we get an accurate density, even with a
VARCHAR histogram column and a sample size of 14294 rows.Now let's add a skewed value and update statistics again:
-- add 70k rows with a FK value of '35000'
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N + 100000 , '35000', REPLICATE('Z', 900)
FROM dbo.GetNums(70000);
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;With a sample size of 17010 rows the density of the first column is smaller than it should be:
```
╔══════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 6.811061E-05 ║ 4.935802 ║ FK ║
║ 5.882353E-06 ║ 10.28007 ║ FK, ID ║
╚══════════════╩════════════════╩═════════╝
╔══════════════╦════════════╦══════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬══════════╬═════════════════════╬════════════════╣
║ 10039 ║ 0 ║ 1 ║ 0 ║ 1 ║
║ 10978 ║ 956.9945 ║ 1 ║ 138 ║ 6.954391 ║
║ 11472 ║ 621.0283 ║ 1 ║ 89 ║ 6.941863 ║
║ 1179 ║ 315.6046 ║ 1 ║ 46 ║ 6.907561 ║
║ 11909 ║ 91.62713 ║ 1 ║ 14 ║ 6.74198 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 35000 ║ 376.6893 ║ 69195.05 ║ 54 ║ 6.918834 ║
║ ... ║ ... ║ ... ║ ... ║ ...
Code Snippets
DROP TABLE IF EXISTS X_STATS_SMALL;
CREATE TABLE X_STATS_SMALL (
ID VARCHAR(10) NOT NULL,
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID)
);
-- insert 100k rows
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.GetNums(100000);
CREATE INDEX IX_X_STATS_SMALL ON X_STATS_SMALL (FK);
-- get sampled stats
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;╔═════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠═════════════╬════════════════╬═════════╣
║ 1.00001E-05 ║ 4.888205 ║ FK ║
║ 1.00001E-05 ║ 9.77641 ║ FK, ID ║
╚═════════════╩════════════════╩═════════╝
╔══════════════╦════════════╦═════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬═════════╬═════════════════════╬════════════════╣
║ 1005 ║ 0 ║ 1 ║ 0 ║ 1 ║
║ 10648 ║ 665.0898 ║ 1 ║ 664 ║ 1.002173 ║
║ 10968 ║ 431.6008 ║ 1 ║ 432 ║ 1 ║
║ 11182 ║ 290.0924 ║ 1 ║ 290 ║ 1 ║
║ 1207 ║ 445.7517 ║ 1 ║ 446 ║ 1 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 99989 ║ 318.3941 ║ 1 ║ 318 ║ 1 ║
╚══════════════╩════════════╩═════════╩═════════════════════╩════════════════╝-- add 70k rows with a FK value of '35000'
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N + 100000 , '35000', REPLICATE('Z', 900)
FROM dbo.GetNums(70000);
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;╔══════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 6.811061E-05 ║ 4.935802 ║ FK ║
║ 5.882353E-06 ║ 10.28007 ║ FK, ID ║
╚══════════════╩════════════════╩═════════╝
╔══════════════╦════════════╦══════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬══════════╬═════════════════════╬════════════════╣
║ 10039 ║ 0 ║ 1 ║ 0 ║ 1 ║
║ 10978 ║ 956.9945 ║ 1 ║ 138 ║ 6.954391 ║
║ 11472 ║ 621.0283 ║ 1 ║ 89 ║ 6.941863 ║
║ 1179 ║ 315.6046 ║ 1 ║ 46 ║ 6.907561 ║
║ 11909 ║ 91.62713 ║ 1 ║ 14 ║ 6.74198 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 35000 ║ 376.6893 ║ 69195.05 ║ 54 ║ 6.918834 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 99966 ║ 325.7854 ║ 1 ║ 47 ║ 6.909731 ║
╚══════════════╩════════════╩══════════╩═════════════════════╩════════════════╝DROP TABLE IF EXISTS X_STATS_LARGE;
CREATE TABLE X_STATS_LARGE (
ID VARCHAR(10) NOT NULL,
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID));
INSERT INTO X_STATS_LARGE WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.Getnums(1000000);
CREATE INDEX IX_X_STATS_LARGE ON X_STATS_LARGE (FK);
-- get sampled stats
UPDATE STATISTICS X_STATS_LARGE IX_X_STATS_LARGE;Context
StackExchange Database Administrators Q#167754, answer score: 4
Revisions (0)
No revisions yet.