HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

How are the number of Histogram steps decided in Statistics

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
numberthearedecidedhistogramstatisticshowsteps

Problem

How are the number of histogram steps decided in Statistics in SQL Server?

Why is it restricted to 200 steps even though my key column has more than 200 distinct values? Is there any deciding factor?

Demo

Schema definition

CREATE TABLE histogram_step
  (
     id   INT IDENTITY(1, 1),
     name VARCHAR(50),
     CONSTRAINT pk_histogram_step PRIMARY KEY (id)
  )


Inserting 100 records into my table

INSERT INTO histogram_step
            (name)
SELECT TOP 100 name
FROM   sys.syscolumns


Updating and checking the statistics

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step)


Histogram steps:

```
+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
| 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 5 | 1 | 1 | 1 | 1 |
| 7 | 1 | 1 | 1 | 1 |
| 9 | 1 | 1 | 1 | 1 |
| 11 | 1 | 1 | 1 | 1 |
| 13 | 1 | 1 | 1 | 1 |
| 15 | 1 | 1 | 1 | 1 |
| 17 | 1 | 1 | 1 | 1 |
| 19 | 1 | 1 | 1 | 1 |
| 21 | 1 | 1 | 1 | 1 |
| 23 | 1 | 1 | 1 | 1 |
| 25 | 1 | 1 | 1 | 1 |
| 27 | 1 | 1 |

Solution

I'm going to limit this post to discussing single column statistics because it'll already be pretty lengthy and you're interested in how SQL Server buckets the data into histogram steps. For multi column statistics the histogram is only created on the leading column.

When SQL Server determines that a statistics update is needed it kicks off a hidden query that reads either all of a table's data or a sample of the table's data. You can view these queries with extended events. There is a function called StatMan within SQL Server that is involved with creating the histograms. For simple statistics objects there are at least two different types of StatMan queries (there are different queries for quick stat updates and I suspect that the incremental stats feature on partitioned tables also uses a different query).

The first one just grabs all of the data from the table without any filtering. You can see this when the table is very small or you gather stats with the FULLSCAN option:

CREATE TABLE X_SHOW_ME_STATMAN (N INT);
CREATE STATISTICS X_STAT_X_SHOW_ME_STATMAN ON X_SHOW_ME_STATMAN (N);

-- after gathering stats with 1 row in table
SELECT StatMan([SC0]) FROM
(
    SELECT TOP 100 PERCENT [N] AS [SC0] 
    FROM [dbo].[X_SHOW_ME_STATMAN] WITH (READUNCOMMITTED)
    ORDER BY [SC0] 
) AS _MS_UPDSTATS_TBL 
OPTION (MAXDOP 16);


SQL Server picks the automatic sample size based on the size of the table (I think that it's both number of rows and pages in the table). If a table is too big then the automatic sample size falls below 100%. Here's what I got for the same table with 1M rows:

-- after gathering stats with 1 M rows in table
SELECT StatMan([SC0], [SB0000]) FROM 
(
    SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000] 
    FROM 
    (
        SELECT [N] AS [SC0] 
        FROM [dbo].[X_SHOW_ME_STATMAN] TABLESAMPLE SYSTEM (6.666667e+001 PERCENT) WITH (READUNCOMMITTED) 
    ) AS _MS_UPDSTATS_TBL_HELPER 
    ORDER BY [SC0], [SB0000] 
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1);


TABLESAMPLE is documented but StatMan and step_direction are not. here SQL Server samples around 66.6% of the data from the table to create the histogram. What this means is that you could get a different number of histogram steps when updating stats (without FULLSCAN) on the same data. I've never observed this in practice but I don't see why it wouldn't be possible.

Let's run a few tests on simple data to see how the stats change over time. Below is some test code that I wrote to insert sequential integers into a table, gather stats after each insert, and save information about the stats into a results table. Let's start with just inserting 1 row at a time up to 10000. Test bed:

DECLARE
@stats_id INT,
@table_object_id INT,
@rows_per_loop INT = 1,
@num_of_loops INT = 10000,
@loop_num INT;

BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE X_STATS_RESULTS;

    SET @table_object_id = OBJECT_ID ('X_SEQ_NUM');
    SELECT @stats_id = stats_id FROM sys.stats
    WHERE OBJECT_ID = @table_object_id
    AND name = 'X_STATS_SEQ_INT_FULL';

    SET @loop_num = 0;
    WHILE @loop_num < @num_of_loops
    BEGIN
        SET @loop_num = @loop_num + 1;

        INSERT INTO X_SEQ_NUM WITH (TABLOCK)
        SELECT @rows_per_loop * (@loop_num - 1) + N FROM dbo.GetNums(@rows_per_loop);

        UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN; -- can comment out FULLSCAN as needed

        INSERT INTO X_STATS_RESULTS WITH (TABLOCK)
        SELECT 'X_STATS_SEQ_INT_FULL', @rows_per_loop * @loop_num, rows_sampled, steps 
        FROM sys.dm_db_stats_properties(@table_object_id, @stats_id);
        END;
END;


For this data the number of histogram steps quickly increases to 200 (it first hits the max number of steps with 397 rows), stays at 199 or 200 until 1485 rows are in the table, then slowly decreases until the histogram only has 3 or 4 steps. Here's a graph of all of the data:

Here's the the histogram looks like for 10k rows:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           1       0                   1
9999            9997        1       9997                1
10000           0           1       0                   1


Is it a problem that the histogram only has 3 steps? It looks like information is preserved from our point of view. Note that because the datatype is an INTEGER we can figure out how many rows are in the table for each integer from 1 - 10000. Typically SQL Server can figure this out too, although there are some cases in which this doesn't quite work out. See this SE post for an example of this.

What do you think will happen if we delete a single row from the table and update stats? Ideally we'd get another histogram step to show that the missing integer is no longer in the table.

```
DELETE FROM X_SEQ_NUM
WHERE X_NUM = 1000;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATIST

Code Snippets

CREATE TABLE X_SHOW_ME_STATMAN (N INT);
CREATE STATISTICS X_STAT_X_SHOW_ME_STATMAN ON X_SHOW_ME_STATMAN (N);

-- after gathering stats with 1 row in table
SELECT StatMan([SC0]) FROM
(
    SELECT TOP 100 PERCENT [N] AS [SC0] 
    FROM [dbo].[X_SHOW_ME_STATMAN] WITH (READUNCOMMITTED)
    ORDER BY [SC0] 
) AS _MS_UPDSTATS_TBL 
OPTION (MAXDOP 16);
-- after gathering stats with 1 M rows in table
SELECT StatMan([SC0], [SB0000]) FROM 
(
    SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000] 
    FROM 
    (
        SELECT [N] AS [SC0] 
        FROM [dbo].[X_SHOW_ME_STATMAN] TABLESAMPLE SYSTEM (6.666667e+001 PERCENT) WITH (READUNCOMMITTED) 
    ) AS _MS_UPDSTATS_TBL_HELPER 
    ORDER BY [SC0], [SB0000] 
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1);
DECLARE
@stats_id INT,
@table_object_id INT,
@rows_per_loop INT = 1,
@num_of_loops INT = 10000,
@loop_num INT;

BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE X_STATS_RESULTS;

    SET @table_object_id = OBJECT_ID ('X_SEQ_NUM');
    SELECT @stats_id = stats_id FROM sys.stats
    WHERE OBJECT_ID = @table_object_id
    AND name = 'X_STATS_SEQ_INT_FULL';

    SET @loop_num = 0;
    WHILE @loop_num < @num_of_loops
    BEGIN
        SET @loop_num = @loop_num + 1;

        INSERT INTO X_SEQ_NUM WITH (TABLOCK)
        SELECT @rows_per_loop * (@loop_num - 1) + N FROM dbo.GetNums(@rows_per_loop);

        UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN; -- can comment out FULLSCAN as needed

        INSERT INTO X_STATS_RESULTS WITH (TABLOCK)
        SELECT 'X_STATS_SEQ_INT_FULL', @rows_per_loop * @loop_num, rows_sampled, steps 
        FROM sys.dm_db_stats_properties(@table_object_id, @stats_id);
        END;
END;
RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           1       0                   1
9999            9997        1       9997                1
10000           0           1       0                   1
DELETE FROM X_SEQ_NUM
WHERE X_NUM  = 1000;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 3 steps

DELETE FROM X_SEQ_NUM
WHERE X_NUM  IN (2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 3 steps

Context

StackExchange Database Administrators Q#159790, answer score: 17

Revisions (0)

No revisions yet.