HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Trim_reason = DICTIONARY_SIZE on Clustered ColumnStore index

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
trim_reasonclusteredcolumnstoredictionary_sizeindex

Problem

On a partitioned table with 17 billion rows, I'm getting a very high number of row_groups.

I've heard that I should try to get 1.048.576 rows in each rowgroup.
How do I achieve that?

The trim_reason is DICTIONARY_SIZE, as reported by sys.dm_db_column_store_row_group_physical_stats.

I get on average 100.000 rows per rowgroup, when my batch job is inserting 1,048,576 rows per batch.

I'm fairly new to columnstore indexes, but the only reference I can find to DICTIONARY_SIZE is for string columns. My table only has bit, bigint, and numeric columns.

My table looks like this (column names obscured), and I know that a lot of the columns contain mostly NULLs. If there are values, they are likely to vary a lot (large standard deviation). It is vibration data.

CREATE TABLE fct.MeasureV2
(
MeasureV2ID bigint NOT NULL IDENTITY(1, 1),
DT1ModelID int NOT NULL CONSTRAINT DF_MeasureV2_DT1ModelID DEFAULT ((0)),
DT1TID int NOT NULL CONSTRAINT DF_MeasureV2_DT1TID DEFAULT ((0)),
TimeStampUTC datetime NOT NULL,
AChkSum bigint NULL,
OChkSum bigint NULL,
SChkSum bigint NULL,
ATp numeric (18, 10) NULL,
ATpAvg numeric (18, 10) NULL,
AWDAbsAvg numeric (18, 10) NULL,
G1TMF numeric (18, 10) NULL,
G2Ps_1TMF numeric (18, 10) NULL,
G2Ps_2TMF numeric (18, 10) NULL,
/* about 600 more numeric (18, 10) NULL columns */
WdSdv numeric (18, 10) NULL,
UpdatedDate datetime NOT NULL,
UpdatedID bigint NULL,
IsCorrected bit NOT NULL CONSTRAINT DF_MeasureV2_IsCorrected DEFAULT ((0)),
TRATransformJobID int NOT NULL
) ON PS_FctMeasure (TimeStampUTC)
GO
CREATE CLUSTERED COLUMNSTORE INDEX CCI_MeasureV2 ON fct.MeasureV2 ON PS_FctMeasure (TimeStampUTC)
GO


There are a lot of columns, but the alternative attribute-value table would contain >200 billion rows, and be almost impossible to query. This format is slow, but not too slow.

We've also tried SPARSE columns, but that took a lot of disk space.

Solution

I've heard that I should try to get 1,048,576 rows in each row_group.

That's an ideal, but it's not always achievable in practice. It's not often a primary factor you should concentrate on in any case.

...the only reference I can find to DICTIONARY_SIZE is for string columns

Dictionary pressure is easier to demo with long string columns, that's all.

Dictionary encoding is available for all data types. The alternative encoding, value-based, is only available for numeric types (in the broadest sense).

Value encoding works by scaling and rebasing numeric values. For example, the values {1100, 1200, 1300} can be stored as (0, 1, 2) by first scaling by a factor of 0.01 to give {11, 12, 13}, then rebasing from 11 to give {0, 1, 2}.

With dictionary encoding, each unique value is stored in a dictionary and assigned an integer id. The segment data then references id numbers in the dictionary instead of the original values.

When dictionaries are used, a primary dictionary stores values for the entire column. An additional secondary dictionary may be created for each segment of the column (i.e. per rowgroup).

A dictionary has a maximum in-memory size of 16MB. It is compressed when persistently stored. The size of rowgroup is automatically reduced during columnstore building if a dictionary becomes full.

Note though, the same secondary dictionary may be referenced by multiple row groups if the same thread was responsible for building them. If that dictionary hits its maximum size, all rowgroups referencing it will also be limited in size.

If sharing secondary dictionaries is causing your rowgroup size limitation, you can try rebuilding the columnstore index with global trace flag 11612 enabled:

DBCC TRACEON (11612, -1);


This trace flag prevents secondary dictionary sharing.

You can find the type of encoding used in sys.column_store_segments.

The type of dictionary and the data it is associated with can be found from sys.column_store_dictionaries.

With a fixed schema and data set, there's little you can do to affect the type of encoding and compression scheme SQL Server uses.

Even if you can change the schema or data set, it's hard to say what you should do, given the information provided and without knowing the types of query you run. It's possible your data isn't optimally suited to columnstore compression, but that doesn't mean it's not one of the better solutions. You might look into using a data lake for example. If you want design advice, I suggest you ask a new question about that, with as much relevant detail as you are able to share.

If your columnstore is already well-partitioned and performing well enough in practice, I wouldn't worry too much about the smaller rowgroups due to dictionary size limits.

Avoiding delta stores, achieving partition/rowgroup elimination at runtime (where possible), and aggregate pushdown is usually much more important.

Code Snippets

DBCC TRACEON (11612, -1);

Context

StackExchange Database Administrators Q#316179, answer score: 8

Revisions (0)

No revisions yet.