HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

How does case-insensitive collation work?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
casehowinsensitiveworkdoescollation

Problem

Default collation type in SQL Server allows for indexing against case insensitive strings yet the case of the data is persisted. How does this actually work? I'm looking for the actual nuts and bolts, bits and bytes, or a good resource that explains it in detail.

create table casetest (fruitnames nvarchar(50) not null);
create unique index IX_fruitnames on casetest(fruitnames);

insert into casetest values ('apples');
insert into casetest values ('Pears');
-- this insert fails
insert into casetest values ('pears');

-- this yields 'Pears' as a result
select * from casetest (forceseek) where fruitnames = 'PEARS'

update casetest set fruitnames = 'pears' where fruitnames = 'pEArs'

-- this yields 'pears' as a result
select * from casetest (forceseek) where fruitnames = 'PEARS'


Questions About SQL Server Collations You Were Too Shy to Ask by Robert Sheldon covers how to use collation. It does not cover how collation works. I'm interested in how an index can be efficiently created/queried not caring about case, while simultaneously storing case data.

Solution

indexing against case insensitive strings yet the case of the data is persisted. How does this actually work?

This is actually not a SQL Server specific behavior, it's just how these things work in general.

So, the data is the data. If you are speaking about an index specifically, the data needs to be stored as it is else it would require a look-up in the main table each time to get the actual value, and there would be no possibility of a covering index (at least not for string types).

The data, either in the table/clustered index or non-clustered index, does not contain any collation / sorting info. It is simply data. The collation (locale/culture rules and sensitivities) is just meta data attached to the column and used when a sort operation is called (unless overridden by a COLLATE clause), which would include the creation/rebuild of an index. The rules defined by a non-binary collation are used to generate sort-keys, which are binary representations of the string (sort keys are unnecessary in binary collations). These binary representations incorporate all of the locale/culture rules and selected sensitivities. The sort-keys are used to place the records in their proper order, but are not themselves stored in the index or table. They aren't stored (at least I haven't seen these values in the index and was told that they aren't stored) because:

  • They aren't truly needed for sorting since they would merely be in the same order as the rows in the table or index anyway. But, the physical order of the index is just sorting, not comparison.



  • While storing them might make comparisons faster, it would also make the index larger as the minimum size for a single character is 5 bytes, and that's just "overhead" (of the sort key structure). Most characters are 2 bytes each, plus 1 byte if there's an accent, plus 1 byte if it's upper-case. For example, "e" is a 7-byte key, "E" and "é" are both 8 bytes, and "É" is a 9-byte key. Hence, not worth storing these in the end.



There are two types of collations: SQL Server and Windows.

SQL Server

SQL Server collations (those with names starting with SQL_) are the older, pre-SQL Server 2000 way of sorting/comparing (even though SQL_Latin1_General_CP1_CI_AS is still the installation default on US English OSes, quite sadly). In this older, simplistic, non-Unicode model, each combination of locale, code page, and the various sensitivities are given a static mapping of each of the characters in that code page. Each character is assigned a value (i.e. sort weight) to denote how it equates with the others. Comparisons in this model appear to do a two-pass operation:

  • First, it removes all accents (such that " ü " becomes " u "), expands characters like " Æ " into " A " and " E ", then does an initial sort so that words are in a natural order (how you would expect to find them in a dictionary).



  • Then, it goes character by character to determine equality based on these underlying values per each character. This second part is what mustaccio is describing in his answer.



The only sensitivities that can be adjusted in these collations are: "case" and "accent" ("width", "kana type" and "variation selector" are not available). Also, none of these collations support Supplementary Characters (which makes sense as those are Unicode-specific and these collations only apply to non-Unicode data).

This approach applies only to non-Unicode VARCHAR data. Each unique combination of locale, code page, case-sensitivity, and accent-sensitivity has a specific "sort ID", which you can see in the following example:

SELECT COLLATIONPROPERTY(N'SQL_Latin1_General_CP1_CI_AS', 'SortID'), -- 52
COLLATIONPROPERTY(N'SQL_Latin1_General_CP1_CS_AS', 'SortID'), -- 51
COLLATIONPROPERTY(N'Latin1_General_100_CI_AS', 'SortID'); -- 0


The only difference between the first two collations is the case-sensitivity. The third collation is a Windows collation and so does not have a static mapping table.

Also, these collations should sort and compare faster than the Windows collations due to being simple lookups for character to sort weight. However, these collations are also far less functional and should generally be avoided if at all possible.

Windows

Windows collations (those with names not starting with SQL_) are the newer (starting in SQL Server 2000) way of sorting/comparing. In this newer, complex, Unicode model, each combination of locale, code page, and the various sensitivities are not given a static mapping. For one thing, there are no code pages in this model. This model assigns a default sort value to each character, and then each locale/culture can re-assign sort values to any number of characters. This allows multiple cultures to use the same characters in different ways. This does have the affect of allowing for multiple languages to be sorted naturally using the same collation if they do not use the same characters (and if one of them does not need to re-assi

Context

StackExchange Database Administrators Q#249715, answer score: 27

Revisions (0)

No revisions yet.