HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlModerate

Accent Sensitive Sort

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sortaccentsensitive

Problem

Why do these two SELECT statements result in a different sort order?

USE tempdb;
CREATE TABLE dbo.OddSort 
(
    id INT IDENTITY(1,1) PRIMARY KEY
    , col1 NVARCHAR(2)
    , col2 NVARCHAR(2)
);
GO
INSERT dbo.OddSort (col1, col2) 
VALUES (N'e', N'eA')
    , (N'é', N'éB')
    , (N'ë', N'ëC')
    , (N'è', N'èD')
    , (N'ê', N'êE')
    , (N'ē', N'ēF');
GO

SELECT * 
FROM dbo.OddSort 
ORDER BY col1 COLLATE Latin1_General_100_CS_AS;


╔════╦══════╦══════╗
║ id ║ col1 ║ col2 ║
╠════╬══════╬══════╣
║ 1 ║ e ║ eA ║
║ 2 ║ é ║ éB ║
║ 4 ║ è ║ èD ║ -- should be id 3?
║ 5 ║ ê ║ êE ║
║ 3 ║ ë ║ ëC ║
║ 6 ║ ē ║ ēF ║
╚════╩══════╩══════╝

SELECT * 
FROM dbo.OddSort 
ORDER BY col2 COLLATE Latin1_General_100_CS_AS;


╔════╦══════╦══════╗
║ id ║ col1 ║ col2 ║
╠════╬══════╬══════╣
║ 1 ║ e ║ eA ║
║ 2 ║ é ║ éB ║
║ 3 ║ ë ║ ëC ║
║ 4 ║ è ║ èD ║
║ 5 ║ ê ║ êE ║
║ 6 ║ ē ║ ēF ║
╚════╩══════╩══════╝

Solution

The behavior you are seeing here is due, in a general sense, to the fact that the Unicode Collation Algorithm (UCA) allows for complex, multi-level sorting. More specifically:

-
Sorting is not Comparison:

Determining whether two strings are the same or different is fairly straight forward (given a particular locale/language and set of sensitivities). But determining the order of 2 or more strings can be highly complex.

-
Sorting is done in a series of steps, with each step applied to the entire string, not character by character:

  • Standard: sort base characters (regardless of accent and case differences)



  • IF Accent-sensitive, apply accent / diacritic weights



  • IF Case-sensitive, apply casing weights



When you sort by col1 (single character), it first determines that all characters have the same weight since they are all "e". Next, it applies the accent / diacritic weights. There are no casing differences so the third step wouldn't change anything. So the only differences are in step 2, which is why there is a preferred order for those rows based on col1.

When you sort by col2 (two characters), it first determines that each row has a different weight since both characters are used to determine the sort weight (e.g. "ea", "eb", etc). Next, it applies the accent / diacritic weights. There are no casing differences so the third step wouldn't change anything. So there are differences in steps 1 and 2 this time. But since the differences in step 1 have already been applied to each string before the weights of step 2 are considered, the weights from step 2 don't have any effect on the ordering; they would only apply if weights from step 1 for two or more rows were the same.

The following adaptation of the sample code from the question hopefully illustrates the sorting behavior described above. I added some additional rows and an additional column to help show the impact of the Collation being case-sensitive (since the original sample data is all lower-case):

SETUP

USE [tempdb];

-- DROP TABLE dbo.OddSort;
CREATE TABLE dbo.OddSort
(
    id INT IDENTITY(1,1) PRIMARY KEY,
    col1 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS,
    col2 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS,
    col3 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS
);
GO

INSERT dbo.OddSort (col1, col2, col3)
VALUES (N'e', N'eA', N'e A')
     , (N'ê', N'êE', N'ê E')
     , (N'é', N'éH', N'é H')
     , (N'ë', N'ëC', N'ë C')
     , (N'E', N'EG', N'E G')
     , (N'Ë', N'ëh', N'ë h')
     , (N'è', N'èD', N'è D')
     , (N'é', N'éB', N'é B')
     , (N'ë', N'ëH', N'ë H')
     , (N'ē', N'ēF', N'ē F');


TEST 1

SELECT [id], [col1], UNICODE([col1]) AS [CodePoint]
FROM dbo.OddSort 
ORDER BY col1;


Returns:

id col1 CodePoint
1 e 101
5 E 69
8 é 233
3 é 233
7 è 232
2 ê 234
4 ë 235
9 ë 235
6 Ë 203
10 ē 275


What we can see in the results above:

  • The Code Point is not determining the sort order



  • The non-accented characters are sorted before the accented characters (within the same letter: f would still come after all of these). Clearly, accent weights are applied before case weights.



  • Lower-case sorts before upper-case within the same accented (or non-accented) character (i.e. the e then E, and the ë then Ë). This tailoring is used by most of the Windows Collations, while most of the SQL Server Collations sort upper-case first.



TEST 2

SELECT [id], [col2]
FROM dbo.OddSort
ORDER BY col2;


Returns:

id col2
1 eA
8 éB
4 ëC
7 èD
2 êE
10 ēF
5 EG
3 éH
6 ëh
9 ëH


What we can see in the results above:

  • First-level sorting truly is the base characters. If it were accents / diacritics then the ëC (id = 4), ēF (id = 10), and EG (id = 5) rows would not be where they are. If it were casing, then the EG (id = 5) row would not be where it is.



  • Second-level sorting truly is the accents / diacritics. This explains why the last three rows are éH -> ëh -> ëH instead of ëh -> éH -> ëH (i.e. IDs 3 -> 6 -> 9 instead of 6 -> 3 -> 9).



  • Third-level sorting truly is the casing. This is why the last 2 rows are ëh -> ëH, since lower-case sorts first.



TEST 3

SELECT [id], [col3]
FROM dbo.OddSort
ORDER BY col3;


Returns:

id col3
1 e A
8 é B
4 ë C
7 è D
2 ê E
10 ē F
5 E G
3 é H
6 ë h
9 ë H


What we can see in the results above:

  • The sort order is exactly the same as in Test 2. The only difference in the test values here is that there is a space between each character, removing the possibility of contextual rules. Hence, we know that the reason for the difference in sort order for col2 in the question is again due to "Multi-Level Comparison", and not "Contextual Sensitivity".



Additional notes:

-
With regarding to getting the exact rules, that is not as easy as it should be. The problem with getting concrete explanations of these rules

Code Snippets

USE [tempdb];

-- DROP TABLE dbo.OddSort;
CREATE TABLE dbo.OddSort
(
    id INT IDENTITY(1,1) PRIMARY KEY,
    col1 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS,
    col2 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS,
    col3 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS
);
GO

INSERT dbo.OddSort (col1, col2, col3)
VALUES (N'e', N'eA', N'e A')
     , (N'ê', N'êE', N'ê E')
     , (N'é', N'éH', N'é H')
     , (N'ë', N'ëC', N'ë C')
     , (N'E', N'EG', N'E G')
     , (N'Ë', N'ëh', N'ë h')
     , (N'è', N'èD', N'è D')
     , (N'é', N'éB', N'é B')
     , (N'ë', N'ëH', N'ë H')
     , (N'ē', N'ēF', N'ē F');
SELECT [id], [col1], UNICODE([col1]) AS [CodePoint]
FROM dbo.OddSort 
ORDER BY col1;

Context

StackExchange Database Administrators Q#199853, answer score: 17

Revisions (0)

No revisions yet.