HiveBrain v1.2.0
Get Started
← Back to all entries
debugsqlMajor

Why does a comparison between 'tr' & 'tR' fail on a SQL Server with Vietnamese_CI_AI collation?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
vietnamese_ci_aiwhysqlwithfailcomparisonbetweendoescollationserver

Problem

There seems to be something special about 'tR' in Vietnamese collation. Appreicate if anyone who knows about it can explain in simple terms.
This issue was discovered during the installation of our product on a "Vietnamese" collated SQL Server. One of the tables in the schema has 'tR' in it's name but a stored procedure is referencing the table in all lower case 'tr'. And this reference fails.

I guess this situation is analogous to '阝' matching 'ss' in other collations.

Here is a reproduction:
select case when 'tr' = 'tR' COLLATE SQL_Latin1_General_CP1_CI_AS then 'match' else 'no match' end
select case when 'tr' = 'tR' COLLATE Vietnamese_CI_AI then 'match' else 'no match' end
select case when 'tr' = 'TR' COLLATE Vietnamese_CI_AI then 'match' else 'no match' end


Results:

-----
match

--------
no match

-----
match


The second T-SQL produces the mismatch. Other combinations of 't' and 'R' do not.

Solution

Given that this behavior is present in the newer version of that collation, and that combinations such as "fr" and "fR" do match (as expected), it could only be culture-specific linguistic rules for that combination of characters.
SELECT CASE WHEN 'tr' = 'tR' COLLATE Vietnamese_100_CI_AI THEN 'Y' ELSE 'N' END;
-- N

SELECT CASE WHEN 'fr' = 'fR' COLLATE Vietnamese_100_CI_AI THEN 'Y' ELSE 'N' END;
-- Y


I found the rule in the sort weight file **. It's the combination of "tr" that's special (in Vietnamese), not "tR". It seems that Vietnamese has certain letter combinations that combine to form a single character, such as the "CH" and "LL" combinations in Spanish. So, the following are valid combinations for the "character" that "T" + "R" combine to form in Vietnamese:

  • tr



  • Tr



  • TR



The combination of "tR" does not form the "TR" character, most likely because that is an unnatural capitalization that more so implies separation of words, such as with Pascal / Camel -casing (e.g. "ChatRoom" and "chatRoom", respectively, as opposed to "tRogdor the Burninator") (educated guess on my part).

The following example shows that the "tr" combination sorts after "tz":
SELECT *
FROM (VALUES (N'Atra'), (N'Atz'), (N'Aua'), (N'Ata'), (N'AtR')) tmp(col)
ORDER BY tmp.[col] COLLATE Vietnamese_100_CI_AI ASC
/*
Ata
AtR
Atz
Atra
Aua
*/


Those results are due to "tr" combining to form a single character that naturally sorts after "t". Meaning, the sorting algorithm sees the following:
Character #
1 | 2 | 3
-------------
A | t | a
A | t | R
A | t | z
A | tr | a
A | u | a


Vietnamese has other two-letter combinations that work the same way as "TR" (i.e. case-insensitive: tr == Tr == TR <> tR ):

  • CH



  • GI



  • KH



  • NG



  • NH



  • PH



  • QU



  • TH



For more info on working with collations / encodings / Unicode, please visit my site: Collations Info

** A sort weight file contains code points and their respective weight values which are separated into categories such as diacritic weight, case weight, etc. There are usually sections that describe transformations, such a combining two code points to be a single weight for a particular culture/locale (e.g. Vietnamese). There can be mappings for decomposing pre-composed characters into individual characters, etc. Please see: Accessing the Windows Sorting Weight Table.

NOTE: Microsoft provides several sort weight files since they have been updated over the years as new versions of Windows and Windows Server have been released. Please keep in mind that none of those files is a 100% match for the rules that SQL Server uses. The closest we get is the Windows Server 2008 Sorting Weight Table.txt file, which should closely match the behavior of the version 100 collations (i.e. those with _100_ in their names).

Context

StackExchange Database Administrators Q#298499, answer score: 31

Revisions (0)

No revisions yet.