gotchasqlModerate
What is difference between Arabic_100_CS_AS_KS_WS_SC_UTF8 and Latin1_General_100_CS_AS_KS_WS_SC_UTF8?
Viewed 0 times
whatarabic_100_cs_as_ks_ws_sc_utf8differencebetweenlatin1_general_100_cs_as_ks_ws_sc_utf8and
Problem
Starting from SQL Server 2019, it supports UTF-8 as collation. However, according to the following queries:
both return code page
Are there any differences between using
SELECT COLLATIONPROPERTY('Arabic_100_CS_AS_KS_WS_SC_UTF8', 'CodePage')
SELECT COLLATIONPROPERTY('Latin1_General_100_CS_AS_KS_WS_SC_UTF8', 'CodePage');both return code page
65001 which is Unicode in Windows. Also all new _UTF8 collations use code page 65001:SELECT * FROM sys.fn_helpcollations() WHERE name LIKE '%_UTF8';Are there any differences between using
Arabic_100_CS_AS_KS_WS_SC_UTF8 and Latin1_General_100_CS_AS_KS_WS_SC_UTF8 as collation?Solution
Yes, all
though not all programs and fonts will work seamlessly with it.
For
As far as differences between
Looking through the "Windows Server 2008 Sorting Weight Table" file (which is what the version
That being said, I did find an example of a difference in behavior for Arabic characters when using an Urdu collation as those collations do have a few modifications to the default sort weights (9 registered in the "Windows Server 2008 Sorting Weight Table" file).
Looking at the "Teh Marbuta" character (U+0629), it has a weight of 29 in the default table (i.e. the table used for US English / Latin1), which has a lower sort weight than the "Peheh" character (U+06A6), which has a default weight of 137. The 41 indicates which "script" the character is in, and these are both Arabic characters. However, the Urdu collations modify the sort weight of "Teh Marbuta" (U+0629) to be 183, which then has a higher sort weight than "Peheh" (U+06A6), still being 137.
If we sort those two characters using
All three of the queries shown above return the following results:
However, when we switch to an
returns:
Finally, keep in mind that while rare to run into this, collations can also affect upper/lower -case mappings. I believe this is confined to only the
returns:
_UTF8 collations use code page 65001 as that is the code page for UTF-8. You can even use 65001 in a DOS / Command Window via:chcp 65001though not all programs and fonts will work seamlessly with it.
For
_UTF8 collations, the code page is not controlled by the culture (i.e. Latin1_General vs Arabic) as it is for non-_UTF8 collations because code pages indicate the specific 8-bit encoding used for VARCHAR data (i.e. 8-bit character data). For non-Unicode 8-bit encodings, the culture is often tied to the code page which is the character set (e.g. Latin1 is code page Windows-1252 which has different characters in the 128-255 range than Windows-1255 which is the code page for Hebrew). But for UTF-8, it is the 8-bit encoding for the singular, all-encompassing character set that is Unicode.As far as differences between
Arabic_100_CS_AS_KS_WS_SC_UTF8 and Latin1_General_100_CS_AS_KS_WS_SC_UTF8 go, it would really only be the culture-specific rules for sorting and comparison of various characters. Of course, those two languages do not really share any characters, but there can still be differences in how some code points are handled.Looking through the "Windows Server 2008 Sorting Weight Table" file (which is what the version
_100_ collations are mostly based on, from what I'm told), I can't find any sort/comparison differences between those two collations. So, they are likely the same in terms of behavior. However, they are not the same in the sense that they still have a different LCID (the locale/culture identifier) so converting their values to non-UTF8 VARCHAR could result in data loss/corruption, and any process/functionality looking at the collation to determine some other behavior might behave differently.That being said, I did find an example of a difference in behavior for Arabic characters when using an Urdu collation as those collations do have a few modifications to the default sort weights (9 registered in the "Windows Server 2008 Sorting Weight Table" file).
Looking at the "Teh Marbuta" character (U+0629), it has a weight of 29 in the default table (i.e. the table used for US English / Latin1), which has a lower sort weight than the "Peheh" character (U+06A6), which has a default weight of 137. The 41 indicates which "script" the character is in, and these are both Arabic characters. However, the Urdu collations modify the sort weight of "Teh Marbuta" (U+0629) to be 183, which then has a higher sort weight than "Peheh" (U+06A6), still being 137.
-- Default
0x0629 41 29 2 2 ;Arabic Teh Marbuta -- ة
0x06a6 41 137 2 2 ;Arabic Peheh -- ڦ
-- Urdu modifications
0x0629 41 183 2 2 ;Teh Marbuta -- ةIf we sort those two characters using
Latin1_General_100_CS_AS_KS_WS_SC_UTF8 or Arabic_100_CS_AS_KS_WS_SC_UTF8, we should get the default behavior. And, even if we use a Yakut collation, which uses the Cyrillic script and has its own modifications to the default sort weights, it doesn't modify either of these Arabic characters, hence they should behave the same as when using a Latin1_General or Arabic collation:SELECT *
FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Latin1_General_100_CS_AS_KS_WS_SC_UTF8 ASC
SELECT *
FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Arabic_100_CS_AS_KS_WS_SC_UTF8 ASC
SELECT *
FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Yakut_100_CS_AS_KS_WS_SC_UTF8 ASC
All three of the queries shown above return the following results:
ID TheChar
1 ة
2 ڦHowever, when we switch to an
Urdu collation, the order of those two characters does indeed change:SELECT *
FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Urdu_100_CS_AS_SC_UTF8 ASC
returns:
ID TheChar
2 ڦ
1 ةFinally, keep in mind that while rare to run into this, collations can also affect upper/lower -case mappings. I believe this is confined to only the
Azeri_* and Turkish collations, and only for the letters 'i' and 'I' (those cultures have a dotted upper-case 'I' and a dotless lower-case 'i'), but still best to be aware of the potential:SELECT UPPER(N'i' COLLATE Arabic_100_CS_AS_KS_WS_SC_UTF8) AS [Arabic],
UPPER(N'i' COLLATE Turkish_100_CS_AS_KS_WS_SC_UTF8) AS [Turkish],
UPPER(N'i' COLLATE Azeri_Cyrillic_100_CS_AS_KS_WS_SC_UTF8) AS [Azeri_Cyrillic],
UPPER(N'i' COLLATE Azeri_Latin_100_CS_AS_KS_WS_SC_UTF8) AS [Azeri_Latin];
returns:
Arabic Turkish Azeri_Cyrillic Azeri_Latin
I İ İ İCode Snippets
-- Default
0x0629 41 29 2 2 ;Arabic Teh Marbuta -- ة
0x06a6 41 137 2 2 ;Arabic Peheh -- ڦ
-- Urdu modifications
0x0629 41 183 2 2 ;Teh Marbuta -- ةID TheChar
1 ة
2 ڦID TheChar
2 ڦ
1 ةArabic Turkish Azeri_Cyrillic Azeri_Latin
I İ İ İContext
StackExchange Database Administrators Q#294387, answer score: 11
Revisions (0)
No revisions yet.