gotchasqlModerate

What is difference between Arabic_100_CS_AS_KS_WS_SC_UTF8 and Latin1_General_100_CS_AS_KS_WS_SC_UTF8?

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

whatarabic_100_cs_as_ks_ws_sc_utf8differencebetweenlatin1_general_100_cs_as_ks_ws_sc_utf8and

Problem

Starting from SQL Server 2019, it supports UTF-8 as collation. However, according to the following queries:

SELECT COLLATIONPROPERTY('Arabic_100_CS_AS_KS_WS_SC_UTF8', 'CodePage')
SELECT COLLATIONPROPERTY('Latin1_General_100_CS_AS_KS_WS_SC_UTF8', 'CodePage');

both return code page 65001 which is Unicode in Windows. Also all new _UTF8 collations use code page 65001:

SELECT * FROM sys.fn_helpcollations() WHERE name LIKE '%_UTF8';

Are there any differences between using Arabic_100_CS_AS_KS_WS_SC_UTF8 and Latin1_General_100_CS_AS_KS_WS_SC_UTF8 as collation?

Solution

Yes, all _UTF8 collations use code page 65001 as that is the code page for UTF-8. You can even use 65001 in a DOS / Command Window via:

chcp 65001

though not all programs and fonts will work seamlessly with it.

For _UTF8 collations, the code page is not controlled by the culture (i.e. Latin1_General vs Arabic) as it is for non-_UTF8 collations because code pages indicate the specific 8-bit encoding used for VARCHAR data (i.e. 8-bit character data). For non-Unicode 8-bit encodings, the culture is often tied to the code page which is the character set (e.g. Latin1 is code page Windows-1252 which has different characters in the 128-255 range than Windows-1255 which is the code page for Hebrew). But for UTF-8, it is the 8-bit encoding for the singular, all-encompassing character set that is Unicode.

As far as differences between Arabic_100_CS_AS_KS_WS_SC_UTF8 and Latin1_General_100_CS_AS_KS_WS_SC_UTF8 go, it would really only be the culture-specific rules for sorting and comparison of various characters. Of course, those two languages do not really share any characters, but there can still be differences in how some code points are handled.

Looking through the "Windows Server 2008 Sorting Weight Table" file (which is what the version _100_ collations are mostly based on, from what I'm told), I can't find any sort/comparison differences between those two collations. So, they are likely the same in terms of behavior. However, they are not the same in the sense that they still have a different LCID (the locale/culture identifier) so converting their values to non-UTF8 VARCHAR could result in data loss/corruption, and any process/functionality looking at the collation to determine some other behavior might behave differently.

That being said, I did find an example of a difference in behavior for Arabic characters when using an Urdu collation as those collations do have a few modifications to the default sort weights (9 registered in the "Windows Server 2008 Sorting Weight Table" file).

Looking at the "Teh Marbuta" character (U+0629), it has a weight of 29 in the default table (i.e. the table used for US English / Latin1), which has a lower sort weight than the "Peheh" character (U+06A6), which has a default weight of 137. The 41 indicates which "script" the character is in, and these are both Arabic characters. However, the Urdu collations modify the sort weight of "Teh Marbuta" (U+0629) to be 183, which then has a higher sort weight than "Peheh" (U+06A6), still being 137.

-- Default
0x0629  41  29  2   2   ;Arabic Teh Marbuta -- ة
0x06a6  41  137 2   2   ;Arabic Peheh       -- ڦ

-- Urdu modifications
0x0629  41  183 2   2   ;Teh Marbuta        -- ة

If we sort those two characters using Latin1_General_100_CS_AS_KS_WS_SC_UTF8 or Arabic_100_CS_AS_KS_WS_SC_UTF8, we should get the default behavior. And, even if we use a Yakut collation, which uses the Cyrillic script and has its own modifications to the default sort weights, it doesn't modify either of these Arabic characters, hence they should behave the same as when using a Latin1_General or Arabic collation:

SELECT *
FROM   (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Latin1_General_100_CS_AS_KS_WS_SC_UTF8 ASC

SELECT *
FROM   (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Arabic_100_CS_AS_KS_WS_SC_UTF8 ASC

SELECT *
FROM   (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Yakut_100_CS_AS_KS_WS_SC_UTF8 ASC

All three of the queries shown above return the following results:

ID    TheChar
1     ة
2     ڦ

However, when we switch to an Urdu collation, the order of those two characters does indeed change:

SELECT *
FROM   (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar)
ORDER BY tmp.[TheChar] COLLATE Urdu_100_CS_AS_SC_UTF8 ASC

returns:

ID    TheChar
2     ڦ
1     ة

Finally, keep in mind that while rare to run into this, collations can also affect upper/lower -case mappings. I believe this is confined to only the Azeri_* and Turkish collations, and only for the letters 'i' and 'I' (those cultures have a dotted upper-case 'I' and a dotless lower-case 'i'), but still best to be aware of the potential:

SELECT UPPER(N'i' COLLATE Arabic_100_CS_AS_KS_WS_SC_UTF8) AS [Arabic],
   UPPER(N'i' COLLATE Turkish_100_CS_AS_KS_WS_SC_UTF8) AS [Turkish],
   UPPER(N'i' COLLATE Azeri_Cyrillic_100_CS_AS_KS_WS_SC_UTF8) AS [Azeri_Cyrillic],
   UPPER(N'i' COLLATE Azeri_Latin_100_CS_AS_KS_WS_SC_UTF8) AS [Azeri_Latin];

returns:

Arabic   Turkish   Azeri_Cyrillic   Azeri_Latin
I        İ         İ                İ

Code Snippets

-- Default
0x0629  41  29  2   2   ;Arabic Teh Marbuta -- ة
0x06a6  41  137 2   2   ;Arabic Peheh       -- ڦ

-- Urdu modifications
0x0629  41  183 2   2   ;Teh Marbuta        -- ة

ID    TheChar
1     ة
2     ڦ

ID    TheChar
2     ڦ
1     ة

Arabic   Turkish   Azeri_Cyrillic   Azeri_Latin
I        İ         İ                İ

Context

StackExchange Database Administrators Q#294387, answer score: 11

Revisions (0)

No revisions yet.