patternsqlMinor
What is the collation used while comparing Unicode string literals in SQL Server 2019?
Viewed 0 times
literalsthewhilecomparingwhatsqlused2019unicodecollation
Problem
My understanding is that the collation for comparing Unicode string literals is determined by the database collation.
My database is using
When I compare N'ß' to 'ss', I expect the comparison to fail. But it does not. I am trying to figure out why. Here is the reproduction:
Output:
I expect 'ß' and 'ss' to match but not if my database is at SQL_Latin1_General_CP1_CI_AS. I wanted to check if there was a gap in my understanding which currently stands as "For Nvarchar() or Nchar() datatype, the SQL Server grabs the collation from the database setting". This translates to "my database is at SQL_Latin1_General_CP1_CI_AS and thus the comparison should fail. If it is succeeding, what collation is the SQL Server using? And why?
It is still not
My database is using
SQL_Latin1_General_CP1_CI_AS collation.When I compare N'ß' to 'ss', I expect the comparison to fail. But it does not. I am trying to figure out why. Here is the reproduction:
set nocount on
go
use tempdb
go
SELECT
@@version as SqlServerVersion,
CONVERT(nvarchar(128), SERVERPROPERTY('collation')) as SqlServerCollation,
DB_NAME() AS DatabaseName
,DATABASEPROPERTYEX(DB_NAME(), 'Collation') AS CollationUsedBySQLServerDatabase
GO
declare @ss varchar(255) = 'ss'
declare @Nscharfess nvarchar(255) = N'ß'
declare @scharfess varchar(255) = 'ß'
select case when @Nscharfess = @ss then 'Unicode : Strings match' else 'Unicode : Strings do not match' end,
case when @scharfess = @ss then 'SQL_Latin1_General_CP1_CI_AS : Strings match' else 'SQL_Latin1_General_CP1_CI_AS : Strings do not match' endOutput:
SqlServerVersion
Microsoft SQL Server 2019 (RTM) - 15.0.2000.5 (X64)
Sep 24 2019 13:48:23
Copyright (C) 2019 Microsoft Corporation
Standard Edition (64-bit) on Windows 10 Enterprise 10.0 (Build 19045: ) (Hypervisor)
SqlServerCollation
SQL_Latin1_General_CP1_CI_AS
DatabaseName
tempdb
CollationUsedBySQLServerDatabase
SQL_Latin1_General_CP1_CI_AS
Unicode : Strings match
SQL_Latin1_General_CP1_CI_AS : Strings do not match
Completion time: 2024-02-15T00:16:59.1968376-08:00
I expect 'ß' and 'ss' to match but not if my database is at SQL_Latin1_General_CP1_CI_AS. I wanted to check if there was a gap in my understanding which currently stands as "For Nvarchar() or Nchar() datatype, the SQL Server grabs the collation from the database setting". This translates to "my database is at SQL_Latin1_General_CP1_CI_AS and thus the comparison should fail. If it is succeeding, what collation is the SQL Server using? And why?
It is still not
Solution
My understanding is that the collation for comparing Unicode string literals is determined by the database collation.
Yes, this is correct. In fact, the local database's default collation is used for comparing both Unicode and non-Unicode string literals (and variables / parameters). This covers the following datatypes:
When I compare N'ß' to 'ss', I expect the comparison to fail. But it does not.
The difference in your two comparisons is that the non-Unicode comparison, when using a SQL Server collation (i.e. one with a name starting with
The SQL Server collations, when comparing or sorting non-Unicode data, use a set of obsolete rules (Microsoft refers to these rules as "Sort Orders") that are far more limited than the Unicode rules. These obsolete rules should not set expectations (unless your use-case requires that legacy behavior).
To put it differently, when working with Unicode data, regardless of Windows or non-Windows collation, Unicode sorting and comparison rules are used. Likewise, when working with non-Unicode data, IF you are using a Windows collation, it will be the Unicode sorting and comparison rules that are used. However, when working with non-Unicode data, IF you are using a SQL Server collation, then the simplistic Sort Orders will be used for sorting and comparison.
So, what's the difference between Unicode rules and Sort Orders, and which one is "correct"? Well, we need to first look into why (linguistically), and how (technically), "ß" equates to "ss".
WHY
According to the Wikipedia article for ß
In German orthography, the letter ß, called Eszett (IPA: [ɛsˈtsɛt]) or scharfes S (IPA: [ˌʃaʁfəs ˈʔɛs], "sharp S"), represents the /s/ phoneme in Standard German when following long vowels and diphthongs. The letter-name Eszett combines the names of the letters of ⟨s⟩ (Es) and ⟨z⟩ (Zett) in German. The character's Unicode names in English are sharp s and eszett. The Eszett letter is used only in German, and can be typographically replaced with the double-s digraph ⟨ss⟩, if the ß-character is unavailable. In the 20th century, the ß-character was replaced with ss in the spelling of Swiss Standard German (Switzerland and Liechtenstein), while remaining Standard German spelling in other varieties of the German language.
HOW
Unicode collating is far more complex than the simple Sort Order mappings and allows for a variety of sorting / comparison rules. This includes "expansions", which will translate a character such as "ß" into two Latin "s" characters (i.e. "ss").
Expansions are not something that the non-Unicode Sort Order functionality can do. Not even the two German Sort Orders can make this comparison:
returns no rows.
CORRECTNESS
Given the above info, we can see that: A) "ß == ss" is a valid linguistic comparison (at least in German), B) this comparison can be made using Unicode rules (which is why the comparison succeeds when working with Unicode data as well as non-Unicode data using a Windows Collation), and C) the comparison cannot be made when working with non-Unicode data using a SQL Server Collation.
So, why does ß == ss, even when using a non-German collation? And, should it be this way?
Rather than having a distinct set of rules per each locale (i.e. the Sort Order approach), Unicode collating starts with a baseline default set of rules that govern all locales. This default set of rules covers US English plus rules from most/all other locales. If a particular locale needs a different rule, it can override the defaults. This is how, even if you are using a Hebrew, Chinese, etc collation (i.e. languages that do not use Latin characters), comparisons operating on English characters will still work correctly for the US English rules.
The "rule" allowing ß to equal ss is a default rule, and only the Hungarian Technical locale overrides this rule. This is why ß is equal to ss for all non-binary Windows collations that are not "Hungarian Technical".
EXPECTATIONS
While ß is not an English character, having it equate to ss is just as valid or invalid as any other rule for non-English languages when using characters not found in the culture of the collation being used.
Given that the Unicode rules are far more accurate for capturing the complexities of most (all?) languages, and that the older Sort Order approach is too simplistic and obs
Yes, this is correct. In fact, the local database's default collation is used for comparing both Unicode and non-Unicode string literals (and variables / parameters). This covers the following datatypes:
CHAR, VARCHAR (which covers '...' string literals), NCHAR, and NVARCHAR (which covers N'...' string literals). Also, TEXT and NTEXT, but don't use those as they are deprecated and less functional/performant than their replacements: VARCHAR(MAX) and NVARCHAR(MAX), respectively).When I compare N'ß' to 'ss', I expect the comparison to fail. But it does not.
The difference in your two comparisons is that the non-Unicode comparison, when using a SQL Server collation (i.e. one with a name starting with
SQL_), is not using Unicode comparison rules, whereas if you were using a Windows collation, it would use the same comparison rules for both Unicode and non-Unicode comparisons.The SQL Server collations, when comparing or sorting non-Unicode data, use a set of obsolete rules (Microsoft refers to these rules as "Sort Orders") that are far more limited than the Unicode rules. These obsolete rules should not set expectations (unless your use-case requires that legacy behavior).
To put it differently, when working with Unicode data, regardless of Windows or non-Windows collation, Unicode sorting and comparison rules are used. Likewise, when working with non-Unicode data, IF you are using a Windows collation, it will be the Unicode sorting and comparison rules that are used. However, when working with non-Unicode data, IF you are using a SQL Server collation, then the simplistic Sort Orders will be used for sorting and comparison.
So, what's the difference between Unicode rules and Sort Orders, and which one is "correct"? Well, we need to first look into why (linguistically), and how (technically), "ß" equates to "ss".
WHY
According to the Wikipedia article for ß
In German orthography, the letter ß, called Eszett (IPA: [ɛsˈtsɛt]) or scharfes S (IPA: [ˌʃaʁfəs ˈʔɛs], "sharp S"), represents the /s/ phoneme in Standard German when following long vowels and diphthongs. The letter-name Eszett combines the names of the letters of ⟨s⟩ (Es) and ⟨z⟩ (Zett) in German. The character's Unicode names in English are sharp s and eszett. The Eszett letter is used only in German, and can be typographically replaced with the double-s digraph ⟨ss⟩, if the ß-character is unavailable. In the 20th century, the ß-character was replaced with ss in the spelling of Swiss Standard German (Switzerland and Liechtenstein), while remaining Standard German spelling in other varieties of the German language.
HOW
Unicode collating is far more complex than the simple Sort Order mappings and allows for a variety of sorting / comparison rules. This includes "expansions", which will translate a character such as "ß" into two Latin "s" characters (i.e. "ss").
Expansions are not something that the non-Unicode Sort Order functionality can do. Not even the two German Sort Orders can make this comparison:
SELECT 'German-PhoneBook: 1141' WHERE 'ß' = 'ss' COLLATE SQL_EBCDIC1141_CP1_CS_AS;
SELECT 'German-PhoneBook: 273' WHERE 'ß' = 'ss' COLLATE SQL_EBCDIC273_CP1_CS_AS;
returns no rows.
CORRECTNESS
Given the above info, we can see that: A) "ß == ss" is a valid linguistic comparison (at least in German), B) this comparison can be made using Unicode rules (which is why the comparison succeeds when working with Unicode data as well as non-Unicode data using a Windows Collation), and C) the comparison cannot be made when working with non-Unicode data using a SQL Server Collation.
So, why does ß == ss, even when using a non-German collation? And, should it be this way?
Rather than having a distinct set of rules per each locale (i.e. the Sort Order approach), Unicode collating starts with a baseline default set of rules that govern all locales. This default set of rules covers US English plus rules from most/all other locales. If a particular locale needs a different rule, it can override the defaults. This is how, even if you are using a Hebrew, Chinese, etc collation (i.e. languages that do not use Latin characters), comparisons operating on English characters will still work correctly for the US English rules.
The "rule" allowing ß to equal ss is a default rule, and only the Hungarian Technical locale overrides this rule. This is why ß is equal to ss for all non-binary Windows collations that are not "Hungarian Technical".
EXPECTATIONS
While ß is not an English character, having it equate to ss is just as valid or invalid as any other rule for non-English languages when using characters not found in the culture of the collation being used.
Given that the Unicode rules are far more accurate for capturing the complexities of most (all?) languages, and that the older Sort Order approach is too simplistic and obs
Code Snippets
Latin1_General_CI_AS : Strings matchContext
StackExchange Database Administrators Q#335859, answer score: 8
Revisions (0)
No revisions yet.