HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

N'Șc' considered duplicate key of N'C' using Latin1_General_CI_AS collation

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
duplicatelatin1_general_ci_asconsideredusingcollationkey

Problem

I have a table with a unique key that includes an NVARCHAR(50) column (correct or not, but is there). So, when trying to insert Șc or C (doesn't matter the order of the insert) it breaks on the 2nd insert due to collation issues. Here is the error:


(1 row(s) affected) Msg 2601, Level 14, State 1, Line 16
Cannot insert duplicate key row in object 'dbo.testT' with unique index 'IX_TestT'. The duplicate key value is (C).

Select returns:

Database default collation is Latin1_General_CI_AS. Spent some time looking on how to solve it, without changing too much the already existing structure, but can't find a way to getting working. Tried different collations and combinations, everything fails. Read (here and here) about character expansions and so on, still stuck. Here is a sample code that I'm using to replicate the issue, feel free to modify and recommend anything that could help solve this.

CREATE TABLE testT (
    [Default_Collation]     [NVARCHAR] (50) COLLATE DATABASE_DEFAULT,
    [Latin1_General_CI_AS]  [NVARCHAR] (50) COLLATE Latin1_General_CI_AS,
    [Latin1_General_CI_AI]  [NVARCHAR] (50) COLLATE Latin1_General_CI_AI,
    [SQL_Collation]         [NVARCHAR] (50) COLLATE SQL_Latin1_General_CP1_CI_AS);
CREATE UNIQUE CLUSTERED INDEX [IX_TestT] ON [dbo].[testT] ([Default_Collation])
ON [PRIMARY]
GO

INSERT INTO testT
SELECT  N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc',  --COLLATE Latin1_General_CI_AS
        N'Șc'   --COLLATE Latin1_General_CI_AS

INSERT INTO testT
SELECT  N'C'    --COLLATE Latin1_General_CI_AS 
        ,N'C'   --COLLATE Latin1_General_CI_AS
        ,N'C'   --COLLATE Latin1_General_CI_AS
        ,N'C'   --COLLATE SQL_Latin1_General_CP1_CI_AS

SELECT * FROM testT;

DROP TABLE testT;

Solution

The problem is that the old SQL Server Collations (i.e. those with names starting with SQL_) and the first two versions of the Windows Collations (the 80 series which came with SQL Server 2000 and do not have a version number in the name, and the 90 series which came with SQL Server 2005) are missing the sort weights for a great many characters. This was mostly corrected starting with the 100 series Collations that came with SQL Server 2008.

As you can see in the below examples, the Ș character matches an empty string when using non-binary, version 80 or 90 Collations (and SQL Server Collations) since they both have the same sort weight: 0. Nothing. Nada. This means that when comparing N'Șc' with N'C' (using pre-series 100 Collations), you are really comparing N'c' with N'C' (test #1):

SELECT 1 WHERE N'Șc' = N'C' COLLATE Latin1_General_CS_AS;
-- no result (due to "c" and "C" being different case)

SELECT 2 WHERE N'Ș' = N'' COLLATE SQL_Latin1_General_CP1_CI_AS;
SELECT 3 WHERE N'Ș' = N'' COLLATE Latin1_General_CI_AS;

SELECT 4 WHERE N'Ș' = N'' COLLATE Latin1_General_BIN2;
-- no result (due to "Ș" still being a code point and empty string has no code points)

SELECT 5 WHERE N'Ș' = N'' COLLATE Latin1_General_100_CI_AS;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

SELECT 6 WHERE N'Ș' = N'' COLLATE Chinese_PRC_CI_AI;
SELECT 7 WHERE N'Ș' = N'' COLLATE Chinese_PRC_90_CI_AI;

SELECT 8 WHERE N'Ș' = N'' COLLATE Indic_General_90_CI_AI;
SELECT 9 WHERE N'Ș' = N'' COLLATE Indic_General_100_CI_AI;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)


So, unfortunately you will need to drop the PK, alter the column to have a 100-level Collation (e.g. Latin1_General_100_CI_AS_SC), and then recreate the PK. Please note that the difference in that suggested Collation from the current Collation is both the 100 and the _SC at the end, which allows it to properly handle supplementary characters.

This does not mean that you need to:

  • change the Collations of other tables (unless they have the same setup of NVARCHAR in the PK)



  • change the Database's default Collation. The main issue with not changing the DB's Collation is that there will be a behavior difference between doing table.column = N'Ș' and @variable = N'Ș' since variables and string literals use the Database's default Collation.



For more examples of this behavior, please see the "Supplementary Characters" section of the following blog post of mine:

The Uni-Code: The Search for the True List of Valid Characters for T-SQL Identifiers, Part 3 of 2 (Delimited Identifiers)

Code Snippets

SELECT 1 WHERE N'Șc' = N'C' COLLATE Latin1_General_CS_AS;
-- no result (due to "c" and "C" being different case)

SELECT 2 WHERE N'Ș' = N'' COLLATE SQL_Latin1_General_CP1_CI_AS;
SELECT 3 WHERE N'Ș' = N'' COLLATE Latin1_General_CI_AS;

SELECT 4 WHERE N'Ș' = N'' COLLATE Latin1_General_BIN2;
-- no result (due to "Ș" still being a code point and empty string has no code points)

SELECT 5 WHERE N'Ș' = N'' COLLATE Latin1_General_100_CI_AS;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

SELECT 6 WHERE N'Ș' = N'' COLLATE Chinese_PRC_CI_AI;
SELECT 7 WHERE N'Ș' = N'' COLLATE Chinese_PRC_90_CI_AI;

SELECT 8 WHERE N'Ș' = N'' COLLATE Indic_General_90_CI_AI;
SELECT 9 WHERE N'Ș' = N'' COLLATE Indic_General_100_CI_AI;
-- no result (due to "Ș" finally having a sort weight in 100 series Collations)

Context

StackExchange Database Administrators Q#207722, answer score: 11

Revisions (0)

No revisions yet.