patternsqlMajor
Why these characters are all equal in SQL Server?
Viewed 0 times
whytheseallareequalsqlservercharacters
Problem
I just don't get it. See this SQL query:
Based on transitive relation, it means that SQL Server considers them all to be the same character.
However, in other environments, say for example C#, they're not the same.
What I'm confused about is:
This of course results in tremendous problems, because I'm working on a text-processing application and data comes almost from everywhere and I need to normalize text before processing it.
If I know the reason of difference, I might find a solution to handle it.
Thank you.
select nchar(65217) -- ﻁ
select nchar(65218) -- ﻂ
select nchar(65219) -- ﻃ
select nchar(65220) -- ﻄ
if nchar(65217) = nchar(65218)
print 'equal'
if nchar(65217) = nchar(65219)
print 'equal'
if nchar(65217) = nchar(65220)
print 'equal'Based on transitive relation, it means that SQL Server considers them all to be the same character.
However, in other environments, say for example C#, they're not the same.
What I'm confused about is:
- How string comparison works in SQL Server
- Why comparison doesn't behave the same on one machine, and one platform, but different environments
- These 4 characters represent one human-understandable character. Why they are so abundant in Unicode character map?
This of course results in tremendous problems, because I'm working on a text-processing application and data comes almost from everywhere and I need to normalize text before processing it.
If I know the reason of difference, I might find a solution to handle it.
Thank you.
Solution
All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.
SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (
The example below illustrates the differences between Windows and binary collation with the Teth character:
Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character
SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (
_BIN suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_ prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robust word-sort rules that closely align with the Windows OS whereas legacy collations consider only single characters.The example below illustrates the differences between Windows and binary collation with the Teth character:
CREATE TABLE dbo.WindowsColationExample
(
Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
);
CREATE TABLE dbo.BinaryColationExample
(
Character1 nchar(1) COLLATE Arabic_100_BIN
, Character2 nchar(1) COLLATE Arabic_100_BIN
, Character3 nchar(1) COLLATE Arabic_100_BIN
, Character4 nchar(1) COLLATE Arabic_100_BIN
);
INSERT INTO dbo.BinaryColationExample
VALUES ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT INTO dbo.WindowsColationExample
VALUES ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
character1 = character2
OR character1 = character3
OR character1 = character4
OR character2 = character3
OR character2 = character4
OR character3 = character4;
--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character
ﻁ is used in different languages (http://en.wikipedia.org/wiki/Teth).Code Snippets
CREATE TABLE dbo.WindowsColationExample
(
Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
);
CREATE TABLE dbo.BinaryColationExample
(
Character1 nchar(1) COLLATE Arabic_100_BIN
, Character2 nchar(1) COLLATE Arabic_100_BIN
, Character3 nchar(1) COLLATE Arabic_100_BIN
, Character4 nchar(1) COLLATE Arabic_100_BIN
);
INSERT INTO dbo.BinaryColationExample
VALUES ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT INTO dbo.WindowsColationExample
VALUES ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
character1 = character2
OR character1 = character3
OR character1 = character4
OR character2 = character3
OR character2 = character4
OR character3 = character4;
--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;Context
StackExchange Database Administrators Q#100643, answer score: 29
Revisions (0)
No revisions yet.