HiveBrain v1.2.0
Get Started
← Back to all entries
gotchasqlModerate

Why does searching for LIKE N'%�%' match any Unicode character and = N'�' match many?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whysearchinganymatchlikeunicodecharacterfordoesmany

Problem

DECLARE @T TABLE(
  Col NCHAR(1));

INSERT INTO @T
VALUES      (N'A'),
            (N'B'),
            (N'C'),
            (N'Ƕ'),
            (N'Ƿ'),
            (N'Ǹ');


SELECT *
FROM   @T
WHERE  Col LIKE N'%�%'


Returns

Col
A
B
C
Ƕ
Ƿ
Ǹ


SELECT *
FROM   @T
WHERE  Col = N'�'


Returns

Col
Ƕ
Ƿ
Ǹ


Generating every possible double byte "character" with the below shows that the = version matches 21,229 of them and the LIKE N'%�%' version all of them (I've tried a few non binary collations with the same result).

WITH T(I, N)
AS 
(
SELECT TOP 65536 ROW_NUMBER() OVER (ORDER BY @@SPID),
                 NCHAR(ROW_NUMBER() OVER (ORDER BY @@SPID))
FROM master..spt_values v1, 
     master..spt_values v2
)
SELECT I, N 
FROM T
WHERE N = N'�'


Anyone able to shed any light as to what is going on here?

Using COLLATE Latin1_General_BIN then matches the single character NCHAR(65533) - but the question is to understand what rules it uses in the other case. What's special about those 21,229 characters that match the = and why does everything match the wildcard like? I presume there is some reason behind it that I am missing.

nchar(65534) [and 21k others] work just as well as nchar(65533). The question could have been phrased using nchar(502) equally as - it behaves the same both as LIKE N'%Ƕ%' (matches everything) and in the = case. That's probably quite a big clue.

Changing the SELECT in the last query to SELECT I, N, RANK() OVER(ORDER BY N) shows that SQL Server can't rank the characters. It seems that any character not handled by the collation is considered equivalent.

A database with a Latin1_General_100_CS_AS collation produces 5840 matches. Latin1_General_100_CS_AS cuts down the = matches quite considerably, but doesn't change the LIKE behaviour. It does seem like there is a pot of characters that has got smaller in later collations that all compare equal and are ignored in wildcard `LIKE

Solution

How one "character" (which can be comprised of multiple Code Points: surrogate pairs, combining characters, etc) compares to another is based on a rather complex set of rules. It is so complex due to needing to account for all the various (and sometimes "wacky") rules found in all of the languages represented in the Unicode specification. This system applies to non-binary Collations for all NVARCHAR data, and for VARCHAR data that is using a Windows Collation and not a SQL Server Collation (one starting with SQL_). This system does not apply to VARCHAR data using a SQL Server Collation as those use simple mappings.

Most of the rules are defined in the Unicode Collation Algorithm (UCA). Some of those rules, and some not covered in the UCA, are:

  • The default ordering / weight given in the allkeys.txt file (noted below)



  • Which sensitivities and options are being used (e.g. is it case sensitive or insensitive?, and if sensitive, then is it upper-case first or lower-case first?)



  • Any locale-based overrides.



  • The version of the Unicode standard is being used.



  • The "human" factor (i.e. Unicode is a specification, not software, and is thus left up to each vendor to implement it)



I emphasized that final point regarding the human factor to hopefully make it clear that one should not expect SQL Server to always behave 100% according to the specification.

The overriding factor here is the weighting given to each Code Point, and the fact that multiple Code Points can share the same weight specification. You can find the basic weights (no locale-specific overrides) here (I believe the 100 series of Collations is Unicode v 5.0 -- informal confirmation in the comments on the Microsoft Connect item):

http://www.unicode.org/Public/UCA/5.0.0/allkeys.txt

The Code Point in question – U+FFFD – is defined as:

FFFD ; [*0F12.0020.0002.FFFD] # REPLACEMENT CHARACTER


That notation is defined in section 9.1 Allkeys File Format of the UCA:

:= ';' +
:= +
:= "[" "." "." ("." )? "]"
:= "*" | "."

Collation elements marked with a "*" are variable.


That last line is important as the Code Point we are looking at has a specification that indeed starts with "*". In section 3.6 Variable Weighting there are four possible behaviors defined, based on Collation configuration values that we have no direct access to (these are hard-coded into the Microsoft implementation of each Collation, such as whether case-sensitive uses lower-case first or upper-case first, a property which is different between VARCHAR data using SQL_ Collations and all other variations).

I do not have time to do the full research into which paths are taken and to infer which options are being used such that a more solid proof can be given, but it is safe to say that within each Code Point specification, whether or not something is considered "equal" is not going to always use the full specification. In this case, we have "0F12.0020.0002.FFFD" and most likely it is just levels 2 and 3 that are being used (i.e. .0020.0002.). Doing a "Count" in Notepad++ for ".0020.0002." finds 12,581 matches (including supplementary characters that we have not been dealing with yet). Doing a "Count" on "[" returns 4049 matches. Doing a RegEx "Find" / "Count" using a pattern of \[\\d{4}\.0020\.0002 returns 832 matches. So somewhere in this combination, plus possibly some other rules I am not seeing, plus some Microsoft-specific implementation details, is the full explanation of this behavior. And to be clear, the behavior is the same for all of the matching characters as they all match each other as they all have the same weight once the rules are applied (meaning, this question could have been asked about any one of them, not necessarily Mr. ).

You can see with the query below and changing the COLLATE clause as per the results below the query how the various sensitivities work across the two versions of Collations:

;WITH cte AS
(
SELECT TOP (65536) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1 AS [Num]
FROM [master].sys.columns col
CROSS JOIN [master].sys.objects obj
)
SELECT cte.Num AS [Decimal],
CONVERT(VARBINARY(2), cte.Num) AS [Hex],
NCHAR(cte.Num) AS [Character]
FROM cte
WHERE NCHAR(cte.Num) = NCHAR(0xFFFD) COLLATE Latin1_General_100_CS_AS_WS --N'�'
ORDER BY cte.Num;


The various counts of matching characters at different collations is below.

Latin1_General_100_CS_AS_WS = 5840
Latin1_General_100_CS_AS = 5841 (The "extra" character is U+3000)
Latin1_General_100_CI_AS = 5841
Latin1_General_100_CI_AI = 6311

Latin1_General_CS_AS_WS = 21,229
Latin1_General_CS_AS = 21,230
Latin1_General_CI_AS = 21,230
Latin1_General_CI_AI = 21,537


In all of the collations listed above N'' = N'�' also evaluates to true.

UPDATE

I was able to do a little more research and here is what I found:

How it "probably" should work

Using the I

Context

StackExchange Database Administrators Q#171331, answer score: 13

Revisions (0)

No revisions yet.