patternsqlMajor
Why are non-digits LIKE [0-9]?
Viewed 0 times
whynonaredigitslike
Problem
My server's default collation is Latin1_General_CI_AS, as determined by this query:
I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate
Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation.
Filtering digits produces non-digit caracters
I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate.
The following statement creates a temporary table with 256 rows, one for each code point in the current code page:
Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable - some of the code points are strictly control characters. Here is a selective sample of the output of
I would expect to be able to filter on the Symbol column to find digit characters using a LIKE predicate and specifying the range of characters '0' thru '9':
It produces a surprising output:
```
CodePoint Symbol
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55
SELECT SERVERPROPERTY('Collation') AS Collation;I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate
LIKE '[0-9]'.Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation.
Filtering digits produces non-digit caracters
I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate.
The following statement creates a temporary table with 256 rows, one for each code point in the current code page:
WITH P0(_) AS (SELECT 0 UNION ALL SELECT 0),
P1(_) AS (SELECT 0 FROM P0 AS L CROSS JOIN P0 AS R),
P2(_) AS (SELECT 0 FROM P1 AS L CROSS JOIN P1 AS R),
P3(_) AS (SELECT 0 FROM P2 AS L CROSS JOIN P2 AS R),
Tally(Number) AS (
SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM P3
)
SELECT Number AS CodePoint, CHAR(Number) AS Symbol
INTO #CodePage
FROM Tally
WHERE Number >= 0 AND Number <= 255;Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable - some of the code points are strictly control characters. Here is a selective sample of the output of
SELECT CodePoint, Symbol FROM #CodePage:0
1
2
...
32
33 !
34 "
35 #
...
48 0
49 1
50 2
...
65 A
66 B
67 C
...
253 ý
254 þ
255 ÿI would expect to be able to filter on the Symbol column to find digit characters using a LIKE predicate and specifying the range of characters '0' thru '9':
SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0-9]';It produces a surprising output:
```
CodePoint Symbol
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55
Solution
[0-9] is not some type of regular expression defined to just match digits. Any range in a
LIKE pattern matches characters between the start and end character according to collation sort order.SELECT CodePoint,
Symbol,
RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM #CodePage
WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER BY Symbol COLLATE Latin1_General_CI_ASReturns
CodePoint Symbol Rnk
-------------------- ------ --------------------
48 0 1
188 ¼ 2
189 ½ 3
190 ¾ 4
185 ¹ 5
49 1 5
50 2 7
178 ² 7
179 ³ 9
51 3 9
52 4 11
53 5 12
54 6 13
55 7 14
56 8 15
57 9 16So you get these results because under your default collation these characters sort after
0 but before 9. It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between
0 and 1.You could also use a set rather than a range. To avoid
2 matching ² you would need a CS collationSELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_ASCode Snippets
SELECT CodePoint,
Symbol,
RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM #CodePage
WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER BY Symbol COLLATE Latin1_General_CI_ASCodePoint Symbol Rnk
-------------------- ------ --------------------
48 0 1
188 ¼ 2
189 ½ 3
190 ¾ 4
185 ¹ 5
49 1 5
50 2 7
178 ² 7
179 ³ 9
51 3 9
52 4 11
53 5 12
54 6 13
55 7 14
56 8 15
57 9 16SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_ASContext
StackExchange Database Administrators Q#34730, answer score: 26
Revisions (0)
No revisions yet.