HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMajor

How to check for Non-Ascii Characters

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
nonasciiforhowcharacterscheck

Problem

What is the best way to check if a VARCHAR field has Non-Ascii Characters?

CHAR(1) through CHAR(31) and CHAR(127) through CHAR(255).

I tried using PATINDEX and have run into the following issue.

Checking the lower range worked correctly.

SELECT *      
FROM mbrnotes      
WHERE PATINDEX('%[' + CHAR(1)+ '-' +CHAR(31)+']%',LINE_TEXT) > 0


My data had three records with 0x1E and all three where returned.

But when I check just the upper range:

SELECT *      
FROM mbrnotes      
WHERE PATINDEX('%[' + CHAR(127)+ '-' +CHAR(255)+']%',LINE_TEXT) > 0


It returns close to all the records in the table (table count 170737 and returned count 170735) and since my data did not have any values in this range I would think it should have returned no records.

Solution

Ranges in the pattern syntax use the sorting rules of your collation.

Use a binary collate clause so the range is ordered by character code.

(I also changed it to LIKE as I find that more obvious than PATINDEX > 0)

SELECT *
FROM mbrnotes
WHERE LINE_TEXT COLLATE Latin1_General_100_BIN2 
     LIKE '%[' + CHAR(0)+ '-' +CHAR(31) + CHAR(127)+ '-' +CHAR(255)+']%'


If you actually want to see the offending characters and you are on a version with the TRANSLATE function you can use something like the below

DECLARE @WhiteListedCharacters NVARCHAR(1000)
= ' !"#$%&''()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+ '[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

SELECT text,
    REPLACE(
        TRANSLATE(
            text,
            @WhiteListedCharacters COLLATE Latin1_General_100_BIN2, 
            REPLICATE(
                LEFT(@WhiteListedCharacters,1),
                LEN(@WhiteListedCharacters))), 
        LEFT(@WhiteListedCharacters,1), 
        '') AS BadChars
FROM   sys.messages
WHERE  language_id = 1038


You can then use that result in a second call to TRANSLATE to preserve only the "good" characters.

DECLARE @WhiteListedCharacters NVARCHAR(1000)
= ' !"#$%&''()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+ '[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

SELECT text,
    BadChars,
    Cleaned = REPLACE(
                  TRANSLATE(
                      text  COLLATE Latin1_General_100_BIN2,
                      BadChars,
                      REPLICATE(N'ψ', LEN(BadChars))
                      ),
              N'ψ', N'')
FROM sys.messages
CROSS APPLY
(
SELECT REPLACE(
           TRANSLATE(
               text,
               @WhiteListedCharacters COLLATE Latin1_General_100_BIN2,
               REPLICATE(
                   LEFT(@WhiteListedCharacters,1),
                   LEN(@WhiteListedCharacters + '-') - 1)),
               LEFT(@WhiteListedCharacters,1), 
        '') AS BadChars
) ca
WHERE language_id = 1038

Code Snippets

SELECT *
FROM mbrnotes
WHERE LINE_TEXT COLLATE Latin1_General_100_BIN2 
     LIKE '%[' + CHAR(0)+ '-' +CHAR(31) + CHAR(127)+ '-' +CHAR(255)+']%'
DECLARE @WhiteListedCharacters NVARCHAR(1000)
= ' !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+ '[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

SELECT text,
    REPLACE(
        TRANSLATE(
            text,
            @WhiteListedCharacters COLLATE Latin1_General_100_BIN2, 
            REPLICATE(
                LEFT(@WhiteListedCharacters,1),
                LEN(@WhiteListedCharacters))), 
        LEFT(@WhiteListedCharacters,1), 
        '') AS BadChars
FROM   sys.messages
WHERE  language_id = 1038
DECLARE @WhiteListedCharacters NVARCHAR(1000)
= ' !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+ '[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

SELECT text,
    BadChars,
    Cleaned = REPLACE(
                  TRANSLATE(
                      text  COLLATE Latin1_General_100_BIN2,
                      BadChars,
                      REPLICATE(N'ψ', LEN(BadChars))
                      ),
              N'ψ', N'')
FROM sys.messages
CROSS APPLY
(
SELECT REPLACE(
           TRANSLATE(
               text,
               @WhiteListedCharacters COLLATE Latin1_General_100_BIN2,
               REPLICATE(
                   LEFT(@WhiteListedCharacters,1),
                   LEN(@WhiteListedCharacters + '-') - 1)),
               LEFT(@WhiteListedCharacters,1), 
        '') AS BadChars
) ca
WHERE language_id = 1038

Context

StackExchange Database Administrators Q#138350, answer score: 27

Revisions (0)

No revisions yet.