patternsqlMinor
Query to find rows containing ASCII characters in a given range
Viewed 0 times
containingrowsrangequeryasciifindcharactersgiven
Problem
I am using some scripts from another topic, but the accepted answer isn't working for all my data scenarios. I would have asked my question on the original How to check for Non-Ascii Characters post, but I don't have enough reputation to comment or up-vote as of yet.
Questions:
My Testing
I created SQL Fiddle with sample data, the stored procedure from one of the answers, and queries to demonstrate the issue.
Query 1: sample_table
Query 2: The other answer by John shows the "bad dash" row containing char(150):
Query 3: The accepted answer by Martin Smith returns no results:
Conclusion
Unfortunately, I often need to find characters within (or outside of) a range in databases I can't create stored procedures in. I'd really like to find a fix for the accepted answer or a simple script that wouldn't require creation of any objects (including temp tables).
Any suggestions? Thanks in advance.
EDIT 1: The solution cannot modify or add any objects or settings in the database. I'm looking for a self-contained query that will select rows with one or more characters in a range between two
Questions:
- Why is the accepted answer not working for char(150)?
- Is there a way to fix the accepted answer?
My Testing
I created SQL Fiddle with sample data, the stored procedure from one of the answers, and queries to demonstrate the issue.
Query 1: sample_table
-- Note: The "bad dash" row has char(150)
SELECT * FROM sample_table;
+-------------------+
| DataColumn |
+-------------------+
| test - good dash |
| test – bad dash |
+-------------------+Query 2: The other answer by John shows the "bad dash" row containing char(150):
SELECT dbo.Find_Invalid_Chars(DataColumn) [Invalid Characters]
FROM sample_table
WHERE dbo.Find_Invalid_Chars(DataColumn) IS NOT NULL;
+----------------------+
| Invalid Characters |
+----------------------+
| test [150] bad dash |
+----------------------+Query 3: The accepted answer by Martin Smith returns no results:
SELECT DataColumn AS [Bad Data]
FROM sample_table
WHERE DataColumn LIKE '%[' + CHAR(127)+ '-' +CHAR(255)+']%' COLLATE Latin1_General_100_BIN2;
+------------+
| [Bad Data] |
+------------+
-- No rows returned.Conclusion
Unfortunately, I often need to find characters within (or outside of) a range in databases I can't create stored procedures in. I'd really like to find a fix for the accepted answer or a simple script that wouldn't require creation of any objects (including temp tables).
Any suggestions? Thanks in advance.
EDIT 1: The solution cannot modify or add any objects or settings in the database. I'm looking for a self-contained query that will select rows with one or more characters in a range between two
CHAR() numbers, regardless of the ASCII or Extended ASCII number suppSolution
Why is the accepted answer not working for char(150)?
Actually, it does. The problem is your test is bad / invalid. You test column,
The quick fix to see that it does work is simply to change the
The following might help explain why making the test column
As you can see in the results below the query, the "bad dash", which was
P.S. Please keep in mind that the function
P.P.S. ALSO, I just noticed a slight difference in logic between the function and the query (i.e. the two answers from the linked question):
Given:
-
The purpose of the query is to find characters in the source table/column that are not handled correctly by some software applications.
and:
-
The data can be in either VARCHAR or NVARCHAR.
I would say that:
-
You don't want to convert
-
It will likely be more reliable to look for characters not in a specific "valid" range as opposed to those in a specific invalid range, especially when dealing with
-
You could get away with a single query if the "valid" range is always between values 0 and 127 (since those values are the same in both cases). But if you need to specify va
Actually, it does. The problem is your test is bad / invalid. You test column,
DataColumn, is using NVARCHAR instead of VARCHAR. The character itself works in both datatypes, but the behavior is different due to how it is being used in each case:- In the
Find_Invalid_Chars()function (i.e. the "other" answer), the string is being converted back intoVARCHARsince that is the datatype of the input parameter for that function. In this case it works as expected (although I believe it can be done much more efficiently than that loop, but that's for another time ;-)
- In the
LIKEquery (i.e. the "accepted" answer), the expanded and concatenated result of'%[' + CHAR(127)+ '-' +CHAR(255)+']%'is actually converted intoNVARCHARsince that is the datatype of the column it is being compared to (andNVARCHARhas a higher datatype precedence), hence thatLIKEfunction is not behaving as expected: either theCHAR(255)character maps to a different code point, and/or theCHAR(150)character in the column itself maps to a different code point (theCHAR(127)character does not change as it's in the standard ASCII range). In either case, the conversion toNVARCHARis causing the numerical value of the "En Dash" character ("–") to no longer be within that range. Meaning, theLIKEfunction is looking for values,y, between127andx(wherex>= 128), andyfor the "En Dash" character is now >x. Whereas inVARCHAR,x= 255 andy= 150.
The quick fix to see that it does work is simply to change the
NVARCHAR datatype of the DataColumn column to be VARCHAR (yes, just remove the initial "N"), then re-build the schema, then execute, and the LIKE query will behave as expected.The following might help explain why making the test column
NVARCHAR caused the LIKE query to not match the row:SELECT UNICODE(CHAR(127)) AS [CHAR(127)],
UNICODE(CHAR(150)) AS [CHAR(150)],
UNICODE(CHAR(255)) AS [CHAR(255)];
/*
CHAR(127) CHAR(150) CHAR(255)
127 8211 255
*/
As you can see in the results below the query, the "bad dash", which was
CHAR(150) became NCHAR(8211) when stored in the NVARCHAR column. And, since that predicate is using a binary collation (generally the correct thing to do in this scenario), it was looking at the code points / values, not the characters. Hence, the LIKE clause was looking for characters with values between 127 and 255, and 8211 is typically not in that range ;-).P.S. Please keep in mind that the function
CHAR(150) can return different characters, or even NULL, based on the default collation of the database in which you execute that function. This is because VARCHAR data is based on code pages, and those are determined by the collation, and the collation being used when executing the CHAR() function is the default collation of the active / current database. This affects values 128 - 255. Values 0 - 127 will always return the same characters, regardless of collation, as those are the standard ASCII character set and are the same across all of the code pages supported in SQL Server (though not in all code pages in general).P.P.S. ALSO, I just noticed a slight difference in logic between the function and the query (i.e. the two answers from the linked question):
CHAR(127) is considered good / valid in the Find_Invalid_Chars() function, yet it's considered bad / invalid in the LIKE query. If it were me, I would consider CHAR(127) valid as it's part of the standard ASCII character set. But, you need to decide what you consider it. Just be aware of that difference in case you do need to adjust the LIKE syntax a little.Given:
-
The purpose of the query is to find characters in the source table/column that are not handled correctly by some software applications.
and:
-
The data can be in either VARCHAR or NVARCHAR.
I would say that:
-
You don't want to convert
NVARCHAR source data into VARCHAR as there might be "best fit" mappings that translate invalid source characters into valid characters, but one or more of your software applications might not use "best fit" mappings.SELECT NCHAR(178) AS [Unicode], -- Superscript 2 (U+00B2)
CONVERT(VARCHAR(5), NCHAR(178)
COLLATE SQL_Latin1_General_CP1_CI_AS) AS [CodePage-1252],
CONVERT(VARCHAR(5), NCHAR(178)
COLLATE Turkmen_100_CI_AS) AS [CodePage-1250]
/*
Unicode CodePage-1252 CodePage-1250
² ² 2
*/
-
It will likely be more reliable to look for characters not in a specific "valid" range as opposed to those in a specific invalid range, especially when dealing with
NVARCHAR which holds a lot more than 256 characters.-
You could get away with a single query if the "valid" range is always between values 0 and 127 (since those values are the same in both cases). But if you need to specify va
Context
StackExchange Database Administrators Q#251740, answer score: 5
Revisions (0)
No revisions yet.