snippetsqlMinor
How to find values with multiple consecutive upper case characters
Viewed 0 times
casehowwithcharactersmultiplefindvaluesupperconsecutive
Problem
For a project that I am working on, I need to identify values that have incorrect uppercase.
For example, I need it to identify the following types of values:
Ideas I have tried so far:
(which gives results for rows that are all upper case, such as MR JOHN SMITH)
and
which picks up some rows too.
But I doubt this is the most efficient way to go about this.
For example, I need it to identify the following types of values:
Mr JOHN Smith
MR John Smith
Mr John SMITH
Mr JOhn SmithIdeas I have tried so far:
Select * from table where Name = upper(Name)
collate SQL_Latin1_General_CP1_CS_AS(which gives results for rows that are all upper case, such as MR JOHN SMITH)
and
Select * from table where right(Name,3) = upper(right(Name,3))
collate SQL_Latin1_General_CP1_CS_ASwhich picks up some rows too.
But I doubt this is the most efficient way to go about this.
Solution
This question is far more complicated than it appears to be on the surface (hence the longer-than-most-would-expect answer). If the strings being searched were codes (postal codes, ISO country codes, ISO state codes, SKUs, etc) or something where the characters used were a limited subset of all possible letters of all languages, then this would be fairly straight-forward. But when dealing with people's names, then no such luck.
All of the following example code and test cases can be found on Pastebin ( Searching
for Case-Sensitive patterns in SQL Server ). The SQL posted on Pastebin includes
additional test cases and additional examples related to various points mentioned below.
Very simplistically you can do the following, which does work with the sample data. I added two test cases to help determine whether or not this (or any) method works as it is data that should not match.
Please note:
-
I am not using a collation that starts with
-
I am using a collation in the
-
I am not using a binary collation (i.e. ending in
INSERT INTO @SampleData (Name) VALUES (N'Mr üLala Jones');
All of the following example code and test cases can be found on Pastebin ( Searching
for Case-Sensitive patterns in SQL Server ). The SQL posted on Pastebin includes
additional test cases and additional examples related to various points mentioned below.
Very simplistically you can do the following, which does work with the sample data. I added two test cases to help determine whether or not this (or any) method works as it is data that should not match.
SET NOCOUNT ON;
DECLARE @SampleData TABLE
(
Name NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
);
INSERT INTO @SampleData (Name) VALUES (N'Mr JOHN Smith');
INSERT INTO @SampleData (Name) VALUES (N'MR John Smith');
INSERT INTO @SampleData (Name) VALUES (N'Mr John SMITH');
INSERT INTO @SampleData (Name) VALUES (N'Mr JOhn Smith');
INSERT INTO @SampleData (Name) VALUES (N'mr john smith'); -- all lower-case
INSERT INTO @SampleData (Name) VALUES (N'Mr John Smith'); -- proper-case
SELECT *
FROM @SampleData
WHERE Name COLLATE Latin1_General_100_CS_AS
LIKE N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%';
Please note:
-
I am not using a collation that starts with
SQL_ as those have been deprecated (or perhaps more accurately: obsoleted) since SQL Server 2000 was released.-
I am using a collation in the
100 series as that is the most recent, and should be available starting in SQL Server 2008. Not all collations have a 100 series, but it is best to use the most recent one, which might be 90.-
I am not using a binary collation (i.e. ending in
_BIN2 or even _BIN though the _BIN collations have been deprecated since SQL Server 2005 was released, so only use _BIN2 if needing a binary collation). Binary collations are not "case sensitive"; they are "byte sensitive" (_BIN) / "code point sensitive" (_BIN2) and those are very different concepts. Binary collations only appear to be case sensitive when working with (or testing) US-English-only characters. Any characters with accents can potentially be handled incorrectly when using binary collations. For example:PRINT N'U' + NCHAR(0x0308) +
N'
For a more in-depth look into this topic, please see my blog post: No, Binary Collations are not Case-Sensitive.
-
Instead of using a single character range (i.e. the [A-Z] that works in LIKE and PATINDEX) I added each of the 26 letters. This is required in order for the preferred _CS (i.e. case sensitive) collations to work properly. If you specify a range of characters via the [A-Z] syntax, then it will appear to not respect the case sensitive collation. That, however, is a false interpretation of the behavior. Case sensitive comparisons can be tricky since they, by design, work differently between equality comparisons (e.g. WHERE N'zbz' LIKE N'%B%' or WHERE N'b' = N'B') and range comparisons (e.g. WHERE N'zbz' LIKE N'%[A-C]%'). Range comparisons are more like sorting. And, case sensitive sorting does not handle all of one case before the other case. Think in terms of how dictionaries are ordered (and in fact, the Microsoft documentation even uses the term "dictionary ordering" to describe non-binary collations): they don't place all words starting with A-Z prior to words starting with a-z. They group upper and lower case of each letter together: A,a,B,b,C,c, and so on, not A,B,C,a,b,c. Hence, assuming that upper-case are first (and which one comes first depends on if you are using a SQL_ or non-SQL_ collation), a range of [A-C] equates to [AaBbC], and if lower-case are first, that same range equates to [AbBcC]. In both cases, the lower-case "b" is validly in that range.
However, this approach mainly works if the data you are searching for is guaranteed to only ever have the 26 US-English characters in it, and never ever any other characters found in most other languages, especially those with accents / diacritical marks. Given that your data contains names, you cannot guarantee that, even if working in the US since people live here who came from somewhere else (or an ancestor did), or sign up for services hosted here even though they live anywhere in the world.
For example, add the following two test cases to the example code above:
INSERT INTO @SampleData (Name) VALUES (N'Mr ÜLala Jones');INSERT INTO @SampleData (Name) VALUES (N'Mr üLala Jones');
Running the code again will not yield new results. The second line added uses a lower-case ü so it shouldn't match anyway. But the first line uses an upper-case Ü` Code Snippets
(?:\p{Lu}\p{M}*){2}Context
StackExchange Database Administrators Q#122612, answer score: 7
Revisions (0)
No revisions yet.