patternsqlModerate
Capitalize first letter with exceptions
Viewed 0 times
exceptionswithfirstcapitalizeletter
Problem
I need to properly format some European addresses.
One of the steps is to capitalize the first letter, but avoiding some specific words, like 'on', 'upon', 'von', 'van', 'di', 'in', 'sul'.
Therefore, although my skills are scarce, I thought it was a good idea to use a RegEx-based function.
After some googling I found this here:
That seems to search for a sequence of a 'non-letter' + a lowercase 'letter'
OK, I think I've understood how it works, and I modify it to best match my needs.
I supposed it was better to search for a space or ' or - and a lowercase letter, therefore I changed it to
Then, after several attempt, I built this RegEx on regexr.com, that seems to catch the required sequence:
But when I put that into the above function, the result is not as expected.
What's wrong?
One of the steps is to capitalize the first letter, but avoiding some specific words, like 'on', 'upon', 'von', 'van', 'di', 'in', 'sul'.
Therefore, although my skills are scarce, I thought it was a good idea to use a RegEx-based function.
After some googling I found this here:
CREATE FUNCTION InitialCap
(
@String nvarchar(max)
)
RETURNS nvarchar(max)
AS
BEGIN
DECLARE @Position INT;
SELECT
@String = STUFF(LOWER(@String),1,1,UPPER(LEFT(@String,1))) COLLATE Latin1_General_Bin,
@Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);
WHILE @Position > 0
SELECT
@String = STUFF(@String,@Position,2,UPPER(SUBSTRING(@String,@Position,2))) COLLATE Latin1_General_Bin,
@Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);
RETURN @String;
ENDThat seems to search for a sequence of a 'non-letter' + a lowercase 'letter'
[^A-Za-z][a-z]OK, I think I've understood how it works, and I modify it to best match my needs.
I supposed it was better to search for a space or ' or - and a lowercase letter, therefore I changed it to
[\s'-][\w]Then, after several attempt, I built this RegEx on regexr.com, that seems to catch the required sequence:
[\s](?!di\s|in\s|sul\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]But when I put that into the above function, the result is not as expected.
What's wrong?
Solution
SQL Server does not internally support Regular Expressions.
If you want / need Regular Expressions in SQL Server, you need to use SQLCLR. You can either code this yourself, or use pre-built functions such as the ones available in SQL# (which I wrote). Most of the RegEx functions are available in the Free version. I think you could use RegEx_Matches to return a result set of words that are not in your exclude list, and then combine that with the String_ToTitleCase4k function (also available in the Free version) to do the InitCap.
For example:
Returns:
The reason that it doesn't work entirely correctly is due to your regular expression being incorrect:
UPDATE:
I was able to fix your regular expression by changing it to be as follows:
Main differences from the original:
New output:
UPDATE 2:
If the desire is to have a list of exclude words that is updateable without needing to update the function containing the regular expression, that is fairly easy to accomplish by doing the following:
-
Create a table to hold the exclusion words:
-
Populate / manage-over-time the words in that table:
-
In whatever code you have that does this data-cleansing, dynamically build the regular expression from the table of words to exclude:
`DECLARE @Expression NVARCHAR(4000) = N'\b(?!(?:';
SELECT @Expression += ex.[Word] + N'|'
FROM @ExcludeWords ex
SET @Expression = LEFT(@Expression, LEN(@Expression) - 1) + N')\b)\w+';
SELECT @Expression; -- JUST FOR DEBUG
--\b(?!(?:di|in|sull|on|upon|von|u
LIKE and PATINDEX both support very limited wildcards, including a single-character range match [...] or exclude [^...] that are of similar syntax to RegEx, and are to a degree functionally similar, but are certainly not RegEx.If you want / need Regular Expressions in SQL Server, you need to use SQLCLR. You can either code this yourself, or use pre-built functions such as the ones available in SQL# (which I wrote). Most of the RegEx functions are available in the Free version. I think you could use RegEx_Matches to return a result set of words that are not in your exclude list, and then combine that with the String_ToTitleCase4k function (also available in the Free version) to do the InitCap.
For example:
DECLARE @Input NVARCHAR(MAX) =
N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce',
@Expression NVARCHAR(4000) =
N'[\s](?!di\s|in\s|sull\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]';
-- show matches for debugging
SELECT word.[StartPos],
word.[EndPos],
word.[Value] AS [Original],
SQL#.String_ToTitleCase4k(word.[Value], N'') AS [TitleCased]
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;
SELECT @Input = STUFF(@Input,
word.[StartPos],
((word.[EndPos] - word.[StartPos]) + 1),
SQL#.String_ToTitleCase4k(word.[Value], N'')
COLLATE Latin1_General_100_BIN2)
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;
SELECT @Input AS [Fixed];Returns:
StartPos EndPos Original TitleCased
-------- ------ -------- ----------
11 12 s S
21 22 o O
32 33 s S
37 38 s S
46 47 g G
Fixed
-------------------------
santacroce Sull'arno O'sullivan SuLL SUlLiVan Gsantacroce
The reason that it doesn't work entirely correctly is due to your regular expression being incorrect:
- It only matches a single letter.
- It won't exclude any of the fragments if they are at the end of the string, but that might be ok if that never occurs in actual use.
- It won't include the first word of the string (due to the required white-space to the left), but that might be ok if that never occurs in actual use.
UPDATE:
I was able to fix your regular expression by changing it to be as follows:
\b(?!(?:di|in|sull|on|upon|von|uber|ueber)\b)\w+
Main differences from the original:
- I am using
\b(word boundary) instead of\s(white-space) as it handles beginning and ending of line/string. It also doesn't capture the white-space, which\sdoes, and if it isn't visually obvious, each of the matched strings above is prefixed with the space that matched. While that space wouldn't affect the replacement as it would still be a space, it did prevent the first word in the group from matching, unless the entire string was preceded by some white-space. In the case of using this with addresses, if they always start with a number then perhaps there will always be that preceding space, but best not to include it in the match.
- I added the
+(one or more) quantifier to the\wso that it will pick up more than just the first character
- I simplified the exclusion list by moving the common
\bat the end of each fragment to just outside of a new, inner non-capturing group. This is a non-functional difference. It just makes it easier to read and deal with.
New output:
StartPos EndPos Original TitleCased
-------- ------ -------- ----------
1 10 santacroce Santacroce
17 20 arno Arno
22 22 o O
24 31 sullivan Sullivan
38 45 sUlLiVan Sullivan
47 57 gsantacroce Gsantacroce
Fixed
-------------------------
Santacroce sull'Arno O'Sullivan suLL Sullivan Gsantacroce
UPDATE 2:
If the desire is to have a list of exclude words that is updateable without needing to update the function containing the regular expression, that is fairly easy to accomplish by doing the following:
-
Create a table to hold the exclusion words:
CREATE TABLE #ExcludeWords ([Word] NVARCHAR(50) NOT NULL);
-
Populate / manage-over-time the words in that table:
INSERT INTO #ExcludeWords ([Word]) VALUES
(N'di'), (N'in'), (N'sull'), (N'on'), (N'upon'), (N'von'), (N'uber'), (N'ueber');
-
In whatever code you have that does this data-cleansing, dynamically build the regular expression from the table of words to exclude:
`DECLARE @Expression NVARCHAR(4000) = N'\b(?!(?:';
SELECT @Expression += ex.[Word] + N'|'
FROM @ExcludeWords ex
SET @Expression = LEFT(@Expression, LEN(@Expression) - 1) + N')\b)\w+';
SELECT @Expression; -- JUST FOR DEBUG
--\b(?!(?:di|in|sull|on|upon|von|u
Code Snippets
DECLARE @Input NVARCHAR(MAX) =
N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce',
@Expression NVARCHAR(4000) =
N'[\s](?!di\s|in\s|sull\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]';
-- show matches for debugging
SELECT word.[StartPos],
word.[EndPos],
word.[Value] AS [Original],
SQL#.String_ToTitleCase4k(word.[Value], N'') AS [TitleCased]
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;
SELECT @Input = STUFF(@Input,
word.[StartPos],
((word.[EndPos] - word.[StartPos]) + 1),
SQL#.String_ToTitleCase4k(word.[Value], N'')
COLLATE Latin1_General_100_BIN2)
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;
SELECT @Input AS [Fixed];Context
StackExchange Database Administrators Q#192601, answer score: 10
Revisions (0)
No revisions yet.