patternsqlModerate

Capitalize first letter with exceptions

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

exceptionswithfirstcapitalizeletter

Problem

I need to properly format some European addresses.
One of the steps is to capitalize the first letter, but avoiding some specific words, like 'on', 'upon', 'von', 'van', 'di', 'in', 'sul'.
Therefore, although my skills are scarce, I thought it was a good idea to use a RegEx-based function.

After some googling I found this here:

CREATE FUNCTION InitialCap
(
    @String nvarchar(max)
)
    RETURNS nvarchar(max)
AS
    BEGIN 
        DECLARE @Position INT;

        SELECT 
            @String   = STUFF(LOWER(@String),1,1,UPPER(LEFT(@String,1))) COLLATE Latin1_General_Bin,
            @Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);

        WHILE @Position > 0
            SELECT 
                @String   = STUFF(@String,@Position,2,UPPER(SUBSTRING(@String,@Position,2))) COLLATE Latin1_General_Bin,
                @Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);

        RETURN @String;
    END

That seems to search for a sequence of a 'non-letter' + a lowercase 'letter'

[^A-Za-z][a-z]

OK, I think I've understood how it works, and I modify it to best match my needs.

I supposed it was better to search for a space or ' or - and a lowercase letter, therefore I changed it to

[\s'-][\w]

Then, after several attempt, I built this RegEx on regexr.com, that seems to catch the required sequence:

[\s](?!di\s|in\s|sul\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]

But when I put that into the above function, the result is not as expected.

What's wrong?

Solution

SQL Server does not internally support Regular Expressions. LIKE and PATINDEX both support very limited wildcards, including a single-character range match [...] or exclude [^...] that are of similar syntax to RegEx, and are to a degree functionally similar, but are certainly not RegEx.

If you want / need Regular Expressions in SQL Server, you need to use SQLCLR. You can either code this yourself, or use pre-built functions such as the ones available in SQL# (which I wrote). Most of the RegEx functions are available in the Free version. I think you could use RegEx_Matches to return a result set of words that are not in your exclude list, and then combine that with the String_ToTitleCase4k function (also available in the Free version) to do the InitCap.

For example:

DECLARE @Input NVARCHAR(MAX) =
               N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce',
        @Expression NVARCHAR(4000) =
               N'[\s](?!di\s|in\s|sull\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]';

-- show matches for debugging
SELECT word.[StartPos],
       word.[EndPos],
       word.[Value] AS [Original],
       SQL#.String_ToTitleCase4k(word.[Value], N'') AS [TitleCased]
FROM   SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;

SELECT @Input = STUFF(@Input,
                      word.[StartPos],
                      ((word.[EndPos] - word.[StartPos]) + 1),
                      SQL#.String_ToTitleCase4k(word.[Value], N'')
                                                COLLATE Latin1_General_100_BIN2)
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;

SELECT @Input AS [Fixed];

Returns:

StartPos    EndPos    Original    TitleCased
--------    ------    --------    ----------
11          12        s           S
21          22        o           O
32          33        s           S
37          38        s           S
46          47        g           G

Fixed
-------------------------
santacroce Sull'arno O'sullivan SuLL SUlLiVan Gsantacroce

The reason that it doesn't work entirely correctly is due to your regular expression being incorrect:

It only matches a single letter.

It won't exclude any of the fragments if they are at the end of the string, but that might be ok if that never occurs in actual use.

It won't include the first word of the string (due to the required white-space to the left), but that might be ok if that never occurs in actual use.

UPDATE:

I was able to fix your regular expression by changing it to be as follows:

\b(?!(?:di|in|sull|on|upon|von|uber|ueber)\b)\w+

Main differences from the original:

I am using \b (word boundary) instead of \s (white-space) as it handles beginning and ending of line/string. It also doesn't capture the white-space, which \s does, and if it isn't visually obvious, each of the matched strings above is prefixed with the space that matched. While that space wouldn't affect the replacement as it would still be a space, it did prevent the first word in the group from matching, unless the entire string was preceded by some white-space. In the case of using this with addresses, if they always start with a number then perhaps there will always be that preceding space, but best not to include it in the match.

I added the + (one or more) quantifier to the \w so that it will pick up more than just the first character

I simplified the exclusion list by moving the common \b at the end of each fragment to just outside of a new, inner non-capturing group. This is a non-functional difference. It just makes it easier to read and deal with.

New output:

StartPos    EndPos    Original      TitleCased
--------    ------    --------      ----------
1           10        santacroce    Santacroce
17          20        arno          Arno
22          22        o             O
24          31        sullivan      Sullivan
38          45        sUlLiVan      Sullivan
47          57        gsantacroce   Gsantacroce

Fixed
-------------------------
Santacroce sull'Arno O'Sullivan suLL Sullivan Gsantacroce

UPDATE 2:

If the desire is to have a list of exclude words that is updateable without needing to update the function containing the regular expression, that is fairly easy to accomplish by doing the following:

-
Create a table to hold the exclusion words:

CREATE TABLE #ExcludeWords ([Word] NVARCHAR(50) NOT NULL);

-
Populate / manage-over-time the words in that table:

INSERT INTO #ExcludeWords ([Word]) VALUES
   (N'di'), (N'in'), (N'sull'), (N'on'), (N'upon'), (N'von'), (N'uber'), (N'ueber');

-
In whatever code you have that does this data-cleansing, dynamically build the regular expression from the table of words to exclude:

`DECLARE @Expression NVARCHAR(4000) = N'\b(?!(?:';

SELECT @Expression += ex.[Word] + N'|'
FROM @ExcludeWords ex

SET @Expression = LEFT(@Expression, LEN(@Expression) - 1) + N')\b)\w+';

SELECT @Expression; -- JUST FOR DEBUG
--\b(?!(?:di|in|sull|on|upon|von|u

Code Snippets

DECLARE @Input NVARCHAR(MAX) =
               N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce',
        @Expression NVARCHAR(4000) =
               N'[\s](?!di\s|in\s|sull\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]';

-- show matches for debugging
SELECT word.[StartPos],
       word.[EndPos],
       word.[Value] AS [Original],
       SQL#.String_ToTitleCase4k(word.[Value], N'') AS [TitleCased]
FROM   SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;


SELECT @Input = STUFF(@Input,
                      word.[StartPos],
                      ((word.[EndPos] - word.[StartPos]) + 1),
                      SQL#.String_ToTitleCase4k(word.[Value], N'')
                                                COLLATE Latin1_General_100_BIN2)
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;


SELECT @Input AS [Fixed];

Context

StackExchange Database Administrators Q#192601, answer score: 10

Revisions (0)

No revisions yet.