HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Replace a sequential set of numbers with special character

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
withnumbersreplacecharactersequentialspecialset

Problem

I have a varchar(200) column that contains entries such as,

ABC123124_A12312
ABC123_A1212
ABC123124_B12312
AC123124_AD12312
A12312_123
etc..

I want to replace a sequence of numbers with a single * so that I can group the distinct non-numeric patterns in the table.

The result for this set would be
ABC_A
ABC_B
AC_AD
A_

I have written the following primitive query below, it works correctly, but takes a long time to run on a huge table.

I need help with rewriting or editing it to improve it's performance. SQL Server 2014

```
-- 1. replace all numeric characters with '*'
-- 2. replace multiple consecutive '' with just a single ''
SELECT REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE(SampleID, '0', '*'),
'1', '*'),
'2', '*'),
'3', '*'),
'4', '*'),
'5', '*'),
'6', '*'),
'7', '*'),
'8', '*'),
'9', '*')
, '', '~') -- replace each occurrence of '' with '~' (token plus asterisk)
, '~', '') -- replace in the result of the previous step each occurrence of '~' (asterisk plus token) with '' (an empty string)
, '~', '') -- re

Solution

Two factors are important for performance:

-
Reduce the number of string operations.

You may find it is possible to implement what you need using e.g. CHARINDEX and PATINDEX to find the start and end of groups, rather than performing very many REPLACE operations on the whole string each time.

-
Use the cheapest collation that provides correct results.

Binary collations are the cheapest. SQL collations (on non-Unicode data only) are a little more expensive. Windows collations are much more expensive.

For example:

DECLARE @T table
(
    SampleID varchar(200) NOT NULL UNIQUE
);

INSERT @T
    (SampleID)
VALUES
    ('ABC123124_A12312'),
    ('ABC123_A1212'),
    ('ABC123124_B12312'),
    ('AC123124_AD12312'),
    ('A12312_123'),
    ('999ABC888DEF');


SELECT
    T.SampleID,
    Pattern =
    (
        SELECT
            CASE
                WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
                WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
                ELSE ''
            END
        FROM dbo.Numbers AS N
        OUTER APPLY
        (
            SELECT 
                SUBSTRING(Bin.string, N.n, 1),
                SUBSTRING(Bin.string, N.n + 1, 1)
        ) AS Chars (prev, this)
        WHERE
            N.n BETWEEN 1 AND LEN(Bin.string)
        ORDER BY N.n
        FOR XML PATH ('')
    )
FROM @T AS T
OUTER APPLY (VALUES('

db<>fiddle demo

That example relies on a permanent table of numbers. If needed, a table sufficient for varchar(200) is:

-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
  L0   AS (SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
    -- Destination column type integer NOT NULL
    ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);

-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);


If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:

(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),


Users of SQL Server 2017 or later can leverage the built-in TRANSLATE function, which may perform better than the nested REPLACE calls.

You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?

Using the SQL# library, a complete solution would be:

SELECT 
    T.SampleID,
    SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;


Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all. + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);


db<>fiddle demo

That example relies on a permanent table of numbers. If needed, a table sufficient for varchar(200) is:

%%CODEBLOCK_2%%

If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:

%%CODEBLOCK_3%%

Users of SQL Server 2017 or later can leverage the built-in TRANSLATE function, which may perform better than the nested REPLACE calls.

You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?

Using the SQL# library, a complete solution would be:

%%CODEBLOCK_4%%

Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all.

Code Snippets

DECLARE @T table
(
    SampleID varchar(200) NOT NULL UNIQUE
);

INSERT @T
    (SampleID)
VALUES
    ('ABC123124_A12312'),
    ('ABC123_A1212'),
    ('ABC123124_B12312'),
    ('AC123124_AD12312'),
    ('A12312_123'),
    ('999ABC888DEF');
SELECT
    T.SampleID,
    Pattern =
    (
        SELECT
            CASE
                WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
                WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
                ELSE ''
            END
        FROM dbo.Numbers AS N
        OUTER APPLY
        (
            SELECT 
                SUBSTRING(Bin.string, N.n, 1),
                SUBSTRING(Bin.string, N.n + 1, 1)
        ) AS Chars (prev, this)
        WHERE
            N.n BETWEEN 1 AND LEN(Bin.string)
        ORDER BY N.n
        FOR XML PATH ('')
    )
FROM @T AS T
OUTER APPLY (VALUES('$' + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);
-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
  L0   AS (SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
    -- Destination column type integer NOT NULL
    ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);

-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);
(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),
SELECT 
    T.SampleID,
    SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;

Context

StackExchange Database Administrators Q#262705, answer score: 12

Revisions (0)

No revisions yet.