patternsqlModerate
Replace a sequential set of numbers with special character
Viewed 0 times
withnumbersreplacecharactersequentialspecialset
Problem
I have a varchar(200) column that contains entries such as,
etc..
I want to replace a sequence of numbers with a single
The result for this set would be
I have written the following primitive query below, it works correctly, but takes a long time to run on a huge table.
I need help with rewriting or editing it to improve it's performance. SQL Server 2014
```
-- 1. replace all numeric characters with '*'
-- 2. replace multiple consecutive '' with just a single ''
SELECT REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE(SampleID, '0', '*'),
'1', '*'),
'2', '*'),
'3', '*'),
'4', '*'),
'5', '*'),
'6', '*'),
'7', '*'),
'8', '*'),
'9', '*')
, '', '~') -- replace each occurrence of '' with '~' (token plus asterisk)
, '~', '') -- replace in the result of the previous step each occurrence of '~' (asterisk plus token) with '' (an empty string)
, '~', '') -- re
ABC123124_A12312ABC123_A1212ABC123124_B12312AC123124_AD12312A12312_123etc..
I want to replace a sequence of numbers with a single
* so that I can group the distinct non-numeric patterns in the table.The result for this set would be
ABC_A ABC_BAC_ADA_I have written the following primitive query below, it works correctly, but takes a long time to run on a huge table.
I need help with rewriting or editing it to improve it's performance. SQL Server 2014
```
-- 1. replace all numeric characters with '*'
-- 2. replace multiple consecutive '' with just a single ''
SELECT REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE(SampleID, '0', '*'),
'1', '*'),
'2', '*'),
'3', '*'),
'4', '*'),
'5', '*'),
'6', '*'),
'7', '*'),
'8', '*'),
'9', '*')
, '', '~') -- replace each occurrence of '' with '~' (token plus asterisk)
, '~', '') -- replace in the result of the previous step each occurrence of '~' (asterisk plus token) with '' (an empty string)
, '~', '') -- re
Solution
Two factors are important for performance:
-
Reduce the number of string operations.
You may find it is possible to implement what you need using e.g.
-
Use the cheapest collation that provides correct results.
Binary collations are the cheapest. SQL collations (on non-Unicode data only) are a little more expensive. Windows collations are much more expensive.
For example:
db<>fiddle demo
That example relies on a permanent table of numbers. If needed, a table sufficient for
%%CODEBLOCK_2%%
If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:
%%CODEBLOCK_3%%
Users of SQL Server 2017 or later can leverage the built-in
You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?
Using the SQL# library, a complete solution would be:
%%CODEBLOCK_4%%
Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all.
-
Reduce the number of string operations.
You may find it is possible to implement what you need using e.g.
CHARINDEX and PATINDEX to find the start and end of groups, rather than performing very many REPLACE operations on the whole string each time.-
Use the cheapest collation that provides correct results.
Binary collations are the cheapest. SQL collations (on non-Unicode data only) are a little more expensive. Windows collations are much more expensive.
For example:
DECLARE @T table
(
SampleID varchar(200) NOT NULL UNIQUE
);
INSERT @T
(SampleID)
VALUES
('ABC123124_A12312'),
('ABC123_A1212'),
('ABC123124_B12312'),
('AC123124_AD12312'),
('A12312_123'),
('999ABC888DEF');SELECT
T.SampleID,
Pattern =
(
SELECT
CASE
WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
ELSE ''
END
FROM dbo.Numbers AS N
OUTER APPLY
(
SELECT
SUBSTRING(Bin.string, N.n, 1),
SUBSTRING(Bin.string, N.n + 1, 1)
) AS Chars (prev, this)
WHERE
N.n BETWEEN 1 AND LEN(Bin.string)
ORDER BY N.n
FOR XML PATH ('')
)
FROM @T AS T
OUTER APPLY (VALUES('
db<>fiddle demo
That example relies on a permanent table of numbers. If needed, a table sufficient for varchar(200) is:
-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
-- Destination column type integer NOT NULL
ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);
-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);
If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:
(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),
Users of SQL Server 2017 or later can leverage the built-in TRANSLATE function, which may perform better than the nested REPLACE calls.
You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?
Using the SQL# library, a complete solution would be:
SELECT
T.SampleID,
SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;
Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all. + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);db<>fiddle demo
That example relies on a permanent table of numbers. If needed, a table sufficient for
varchar(200) is:%%CODEBLOCK_2%%
If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:
%%CODEBLOCK_3%%
Users of SQL Server 2017 or later can leverage the built-in
TRANSLATE function, which may perform better than the nested REPLACE calls.You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?
Using the SQL# library, a complete solution would be:
%%CODEBLOCK_4%%
Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all.
Code Snippets
DECLARE @T table
(
SampleID varchar(200) NOT NULL UNIQUE
);
INSERT @T
(SampleID)
VALUES
('ABC123124_A12312'),
('ABC123_A1212'),
('ABC123124_B12312'),
('AC123124_AD12312'),
('A12312_123'),
('999ABC888DEF');SELECT
T.SampleID,
Pattern =
(
SELECT
CASE
WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
ELSE ''
END
FROM dbo.Numbers AS N
OUTER APPLY
(
SELECT
SUBSTRING(Bin.string, N.n, 1),
SUBSTRING(Bin.string, N.n + 1, 1)
) AS Chars (prev, this)
WHERE
N.n BETWEEN 1 AND LEN(Bin.string)
ORDER BY N.n
FOR XML PATH ('')
)
FROM @T AS T
OUTER APPLY (VALUES('$' + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
-- Destination column type integer NOT NULL
ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);
-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),SELECT
T.SampleID,
SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;Context
StackExchange Database Administrators Q#262705, answer score: 12
Revisions (0)
No revisions yet.