gotchasqlMinor
Why does REGEX_REPLACE return the whole string of non-matches?
Viewed 0 times
matcheswhytheregex_replacewholereturnnondoesstring
Problem
I am using MySQL 8 and have a Regex that needs to return a Regex group from a table column of string data.
DB-Fiddle
Table:
SQL:
Regex expected output here (regex101.com)
The Regex contains a group, whose REGEX replace is used to return to the SELECT.
The MySQL manual states:
Replaces occurrences in the string expr that match the regular expression specified by the pattern pat with the replacement string repl, and returns the resulting string. If expr, pat, or repl is NULL, the return value is NULL.
Expected outcome:
id_output
47099085T
73654109H
Actual outcome:
id_output
47099085T
73654109H
8.30781719e-05
0113:11:19%2000:54:17.042828927Z
So for the two rows that do not match the sub group, I am expecting the return on the REGEX_REPLACE to be NULL, because there isn't any match. But it instead returns the whole column's data.
Why?
How can I edit the REGEXP to only return matching outcomes?
DB-Fiddle
Table:
CREATE TABLE `table_name`(
id_column VARCHAR(300)
);
INSERT INTO `table_name` VALUES
('47099085T'),
('D73654109H'),
('8.30781719e-05'),
('0113:11:19%2000:54:17.042828927Z');SQL:
SELECT REGEXP_REPLACE(id_column,
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1') as id_output
FROM `table_name`Regex expected output here (regex101.com)
The Regex contains a group, whose REGEX replace is used to return to the SELECT.
> "47099085T" // matches and returns value in quotes.
> D"73654109H" // matches and returns value in quotes.
> 8.30781719e-05 // unmatched
> 0113:11:19%2000:54:17.042828927Z // unmatchedThe MySQL manual states:
REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])Replaces occurrences in the string expr that match the regular expression specified by the pattern pat with the replacement string repl, and returns the resulting string. If expr, pat, or repl is NULL, the return value is NULL.
Expected outcome:
id_output
47099085T
73654109H
Actual outcome:
id_output
47099085T
73654109H
8.30781719e-05
0113:11:19%2000:54:17.042828927Z
So for the two rows that do not match the sub group, I am expecting the return on the REGEX_REPLACE to be NULL, because there isn't any match. But it instead returns the whole column's data.
Why?
How can I edit the REGEXP to only return matching outcomes?
Solution
The REPLACE function pretty much does the same thing
Basically, both REPLACE and REGEXP_REPLACE would simply return the original string if there was nothing to replace in the original string.
Just imagine your query written as an UPDATE
That would be quite destructive if you want REGEXP_REPLACE to return nothing. In light of this, it is a very good thing that both REPLACE and REGEXP_REPLACE would return the original string.
SUGGESTION : Use a WHERE Clause
The db-fiddle Query is even better WHERE cluase
What about the MySQL Documentation ???
Your citation of the MySQL Documentation points to the parameters of REGEXP_REPLACE being NULL. It makes no mention of the return value of REGEXP_REPLACE.
UPDATE 2023-01-20 10:50 EST
REGEXP_REPLACE should not return the same value in any instance. The only thing I can think of is if you are running an older version of MySQL 8.0. There is a character set bug mentioned on that documentation page:
Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)
You might need to experiment with REGEXP_LIKE in the WHERE clause and see (I am not a fan of REGEXP stuff).
Basically, both REPLACE and REGEXP_REPLACE would simply return the original string if there was nothing to replace in the original string.
Just imagine your query written as an UPDATE
UPDATE table_name SET id_column =
REGEXP_REPLACE(id_column,'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1') ;That would be quite destructive if you want REGEXP_REPLACE to return nothing. In light of this, it is a very good thing that both REPLACE and REGEXP_REPLACE would return the original string.
SUGGESTION : Use a WHERE Clause
SELECT REGEXP_REPLACE(id_column,
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1')
as id_output FROM `table_name` WHERE id_column <>
REGEXP_REPLACE(id_column,
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1');The db-fiddle Query is even better WHERE cluase
SELECT id_column
FROM `table_name`
WHERE id_column REGEXP
'[^\\.0-9][0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke]'lWhat about the MySQL Documentation ???
Your citation of the MySQL Documentation points to the parameters of REGEXP_REPLACE being NULL. It makes no mention of the return value of REGEXP_REPLACE.
UPDATE 2023-01-20 10:50 EST
REGEXP_REPLACE should not return the same value in any instance. The only thing I can think of is if you are running an older version of MySQL 8.0. There is a character set bug mentioned on that documentation page:
Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)
You might need to experiment with REGEXP_LIKE in the WHERE clause and see (I am not a fan of REGEXP stuff).
Code Snippets
UPDATE table_name SET id_column =
REGEXP_REPLACE(id_column,'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1') ;SELECT REGEXP_REPLACE(id_column,
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1')
as id_output FROM `table_name` WHERE id_column <>
REGEXP_REPLACE(id_column,
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1');SELECT id_column
FROM `table_name`
WHERE id_column REGEXP
'[^\\.0-9][0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke]'lContext
StackExchange Database Administrators Q#322356, answer score: 4
Revisions (0)
No revisions yet.