HiveBrain v1.2.0
Get Started
← Back to all entries
gotchasqlMinor

Why does REGEX_REPLACE return the whole string of non-matches?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
matcheswhytheregex_replacewholereturnnondoesstring

Problem

I am using MySQL 8 and have a Regex that needs to return a Regex group from a table column of string data.

DB-Fiddle

Table:

CREATE TABLE `table_name`(
    id_column VARCHAR(300)
);
INSERT INTO `table_name` VALUES
('47099085T'),
('D73654109H'),
('8.30781719e-05'),
('0113:11:19%2000:54:17.042828927Z');


SQL:

SELECT REGEXP_REPLACE(id_column,
        '[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1') as id_output
 FROM `table_name`


Regex expected output here (regex101.com)

The Regex contains a group, whose REGEX replace is used to return to the SELECT.

> "47099085T"                         // matches and returns value in quotes.
> D"73654109H"                        // matches and returns value in quotes.
> 8.30781719e-05                      // unmatched
> 0113:11:19%2000:54:17.042828927Z    // unmatched


The MySQL manual states:

REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])


Replaces occurrences in the string expr that match the regular expression specified by the pattern pat with the replacement string repl, and returns the resulting string. If expr, pat, or repl is NULL, the return value is NULL.

Expected outcome:

id_output

47099085T

73654109H

Actual outcome:

id_output

47099085T

73654109H

8.30781719e-05

0113:11:19%2000:54:17.042828927Z

So for the two rows that do not match the sub group, I am expecting the return on the REGEX_REPLACE to be NULL, because there isn't any match. But it instead returns the whole column's data.
Why?

How can I edit the REGEXP to only return matching outcomes?

Solution

The REPLACE function pretty much does the same thing

Basically, both REPLACE and REGEXP_REPLACE would simply return the original string if there was nothing to replace in the original string.

Just imagine your query written as an UPDATE

UPDATE table_name SET id_column =
REGEXP_REPLACE(id_column,'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1') ;


That would be quite destructive if you want REGEXP_REPLACE to return nothing. In light of this, it is a very good thing that both REPLACE and REGEXP_REPLACE would return the original string.
SUGGESTION : Use a WHERE Clause

SELECT REGEXP_REPLACE(id_column, 
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1')
as id_output FROM `table_name` WHERE id_column <>
REGEXP_REPLACE(id_column, 
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1');


The db-fiddle Query is even better WHERE cluase

SELECT id_column
FROM `table_name` 
WHERE id_column REGEXP
'[^\\.0-9][0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke]'l


What about the MySQL Documentation ???

Your citation of the MySQL Documentation points to the parameters of REGEXP_REPLACE being NULL. It makes no mention of the return value of REGEXP_REPLACE.
UPDATE 2023-01-20 10:50 EST

REGEXP_REPLACE should not return the same value in any instance. The only thing I can think of is if you are running an older version of MySQL 8.0. There is a character set bug mentioned on that documentation page:

Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)

You might need to experiment with REGEXP_LIKE in the WHERE clause and see (I am not a fan of REGEXP stuff).

Code Snippets

UPDATE table_name SET id_column =
REGEXP_REPLACE(id_column,'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1') ;
SELECT REGEXP_REPLACE(id_column, 
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1')
as id_output FROM `table_name` WHERE id_column <>
REGEXP_REPLACE(id_column, 
'[^\\.0-9]([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$1');
SELECT id_column
FROM `table_name` 
WHERE id_column REGEXP
'[^\\.0-9][0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke]'l

Context

StackExchange Database Administrators Q#322356, answer score: 4

Revisions (0)

No revisions yet.