patternsqlMinor
PostgreSQL - Index efficiently on REGEX_REPLACE()
Viewed 0 times
postgresqlindexefficientlyregex_replace
Problem
I have a query which is designed to loop and search addresses for duplicates, the query uses REGEX_REPLACE. I am trying to index on the regex as on doing an explain and its doing a sequential scan on the user_property table with a filter on the regex
```
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) with user_detail AS (
SELECT user_id,
max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
FROM PUBLIC.user_property cp
JOIN PUBLIC.user c using (user_id)
WHERE c.user_group_id= '38'
AND cp.user_property_is_active
GROUP BY user_id
),
duplicate as (
SELECT COALESCE(MAX(
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
THEN 1 ELSE 0 END), 0) AS dup_name_address,
COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
(REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
OR
(REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\
```
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) with user_detail AS (
SELECT user_id,
max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
FROM PUBLIC.user_property cp
JOIN PUBLIC.user c using (user_id)
WHERE c.user_group_id= '38'
AND cp.user_property_is_active
GROUP BY user_id
),
duplicate as (
SELECT COALESCE(MAX(
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
THEN 1 ELSE 0 END), 0) AS dup_name_address,
COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
(REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
OR
(REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\
Solution
I found this an interesing question +1!
In order to answer, I went down the old-fashioned route and did some testing.
All of the code below can be found on the fiddle here. The strategy employed uses
and populate with a few records:
and now, the secret sauce - indexing those
and just to check (always check):
Result:
First, we run this:
-
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.
-
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.
The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without
From the documentation here:
-
enable_seqscan (boolean)
Enables or disables the query planner's use of sequential scan plan types.
but turning this variable off discourages the planner from using one
if there are other methods available. The default is on.
[Emphasis mine] - also, see the discussion below.
and then we run:
Result:
Excellent - the result we want - an
Then, we run:
where the data requested is entirely contained in the index (of the
We have an
Interestingly, when I rerun this (at the end):
Result:
I can only surmise (a nice word for "guess"!) that at this point, the optimiser knows that the table is in the memory buffer and that a
I've seen this before, quite why it uses the index for my first query and not for this second one is a mystery to me - the source code is a bit above my pay grade I'm afraid.
In order to answer, I went down the old-fashioned route and did some testing.
All of the code below can be found on the fiddle here. The strategy employed uses
GENERATED columns (Manual). You could also use expression (aka functional) indexes - see note at the bottom of this answer).CREATE TABLE test
(
t_id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
address TEXT,
post_code TEXT,
add_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(address, 'Building', 'BUILDING')) STORED,
p_c_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(post_code, 'abc', 'ABC')) STORED
);and populate with a few records:
INSERT INTO test (address, post_code) VALUES
('The Building, Apt 13, Flr 6', 'abc123'),
('The Building, Apt 45, Flr 8', 'abc456'),
('The Building, Apt 45, Flr 9', 'abc789');and now, the secret sauce - indexing those
GENERATED columns:CREATE INDEX my_field_regexp_idx ON test (add_bis);
CREATE INDEX post_code_regexp_idx ON test(p_c_bis);and just to check (always check):
SELECT * FROM test;Result:
t_id address post_code add_bis p_c_bis
1 The Building, Apt 13, Flr 6 abc123 The BUILDING, Apt 13, Flr 6 ABC123
2 The Building, Apt 45, Flr 8 abc456 The BUILDING, Apt 45, Flr 8 ABC456
3 The Building, Apt 45, Flr 9 abc789 The BUILDING, Apt 45, Flr 9 ABC789First, we run this:
SET enable_seqscan = OFF;-
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.
-
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.
The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without
enable_seqscan = OFF, the very small sample tables here would cause the optimiser to automatically choose a sequential scan. With a large number of records on a production system, this should not be a problem.From the documentation here:
-
enable_seqscan (boolean)
Enables or disables the query planner's use of sequential scan plan types.
It is impossible to suppress sequential scans entirely,but turning this variable off discourages the planner from using one
if there are other methods available. The default is on.
[Emphasis mine] - also, see the discussion below.
and then we run:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
*
FROM
test
WHERE add_bis LIKE 'The BUILDING';Result:
QUERY PLAN
Index Scan using address_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)
Output: t_id, address, post_code, add_bis, p_c_bis
Index Cond: (test.add_bis = 'The BUILDING'::text)
Filter: (test.add_bis ~~ 'The BUILDING'::text)
Buffers: shared read=1
Planning:
Buffers: shared hit=22
Planning Time: 0.150 ms
Execution Time: 0.042 msExcellent - the result we want - an
Index Scan:Index Scan using address_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)Then, we run:
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
p_c_bis
FROM test
WHERE p_c_bis = 'ABC123';where the data requested is entirely contained in the index (of the
REGEXP_REPLACE()ed column) and et voilà:Index Only Scan using post_code_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=32) (actual time=0.039..0.040 rows=1 loops=1)
Output: p_c_bis
Index Cond: (test.p_c_bis = 'ABC123'::text)
Heap Fetches: 1
Buffers: shared hit=1 read=1
Planning Time: 0.043 ms
Execution Time: 0.064 msWe have an
Index Only Scan which is what we wanted! There's another example of an Index Only Scan in the fiddle.Interestingly, when I rerun this (at the end):
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
*
FROM test
WHERE add_bis LIKE 'The BUILDING%';Result:
Seq Scan on public.test (cost=10000000000.00..10000000001.04 rows=1 width=132) (actual time=0.008..0.009 rows=3 loops=1)
Output: t_id, address, post_code, add_bis, p_c_bis
Filter: (test.add_bis ~~ 'The BUILDING%'::text)
Buffers: shared hit=1
Planning Time: 0.038 ms
Execution Time: 0.027 msI can only surmise (a nice word for "guess"!) that at this point, the optimiser knows that the table is in the memory buffer and that a
Seq Scan will be the fastest/least-costly option no matter what (see the very high cost (cost=100000000000)). This chimes nicely with the documentation above ("It is impossible to suppress sequential scans entirely").I've seen this before, quite why it uses the index for my first query and not for this second one is a mystery to me - the source code is a bit above my pay grade I'm afraid.
- As with all answers on StackExchange, there is no substi
Code Snippets
CREATE TABLE test
(
t_id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
address TEXT,
post_code TEXT,
add_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(address, 'Building', 'BUILDING')) STORED,
p_c_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(post_code, 'abc', 'ABC')) STORED
);INSERT INTO test (address, post_code) VALUES
('The Building, Apt 13, Flr 6', 'abc123'),
('The Building, Apt 45, Flr 8', 'abc456'),
('The Building, Apt 45, Flr 9', 'abc789');CREATE INDEX my_field_regexp_idx ON test (add_bis);
CREATE INDEX post_code_regexp_idx ON test(p_c_bis);SELECT * FROM test;t_id address post_code add_bis p_c_bis
1 The Building, Apt 13, Flr 6 abc123 The BUILDING, Apt 13, Flr 6 ABC123
2 The Building, Apt 45, Flr 8 abc456 The BUILDING, Apt 45, Flr 8 ABC456
3 The Building, Apt 45, Flr 9 abc789 The BUILDING, Apt 45, Flr 9 ABC789Context
StackExchange Database Administrators Q#320792, answer score: 3
Revisions (0)
No revisions yet.