HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL - Index efficiently on REGEX_REPLACE()

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlindexefficientlyregex_replace

Problem

I have a query which is designed to loop and search addresses for duplicates, the query uses REGEX_REPLACE. I am trying to index on the regex as on doing an explain and its doing a sequential scan on the user_property table with a filter on the regex

```
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) with user_detail AS (
SELECT user_id,
max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
FROM PUBLIC.user_property cp
JOIN PUBLIC.user c using (user_id)
WHERE c.user_group_id= '38'
AND cp.user_property_is_active
GROUP BY user_id
),
duplicate as (
SELECT COALESCE(MAX(
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
THEN 1 ELSE 0 END), 0) AS dup_name_address,
COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
(REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
OR
(REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\

Solution

I found this an interesing question +1!

In order to answer, I went down the old-fashioned route and did some testing.

All of the code below can be found on the fiddle here. The strategy employed uses GENERATED columns (Manual). You could also use expression (aka functional) indexes - see note at the bottom of this answer).

CREATE TABLE test
(
  t_id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  address TEXT,
  post_code TEXT,
  add_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(address, 'Building', 'BUILDING')) STORED,
  p_c_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(post_code, 'abc', 'ABC')) STORED
);


and populate with a few records:

INSERT INTO test (address, post_code) VALUES
('The Building, Apt 13, Flr 6', 'abc123'),
('The Building, Apt 45, Flr 8', 'abc456'),
('The Building, Apt 45, Flr 9', 'abc789');


and now, the secret sauce - indexing those GENERATED columns:

CREATE INDEX my_field_regexp_idx ON test (add_bis);

CREATE INDEX post_code_regexp_idx ON test(p_c_bis);


and just to check (always check):

SELECT * FROM test;


Result:

t_id    address     post_code   add_bis     p_c_bis
1   The Building, Apt 13, Flr 6     abc123  The BUILDING, Apt 13, Flr 6     ABC123
2   The Building, Apt 45, Flr 8     abc456  The BUILDING, Apt 45, Flr 8     ABC456
3   The Building, Apt 45, Flr 9     abc789  The BUILDING, Apt 45, Flr 9     ABC789


First, we run this:

SET enable_seqscan = OFF;


-
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.

-
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.

The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without enable_seqscan = OFF, the very small sample tables here would cause the optimiser to automatically choose a sequential scan. With a large number of records on a production system, this should not be a problem.

From the documentation here:

-
enable_seqscan (boolean)

Enables or disables the query planner's use of sequential scan plan types. It is impossible to suppress sequential scans entirely,
but turning this variable off discourages the planner from using one
if there are other methods available. The default is on.

[Emphasis mine] - also, see the discussion below.

and then we run:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT 
  *
FROM
  test
WHERE add_bis LIKE 'The BUILDING';


Result:

QUERY PLAN
Index Scan using address_regexp_idx on public.test  (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)
  Output: t_id, address, post_code, add_bis, p_c_bis
  Index Cond: (test.add_bis = 'The BUILDING'::text)
  Filter: (test.add_bis ~~ 'The BUILDING'::text)
  Buffers: shared read=1
Planning:
  Buffers: shared hit=22
Planning Time: 0.150 ms
Execution Time: 0.042 ms


Excellent - the result we want - an Index Scan:

Index Scan using address_regexp_idx on public.test  (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)


Then, we run:

EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
  p_c_bis
FROM test
WHERE p_c_bis = 'ABC123';


where the data requested is entirely contained in the index (of the REGEXP_REPLACE()ed column) and et voilà:

Index Only Scan using post_code_regexp_idx on public.test  (cost=0.13..8.15 rows=1 width=32) (actual time=0.039..0.040 rows=1 loops=1)
  Output: p_c_bis
  Index Cond: (test.p_c_bis = 'ABC123'::text)
  Heap Fetches: 1
  Buffers: shared hit=1 read=1
Planning Time: 0.043 ms
Execution Time: 0.064 ms


We have an Index Only Scan which is what we wanted! There's another example of an Index Only Scan in the fiddle.

Interestingly, when I rerun this (at the end):

EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
  *
FROM test
WHERE add_bis LIKE 'The BUILDING%';


Result:

Seq Scan on public.test  (cost=10000000000.00..10000000001.04 rows=1 width=132) (actual time=0.008..0.009 rows=3 loops=1)
  Output: t_id, address, post_code, add_bis, p_c_bis
  Filter: (test.add_bis ~~ 'The BUILDING%'::text)
  Buffers: shared hit=1
Planning Time: 0.038 ms
Execution Time: 0.027 ms


I can only surmise (a nice word for "guess"!) that at this point, the optimiser knows that the table is in the memory buffer and that a Seq Scan will be the fastest/least-costly option no matter what (see the very high cost (cost=100000000000)). This chimes nicely with the documentation above ("It is impossible to suppress sequential scans entirely").

I've seen this before, quite why it uses the index for my first query and not for this second one is a mystery to me - the source code is a bit above my pay grade I'm afraid.

  • As with all answers on StackExchange, there is no substi

Code Snippets

CREATE TABLE test
(
  t_id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  address TEXT,
  post_code TEXT,
  add_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(address, 'Building', 'BUILDING')) STORED,
  p_c_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(post_code, 'abc', 'ABC')) STORED
);
INSERT INTO test (address, post_code) VALUES
('The Building, Apt 13, Flr 6', 'abc123'),
('The Building, Apt 45, Flr 8', 'abc456'),
('The Building, Apt 45, Flr 9', 'abc789');
CREATE INDEX my_field_regexp_idx ON test (add_bis);

CREATE INDEX post_code_regexp_idx ON test(p_c_bis);
SELECT * FROM test;
t_id    address     post_code   add_bis     p_c_bis
1   The Building, Apt 13, Flr 6     abc123  The BUILDING, Apt 13, Flr 6     ABC123
2   The Building, Apt 45, Flr 8     abc456  The BUILDING, Apt 45, Flr 8     ABC456
3   The Building, Apt 45, Flr 9     abc789  The BUILDING, Apt 45, Flr 9     ABC789

Context

StackExchange Database Administrators Q#320792, answer score: 3

Revisions (0)

No revisions yet.