gotchasqlMinor

Why does a SELECT (JOIN on regex) gets 30x slower with +1 row (query plan stays the same)? Lack of memory?

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

30xwhythestayssamewithqueryregexslowerjoin

Problem

Here is my query, it takes 2.7 sec:

SELECT t2.id, t1.id FROM sample t1
INNER JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex;

The associated EXPLAIN (ANALYSE,BUFFERS) output:

Nested Loop (cost=0.00..1108.30 rows=356 width=8) (actual time=56.130..2740.906 rows=33 loops=1)
 Join Filter: (t1.data ~* regex_set.regex)
 Rows Removed by Join Filter: 65967
 Buffers: local hit=18
 -> Seq Scan on sample t1 (cost=0.00..38.59 rows=2159 width=36) (actual time=0.014..1.534 rows=2000 loops=1)
    Buffers: local hit=17
 -> Materialize (cost=0.00..1.08 rows=33 width=36) (actual time=0.000..0.004 rows=33 loops=2000)
    Buffers: local hit=1
    -> Limit (cost=0.00..0.59 rows=33 width=36) (actual time=0.005..0.014 rows=33 loops=1)
       Buffers: local hit=1
       -> Seq Scan on regex_set (cost=0.00..22.70 rows=1270 width=36) (actual time=0.004..0.008 rows=33 loops=1)
          Buffers: local hit=1
Planning time: 0.129 ms
Execution time: 2740.952 ms

The same query with limit set to 32 takes only 0.25 sec with same plan (there is nothing special with row 33: tested limit 32 offset 1, had the same result as below):

Nested Loop (cost=0.00..1075.88 rows=345 width=8) (actual time=5.871..255.315 rows=32 loops=1)
 Join Filter: (t1.data ~* regex_set.regex)
 Rows Removed by Join Filter: 63968
 Buffers: local hit=18
 -> Seq Scan on sample t1 (cost=0.00..38.59 rows=2159 width=36) (actual time=0.008..0.498 rows=2000 loops=1)
    Buffers: local hit=17
 -> Materialize (cost=0.00..1.05 rows=32 width=36) (actual time=0.000..0.002 rows=32 loops=2000)
    Buffers: local hit=1
    -> Limit (cost=0.00..0.57 rows=32 width=36) (actual time=0.003..0.013 rows=32 loops=1)
       Buffers: local hit=1
       -> Seq Scan on regex_set (cost=0.00..22.70 rows=1270 width=36) (actual time=0.003..0.006 rows=32 loops=1)
          Buffers: local hit=1
Planning time: 0.109 ms
Execution time: 255.374 ms

Here is the initialisation script I used for these tests.

```
DO $

Solution

Thanks for the full example. With that and a little help from "perf" and "gdb", I traced the problem down to this:

src/backend/utils/adt/regexp.c:#define MAX_CACHED_RES   32

Once you are trying to hold 33 regexps in working memory and accessing them in a predictable cycle, each one pushes out of the cache the one that will be needed next, and so every regexp is recompiled every time it is needed. This recompilation is slow.

I don't know why 32 was chosen for that value, but clearly the cache can't be made unlimited in size. Maybe it would be nice of this were user-settable rather than compiled in.

With this knowledge, it is easy to tweak the query to change the order of the join execution so that one regexp is tested to completion before moving on to the next one:

SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex

However, if there are regexp in t2 which match no rows in t1, then this query will start returning null-extended results for those rows, which is a change in behavior from your existing query. And if you try to add a simple WHERE clause to filter out those extra rows, then the planner will see through your trick and go back to using the slow join order. You could try to use a non-transparent equivalent to filter them out:

SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex 
where coalesce(t1.id,-5) != -5

(assuming real t1.id are always positive)

But perhaps the best solution to this problem is to build a speciality index which supports regexp queries:

create extension pg_trgm ;
create index on sample using gin (data gin_trgm_ops);

The prospect of using that index will inherently require the faster join order to be used, and the actual use of the index will make the query faster as well.

Code Snippets

src/backend/utils/adt/regexp.c:#define MAX_CACHED_RES   32

SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex

SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex 
where coalesce(t1.id,-5) != -5

create extension pg_trgm ;
create index on sample using gin (data gin_trgm_ops);

Context

StackExchange Database Administrators Q#217193, answer score: 4

Revisions (0)

No revisions yet.