gotchasqlMinor
Why does a SELECT (JOIN on regex) gets 30x slower with +1 row (query plan stays the same)? Lack of memory?
Viewed 0 times
30xwhythestayssamewithqueryregexslowerjoin
Problem
Here is my query, it takes 2.7 sec:
The associated
The same query with limit set to 32 takes only 0.25 sec with same plan (there is nothing special with row 33: tested
Here is the initialisation script I used for these tests.
```
DO $
SELECT t2.id, t1.id FROM sample t1
INNER JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex;The associated
EXPLAIN (ANALYSE,BUFFERS) output:Nested Loop (cost=0.00..1108.30 rows=356 width=8) (actual time=56.130..2740.906 rows=33 loops=1)
Join Filter: (t1.data ~* regex_set.regex)
Rows Removed by Join Filter: 65967
Buffers: local hit=18
-> Seq Scan on sample t1 (cost=0.00..38.59 rows=2159 width=36) (actual time=0.014..1.534 rows=2000 loops=1)
Buffers: local hit=17
-> Materialize (cost=0.00..1.08 rows=33 width=36) (actual time=0.000..0.004 rows=33 loops=2000)
Buffers: local hit=1
-> Limit (cost=0.00..0.59 rows=33 width=36) (actual time=0.005..0.014 rows=33 loops=1)
Buffers: local hit=1
-> Seq Scan on regex_set (cost=0.00..22.70 rows=1270 width=36) (actual time=0.004..0.008 rows=33 loops=1)
Buffers: local hit=1
Planning time: 0.129 ms
Execution time: 2740.952 msThe same query with limit set to 32 takes only 0.25 sec with same plan (there is nothing special with row 33: tested
limit 32 offset 1, had the same result as below):Nested Loop (cost=0.00..1075.88 rows=345 width=8) (actual time=5.871..255.315 rows=32 loops=1)
Join Filter: (t1.data ~* regex_set.regex)
Rows Removed by Join Filter: 63968
Buffers: local hit=18
-> Seq Scan on sample t1 (cost=0.00..38.59 rows=2159 width=36) (actual time=0.008..0.498 rows=2000 loops=1)
Buffers: local hit=17
-> Materialize (cost=0.00..1.05 rows=32 width=36) (actual time=0.000..0.002 rows=32 loops=2000)
Buffers: local hit=1
-> Limit (cost=0.00..0.57 rows=32 width=36) (actual time=0.003..0.013 rows=32 loops=1)
Buffers: local hit=1
-> Seq Scan on regex_set (cost=0.00..22.70 rows=1270 width=36) (actual time=0.003..0.006 rows=32 loops=1)
Buffers: local hit=1
Planning time: 0.109 ms
Execution time: 255.374 msHere is the initialisation script I used for these tests.
```
DO $
Solution
Thanks for the full example. With that and a little help from "perf" and "gdb", I traced the problem down to this:
Once you are trying to hold 33 regexps in working memory and accessing them in a predictable cycle, each one pushes out of the cache the one that will be needed next, and so every regexp is recompiled every time it is needed. This recompilation is slow.
I don't know why 32 was chosen for that value, but clearly the cache can't be made unlimited in size. Maybe it would be nice of this were user-settable rather than compiled in.
With this knowledge, it is easy to tweak the query to change the order of the join execution so that one regexp is tested to completion before moving on to the next one:
However, if there are regexp in t2 which match no rows in t1, then this query will start returning null-extended results for those rows, which is a change in behavior from your existing query. And if you try to add a simple WHERE clause to filter out those extra rows, then the planner will see through your trick and go back to using the slow join order. You could try to use a non-transparent equivalent to filter them out:
(assuming real t1.id are always positive)
But perhaps the best solution to this problem is to build a speciality index which supports regexp queries:
The prospect of using that index will inherently require the faster join order to be used, and the actual use of the index will make the query faster as well.
src/backend/utils/adt/regexp.c:#define MAX_CACHED_RES 32Once you are trying to hold 33 regexps in working memory and accessing them in a predictable cycle, each one pushes out of the cache the one that will be needed next, and so every regexp is recompiled every time it is needed. This recompilation is slow.
I don't know why 32 was chosen for that value, but clearly the cache can't be made unlimited in size. Maybe it would be nice of this were user-settable rather than compiled in.
With this knowledge, it is easy to tweak the query to change the order of the join execution so that one regexp is tested to completion before moving on to the next one:
SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regexHowever, if there are regexp in t2 which match no rows in t1, then this query will start returning null-extended results for those rows, which is a change in behavior from your existing query. And if you try to add a simple WHERE clause to filter out those extra rows, then the planner will see through your trick and go back to using the slow join order. You could try to use a non-transparent equivalent to filter them out:
SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex
where coalesce(t1.id,-5) != -5(assuming real t1.id are always positive)
But perhaps the best solution to this problem is to build a speciality index which supports regexp queries:
create extension pg_trgm ;
create index on sample using gin (data gin_trgm_ops);The prospect of using that index will inherently require the faster join order to be used, and the actual use of the index will make the query faster as well.
Code Snippets
src/backend/utils/adt/regexp.c:#define MAX_CACHED_RES 32SELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regexSELECT t2.id, t1.id FROM sample t1
right JOIN (select * from regex_set limit 33) t2 ON t1.data ~* t2.regex
where coalesce(t1.id,-5) != -5create extension pg_trgm ;
create index on sample using gin (data gin_trgm_ops);Context
StackExchange Database Administrators Q#217193, answer score: 4
Revisions (0)
No revisions yet.