patternsqlMinor
Postgres LIKE query using a GiST index is just as slow as a full scan
Viewed 0 times
fullscanpostgresjustquerygistlikeslowusingindex
Problem
What I have is a very simple database that stores paths, extensions and names of files from UNC shares. For testing, I inserted about 1,5 mio rows and the below query uses a GiST index as you see, but still it takes 5 seconds to return. Expected would be a few (like 100) milliseconds.
When using
I also have the same setup for the
What I already tried cannot be written here in short words. Whatever I do, it gets slow starting from about 1 mio rows. As there is basically never anything deleted, of course vacuuming and reindexing does not help at all.
I cannot really use any other type of search than
Is my expectation that this should work in around 100 milliseconds even for many million more rows that wrong?
Further information
DB create script
Re-try without any text or other index at all, 1.7 mio unique entries
Try with Triagram gin index:
```
CREATE INDEX IF NOT EXISTS name_gin_idx ON residentfiles USING gin (name gin_trgm_ops);
CREATE INDEX IF NOT EXISTS parentpath_gin_idx ON residentfiles USING
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE parentpath LIKE 'somevalue'When using
%% in the query, it takes the not that long, even when using sequential scan (?!)EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE parentpath LIKE '%a%'I also have the same setup for the
name (filename) column, when executing a similar query on that one, it only takes half of the time, even when using %%:EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE name LIKE '%a%'What I already tried cannot be written here in short words. Whatever I do, it gets slow starting from about 1 mio rows. As there is basically never anything deleted, of course vacuuming and reindexing does not help at all.
I cannot really use any other type of search than
LIKE %% and a GIN or GiST index because I need to be able to find any character in the columns of interest, not only "words for a specific human language".Is my expectation that this should work in around 100 milliseconds even for many million more rows that wrong?
Further information
DB create script
Re-try without any text or other index at all, 1.7 mio unique entries
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Seq Scan on residentfiles (cost=0.00..78162.76 rows=33225 width=232) (actual time=0.076..3195.965 rows=45301 loops=1)
Filter: ((name)::text ~~ '%12345%'::text)
Rows Removed by Filter: 1604780+
Planning time: 0.596 ms
Execution time: 3318.595 msTry with Triagram gin index:
```
CREATE INDEX IF NOT EXISTS name_gin_idx ON residentfiles USING gin (name gin_trgm_ops);
CREATE INDEX IF NOT EXISTS parentpath_gin_idx ON residentfiles USING
Solution
In my opinion if you do not share how you did the test, it's very hard to give you an answer. Let see an example of what I mean. Sorry for I used a postgres 11 but the conclusions are the same:
This is a new db, there is nothing running against the instance:
Inserting 500000 rows using a very simple loop.
Now I restart the instance to have a clean buffer cache, then I run the explain for the first select twice, to see how cache is "disturbing" our results:
It's clear that the first run need to retrive rows from disk (read=237) while the second one need only to access buffer cache (shared hit=232, NO reads).
Now let's do the same for the second select, restart the instance and run the explain twice:
As you can see, now the reads come from disk the first run and from buffer the second one. Things are even more complicated because of OS buffers. Is is possible to clean OS cache and re-run everithing having back different results:
Look at column buff/cache as it drop down from 2879628 to 266008. Now run the explain again (twice for any select):
```
postg
This is a new db, there is nothing running against the instance:
test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEXInserting 500000 rows using a very simple loop.
test=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
count
--------
500000
(1 row)
test=# select * from test_trgmidx limit 20;
col1 | col2
------------+------------
ABCD1EFGH | abcd1efgh
ABCD2EFGH | abcd2efgh
ABCD3EFGH | abcd3efgh
ABCD4EFGH | abcd4efgh
ABCD5EFGH | abcd5efgh
ABCD6EFGH | abcd6efgh
ABCD7EFGH | abcd7efgh
ABCD8EFGH | abcd8efgh
ABCD9EFGH | abcd9efgh
ABCD10EFGH | abcd10efgh
ABCD11EFGH | abcd11efgh
ABCD12EFGH | abcd12efgh
ABCD13EFGH | abcd13efgh
ABCD14EFGH | abcd14efgh
ABCD15EFGH | abcd15efgh
ABCD16EFGH | abcd16efgh
ABCD17EFGH | abcd17efgh
ABCD18EFGH | abcd18efgh
ABCD19EFGH | abcd19efgh
ABCD20EFGH | abcd20efgh
(20 rows)Now I restart the instance to have a clean buffer cache, then I run the explain for the first select twice, to see how cache is "disturbing" our results:
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=19 read=237
Planning Time: 0.303 ms
Execution Time: 4.934 ms
(6 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)It's clear that the first run need to retrive rows from disk (read=237) while the second one need only to access buffer cache (shared hit=232, NO reads).
Now let's do the same for the second select, restart the instance and run the explain twice:
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared read=3676
Planning Time: 0.719 ms
Execution Time: 94.942 ms
(7 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared hit=3676
Planning Time: 0.081 ms
Execution Time: 65.878 ms
(7 rows)As you can see, now the reads come from disk the first run and from buffer the second one. Things are even more complicated because of OS buffers. Is is possible to clean OS cache and re-run everithing having back different results:
# free
total used free shared buff/cache available
Mem: 7914604 929920 4105056 93960 2879628 6748994
Swap: 4063228 0 4063228
# echo 3 > /proc/sys/vm/drop_caches
# free
total used free shared buff/cache available
Mem: 7914604 802204 6846392 93960 266008 6951156
Swap: 4063228 0 4063228
#Look at column buff/cache as it drop down from 2879628 to 266008. Now run the explain again (twice for any select):
```
postg
Code Snippets
test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEXtest=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
count
--------
500000
(1 row)
test=# select * from test_trgmidx limit 20;
col1 | col2
------------+------------
ABCD1EFGH | abcd1efgh
ABCD2EFGH | abcd2efgh
ABCD3EFGH | abcd3efgh
ABCD4EFGH | abcd4efgh
ABCD5EFGH | abcd5efgh
ABCD6EFGH | abcd6efgh
ABCD7EFGH | abcd7efgh
ABCD8EFGH | abcd8efgh
ABCD9EFGH | abcd9efgh
ABCD10EFGH | abcd10efgh
ABCD11EFGH | abcd11efgh
ABCD12EFGH | abcd12efgh
ABCD13EFGH | abcd13efgh
ABCD14EFGH | abcd14efgh
ABCD15EFGH | abcd15efgh
ABCD16EFGH | abcd16efgh
ABCD17EFGH | abcd17efgh
ABCD18EFGH | abcd18efgh
ABCD19EFGH | abcd19efgh
ABCD20EFGH | abcd20efgh
(20 rows)test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=19 read=237
Planning Time: 0.303 ms
Execution Time: 4.934 ms
(6 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared read=3676
Planning Time: 0.719 ms
Execution Time: 94.942 ms
(7 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared hit=3676
Planning Time: 0.081 ms
Execution Time: 65.878 ms
(7 rows)# free
total used free shared buff/cache available
Mem: 7914604 929920 4105056 93960 2879628 6748994
Swap: 4063228 0 4063228
# echo 3 > /proc/sys/vm/drop_caches
# free
total used free shared buff/cache available
Mem: 7914604 802204 6846392 93960 266008 6951156
Swap: 4063228 0 4063228
#Context
StackExchange Database Administrators Q#227848, answer score: 5
Revisions (0)
No revisions yet.