HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Postgres LIKE query using a GiST index is just as slow as a full scan

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fullscanpostgresjustquerygistlikeslowusingindex

Problem

What I have is a very simple database that stores paths, extensions and names of files from UNC shares. For testing, I inserted about 1,5 mio rows and the below query uses a GiST index as you see, but still it takes 5 seconds to return. Expected would be a few (like 100) milliseconds.

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE 'somevalue'


When using %% in the query, it takes the not that long, even when using sequential scan (?!)

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE '%a%'


I also have the same setup for the name (filename) column, when executing a similar query on that one, it only takes half of the time, even when using %%:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  name LIKE '%a%'


What I already tried cannot be written here in short words. Whatever I do, it gets slow starting from about 1 mio rows. As there is basically never anything deleted, of course vacuuming and reindexing does not help at all.
I cannot really use any other type of search than LIKE %% and a GIN or GiST index because I need to be able to find any character in the columns of interest, not only "words for a specific human language".

Is my expectation that this should work in around 100 milliseconds even for many million more rows that wrong?
Further information

DB create script

Re-try without any text or other index at all, 1.7 mio unique entries

EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Seq Scan on residentfiles  (cost=0.00..78162.76 rows=33225 width=232) (actual time=0.076..3195.965 rows=45301 loops=1)
  Filter: ((name)::text ~~ '%12345%'::text)
  Rows Removed by Filter: 1604780+
Planning time: 0.596 ms
Execution time: 3318.595 ms


Try with Triagram gin index:

```
CREATE INDEX IF NOT EXISTS name_gin_idx ON residentfiles USING gin (name gin_trgm_ops);
CREATE INDEX IF NOT EXISTS parentpath_gin_idx ON residentfiles USING

Solution

In my opinion if you do not share how you did the test, it's very hard to give you an answer. Let see an example of what I mean. Sorry for I used a postgres 11 but the conclusions are the same:

This is a new db, there is nothing running against the instance:

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEX


Inserting 500000 rows using a very simple loop.

test=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
 count
--------
 500000
(1 row)

test=# select * from test_trgmidx limit 20;
col1        |    col2
------------+------------
 ABCD1EFGH  | abcd1efgh
 ABCD2EFGH  | abcd2efgh
 ABCD3EFGH  | abcd3efgh
 ABCD4EFGH  | abcd4efgh
 ABCD5EFGH  | abcd5efgh
 ABCD6EFGH  | abcd6efgh
 ABCD7EFGH  | abcd7efgh
 ABCD8EFGH  | abcd8efgh
 ABCD9EFGH  | abcd9efgh
 ABCD10EFGH | abcd10efgh
 ABCD11EFGH | abcd11efgh
 ABCD12EFGH | abcd12efgh
 ABCD13EFGH | abcd13efgh
 ABCD14EFGH | abcd14efgh
 ABCD15EFGH | abcd15efgh
 ABCD16EFGH | abcd16efgh
 ABCD17EFGH | abcd17efgh
 ABCD18EFGH | abcd18efgh
 ABCD19EFGH | abcd19efgh
 ABCD20EFGH | abcd20efgh
(20 rows)


Now I restart the instance to have a clean buffer cache, then I run the explain for the first select twice, to see how cache is "disturbing" our results:

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30   rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
  Output: col1, col2
  Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
  Buffers: shared hit=19 read=237
 Planning Time: 0.303 ms
 Execution Time: 4.934 ms
(6 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
 Output: col1, col2
 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)


It's clear that the first run need to retrive rows from disk (read=237) while the second one need only to access buffer cache (shared hit=232, NO reads).
Now let's do the same for the second select, restart the instance and run the explain twice:

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared read=3676
 Planning Time: 0.719 ms
 Execution Time: 94.942 ms
(7 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=3676
 Planning Time: 0.081 ms
 Execution Time: 65.878 ms
(7 rows)


As you can see, now the reads come from disk the first run and from buffer the second one. Things are even more complicated because of OS buffers. Is is possible to clean OS cache and re-run everithing having back different results:

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      929920     4105056       93960     2879628     6748994
Swap:       4063228           0     4063228

# echo 3 > /proc/sys/vm/drop_caches

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      802204     6846392       93960      266008     6951156
Swap:       4063228           0     4063228
#


Look at column buff/cache as it drop down from 2879628 to 266008. Now run the explain again (twice for any select):

```
postg

Code Snippets

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEX
test=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
 count
--------
 500000
(1 row)

test=# select * from test_trgmidx limit 20;
col1        |    col2
------------+------------
 ABCD1EFGH  | abcd1efgh
 ABCD2EFGH  | abcd2efgh
 ABCD3EFGH  | abcd3efgh
 ABCD4EFGH  | abcd4efgh
 ABCD5EFGH  | abcd5efgh
 ABCD6EFGH  | abcd6efgh
 ABCD7EFGH  | abcd7efgh
 ABCD8EFGH  | abcd8efgh
 ABCD9EFGH  | abcd9efgh
 ABCD10EFGH | abcd10efgh
 ABCD11EFGH | abcd11efgh
 ABCD12EFGH | abcd12efgh
 ABCD13EFGH | abcd13efgh
 ABCD14EFGH | abcd14efgh
 ABCD15EFGH | abcd15efgh
 ABCD16EFGH | abcd16efgh
 ABCD17EFGH | abcd17efgh
 ABCD18EFGH | abcd18efgh
 ABCD19EFGH | abcd19efgh
 ABCD20EFGH | abcd20efgh
(20 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30   rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
  Output: col1, col2
  Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
  Buffers: shared hit=19 read=237
 Planning Time: 0.303 ms
 Execution Time: 4.934 ms
(6 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
 Output: col1, col2
 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared read=3676
 Planning Time: 0.719 ms
 Execution Time: 94.942 ms
(7 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=3676
 Planning Time: 0.081 ms
 Execution Time: 65.878 ms
(7 rows)
# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      929920     4105056       93960     2879628     6748994
Swap:       4063228           0     4063228

# echo 3 > /proc/sys/vm/drop_caches

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      802204     6846392       93960      266008     6951156
Swap:       4063228           0     4063228
#

Context

StackExchange Database Administrators Q#227848, answer score: 5

Revisions (0)

No revisions yet.