HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to query efficiently from Postgres to select special words?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
efficientlypostgresquerywordsspecialhowselectfrom

Problem

Let's say that I have a table called words with very many records.

Columns are id and name.

In the words table I have for example:

'systematic', 'سلام','gear','synthesis','mysterious', etc.


NB: we have utf8 words, too.

How to query efficiently to see which words include letters 's', 'm' and 'e' (all of them)?

The output would be:

systematic,mysterious


I have no idea how to do such a thing. It should be efficient because our server would suffer otherwise.

Solution

An easy approach would be to consider the array of letters corresponding to each word, and search inside that with the @> (contains) array operator. This works independantly of the letters positions as shown in the example from the manual, i.e. ARRAY[1,4,3] @> ARRAY[3,1] is true.

This array can be easily obtained with regexp_split_to_array(name, '').

[EDIT: per @Erwin's answer, string_to_array(name, NULL) is faster so better use that. It's a drop-in replacement in the rest of the answer]

Here's a demo that first materializes the array as a column in a test table containing a mix of english and french words (~511000 rows, average length=13 characters), and then a second test table without adding the array as a column.

=> CREATE TABLE tstword AS
    SELECT word_id as id,
    wordtext as name,
    regexp_split_to_array(wordtext, '') as arr FROM words;


To find a relatively large number of words:

=> select count(*) from tstword where arr @> array['s','m','e'];
 count 
-------
 42268
(1 row)

Time: 268.809 ms


This does a sequential scan as shown by EXPLAIN ANALYZE:

explain analyze select name from tstword where arr @> array['s','m','e'];
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on tstword  (cost=0.00..17554.46 rows=21256 width=14) (actual time=0.020..268.525 rows=42268 loops=1)
   Filter: (arr @> '{s,m,e}'::text[])
   Rows Removed by Filter: 468729
 Total runtime: 269.927 ms
(4 rows)

Time: 270.414 ms


But we can index the array with a GIN index:

CREATE INDEX idx_tst on tstword using gin(arr);


And then it's quite faster:

explain analyze select name from tstword where arr @> array['s','m','e'];
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tstword  (cost=252.74..11815.73 rows=21256 width=14) (actual time=46.378..60.203 rows=42268 loops=1)
   Recheck Cond: (arr @> '{s,m,e}'::text[])
   ->  Bitmap Index Scan on idx_tst  (cost=0.00..247.42 rows=21256 width=0) (actual time=45.202..45.202 rows=42268 loops=1)
         Index Cond: (arr @> '{s,m,e}'::text[])
 Total runtime: 61.677 ms
(5 rows)

Time: 70.185 ms


We might even avoid materializing the array as a column by directly indexing the expression, since postgres supports functional indexes.

create table tstword2 as select word_id as id,wordtext as name from words;
create index idx_tst2 on tstword2  using gin(regexp_split_to_array(name, ''));


Then the search has to be made with the exact same expression, and the index gets used:

explain analyze select name from tstword2 where regexp_split_to_array(name, '') @> array['s','m','e'];
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tstword2  (cost=40.00..44.02 rows=1 width=14) (actual time=39.390..48.435 rows=42268 loops=1)
   Recheck Cond: (regexp_split_to_array((name)::text, ''::text) @> '{s,m,e}'::text[])
   ->  Bitmap Index Scan on idx_tst2  (cost=0.00..40.00 rows=1 width=0) (actual time=39.053..39.053 rows=42268 loops=1)
         Index Cond: (regexp_split_to_array((name)::text, ''::text) @> '{s,m,e}'::text[])
 Total runtime: 49.748 ms
(5 rows)

Time: 50.193 ms


See GiST and GIN Index Types in the manual for caveats on these types of indexes.

Code Snippets

=> CREATE TABLE tstword AS
    SELECT word_id as id,
    wordtext as name,
    regexp_split_to_array(wordtext, '') as arr FROM words;
=> select count(*) from tstword where arr @> array['s','m','e'];
 count 
-------
 42268
(1 row)

Time: 268.809 ms
explain analyze select name from tstword where arr @> array['s','m','e'];
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on tstword  (cost=0.00..17554.46 rows=21256 width=14) (actual time=0.020..268.525 rows=42268 loops=1)
   Filter: (arr @> '{s,m,e}'::text[])
   Rows Removed by Filter: 468729
 Total runtime: 269.927 ms
(4 rows)

Time: 270.414 ms
CREATE INDEX idx_tst on tstword using gin(arr);
explain analyze select name from tstword where arr @> array['s','m','e'];
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tstword  (cost=252.74..11815.73 rows=21256 width=14) (actual time=46.378..60.203 rows=42268 loops=1)
   Recheck Cond: (arr @> '{s,m,e}'::text[])
   ->  Bitmap Index Scan on idx_tst  (cost=0.00..247.42 rows=21256 width=0) (actual time=45.202..45.202 rows=42268 loops=1)
         Index Cond: (arr @> '{s,m,e}'::text[])
 Total runtime: 61.677 ms
(5 rows)

Time: 70.185 ms

Context

StackExchange Database Administrators Q#55164, answer score: 5

Revisions (0)

No revisions yet.