snippetsqlMinor
How to query efficiently from Postgres to select special words?
Viewed 0 times
efficientlypostgresquerywordsspecialhowselectfrom
Problem
Let's say that I have a table called
Columns are
In the
NB: we have utf8 words, too.
How to query efficiently to see which words include letters
The output would be:
I have no idea how to do such a thing. It should be efficient because our server would suffer otherwise.
words with very many records.Columns are
id and name.In the
words table I have for example:'systematic', 'سلام','gear','synthesis','mysterious', etc.NB: we have utf8 words, too.
How to query efficiently to see which words include letters
's', 'm' and 'e' (all of them)?The output would be:
systematic,mysteriousI have no idea how to do such a thing. It should be efficient because our server would suffer otherwise.
Solution
An easy approach would be to consider the array of letters corresponding to each word, and search inside that with the
This array can be easily obtained with
[EDIT: per @Erwin's answer,
Here's a demo that first materializes the array as a column in a test table containing a mix of english and french words (~511000 rows, average length=13 characters), and then a second test table without adding the array as a column.
To find a relatively large number of words:
This does a sequential scan as shown by EXPLAIN ANALYZE:
But we can index the array with a GIN index:
And then it's quite faster:
We might even avoid materializing the array as a column by directly indexing the expression, since postgres supports functional indexes.
Then the search has to be made with the exact same expression, and the index gets used:
See GiST and GIN Index Types in the manual for caveats on these types of indexes.
@> (contains) array operator. This works independantly of the letters positions as shown in the example from the manual, i.e. ARRAY[1,4,3] @> ARRAY[3,1] is true.This array can be easily obtained with
regexp_split_to_array(name, '').[EDIT: per @Erwin's answer,
string_to_array(name, NULL) is faster so better use that. It's a drop-in replacement in the rest of the answer]Here's a demo that first materializes the array as a column in a test table containing a mix of english and french words (~511000 rows, average length=13 characters), and then a second test table without adding the array as a column.
=> CREATE TABLE tstword AS
SELECT word_id as id,
wordtext as name,
regexp_split_to_array(wordtext, '') as arr FROM words;To find a relatively large number of words:
=> select count(*) from tstword where arr @> array['s','m','e'];
count
-------
42268
(1 row)
Time: 268.809 msThis does a sequential scan as shown by EXPLAIN ANALYZE:
explain analyze select name from tstword where arr @> array['s','m','e'];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on tstword (cost=0.00..17554.46 rows=21256 width=14) (actual time=0.020..268.525 rows=42268 loops=1)
Filter: (arr @> '{s,m,e}'::text[])
Rows Removed by Filter: 468729
Total runtime: 269.927 ms
(4 rows)
Time: 270.414 msBut we can index the array with a GIN index:
CREATE INDEX idx_tst on tstword using gin(arr);And then it's quite faster:
explain analyze select name from tstword where arr @> array['s','m','e'];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tstword (cost=252.74..11815.73 rows=21256 width=14) (actual time=46.378..60.203 rows=42268 loops=1)
Recheck Cond: (arr @> '{s,m,e}'::text[])
-> Bitmap Index Scan on idx_tst (cost=0.00..247.42 rows=21256 width=0) (actual time=45.202..45.202 rows=42268 loops=1)
Index Cond: (arr @> '{s,m,e}'::text[])
Total runtime: 61.677 ms
(5 rows)
Time: 70.185 msWe might even avoid materializing the array as a column by directly indexing the expression, since postgres supports functional indexes.
create table tstword2 as select word_id as id,wordtext as name from words;
create index idx_tst2 on tstword2 using gin(regexp_split_to_array(name, ''));Then the search has to be made with the exact same expression, and the index gets used:
explain analyze select name from tstword2 where regexp_split_to_array(name, '') @> array['s','m','e'];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tstword2 (cost=40.00..44.02 rows=1 width=14) (actual time=39.390..48.435 rows=42268 loops=1)
Recheck Cond: (regexp_split_to_array((name)::text, ''::text) @> '{s,m,e}'::text[])
-> Bitmap Index Scan on idx_tst2 (cost=0.00..40.00 rows=1 width=0) (actual time=39.053..39.053 rows=42268 loops=1)
Index Cond: (regexp_split_to_array((name)::text, ''::text) @> '{s,m,e}'::text[])
Total runtime: 49.748 ms
(5 rows)
Time: 50.193 msSee GiST and GIN Index Types in the manual for caveats on these types of indexes.
Code Snippets
=> CREATE TABLE tstword AS
SELECT word_id as id,
wordtext as name,
regexp_split_to_array(wordtext, '') as arr FROM words;=> select count(*) from tstword where arr @> array['s','m','e'];
count
-------
42268
(1 row)
Time: 268.809 msexplain analyze select name from tstword where arr @> array['s','m','e'];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on tstword (cost=0.00..17554.46 rows=21256 width=14) (actual time=0.020..268.525 rows=42268 loops=1)
Filter: (arr @> '{s,m,e}'::text[])
Rows Removed by Filter: 468729
Total runtime: 269.927 ms
(4 rows)
Time: 270.414 msCREATE INDEX idx_tst on tstword using gin(arr);explain analyze select name from tstword where arr @> array['s','m','e'];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tstword (cost=252.74..11815.73 rows=21256 width=14) (actual time=46.378..60.203 rows=42268 loops=1)
Recheck Cond: (arr @> '{s,m,e}'::text[])
-> Bitmap Index Scan on idx_tst (cost=0.00..247.42 rows=21256 width=0) (actual time=45.202..45.202 rows=42268 loops=1)
Index Cond: (arr @> '{s,m,e}'::text[])
Total runtime: 61.677 ms
(5 rows)
Time: 70.185 msContext
StackExchange Database Administrators Q#55164, answer score: 5
Revisions (0)
No revisions yet.