patternsqlMajor
Best index for similarity function
Viewed 0 times
functionforindexsimilaritybest
Problem
So I have this table with 6.2 millions records and I have to perform search queries with similarity for one for the column.
The queries can be:
More conditions can be added in the where(year = X, worksite_state = N, status = 'certified', visa_class = Z).
Running some of those queries can take a really long time, over 30seconds. Sometimes more than a minutes.
I can't figure out how I should index my column to make it blazing fast.
EDIT:
Here is the postgres version:
PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit
Here is the table definition:
```
Table "public.lca_test"
Column | Type | Modifiers | Storage | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('lca_test_id_seq'::regclass) | plain | |
raw_id | integer | | plain | |
year | integer |
The queries can be:
SELECT "lca_test".* FROM "lca_test"
WHERE (similarity(job_title, 'sales executive') > 0.6)
AND worksite_city = 'los angeles'
ORDER BY salary ASC LIMIT 50 OFFSET 0More conditions can be added in the where(year = X, worksite_state = N, status = 'certified', visa_class = Z).
Running some of those queries can take a really long time, over 30seconds. Sometimes more than a minutes.
EXPLAIN ANALYZE of the previously mentioned query gives me this:Limit (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
-> Index Scan using index_lca_test_on_salary on lca_test (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 msI can't figure out how I should index my column to make it blazing fast.
EDIT:
Here is the postgres version:
PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit
Here is the table definition:
```
Table "public.lca_test"
Column | Type | Modifiers | Storage | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('lca_test_id_seq'::regclass) | plain | |
raw_id | integer | | plain | |
year | integer |
Solution
Worth mentioning that you installed the additional module
Similarity operator
Whatever else you do, use the similarity operator
To get the desired minimum similarity of
(In Postgres 9.6 or older use the deprecated
The setting stays for the rest of your session until reset. Check with:
(Used to be
Simple case
Just getting the best matches in column
To also include an equality condition on
Then:
Query:
`
pg_trgm, which provides the similarity() function.Similarity operator
%Whatever else you do, use the similarity operator
% instead of the expression (similarity(job_title, 'sales executive') > 0.6). Index support is bound to operators in Postgres, not to functions.To get the desired minimum similarity of
0.6, set the GUC parameter:SET pg_trgm.similarity_threshold = 0.6; -- once per session(In Postgres 9.6 or older use the deprecated
SELECT set_limit(0.6);)The setting stays for the rest of your session until reset. Check with:
SHOW pg_trgm.similarity_threshold;(Used to be
SELECT show_limit();)Simple case
Just getting the best matches in column
job_title for the given string would be a plain case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops (but not with a GIN index):CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);To also include an equality condition on
worksite_city you would need the additional module btree_gist. Run (once per DB):CREATE EXTENSION btree_gist;Then:
CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);Query:
SET pg_trgm.similarity_threshold = 0.6 -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY (job_title 'sales executive')
LIMIT 50;`
being the "distance" operator:
one minus the similarity() value.
Postgres can also combine two separate indexes, a plain btree index on worksite_city, and a separate GiST index on job_title, but the multicolumn index should be fastest - when combining the two columns like you did.
Your case
However, your query sorts by salary, not by distance or similarity, which is something else entirely. Now we can use both GIN and GiST index, and GIN will be faster. (Even more so in later version with major improvements to GIN indexes - upgrade hint!)
Similar story for the additional equality check on worksite_city: install the additional module btree_gin. Run (once per DB):
CREATE EXTENSION btree_gin;
Then:
CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);
Query:
SET pg_trgm.similarity_threshold = 0.6; -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY salary
LIMIT 50; -- OFFSET 0
Again, this also works (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"), possibly in combination with other indexes. The best solution depends on the complete picture.
Further reading:
- How is LIKE implemented?
- Finding similar strings with PostgreSQL quickly
- Pattern matching with LIKE, SIMILAR TO or regular expressions
Asides
-
You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?
-
You have some dubious data types:
employement_start_date | character varying
employement_end_date | character varying
Seems like those should be date`. Etc.Code Snippets
SET pg_trgm.similarity_threshold = 0.6; -- once per sessionSHOW pg_trgm.similarity_threshold;CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);CREATE EXTENSION btree_gist;CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);Context
StackExchange Database Administrators Q#103821, answer score: 29
Revisions (0)
No revisions yet.