patternsqlMajor

Best index for similarity function

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

functionforindexsimilaritybest

Problem

So I have this table with 6.2 millions records and I have to perform search queries with similarity for one for the column.
The queries can be:

SELECT  "lca_test".* FROM "lca_test"
 WHERE (similarity(job_title, 'sales executive') > 0.6)
 AND worksite_city = 'los angeles' 
 ORDER BY salary ASC LIMIT 50 OFFSET 0

More conditions can be added in the where(year = X, worksite_state = N, status = 'certified', visa_class = Z).

Running some of those queries can take a really long time, over 30seconds. Sometimes more than a minutes.

EXPLAIN ANALYZE of the previously mentioned query gives me this:

Limit  (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
->  Index Scan using index_lca_test_on_salary on lca_test  (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 ms

Solution

Worth mentioning that you installed the additional module pg_trgm, which provides the similarity() function.
Similarity operator %

Whatever else you do, use the similarity operator % instead of the expression (similarity(job_title, 'sales executive') > 0.6). Index support is bound to operators in Postgres, not to functions.

To get the desired minimum similarity of 0.6, set the GUC parameter:

SET pg_trgm.similarity_threshold = 0.6;  -- once per session

(In Postgres 9.6 or older use the deprecated SELECT set_limit(0.6);)

The setting stays for the rest of your session until reset. Check with:

SHOW pg_trgm.similarity_threshold;

(Used to be SELECT show_limit();)
Simple case

Just getting the best matches in column job_title for the given string would be a plain case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops (but not with a GIN index):

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

To also include an equality condition on worksite_city you would need the additional module btree_gist. Run (once per DB):

CREATE EXTENSION btree_gist;

Then:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

Query:

SET pg_trgm.similarity_threshold = 0.6  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title  'sales executive')
LIMIT  50;

 being the "distance" operator:

one minus the

similarity()

 value.

Postgres can also combine two separate indexes, a plain btree index on

worksite_city, and a separate GiST index on job_title

, but the multicolumn index should be fastest - when combining the two columns like you did.
Your case

However, your query sorts by

salary

, not by distance or similarity, which is something else entirely. Now we can use both GIN and GiST index, and GIN will be faster. (Even more so in later version with major improvements to GIN indexes - upgrade hint!)

Similar story for the additional equality check on

worksite_city: install the additional module btree_gin

. Run (once per DB):

CREATE EXTENSION btree_gin;


Then:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);


Query:

SET pg_trgm.similarity_threshold = 0.6;  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50; -- OFFSET 0


Again, this also works (less efficiently) with the simpler index you already have (

"index_lcas_job_title_trigram"

), possibly in combination with other indexes. The best solution depends on the complete picture.

Further reading:

How is LIKE implemented?



Finding similar strings with PostgreSQL quickly



Pattern matching with LIKE, SIMILAR TO or regular expressions



Asides

- 
You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?

- 
You have some dubious data types:

employement_start_date | character varying
  employement_end_date   | character varying


Seems like those should be

date`. Etc.

Code Snippets

SET pg_trgm.similarity_threshold = 0.6;  -- once per session

SHOW pg_trgm.similarity_threshold;

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

CREATE EXTENSION btree_gist;

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

Context

StackExchange Database Administrators Q#103821, answer score: 29

Revisions (0)

No revisions yet.