patternsqlMinor
Slow query times for similarity searches with pg_trgm indices
Viewed 0 times
indiceswithqueryslowpg_trgmfortimessimilaritysearches
Problem
We added two pg_trgm indices to a table, to enable fuzzy searching by either email address or name, as we need to find users by name, or email addresses which have been misspelled during signup (e.g. "@gmail.con").
However, doing a ranked search on either of these indices is very slow in the vast majority of cases. i.e. with an increased timeout, a query might return in 60 seconds, on very rare occasions as fast as 15 seconds, but usually queries will time out.
This particular table has over 25 million rows, and is constantly queried, updated, and inserted into (the mean time for each is under 2ms). The setup is PostgreSQL 9.6.6 running on an RDS db.m4.large instance with general purpose SSD storage, and more-or-less default parameters. The pg_trgm extension is version 1.3.
Queries:
-
-
These queries do not need to be run very often (dozens of times a day), but they should be based on the current table state, and ideally return within around 10 seconds.
Schema:
```
=> \d+ users
Table "public.users"
Column | Type | Collation | Nullable | Default | Storage
-------------------+-----------------------------+-----------+----------+---------+----------
id | uuid | | not null | | plain
email | citext | | not null | | extended
email_is_verified | boolean | | not null | | plain
first_name | text
ANALYZE was run after index creation.However, doing a ranked search on either of these indices is very slow in the vast majority of cases. i.e. with an increased timeout, a query might return in 60 seconds, on very rare occasions as fast as 15 seconds, but usually queries will time out.
pg_trgm.similarity_threshold is the default value of 0.3, but bumping this up to 0.8 didn't seem to make a difference.This particular table has over 25 million rows, and is constantly queried, updated, and inserted into (the mean time for each is under 2ms). The setup is PostgreSQL 9.6.6 running on an RDS db.m4.large instance with general purpose SSD storage, and more-or-less default parameters. The pg_trgm extension is version 1.3.
Queries:
-
SELECT *
FROM users
WHERE email % 'chris@example.com'
ORDER BY email 'chris@example.com' LIMIT 10;-
SELECT *
FROM users
WHERE (first_name || ' ' || last_name) % 'chris orr'
ORDER BY (first_name || ' ' || last_name) 'chris orr' LIMIT 10;These queries do not need to be run very often (dozens of times a day), but they should be based on the current table state, and ideally return within around 10 seconds.
Schema:
```
=> \d+ users
Table "public.users"
Column | Type | Collation | Nullable | Default | Storage
-------------------+-----------------------------+-----------+----------+---------+----------
id | uuid | | not null | | plain
email | citext | | not null | | extended
email_is_verified | boolean | | not null | | plain
first_name | text
Solution
You might be able to get better performance with
In the specific case of the email column, you might be better to split them into username and domain, and then query for similar username with exact domain and vice versa. Then the extreme prevalence of the major cloud email providers is less likely to pollute the indexes with trigrams which add little information.
Finally what is the use case for this? Knowing why you need to run these queries could lead to better suggestions. In particular, why would you need to do a similarity search on emails, once they have been verified as being deliverable and going to the correct person? Perhaps you could build a partial index on only the subset of emails which have not been verified yet?
gin_trgm_ops rather than gist_trgm_ops. Which is better is pretty unpredictable, it is sensitive to the distribution of text patterns and lengths in your data and in your query terms. You pretty much just have to try it and see how it works for you. One thing is that the GIN method will be quite sensitive to pg_trgm.similarity_threshold, unlike the GiST method. It will also depend on what version of pg_trgm you have. If you started with an older version of PostgreSQL but updated it with pg_upgrade, you might not have the latest version. The planner does no better at predicting which index type is superior than we can do. So to test it, you can't just create both, you have to drop the other one, to force the planner to use the one you want.In the specific case of the email column, you might be better to split them into username and domain, and then query for similar username with exact domain and vice versa. Then the extreme prevalence of the major cloud email providers is less likely to pollute the indexes with trigrams which add little information.
Finally what is the use case for this? Knowing why you need to run these queries could lead to better suggestions. In particular, why would you need to do a similarity search on emails, once they have been verified as being deliverable and going to the correct person? Perhaps you could build a partial index on only the subset of emails which have not been verified yet?
Context
StackExchange Database Administrators Q#208346, answer score: 2
Revisions (0)
No revisions yet.