gotchasqlMinor
Postgres trigram match acting strange for specific characters
Viewed 0 times
actingpostgresmatchstrangeforspecificcharacterstrigram
Problem
I'm using pg_trgm to run a query, but I'm getting a lot of 1 matches when searching with a symbol difference. I have the following query:
In
They all return with a similarity match of 1. Do I need to escape the '$' or something similar?
SELECT my_column, similarity('$ Hello', my_column) AS sml
FROM my_table
WHERE my_column % '$ Hello'
ORDER BY sml DESC, my_column;In
my_table, I have the following:- Hello
? Hello
| Hello
$ Hello
! Hello
!? HelloThey all return with a similarity match of 1. Do I need to escape the '$' or something similar?
Solution
Other answers clarified that trigram similarity is based on alphanumeric characters only. That's why all your examples match 100 %.
You may still be able to make use of a trigram GiST or GIN index and establish your desired sort order with additional
The boolean expression
You can do more, depends on your exact use case and requirements. Example:
Within the same trigram similarity, this sorts exact matches first, then strings containing the exact phrase. And within each subgroup matches the shorter Levenshtein distance first. Alphabetical as final tiebreaker. Related:
Last but not least, you tagged full text search. But your example is based on trigram similarity (provided by the additional module
But punctuation characters are considered noise and stripped in FTS all the same. Same "problem".
It starts with the "default" parser (currently the only one), parsing all these as noise to begin with ...
You may still be able to make use of a trigram GiST or GIN index and establish your desired sort order with additional
ORDER BY expressions. For your demonstrated case:SELECT my_column, similarity('$ Hello', my_column) AS sml
FROM my_table
WHERE my_column % '$ Hello'
ORDER BY sml DESC
, my_column <> '$ Hello' -- !
, my_column;The boolean expression
my_column <> '$ Hello' evaluates to FALSE, TRUE, or NULL - with this sort order. So the exact match (considering all characters) comes first. And this query can still use a trigram index. A trigram GiST index would (still) even support a "nearest neighbor" (KNN) search with LIMIT. Related:- Search in 300 million addresses with pg_trgm
- Optimizing a postgres similarity query (pg_trgm + gin index)
You can do more, depends on your exact use case and requirements. Example:
...
ORDER BY sml DESC
, my_column <> '$ Hello'
, my_column !~ '\$ Hello' -- note $ escaped with \$
, levenshtein(my_column , '$ Hello')
, my_column;Within the same trigram similarity, this sorts exact matches first, then strings containing the exact phrase. And within each subgroup matches the shorter Levenshtein distance first. Alphabetical as final tiebreaker. Related:
- Pattern matching with LIKE, SIMILAR TO or regular expressions
- PostgreSQL 8.4 no levenshtein fuzzymatching?
- Escape function for regular expression or LIKE patterns
Last but not least, you tagged full text search. But your example is based on trigram similarity (provided by the additional module
pg_trgm), which is a largely different concept with completely separate infrastructure and operators. You may want actual full text search instead (and phrase search with that?):- How to search hyphenated words in PostgreSQL full text search?
But punctuation characters are considered noise and stripped in FTS all the same. Same "problem".
ts_debug() shows how your text search configuration classifies identified tokens in a given string (simple configuration in the example).SELECT * FROM ts_debug('simple', '? Hello %&/( 123');It starts with the "default" parser (currently the only one), parsing all these as noise to begin with ...
Code Snippets
SELECT my_column, similarity('$ Hello', my_column) AS sml
FROM my_table
WHERE my_column % '$ Hello'
ORDER BY sml DESC
, my_column <> '$ Hello' -- !
, my_column;...
ORDER BY sml DESC
, my_column <> '$ Hello'
, my_column !~ '\$ Hello' -- note $ escaped with \$
, levenshtein(my_column , '$ Hello')
, my_column;SELECT * FROM ts_debug('simple', '? Hello %&/( 123');Context
StackExchange Database Administrators Q#242356, answer score: 6
Revisions (0)
No revisions yet.