HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

How exactly does trigram word-similarity work?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
similaritywordworkdoeshowtrigramexactly

Problem

The docs on the word_similarity function say:


Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the words of the second string).

But they are a bit vague about what the result should be if the 'first string' contains more than one word. What is the explanation for the following results:

select word_similarity('foo bar','foo bar baz');


| word_similarity |
| :-------------- |
| 1 |

select word_similarity('baz bar','foo bar baz');


| word_similarity |
| :-------------- |
| 1 |

select word_similarity('baz foo','foo bar baz');


| word_similarity |
| :-------------- |
| 0.8 |

dbfiddle here

Solution

It doesn't work like that, from the a mailing list thread in 2017-2018


We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.
Possible solution could be forcing this extent boundaries to be at word
boundaries. However, it would become less convenient to search for part
of word. And we already have users adopt this feature.

The solution was to release pg_trgm 1.4 with new features which should drop with PostgreSQL 11. It will provide strict_word_similarity which will force extent boundaries to be at word boundaries. In addition there will be two new operators ` and <<% which will do strict-word-extent operations on the index.

For more information see the docs on the development version of
pg_trgm`

Context

StackExchange Database Administrators Q#184716, answer score: 4

Revisions (0)

No revisions yet.