HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlModerate

How to search hyphenated words in PostgreSQL full text search?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlfullsearchwordstexthyphenatedhow

Problem

I have to search for hyphenated words like 'good-morning', 'good-evening', etc.

My query is:

select id, ts_headline(content,
                       to_tsquery('english','good-morning'),
                       'HighlightAll=true MaxFragments=100 FragmentDelimiter=

When executing this query I also get results of 'good' and 'morning' separately. But I want exactly matching words and fragments.

(For ts_content I used the same default config english to create the tsvector.)

How can I search such hyphenated words in PostgreSQL full text search?) from table where ts_content @@ to_tsquery('english','good-morning');


When executing this query I also get results of 'good' and 'morning' separately. But I want exactly matching words and fragments.

(For ts_content I used the same default config english to create the tsvector.)

How can I search such hyphenated words in PostgreSQL full text search?

Solution

The key word here is phrase search, introduced with Postgres 9.6.

Use the "FOLLOWED BY" operator ` or one of the related operators. Or better yet, use the function phraseto_tsquery() to generate your tsquery.

Quoting the manual, it ...

produces
tsquery that searches for a phrase, ignoring punctuation

And:

phraseto_tsquery behaves much like plainto_tsquery, except that it
inserts the
(FOLLOWED BY) operator between surviving words instead
of the
& (AND) operator. Also, stop words are not simply discarded,
but are accounted for by inserting
operators rather than
operators. This function is useful when searching for exact lexeme
sequences, since the FOLLOWED BY operators check lexeme order not just
the presence of all the lexemes.

Your query would work like this:

select id
     , ts_headline(content, phraseto_tsquery('english', 'good-morning')
                          , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=

phraseto_tsquery('english', 'good-morning') generates this tsquery:

'good-morn'  'good'  'morn'


Since "good-morning" is identified as
asciihword (hyphenated ASCII word), the stemmed complete word is added before the components. The manual:

It is possible for the parser to produce overlapping tokens from the
same piece of text. As an example, a hyphenated word will be reported
both as the entire word and as each component: (followed by an example)

to_tsvector() basically does the same on the other end, so everything matches up. This allows for fine-grained options with hyphenated words. The above only finds "good-morning" with a hyphen (or variants stemming to the same). To find all strings with "good" followed by "morn" (or variants stemming to the same) use phraseto_tsquery('english','good morning') generating this tsquery: 'good' 'morn'

OTOH, you can enforce exact matches by adding another filter like:

...
AND content ~* 'good-morning'  -- case insensitive regexp match


Or:

...
AND content ILIKE '%good-morning%'


Seems a bit redundant to the human eye, but this way you get fast full text index support and exact matches.

The latter is mostly equivalent, but different (fewer) characters have special meaning in the
LIKE pattern and might need escaping. Related:

  • PostgreSQL: Regular Expression escape function



  • Pattern matching with LIKE, SIMILAR TO or regular expressions



Example to demonstrate the operator
:

phraseto_tsquery('english', 'Juliet and the Licks') generates this tsquery:

'juliet'  'lick'


meaning that lick must be the third lexeme after juliet`.) from tbl where ts_content @@ phraseto_tsquery('english','good-morning');


phraseto_tsquery('english', 'good-morning') generates this tsquery:

%%CODEBLOCK_1%%

Since "good-morning" is identified as
asciihword (hyphenated ASCII word), the stemmed complete word is added before the components. The manual:

It is possible for the parser to produce overlapping tokens from the
same piece of text. As an example, a hyphenated word will be reported
both as the entire word and as each component: (followed by an example)

to_tsvector() basically does the same on the other end, so everything matches up. This allows for fine-grained options with hyphenated words. The above only finds "good-morning" with a hyphen (or variants stemming to the same). To find all strings with "good" followed by "morn" (or variants stemming to the same) use phraseto_tsquery('english','good morning') generating this tsquery: 'good' 'morn'

OTOH, you can enforce exact matches by adding another filter like:

%%CODEBLOCK_2%%

Or:

%%CODEBLOCK_3%%

Seems a bit redundant to the human eye, but this way you get fast full text index support and exact matches.

The latter is mostly equivalent, but different (fewer) characters have special meaning in the
LIKE pattern and might need escaping. Related:

  • PostgreSQL: Regular Expression escape function



  • Pattern matching with LIKE, SIMILAR TO or regular expressions



Example to demonstrate the operator
:

phraseto_tsquery('english', 'Juliet and the Licks') generates this tsquery:

%%CODEBLOCK_4%%

meaning that lick must be the third lexeme after juliet`.

Code Snippets

select id
     , ts_headline(content, phraseto_tsquery('english', 'good-morning')
                          , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from   tbl 
where  ts_content @@ phraseto_tsquery('english','good-morning');
'good-morn' <-> 'good' <-> 'morn'
...
AND content ~* 'good-morning'  -- case insensitive regexp match
...
AND content ILIKE '%good-morning%'
'juliet' <3> 'lick'

Context

StackExchange Database Administrators Q#204588, answer score: 10

Revisions (0)

No revisions yet.