HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Why tsquery containing a hyphen isn't found in full text search?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
containingwhyfullsearchtextfoundtsqueryhyphenisn

Problem

Trying to figure out, why the below query isn't returning any results:

select id from story where tsv_story_text @@ to_tsquery('CumEx-Files2.0');


The term exists in the headline of a story in the database (PostgreSQL 11.12) like:
CumEx-Files 2.0 investigation reveals €150bn tax fraud


The only way I could get the query to work was to replace the hyphen with a dash in both the vector and the query which worked, so I tried debugging the phrase, but it seems like the hyphen is but a space symbol - I thought it has to do something with the numerals in the phrase, or is it?

SELECT * FROM ts_debug('english', 'CumEx-Files2.0');
      alias      |           description           |    token    |  dictionaries  |  dictionary  |   lexemes
-----------------+---------------------------------+-------------+----------------+--------------+-------------
 asciihword      | Hyphenated word, all ASCII      | CumEx-Files | {english_stem} | english_stem | {cumex-fil}
 hword_asciipart | Hyphenated word part, all ASCII | CumEx       | {english_stem} | english_stem | {cumex}
 blank           | Space symbols                   | -           | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII | Files       | {english_stem} | english_stem | {file}
 blank           | Space symbols                   |           | {}             |              |
 float           | Decimal notation                | 2.0         | {simple}       | simple       | {2.0}

Solution

There was change of behavior in Postgres 14. (A fix, really, but not back-patched, because change of behavior ...)

The release notes for Postgres 14:

-
Fix to_tsquery() and websearch_to_tsquery() to properly parse query text containing discarded tokens (Alexander Korotkov)

Certain discarded tokens, like underscore, caused the output of these functions to produce incorrect tsquery output, e.g., both websearch_to_tsquery('"pg_class pg"') and to_tsquery('pg_class pg') used to output ( 'pg' & 'class' ) 'pg', but now both output 'pg' 'class' 'pg'.

That's exactly what happens in your case. Compare the results in the following two fiddles:

db<>fiddle here - Postgres 11

db<>fiddle here - Postgres 14

So your query:

SELECT to_tsvector('english', 'CumEx-Files 2.0 investigation reveals €150bn tax fraud') @@ to_tsquery('english', 'CumEx-Files2.0');


... now works as expected in Postgres 14 or later (returns true). Consider upgrading.

Code Snippets

SELECT to_tsvector('english', 'CumEx-Files 2.0 investigation reveals €150bn tax fraud') @@ to_tsquery('english', 'CumEx-Files<->2.0');

Context

StackExchange Database Administrators Q#302750, answer score: 4

Revisions (0)

No revisions yet.