HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Postgres full text search with unaccent and inflection (conjugation, etc.)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fullsearchpostgreswithtextinflectionconjugationandunaccentetc

Problem

I want to be able to search unaccented phrases in an inflected (Polish) language in Postgres.

Say, if a document contains robiłem, the lexeme should be robić (the infinivite). Its forms are robię, robił, robiła and so on. I want to be able to find it, for example, with a phrase robie which is unaccented robię.

What I did is I started out with a perfectly well working polish text search config

CREATE TEXT SEARCH DICTIONARY polish_ispell (
TEMPLATE = pg_catalog.ispell,
dictfile = 'polish', afffile = 'polish', stopwords = 'polish' );


Then I tried to extend it to include the unaccent.

create extension unaccent;
create text search configuration polish_unaccented (copy = polish);
ALTER TEXT SEARCH CONFIGURATION polish_unaccented   ALTER MAPPING FOR hword, 
hword_part, word WITH unaccen, polish_ispell, simple, ;


Sadly, lexems are not created correctly with this config:

select to_tsvector('polish_unaccented' ,'robił');

'robil':1


The lexem should be of course:

'robić':1


So the below cant return true (and that's what I need I think):

select to_tsvector('polish_unaccented','robić') @@ to_tsquery('polish_unaccented','robie');


I've googled but did not find any documents showing how to really configure Postgres for my case. The docs only show the lame 'Hôtels' example, which is not a 'lexemed' word.

Cheers

Solution

Wow, this was fun. So I wrote a program to do this for you called pg_hunspell

pg_hunspell pl PL polish

SELECT to_tsvector('polish' ,'robił');
 to_tsvector 
-------------
 'robić':1
(1 row)

Code Snippets

pg_hunspell pl PL polish

SELECT to_tsvector('polish' ,'robił');
 to_tsvector 
-------------
 'robić':1
(1 row)

Context

StackExchange Database Administrators Q#176471, answer score: 3

Revisions (0)

No revisions yet.