snippetsqlMinor
How to store and query strings matching prefix or suffix in PostgreSQL?
Viewed 0 times
suffixpostgresqlprefixquerystorehowandstringsmatching
Problem
Say I have 1 million words each in 100 different languages (so 100 million words, each with alphabets ranging from 10 characters to 200 characters, although Chinese has 80k "characters" alone! And I would want to support it on Chinese too). These words are stored in the database for a variety of other reasons, so we should ideally use them from the database if possible.
What would you add to a
To do prefix queries in PostgreSQL, (without indexes I think?), you would do something like:
For searching all words starting with
Already, I am considering, based on this answer, using 26 indexes for the English language to be able to find all words that unscramble some input (so
Realistically, most languages have less than 200k words, and it would probably start out with less than 20k words. I am talking way more than the English language, but Tibetan ལྷ་
What would you add to a
words table (assuming the words.text property has the string from any of 100 different languages) to allow for prefix/suffix matching queries? Maybe we have different tables for different languages if absolutely necessary.To do prefix queries in PostgreSQL, (without indexes I think?), you would do something like:
select * from words where text like 'cal%'For searching all words starting with
cal. But is that the most efficient/scalable approach for this size of dataset? I don't have too much experience with these types of queries in my career. If not, what is the recommended, standard, ideal, or otherwise efficient approach (in terms of data size and query performance)? What sorts of indexes would you need to add to make this efficient? If PostgreSQL can't do this efficiently (or SQL for that matter), then at a high level what is the preferred solution? I know that in-memory tries are a good solution, but is that that much better than a PostgreSQL solution (if it is at all possible)?Already, I am considering, based on this answer, using 26 indexes for the English language to be able to find all words that unscramble some input (so
CAUDK finds duck, amongst other things), but not sure if a similar solution could be extended to prefix/suffix searches. Plus, that also doesn't appear that efficient (in terms of data size), to handle larger alphabets, and other types of queries.Realistically, most languages have less than 200k words, and it would probably start out with less than 20k words. I am talking way more than the English language, but Tibetan ལྷ་
Solution
Prefix search in most cases is as efficient as string search, but stops earlier.
For suffix search you have to store your word reversed (desrever) and then search for reversed suffix (de% instead of %ed).
For anagram/scramble search you have to store your word ordered alphabetically (anagram -> aaagmnr)
For suffix search you have to store your word reversed (desrever) and then search for reversed suffix (de% instead of %ed).
For anagram/scramble search you have to store your word ordered alphabetically (anagram -> aaagmnr)
Context
StackExchange Database Administrators Q#313922, answer score: 4
Revisions (0)
No revisions yet.