HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What is an efficient data structure for prefix matching?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
whatefficientforstructureprefixdatamatching

Problem

I'm looking for an data structure that supports efficient random prefix matching queries (pattern) over a previously known set of words (dictionary). The dictionary is expected to contain about 10,000 words of varying length (I haven't calculated average word length, but I don't expect any word to be more than 80 characters long). Prefix matching in this case would be equivalent to words[i].toLowerCase().startsWith(pattern.toLowerCase()).

A Trie is an obvious choice, and provides linear time search corresponding to the length of the pattern. However, I'm confused whether a Suffix Tree, or a Suffix Array, might provide any improvements over a Trie. It seems a Suffix whatever is commonly used for one text, not multiple. I also have no requirement for supporting the various cool use cases (longest common prefix, or number of times a prefix occurs etc) that Suffix whatever can efficiently support.

I'm looking for a recommendation on which data structure to use, and why. Tries use a lot of space, and if I end up using one, I'd consider compress the branches with nodes with outdegree 1.

For the duplicate button happy readers, I've read this and this question, none seemed directly relevant.

Example:

Dictionary: [banana, apple, banter, orange]

Pattern: ban
Return: banana (any match)

Pattern: grapes
Return: null

Solution

A trie is asymptotically optimal for this. No data structure can achieve better asymptotic running time.

If you care about constant factors, the only way to know what will be optimal is to try multiple approaches and benchmark them. Standard theoretical running time analysis is not reliable at predicting constant factors.

Another data structure to consider is to store every prefix of each word in a hashtable. This will increase the space usage by about 10x (if the average word length is 10) but might speed up lookups in practice. The asymptotic running time will remain the same. You'll have to decide whether you're willing to trade off space for time.

Suffix trees and suffix arrays don't seem relevant. You're asking about prefix matching, not suffix matching or matching in the middle of the string.

Context

StackExchange Computer Science Q#104771, answer score: 6

Revisions (0)

No revisions yet.