HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Infix search in millions of strings

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
infixstringssearchmillions

Problem

Let's say we have millions of strings (each of them

-
I'm looking for something working on strings even without meaning (the pattern qys should allow to find the string uyiuqysidi among millions of other strings), on DNA (pattern ATTG found in GGATCATTGAAGG), on sequences (subsequence 1, 4, 8 found in sequence 7, 2, 1, 4, 8, 19, 32), etc.

Solution

Let $\$$ be a symbol not in the alphabet, and let $\{ w_i \}_{i \in \mathbb{N}}$ be the strings you are searching from. Construct the string $S = w_1 \circ \$ \circ w_2 \circ \$ \circ \dots \circ w_n$ and use Ukkonen's algorithm to construct a suffix tree for $S$. You are now able to retrieve all $m$ occurrences of a pattern $P$ in time $\Theta(m + |P|)$, with $\sum_i|w_i|$ preprocessing time.

Context

StackExchange Computer Science Q#90928, answer score: 3

Revisions (0)

No revisions yet.