HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Efficient algorithm for edit distance for short sequences

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
editshortefficientdistancealgorithmforsequences

Problem

I have an application that needs to compute billions of levenshtein distance between pairs of strings. The strings are short (70 in length) DNA sequences, consisting only of 4 characters. Also it can be assumed that one of the strings is fixed, i.e., we are comparing one fixed string to a billion other strings.

I know that the dynamic programming implementation of the levenshtein distance is $\mathcal{O}(m n)$, would like to know if there are any room for improvement. I found these two algorithms:

  • $\mathcal{O}(n + d^2)$ algorithm, in which $d$ is the edit distance by Berghel et al. However I can't make the assumption that $d$ is small so it might not give any advantage



  • $log(n)^{\mathcal{O}(1/\epsilon)}$ approximation in $n^{1+\epsilon}$ time by Andoni et al. But I have two concerns regarding this:



  • Is this algorithm also fast in practice?



  • Does $log(n)^{\mathcal{O}(1/\epsilon)}$ mean the computed edit distance in worst case is $log(n)^{\mathcal{O}(1/\epsilon)}$ times the actual one? In that case it's too much.



Do you know of any other algorithm/idea/approach that might be applicable?

Solution

One approach is to build a Levenshtein automaton for the fixed string (see, e.g., here). Given a string $x$ and a distance $D$, you can build a DFA that recognizes all strings that are at distance $\le D$ from $x$. Thus, you can test whether a string is close to $x$ in $O(n)$ time, where $n$ is the length of the string. I'm not sure what the space requirements are to store the DFA (they are linear in $m,n$, but might be exponential in $D$).

Alternatively, you could use an "early-out" algorithm for computing the edit distance. You mentioned that you are only interested in the edit distance if it is less than some threshold $D$. There is an "early-out" algorithm for computing the edit distance whose running time is $O(\max(n,m) \times D)$, which computes the edit distance if it is $\le D$ or else outputs "too big" if it is $>D$. Basically, you do the standard dynamic programming algorithm for the edit distance, but only compute the elements of the matrix that are $\le D$ away from the diagonal. In your case this might or might not be better than the other alternatives.

Context

StackExchange Computer Science Q#79358, answer score: 3

Revisions (0)

No revisions yet.