HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

How fast can we identifiy almost-duplicates in a list of strings?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
fastcanalmosthowidentifiyliststringsduplicates

Problem

I'm having trouble figuring out the upper bound running time for this scenario:

Input:

  • $N$ number of strings



  • $M$ upper bound of string length



  • $T$ threshold for edit distance (2 strings with a Damerau-Levenshtein edit distance lower than $T$ are considered "duplicate")



Expected values

$N \approx 1,000,000$

$M \approx 200$

$T \leq 2$

The algorithm should do the following:

For each string in the list, find any other items in that list that have an edit distance smaller than some threshold, and mark them as "duplicate" (e.g. add them to some other list that tracks duplicates)

I'm having trouble calculating the upper bound for the optimal solution

What is the most optimal upper bound for such an algorithm?

I guess first I need to understand what is the best algorithm for the edit distance itself (is it $O(M \cdot N)$?) then it's simply $N^2$ times that, right? so it is surely "slower" than $N^2$ but my question is how slower?

Solution

Comparing two strings

Algorithms for computing the Levenshtein edit distance between a pair of strings can be found in the Wikipedia page on the edit distance. As that page explains, the running time for computing the edit distance between two strings of length $M$ is $O(M^2)$ time and $O(M)$ space. If you only care about whether the edit distance is $\le T$ or $>T$, then the running time can be reduced to $O(MT)$ time and $O(M)$ space.

Algorithms for computing the (full) Damerau-Levenshtein distance are linked to on the corresponding Wikipedia page. As Wikipedia explains, the (full) Damerau-Levenshtein distance can be computed in $O(M^2)$ time and $O(M^2)$ space; the algorithm is found in the Lowrance and Wagner paper cited there, or in Appendix A of the Bard paper cited there.

Wikipedia also shows how to compute what is known as restricted Damerau-Levenshtein edit distance; that is considerably easier than computing the full Damerau-Levenshtein distance, and the restricted Damerau-Levenshtein edit distance can be computed using a simple modification to the standard dynamic programming algorithm for the Levenshtein edit distance. The running time is $O(M^2)$, and it can be reduced to $O(MT)$ time if you only care about whether the distance is $\le T$ or $>T$.

Wikipedia also claims that for spelling correction and typo detection in natural language, the restricted Damerau-Levenshtein edit distance rarely differs from the full Damerau-Levenshtein distance, so you might as well use the restricted distance as it is easier to understand and implement. I have no personal knowledge of this; just passing along what is claimed in Wikipedia.

I recommend reading Appendix A of the Bard paper cited at Wikipedia for further implementation details.

Comparing all pairs: naive computation of pairwise distances

Therefore, the naive algorithm of computing the edit distance between each pair of strings will take $O(N^2 MT)$ time and $O(M)$ space.

However, for real-world problems of the sort you mentioned, this naive algorithm is typically not optimal. If $T$ is fairly small, there are much better algorithms; the best algorithm depends heavily on $T$. But since you haven't specified $T$ in your question, you haven't provided us enough information to suggest a specific algorithm.

Comparing all pairs: better data structures

A related problem is: given the set of $N$ strings, compute a data structure that allows us to quickly find the closest one in the set to any given string (under edit distance as our measure of "closeness"). There is lots of theoretical work on this problem, e.g., using BK-trees, Levenshtein automata, and many other techniques. See especially Efficient map data structure supporting approximate lookup and https://cstheory.stackexchange.com/q/4165/5038 for an entry into the literature.

In your case, this might be overkill, though.

The case where $T=1$: sorting

In the special case where $T=1$, there is a clean algorithm. In other words, we want to consider two strings $X,Y$ if their edit distance is $1$ (i.e., $X$ can be transformed into $Y$ with a single insertion, deletion, or substitution, transposition).

Notice that if $d(X,Y)=1$, then the single difference must appear in either the first half of the string or in the last half of the string. In other words, either the first half of $X$ matches the first half of $Y$ (they have a long matching prefix), or the second half of $X$ matches the second half of $Y$ (they have a long matching suffix).

This immediately suggests a nice two-stage algorithm:

-
Sort the entire list of strings. For each string $X$, enumerate all other strings $Y$ that agree with $Y$ in its first half and check whether any of them are at edit distance $1$ from $X$. This can be done through a linear scan of a short contiguous stretch of the sorted list, and the start and end of that stretch can be found quickly with binary search.

-
Next, reverse all the strings and do the same thing again.

(Note that you need to be a little bit careful about the definition of "half" above, to take into account that $X,Y$ might differ in their length. However these pesky details can be easily worked out with a little bit of care.)

The space requirement is $O(N)$. Sorting takes $O(N \lg N)$ time. If you're lucky and the strings are fairly random, then hopefully each string will have only a small number of other candidates that match in the first half or last half. If this is the case, the running time might be in the vicinity $O(N \lg N)$.

Does this generalize to $T>1$? Maybe, but it gets messier.

One way to try to generalize is to split the string into $T+1$ equal-length segments, and note that if $d(X,Y)\le T$, then there must be some segment where they match. However, this runs into messy details because the length of $X$ might differ from the length of $Y$. Therefore, if $T>1$, I don't recommend this approach.

Another way to try to generalize is to break each st

Context

StackExchange Computer Science Q#27539, answer score: 3

Revisions (0)

No revisions yet.