HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

Data structure or algorithm for quickly finding differences between strings

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
quicklystringsalgorithmdifferencesbetweenforstructurefindingdata

Problem

I have an array of 100,000 strings, all of length $k$. I want to compare each string to every other string to see if any two strings differ by 1 character. Right now, as I add each string to the array, I'm checking it against every string already in the array, which has a time complexity of $\frac{n(n-1)}{2} k$.

Is there a data structure or algorithm that can compare strings to each other faster than what I'm already doing?

Some additional information:

-
Order matters: abcde and xbcde differ by 1 character, while abcde and edcba differ by 4 characters.

-
For each pair of strings that differ by one character, I will be removing one of those strings from the array.

-
Right now, I'm looking for strings that differ by only 1 character, but it would be nice if that 1 character difference could be increased to, say, 2, 3, or 4 characters. However, in this case, I think efficiency is more important than the ability to increase the character-difference limit.

-
$k$ is usually in the range of 20-40.

Solution

My solution is similar to j_random_hacker's but uses only a single hash set.

I would create a hash set of strings.
For each string in the input, add to the set $k$ strings.
In each of these strings replace one of the letters with a special character, not found in any of the strings.
While you add them, check that they are not already in the set. If they are then you have two strings that only differ by (at most) one character.

An example with strings 'abc', 'adc'

For abc we add 'bc', 'ac' and 'ab*'

For adc we add 'dc', 'ac' and 'ad*'

When we add 'a*c' the second time we notice it is already in the set, so we know that there are two strings that only differ by one letter.

The total running time of this algorithm is $O(n*k^2)$.
This is because we create $k$ new strings for all $n$ strings in the input. For each of those strings we need to calculate the hash, which typically takes $O(k)$ time.

Storing all the strings takes $O(n*k^2)$ space.

Further improvements

We can improve the algorithm further by not storing the modified strings directly but instead storing an object with a reference to the original string and the index of the character that is masked. This way we do not need to create all of the strings and we only need $O(n*k)$ space to store all of the objects.

You will need to implement a custom hash function for the objects. We can take the Java implementation as an example, see the java documentation. The java hashCode multiplies the unicode value of each character with $31^{k-i}$ (with $k$ the string length and $i$ the one-based index of the character. Note that each altered string only differs by one character from the original. We can easily compute the contribution of that character to the hash code. We can subtract that and add our masking character instead. This takes $O(1)$ to compute. This allows us to bring the total running time down to $O(n*k)$

Context

StackExchange Computer Science Q#93467, answer score: 15

Revisions (0)

No revisions yet.