HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Modified Z-Algorithm for string similarity with one char difference

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withcharmodifiedalgorithmonedifferenceforsimilaritystring

Problem

I have modified a Z-Algorithm for string similarity in order to match strings which differ in 0 or 1 char.

However, it seems much slower than original Z-Algorithm although from my point of view there is only a minor different and it also has a O(n) time complexity.

Input is a string in which a pattern is going to be found. So the goal is to find parts in string where pattern is present. The goal is also to find parts in string where the pattern is 'almost present'. So peep, leek and peek are all 'present' in string = "blahpeekblahpeepblahleek" for pattern = "peek". The output are indices in which matches are found. So for my example the output is "4 12 20".

modified Z-Algorithm:

static String stringSimilarity(String string, String pattern ) {
String s = pattern + "$" + string;
int len = s.length();
int patternLen = pattern .length();

int[] z = new int[len];
int l, r, i;
StringBuilder result = new StringBuilder();

l = r = patternLen;      
i = patternLen + 1;      

while (i = patternLen)
        result.append(i - patternLen - 1).append(" ");

    if (i + z[i] - 1 > r) {     
        l = i;          
        r = i + z[i] - 1;
    }

    i += 1;
}

return result.toString();
}


Z-Algorithm body (code is not entirely mine but I adopted it while mine was not that readable):

while (i = patternLen)
        result.append(i - patternLen- 1).append(" ");

    //O(1)
    if (i + z[i] - 1 > r) {     
        l = i;      
        r = i + z[i] - 1;
    }       

    i += 1;     //O(1)
}


Can you please check the complexity of my modified Z-Algorithm?

From my point of view it is the same, however, tests on Hackerank show timeout.

Solution

Why it is slow

In your modified Z-algorithm, you start computing the Z array from i = patternLen + 1. This means that for the whole pattern, your Z array will have value 0. This defeats the purpose of the Z-algorithm. Even if you removed the part where you ignore one mismatched character, this will have changed your search from a \$O(n)\$ search to a \$O(n*m)\$ search, where \$m\$ is the length of the pattern.

For example, I did a test where I searched a string containing 1 million 'a' in a row for a pattern containing 1000 'a' in a row. This took 4 seconds. When I changed the code to start at the beginning of the pattern (i.e. the real Z-algorithm), it took 0.08 seconds.

But...

You can't just start at i=1 and expect to fix the problem. Your "ignore one mismatch" code requires that you start over from the beginning every time because you can't use partial substring matches like you can with the Z-algorithm. The reason is you may have skipped over a mismatch, so the value in the Z array doesn't necessary correspond to the substring that you matched.

So in other words, perhaps Z-algortithm is not well suited for this problem, or at least you will need to modify it in a different way.

Context

StackExchange Code Review Q#142884, answer score: 3

Revisions (0)

No revisions yet.