HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

Collection of exact string matching algorithms in Java

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
exactcollectionjavaalgorithmsstringmatching

Problem

(See the next iteration.)

I have this small collection of exact string matching algorithms:

  • Knuth-Morris-Pratt algorithm,



  • Finite automaton matcher,



  • Rabin-Karp algorithm,



  • Z algorithm.



The main research question was comparing performance of all the algorithms in two different settings:

  • Input which degrades performance of the naïve string matcher (String.indexOf), where both the text and the pattern are of the format aaaa...aab,



  • Both text and the pattern are random.



The results are:

[WORST CASE OF String.indexOf]
String.indexOf in 6976 milliseconds.
Knuth-Morris-Pratt matcher in 56 milliseconds.
Finite automaton matcher in 98 milliseconds.
Rabin-Karp matcher in 74 milliseconds.
Z matcher in 89 milliseconds.

[RANDOM STRINGS]
[SEED: 1446895693075]
String.indexOf in 238 milliseconds.
Knuth-Morris-Pratt matcher in 303 milliseconds.
Finite automaton matcher in 371 milliseconds.
Rabin-Karp matcher in 909 milliseconds.
Z matcher in 705 milliseconds.


net.coderodde.string.matching.StringMatchers.java:

`package net.coderodde.string.matching;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

/**
* This class contains different string matching algorithms as static methods.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Nov 7, 2015)
*/
public final class StringMatchers {

public static final int NOT_FOUND_INDEX = -1;

public static final class KnuthMorrisPrattMatcher {

/**
* This static method implements the
*
* Knuth-Morris-Pratt pattern matching algorithm that runs in time
* {@code O(n + m)}, where {@code n} is the length of the text to search
* and {@code m} is the length of the pattern to search.
*
* @param text the text to search in.
* @param pattern the pattern to search for.
* @param startIndex the character index from which to start matching.

Solution

The Matcher implementations

You group all these matchers in a StringMatchers class. Other than grouping the matchers together, this class adds no real benefit.
All macher's match() methods are static. Yet all have the same match() methods. This just aches to be refactored to all matchers implementing the same interface.

public interface StringMatcher {

    int NOT_FOUND_INDEX = -1;

    int match(String text, String pattern, int startIndex);

    default int match(String text, String pattern) {
        return match(text, pattern, 0);
    }
}


The StringMatchers class can still house the 4 implementations, but rather than exposing the implementation classes, we'll make the classes private and add factory methods to the StringMatchers class.

public final class StringMatchers {

    public static StringMatcher knuthMorrisPrattMatcher() {
        return new KnuthMorrisPrattMatcher();
    }

    public static StringMatcher automatonMatcher() {
        return new AutomatonMatcher();
    }

    public static StringMatcher rabinKarpMatcher() {
        return new RabinKarpMatcher();
    }

    public static StringMatcher zMatcher() {
        return new ZMatcher();
    }

    private static final class KnuthMorrisPrattMatcher implements StringMatcher { ... }

    private static final class AutomatonMatcher implements StringMatcher { ... }

    private static final class RabinKarpMatcher implements StringMatcher { ... }

    private static final class ZMatcher implements StringMatcher { ...}
}


Ultimately you could even refactor this to an enum.

In all algorithms you use single letter variable names. As a rule this is not done. For well documented algorithms it could be permissable if the published algorithm defines these variables. I've checked the KnuthMorrisPrattMatcher, and found that the referenced material uses different variable names. So either correct that, or use meaningful variable names. Otherwise, in 6 months time, even you will be scratching your head when reviewing this code.

The StringMatchers test

All StringMatcher implementations should pass the same tests. Since they now have a common interface, we can test against the interface.
We could make an abstract test class and 4 subclasses to test each implementation.

public abstract class StringMatchersTest {

    abstract StringMatcher getMatcherToTest();

    @Test
    public void testMatcher() {
        assertEquals(0, getMatcherToTest().match("acacaga", "acacaga"));
        assertEquals(-1, getMatcherToTest().match("aaa", "aaaa"));
        assertEquals(0,  getMatcherToTest().match("aaaa", "aaaa"));
        assertEquals(-1, getMatcherToTest().match("aaaa", "bb"));
        assertEquals(1,  getMatcherToTest().match("abbb", "bb"));
        assertEquals(2,  getMatcherToTest().match("abcc", "cc"));

        assertEquals(5, getMatcherToTest().match("aaaaaaab", "aab"));
        assertEquals(4, getMatcherToTest().match("ababababaca", "ababaca"));

        assertTrue("".indexOf("") == getMatcherToTest().match("", ""));
        assertTrue("".indexOf("a") == getMatcherToTest().match("", "a"));
        assertTrue("a".indexOf("") == getMatcherToTest().match("a", ""));
        assertTrue("hello".indexOf("ello", -2) ==  getMatcherToTest().match("hello", "ello", -2));

        assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 5));
        assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 4));
        assertEquals(3, getMatcherToTest().match("aabaab", "aab", 3));
        assertEquals(3, getMatcherToTest().match("aabaab", "aab", 2));
        assertEquals(3, getMatcherToTest().match("aabaab", "aab", 1));
        assertEquals(0, getMatcherToTest().match("aabaab", "aab", 0));
        assertEquals(0, getMatcherToTest().match("aabaab", "aab", -1));
        assertEquals(0, getMatcherToTest().match("aabaab", "aab", -2));
    }

}

public class KnuthMorrisPrattMatcherTest extends StringMatchersTest {
    @Override
    StringMatcher getMatcherToTest() {
        return StringMatchers.knuthMorrisPrattMatcher();
    }
}


Or (even better) we can use the Parameterized Junit runner :

@RunWith(Parameterized.class)
public class StringMatchersTest {

    @Parameterized.Parameters
    public static List getParameters() {
        return asList(
                new Object[]{ StringMatchers.knuthMorrisPrattMatcher() },
                new Object[]{ StringMatchers.automatonMatcher() },
                new Object[]{ StringMatchers.rabinKarpMatcher() },
                new Object[]{ StringMatchers.zMatcher() }
        );
    }

    private final StringMatcher matcher;

    public StringMatchersTest(StringMatcher matcher) {
        this.matcher = matcher;
    }

    @Test
    public void testMatcher() { ... }
}


Either way, you'll want to make a separate test method for each assertion. That way it's immediately clear which specific test case fails.

PerformanceDemo

You measure elapsed time using `System.currentTimeMillis

Code Snippets

public interface StringMatcher {

    int NOT_FOUND_INDEX = -1;

    int match(String text, String pattern, int startIndex);

    default int match(String text, String pattern) {
        return match(text, pattern, 0);
    }
}
public final class StringMatchers {

    public static StringMatcher knuthMorrisPrattMatcher() {
        return new KnuthMorrisPrattMatcher();
    }

    public static StringMatcher automatonMatcher() {
        return new AutomatonMatcher();
    }

    public static StringMatcher rabinKarpMatcher() {
        return new RabinKarpMatcher();
    }

    public static StringMatcher zMatcher() {
        return new ZMatcher();
    }

    private static final class KnuthMorrisPrattMatcher implements StringMatcher { ... }

    private static final class AutomatonMatcher implements StringMatcher { ... }

    private static final class RabinKarpMatcher implements StringMatcher { ... }

    private static final class ZMatcher implements StringMatcher { ...}
}
public abstract class StringMatchersTest {

    abstract StringMatcher getMatcherToTest();

    @Test
    public void testMatcher() {
        assertEquals(0, getMatcherToTest().match("acacaga", "acacaga"));
        assertEquals(-1, getMatcherToTest().match("aaa", "aaaa"));
        assertEquals(0,  getMatcherToTest().match("aaaa", "aaaa"));
        assertEquals(-1, getMatcherToTest().match("aaaa", "bb"));
        assertEquals(1,  getMatcherToTest().match("abbb", "bb"));
        assertEquals(2,  getMatcherToTest().match("abcc", "cc"));

        assertEquals(5, getMatcherToTest().match("aaaaaaab", "aab"));
        assertEquals(4, getMatcherToTest().match("ababababaca", "ababaca"));

        assertTrue("".indexOf("") == getMatcherToTest().match("", ""));
        assertTrue("".indexOf("a") == getMatcherToTest().match("", "a"));
        assertTrue("a".indexOf("") == getMatcherToTest().match("a", ""));
        assertTrue("hello".indexOf("ello", -2) ==  getMatcherToTest().match("hello", "ello", -2));

        assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 5));
        assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 4));
        assertEquals(3, getMatcherToTest().match("aabaab", "aab", 3));
        assertEquals(3, getMatcherToTest().match("aabaab", "aab", 2));
        assertEquals(3, getMatcherToTest().match("aabaab", "aab", 1));
        assertEquals(0, getMatcherToTest().match("aabaab", "aab", 0));
        assertEquals(0, getMatcherToTest().match("aabaab", "aab", -1));
        assertEquals(0, getMatcherToTest().match("aabaab", "aab", -2));
    }

}

public class KnuthMorrisPrattMatcherTest extends StringMatchersTest {
    @Override
    StringMatcher getMatcherToTest() {
        return StringMatchers.knuthMorrisPrattMatcher();
    }
}
@RunWith(Parameterized.class)
public class StringMatchersTest {

    @Parameterized.Parameters
    public static List<Object[]> getParameters() {
        return asList(
                new Object[]{ StringMatchers.knuthMorrisPrattMatcher() },
                new Object[]{ StringMatchers.automatonMatcher() },
                new Object[]{ StringMatchers.rabinKarpMatcher() },
                new Object[]{ StringMatchers.zMatcher() }
        );
    }

    private final StringMatcher matcher;

    public StringMatchersTest(StringMatcher matcher) {
        this.matcher = matcher;
    }

    @Test
    public void testMatcher() { ... }
}

Context

StackExchange Code Review Q#110097, answer score: 12

Revisions (0)

No revisions yet.