patternjavaModerate
Collection of exact string matching algorithms in Java
Viewed 0 times
exactcollectionjavaalgorithmsstringmatching
Problem
(See the next iteration.)
I have this small collection of exact string matching algorithms:
The main research question was comparing performance of all the algorithms in two different settings:
The results are:
net.coderodde.string.matching.StringMatchers.java:
`package net.coderodde.string.matching;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
/**
* This class contains different string matching algorithms as static methods.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Nov 7, 2015)
*/
public final class StringMatchers {
public static final int NOT_FOUND_INDEX = -1;
public static final class KnuthMorrisPrattMatcher {
/**
* This static method implements the
*
* Knuth-Morris-Pratt pattern matching algorithm that runs in time
* {@code O(n + m)}, where {@code n} is the length of the text to search
* and {@code m} is the length of the pattern to search.
*
* @param text the text to search in.
* @param pattern the pattern to search for.
* @param startIndex the character index from which to start matching.
I have this small collection of exact string matching algorithms:
- Knuth-Morris-Pratt algorithm,
- Finite automaton matcher,
- Rabin-Karp algorithm,
- Z algorithm.
The main research question was comparing performance of all the algorithms in two different settings:
- Input which degrades performance of the naïve string matcher (
String.indexOf), where both the text and the pattern are of the formataaaa...aab,
- Both text and the pattern are random.
The results are:
[WORST CASE OF String.indexOf]
String.indexOf in 6976 milliseconds.
Knuth-Morris-Pratt matcher in 56 milliseconds.
Finite automaton matcher in 98 milliseconds.
Rabin-Karp matcher in 74 milliseconds.
Z matcher in 89 milliseconds.
[RANDOM STRINGS]
[SEED: 1446895693075]
String.indexOf in 238 milliseconds.
Knuth-Morris-Pratt matcher in 303 milliseconds.
Finite automaton matcher in 371 milliseconds.
Rabin-Karp matcher in 909 milliseconds.
Z matcher in 705 milliseconds.
net.coderodde.string.matching.StringMatchers.java:
`package net.coderodde.string.matching;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
/**
* This class contains different string matching algorithms as static methods.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Nov 7, 2015)
*/
public final class StringMatchers {
public static final int NOT_FOUND_INDEX = -1;
public static final class KnuthMorrisPrattMatcher {
/**
* This static method implements the
*
* Knuth-Morris-Pratt pattern matching algorithm that runs in time
* {@code O(n + m)}, where {@code n} is the length of the text to search
* and {@code m} is the length of the pattern to search.
*
* @param text the text to search in.
* @param pattern the pattern to search for.
* @param startIndex the character index from which to start matching.
Solution
The Matcher implementations
You group all these matchers in a
All macher's
The
Ultimately you could even refactor this to an
In all algorithms you use single letter variable names. As a rule this is not done. For well documented algorithms it could be permissable if the published algorithm defines these variables. I've checked the
The StringMatchers test
All StringMatcher implementations should pass the same tests. Since they now have a common interface, we can test against the interface.
We could make an abstract test class and 4 subclasses to test each implementation.
Or (even better) we can use the
Either way, you'll want to make a separate test method for each assertion. That way it's immediately clear which specific test case fails.
PerformanceDemo
You measure elapsed time using `System.currentTimeMillis
You group all these matchers in a
StringMatchers class. Other than grouping the matchers together, this class adds no real benefit.All macher's
match() methods are static. Yet all have the same match() methods. This just aches to be refactored to all matchers implementing the same interface.public interface StringMatcher {
int NOT_FOUND_INDEX = -1;
int match(String text, String pattern, int startIndex);
default int match(String text, String pattern) {
return match(text, pattern, 0);
}
}The
StringMatchers class can still house the 4 implementations, but rather than exposing the implementation classes, we'll make the classes private and add factory methods to the StringMatchers class.public final class StringMatchers {
public static StringMatcher knuthMorrisPrattMatcher() {
return new KnuthMorrisPrattMatcher();
}
public static StringMatcher automatonMatcher() {
return new AutomatonMatcher();
}
public static StringMatcher rabinKarpMatcher() {
return new RabinKarpMatcher();
}
public static StringMatcher zMatcher() {
return new ZMatcher();
}
private static final class KnuthMorrisPrattMatcher implements StringMatcher { ... }
private static final class AutomatonMatcher implements StringMatcher { ... }
private static final class RabinKarpMatcher implements StringMatcher { ... }
private static final class ZMatcher implements StringMatcher { ...}
}Ultimately you could even refactor this to an
enum.In all algorithms you use single letter variable names. As a rule this is not done. For well documented algorithms it could be permissable if the published algorithm defines these variables. I've checked the
KnuthMorrisPrattMatcher, and found that the referenced material uses different variable names. So either correct that, or use meaningful variable names. Otherwise, in 6 months time, even you will be scratching your head when reviewing this code.The StringMatchers test
All StringMatcher implementations should pass the same tests. Since they now have a common interface, we can test against the interface.
We could make an abstract test class and 4 subclasses to test each implementation.
public abstract class StringMatchersTest {
abstract StringMatcher getMatcherToTest();
@Test
public void testMatcher() {
assertEquals(0, getMatcherToTest().match("acacaga", "acacaga"));
assertEquals(-1, getMatcherToTest().match("aaa", "aaaa"));
assertEquals(0, getMatcherToTest().match("aaaa", "aaaa"));
assertEquals(-1, getMatcherToTest().match("aaaa", "bb"));
assertEquals(1, getMatcherToTest().match("abbb", "bb"));
assertEquals(2, getMatcherToTest().match("abcc", "cc"));
assertEquals(5, getMatcherToTest().match("aaaaaaab", "aab"));
assertEquals(4, getMatcherToTest().match("ababababaca", "ababaca"));
assertTrue("".indexOf("") == getMatcherToTest().match("", ""));
assertTrue("".indexOf("a") == getMatcherToTest().match("", "a"));
assertTrue("a".indexOf("") == getMatcherToTest().match("a", ""));
assertTrue("hello".indexOf("ello", -2) == getMatcherToTest().match("hello", "ello", -2));
assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 5));
assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 4));
assertEquals(3, getMatcherToTest().match("aabaab", "aab", 3));
assertEquals(3, getMatcherToTest().match("aabaab", "aab", 2));
assertEquals(3, getMatcherToTest().match("aabaab", "aab", 1));
assertEquals(0, getMatcherToTest().match("aabaab", "aab", 0));
assertEquals(0, getMatcherToTest().match("aabaab", "aab", -1));
assertEquals(0, getMatcherToTest().match("aabaab", "aab", -2));
}
}
public class KnuthMorrisPrattMatcherTest extends StringMatchersTest {
@Override
StringMatcher getMatcherToTest() {
return StringMatchers.knuthMorrisPrattMatcher();
}
}Or (even better) we can use the
Parameterized Junit runner :@RunWith(Parameterized.class)
public class StringMatchersTest {
@Parameterized.Parameters
public static List getParameters() {
return asList(
new Object[]{ StringMatchers.knuthMorrisPrattMatcher() },
new Object[]{ StringMatchers.automatonMatcher() },
new Object[]{ StringMatchers.rabinKarpMatcher() },
new Object[]{ StringMatchers.zMatcher() }
);
}
private final StringMatcher matcher;
public StringMatchersTest(StringMatcher matcher) {
this.matcher = matcher;
}
@Test
public void testMatcher() { ... }
}Either way, you'll want to make a separate test method for each assertion. That way it's immediately clear which specific test case fails.
PerformanceDemo
You measure elapsed time using `System.currentTimeMillis
Code Snippets
public interface StringMatcher {
int NOT_FOUND_INDEX = -1;
int match(String text, String pattern, int startIndex);
default int match(String text, String pattern) {
return match(text, pattern, 0);
}
}public final class StringMatchers {
public static StringMatcher knuthMorrisPrattMatcher() {
return new KnuthMorrisPrattMatcher();
}
public static StringMatcher automatonMatcher() {
return new AutomatonMatcher();
}
public static StringMatcher rabinKarpMatcher() {
return new RabinKarpMatcher();
}
public static StringMatcher zMatcher() {
return new ZMatcher();
}
private static final class KnuthMorrisPrattMatcher implements StringMatcher { ... }
private static final class AutomatonMatcher implements StringMatcher { ... }
private static final class RabinKarpMatcher implements StringMatcher { ... }
private static final class ZMatcher implements StringMatcher { ...}
}public abstract class StringMatchersTest {
abstract StringMatcher getMatcherToTest();
@Test
public void testMatcher() {
assertEquals(0, getMatcherToTest().match("acacaga", "acacaga"));
assertEquals(-1, getMatcherToTest().match("aaa", "aaaa"));
assertEquals(0, getMatcherToTest().match("aaaa", "aaaa"));
assertEquals(-1, getMatcherToTest().match("aaaa", "bb"));
assertEquals(1, getMatcherToTest().match("abbb", "bb"));
assertEquals(2, getMatcherToTest().match("abcc", "cc"));
assertEquals(5, getMatcherToTest().match("aaaaaaab", "aab"));
assertEquals(4, getMatcherToTest().match("ababababaca", "ababaca"));
assertTrue("".indexOf("") == getMatcherToTest().match("", ""));
assertTrue("".indexOf("a") == getMatcherToTest().match("", "a"));
assertTrue("a".indexOf("") == getMatcherToTest().match("a", ""));
assertTrue("hello".indexOf("ello", -2) == getMatcherToTest().match("hello", "ello", -2));
assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 5));
assertEquals(-1, getMatcherToTest().match("aabaab", "aab", 4));
assertEquals(3, getMatcherToTest().match("aabaab", "aab", 3));
assertEquals(3, getMatcherToTest().match("aabaab", "aab", 2));
assertEquals(3, getMatcherToTest().match("aabaab", "aab", 1));
assertEquals(0, getMatcherToTest().match("aabaab", "aab", 0));
assertEquals(0, getMatcherToTest().match("aabaab", "aab", -1));
assertEquals(0, getMatcherToTest().match("aabaab", "aab", -2));
}
}
public class KnuthMorrisPrattMatcherTest extends StringMatchersTest {
@Override
StringMatcher getMatcherToTest() {
return StringMatchers.knuthMorrisPrattMatcher();
}
}@RunWith(Parameterized.class)
public class StringMatchersTest {
@Parameterized.Parameters
public static List<Object[]> getParameters() {
return asList(
new Object[]{ StringMatchers.knuthMorrisPrattMatcher() },
new Object[]{ StringMatchers.automatonMatcher() },
new Object[]{ StringMatchers.rabinKarpMatcher() },
new Object[]{ StringMatchers.zMatcher() }
);
}
private final StringMatcher matcher;
public StringMatchersTest(StringMatcher matcher) {
this.matcher = matcher;
}
@Test
public void testMatcher() { ... }
}Context
StackExchange Code Review Q#110097, answer score: 12
Revisions (0)
No revisions yet.