patternjavaMinor
Aho-Corasick for multiple exact string matching in Java
Viewed 0 times
exactjavaformultipleahostringmatchingcorasick
Problem
I have this program that compares performance of two algorithm for multiple exact string matching: Given a set of patterns and the text, find in the text all occurrences of any pattern. For example:
Above,
The first algorithm is Aho-Corasick, and the other one is brute force.
Note, however, that this implementations considers only
AbstractMultipleExactStringMatcher.java:
```
package net.coderodde.patternmatching;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
/**
* This interface defines the API for multiple exact string matching algorithms.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Jan 1, 2016)
*/
public abstract class AbstractMultipleExactStringMatcher {
public abstract List match(String text, String... patterns);
protected String[] filterPatterns(String[] patterns) {
Set filter = new HashSet<>(Arrays.asList(patterns));
return filter.toArray(new String[filter.size()]);
}
/**
* This class represents a match.
*/
public static final class MatchingResult {
/**
* The index of the pattern being matched.
*/
public final int patternIndex;
/**
* The index of the last character in a pattern indexed by
* {@code patternIndex}.
*/
public final int concludingIndex;
public MatchingResult(int patternIndex, int concludingIndex) {
this.patternIndex = patternIndex;
this.concludingIndex = concludingIndex;
}
@Override
public boolean equals(Object o) {
if (o == null) {
return false;
}
if (!getClass().equals(o.getClass())) {
return false;
Text: habababa
Patterns: aba, haAbove,
aba occurs 3 times, and ha occurs one time (matches are allowed to overlap).The first algorithm is Aho-Corasick, and the other one is brute force.
Note, however, that this implementations considers only
a, b, c, ..., x, y, z to be the alphabet. Perhaps, I will address this limitation in later posts. AbstractMultipleExactStringMatcher.java:
```
package net.coderodde.patternmatching;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
/**
* This interface defines the API for multiple exact string matching algorithms.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Jan 1, 2016)
*/
public abstract class AbstractMultipleExactStringMatcher {
public abstract List match(String text, String... patterns);
protected String[] filterPatterns(String[] patterns) {
Set filter = new HashSet<>(Arrays.asList(patterns));
return filter.toArray(new String[filter.size()]);
}
/**
* This class represents a match.
*/
public static final class MatchingResult {
/**
* The index of the pattern being matched.
*/
public final int patternIndex;
/**
* The index of the last character in a pattern indexed by
* {@code patternIndex}.
*/
public final int concludingIndex;
public MatchingResult(int patternIndex, int concludingIndex) {
this.patternIndex = patternIndex;
this.concludingIndex = concludingIndex;
}
@Override
public boolean equals(Object o) {
if (o == null) {
return false;
}
if (!getClass().equals(o.getClass())) {
return false;
Solution
I tested your code on a set of 100k patterns and an alphabet of 2.5k symbols and automaton construction took ages.
After a closer look, I found that the failure function construction part keeps iterating over the whole alphabet, where instead it could have just used the children of the current trie node -- which are typically like an order of magnitude fewer in number than the size of the alphabet :) (to me it was two orders of magnitude)
This optimization sped up the automaton construction by the corresponding orders of magnitude.
After a closer look, I found that the failure function construction part keeps iterating over the whole alphabet, where instead it could have just used the children of the current trie node -- which are typically like an order of magnitude fewer in number than the size of the alphabet :) (to me it was two orders of magnitude)
This optimization sped up the automaton construction by the corresponding orders of magnitude.
Context
StackExchange Code Review Q#115624, answer score: 5
Revisions (0)
No revisions yet.