HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Aho-Corasick for multiple exact string matching in Java

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
exactjavaformultipleahostringmatchingcorasick

Problem

I have this program that compares performance of two algorithm for multiple exact string matching: Given a set of patterns and the text, find in the text all occurrences of any pattern. For example:

Text:     habababa
Patterns: aba, ha


Above, aba occurs 3 times, and ha occurs one time (matches are allowed to overlap).

The first algorithm is Aho-Corasick, and the other one is brute force.

Note, however, that this implementations considers only a, b, c, ..., x, y, z to be the alphabet. Perhaps, I will address this limitation in later posts.

AbstractMultipleExactStringMatcher.java:

```
package net.coderodde.patternmatching;

import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

/**
* This interface defines the API for multiple exact string matching algorithms.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Jan 1, 2016)
*/
public abstract class AbstractMultipleExactStringMatcher {

public abstract List match(String text, String... patterns);

protected String[] filterPatterns(String[] patterns) {
Set filter = new HashSet<>(Arrays.asList(patterns));
return filter.toArray(new String[filter.size()]);
}

/**
* This class represents a match.
*/
public static final class MatchingResult {

/**
* The index of the pattern being matched.
*/
public final int patternIndex;

/**
* The index of the last character in a pattern indexed by
* {@code patternIndex}.
*/
public final int concludingIndex;

public MatchingResult(int patternIndex, int concludingIndex) {
this.patternIndex = patternIndex;
this.concludingIndex = concludingIndex;
}

@Override
public boolean equals(Object o) {
if (o == null) {
return false;
}

if (!getClass().equals(o.getClass())) {
return false;

Solution

I tested your code on a set of 100k patterns and an alphabet of 2.5k symbols and automaton construction took ages.

After a closer look, I found that the failure function construction part keeps iterating over the whole alphabet, where instead it could have just used the children of the current trie node -- which are typically like an order of magnitude fewer in number than the size of the alphabet :) (to me it was two orders of magnitude)

This optimization sped up the automaton construction by the corresponding orders of magnitude.

Context

StackExchange Code Review Q#115624, answer score: 5

Revisions (0)

No revisions yet.