HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Karp-Rabin string matching in Java

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rabinjavastringmatchingkarp

Problem

I wrote an implementation of the Karp-Rabin string matching algorithm in Java 7, based on the discussion in Section 32.2 of Introduction to Algorithms (CLRS). Clearly, I need more experience with numeric programming, because I spent two days wrestling with overflow issues and floating point error, but I believe it works now.

Of course, anything goes, but I'm especially looking for feedback on the following points:

  • Are there any cases left where it would fail?



  • How can I improve the design?



  • How can I make the code more readable?



  • How can I make the code more robust?



  • Did I miss any glaring performance traps?



Here's the class constructor and the match method, which do the bulk of the work.

public class KarpRabin {
    final long q, b;
    public static final int NONE = -1;
    Map digitMap;

    public KarpRabin(String alphabet) {
        digitMap = new HashMap<>();
        for (int i = 0; i = q) {
            throw new IllegalArgumentException("b cannot be larger than " + q);
        }
    }

    public int match(String pattern, String text) {
        // Returns location of first match of pattern in text.
        if (pattern.length() > text.length()) {
            throw new IllegalArgumentException("Pattern longer than text.");
        }
        int[] T = digitValue(text);
        int[] P = digitValue(pattern);
        long h, p, t;
        p = t = 0;
        h = expt(b, pattern.length()-1).mod(BigInteger.valueOf(q)).longValue(); // See note [7].

        // Calculate fingerprint of pattern and of first
        // pattern.length-length group in text.
        for (int i = 0; i = 0 : "t is below 0, has value " + t;
       }
       // See note [4].
       if (p == t) {
           if (pattern.equals(text.substring(
               text.length() - pattern.length(), text.length())))
               return text.length() - pattern.length();
       }
       return NONE;
}


Here's a gist with the rest of the code, including the private helper methods.

A

Solution

Clearly, I need more experience with numeric programming, because I spent two days wrestling with overflow issues and floating point error

I'd suggest to just avoid it all. Use mod \$2^{32}\$ arithmetic, ignore overflow, forget floating point and forget BigInteger.

I'm afraid, the problem is not the code, but the algorithm. I can see that the Wikipedia really describes all the strange things concerning modular arithmetic, but I'm pretty sure that you took it too literally.

For a rolling hash, all you need is something like

h = BASE_MULTIPLIER * h + inChar + OUT_MULTIPLIER * outChar;


where inChar is the incoming character, BASE_MULTIPLIER is any number co-prime to \$2^{32}\$ (i.e., odd) and OUT_MULTIPLIER is its \$n\$-th power mod \$2^{32}\$, where \$n\$ is the length of the sliding window.

If you insist on using a prime instead of \$2^{32}\$, then you can do it all without BigInteger as long as the prime fits in an int (use long and compute modulus after each operation).

If you're still unsatisfied with the hash function, then there are simple ways to improve it (just ask).

throw new IllegalArgumentException("Pattern longer than text.");


What's illegal here? It's just not there, cf. "x".contains("abc").

throw new IllegalArgumentException("b cannot be larger than " + q);


It can... IIUIC, it's just hashing, so there's no reason to whine because of collisions as long as they're rare.


How can I make the code more robust?

Don't require to specify the alphabet and similar things. It's not necessary.


Did I miss any glaring performance traps?

Both % and BigInteger are performance killers.


Because of the way I implemented the algorithm, supporting general Unicode strings in UCS-2 or UTF-16 would require doing base-65,536 arithmetic.

If you implemented it my way, there'd be no problem.


There's no hard limit on the size, but larger alphabets mean more BigInteger math.

Do no BigInteger at all.


Although the current code is Java 7, I plan to switch to Java 8 in the near future

Just don't. If you see something where Java 8 could obviously help a lot, go for it. Otherwise, I wouldn't care. I'm not saying that Java 8 is no big improvement, it's just that sticking with Java 7 or 6 makes your program more usable.

Code Snippets

h = BASE_MULTIPLIER * h + inChar + OUT_MULTIPLIER * outChar;
throw new IllegalArgumentException("Pattern longer than text.");
throw new IllegalArgumentException("b cannot be larger than " + q);

Context

StackExchange Code Review Q#69678, answer score: 4

Revisions (0)

No revisions yet.