HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Karp-Rabin with bitwise hash

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rabinwithhashbitwisekarp

Problem

As a Rags-to-Riches re-implementation of the Karp-Rabin question here, from @tsleyson, I thought it would be interesting to understand the algorithm better, and to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.

Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the rolling hash matches the search term's hash, does it double-check that the search actually matches. Because the hash is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").

The description for the algorithm suggests using a 'base' prime number as the foundation for a hashing algorithm that allows you to 'rotate' an out-of-scope character off of the hash, and then add a new member to it. As you 'roll' the hash down the search text, if the hash matches the hash of your pattern, you can then double-check to see if it is just a collision, or a real match.

Instead of using a system requiring repeated multiplications by prime numbers, I decided to use a computationally simpler bit-shifting algorithm (though the logic may be more complicated, the computations should be simpler). In essence, the algorithm I use uses a long value (64-bit) as a hash, and it rotates characters on to the hash, and uses XOR bitwise logic to 'add' or 'remove' the characters.

I am looking for a review of the usability, and any performance suggestions.

```
import java.util.Arrays;

public class KarpRabin {

private static final int[] EMPTY = new int[0];

// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 3

Solution

to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.

This is an optimization, I'm not sure about. It's a bit faster (1 instead of 3 or 4 cycles latency, but probably the same throughput on modern Intel), but the hash quality is worse.

This assumes the use of rotations optimized by the JIT, which is not the case for the reviewed code. It's quite possible that a hand-made rotation is worse than multiplication.

You need no prime, any odd number would do.

// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 31;


For int, the rotation distance of 31 would be equivalent to -1, which is not good as the next char uses about the same bits. For long, it's not bad.

You're speaking about primes, but the shift is modulo 64 and there's no such thing as primes in the ring Z/64. Again, what you need is a number co-prime to 64, i.e., odd. This assures that it takes 64 operations to land in the same position.

I don't think 31 is very good, as after two operations, you get 62, i.e., -2. So using three ASCII chars, a collision can be easily produced as their bits overlap.

When sticking with ASCII, 7 should be fine, but for general strings, I'd suggest at least 16. FWIW, with 17, you're guaranteed that there'll no collisions for up 3 steps. After 4 steps, it totals to 68, i.e., 4, which is not perfect, so I'd go for 19 instead. A measurement with real data could tell us more, while I guess, the answer is that it doesn't matter much.

(base >> (BITS - SHIFT))


Wrong! Use Long.rotateLeft. It does exactly the same, but is clearer and much faster as it's an intrinsic. Whenever the JIT sees Long.rotateLeft, it uses the rotation instruction; when it sees an equivalent code, it doesn't care.

Actually, you need no BITS as the distance is taken modulo 64 (guaranteed in Java, undefined in C).

public static boolean match(final String pattern, final String text)


See my comment on the linked CR concerning indexOf and contains.

Concerning naming, you're looking for a pattern in a text, that's fine, but maybe searching a needle in a haystack would be even clearer.

if (pattern == null || text == null || pattern.length() > text.length()) {
    return false;
}


Don't allow nulls. Just throw as every null means a problem and the earlier you fail, the better. See Guava's arguments.

I'd probably go for

if (pattern.length() >= text.length()) {
    return pattern.equals(text);
}


and simply let it blow on any null.

Someone less lazy would add explicit

import static com.google.common.base.Preconditions.checkNotNull;

checkNotNull(pattern);
checkNotNull(text);


int[] possibles =


I'd call it result.

int pos = 0;
while ((pos = kr.next()) >= 0) {


This should be a for loop.


added in a performance test

Benchmarking Java is hard, have a look at JMH.

Code Snippets

// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 31;
(base << SHIFT | base >>> (BITS - SHIFT))
public static boolean match(final String pattern, final String text)
if (pattern == null || text == null || pattern.length() > text.length()) {
    return false;
}
if (pattern.length() >= text.length()) {
    return pattern.equals(text);
}

Context

StackExchange Code Review Q#70108, answer score: 4

Revisions (0)

No revisions yet.