patternjavaMinor
Karp-Rabin with bitwise hash
Viewed 0 times
rabinwithhashbitwisekarp
Problem
As a Rags-to-Riches re-implementation of the Karp-Rabin question here, from @tsleyson, I thought it would be interesting to understand the algorithm better, and to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.
Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the rolling hash matches the search term's hash, does it double-check that the search actually matches. Because the hash is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").
The description for the algorithm suggests using a 'base' prime number as the foundation for a hashing algorithm that allows you to 'rotate' an out-of-scope character off of the hash, and then add a new member to it. As you 'roll' the hash down the search text, if the hash matches the hash of your pattern, you can then double-check to see if it is just a collision, or a real match.
Instead of using a system requiring repeated multiplications by prime numbers, I decided to use a computationally simpler bit-shifting algorithm (though the logic may be more complicated, the computations should be simpler). In essence, the algorithm I use uses a long value (64-bit) as a hash, and it rotates characters on to the hash, and uses XOR bitwise logic to 'add' or 'remove' the characters.
I am looking for a review of the usability, and any performance suggestions.
```
import java.util.Arrays;
public class KarpRabin {
private static final int[] EMPTY = new int[0];
// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 3
Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the rolling hash matches the search term's hash, does it double-check that the search actually matches. Because the hash is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").
The description for the algorithm suggests using a 'base' prime number as the foundation for a hashing algorithm that allows you to 'rotate' an out-of-scope character off of the hash, and then add a new member to it. As you 'roll' the hash down the search text, if the hash matches the hash of your pattern, you can then double-check to see if it is just a collision, or a real match.
Instead of using a system requiring repeated multiplications by prime numbers, I decided to use a computationally simpler bit-shifting algorithm (though the logic may be more complicated, the computations should be simpler). In essence, the algorithm I use uses a long value (64-bit) as a hash, and it rotates characters on to the hash, and uses XOR bitwise logic to 'add' or 'remove' the characters.
I am looking for a review of the usability, and any performance suggestions.
```
import java.util.Arrays;
public class KarpRabin {
private static final int[] EMPTY = new int[0];
// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 3
Solution
to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.
This is an optimization, I'm not sure about. It's a bit faster (1 instead of 3 or 4 cycles latency, but probably the same throughput on modern Intel), but the hash quality is worse.
This assumes the use of rotations optimized by the JIT, which is not the case for the reviewed code. It's quite possible that a hand-made rotation is worse than multiplication.
You need no prime, any odd number would do.
For
You're speaking about primes, but the shift is modulo 64 and there's no such thing as primes in the ring
I don't think 31 is very good, as after two operations, you get 62, i.e., -2. So using three ASCII chars, a collision can be easily produced as their bits overlap.
When sticking with ASCII, 7 should be fine, but for general strings, I'd suggest at least 16. FWIW, with 17, you're guaranteed that there'll no collisions for up 3 steps. After 4 steps, it totals to 68, i.e., 4, which is not perfect, so I'd go for 19 instead. A measurement with real data could tell us more, while I guess, the answer is that it doesn't matter much.
Wrong! Use
Actually, you need no
See my comment on the linked CR concerning
Concerning naming, you're looking for a
Don't allow
I'd probably go for
and simply let it blow on any
Someone less lazy would add explicit
I'd call it
This should be a
added in a performance test
Benchmarking Java is hard, have a look at JMH.
This is an optimization, I'm not sure about. It's a bit faster (1 instead of 3 or 4 cycles latency, but probably the same throughput on modern Intel), but the hash quality is worse.
This assumes the use of rotations optimized by the JIT, which is not the case for the reviewed code. It's quite possible that a hand-made rotation is worse than multiplication.
You need no prime, any odd number would do.
// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 31;For
int, the rotation distance of 31 would be equivalent to -1, which is not good as the next char uses about the same bits. For long, it's not bad.You're speaking about primes, but the shift is modulo 64 and there's no such thing as primes in the ring
Z/64. Again, what you need is a number co-prime to 64, i.e., odd. This assures that it takes 64 operations to land in the same position.I don't think 31 is very good, as after two operations, you get 62, i.e., -2. So using three ASCII chars, a collision can be easily produced as their bits overlap.
When sticking with ASCII, 7 should be fine, but for general strings, I'd suggest at least 16. FWIW, with 17, you're guaranteed that there'll no collisions for up 3 steps. After 4 steps, it totals to 68, i.e., 4, which is not perfect, so I'd go for 19 instead. A measurement with real data could tell us more, while I guess, the answer is that it doesn't matter much.
(base >> (BITS - SHIFT))Wrong! Use
Long.rotateLeft. It does exactly the same, but is clearer and much faster as it's an intrinsic. Whenever the JIT sees Long.rotateLeft, it uses the rotation instruction; when it sees an equivalent code, it doesn't care.Actually, you need no
BITS as the distance is taken modulo 64 (guaranteed in Java, undefined in C).public static boolean match(final String pattern, final String text)See my comment on the linked CR concerning
indexOf and contains.Concerning naming, you're looking for a
pattern in a text, that's fine, but maybe searching a needle in a haystack would be even clearer.if (pattern == null || text == null || pattern.length() > text.length()) {
return false;
}Don't allow
nulls. Just throw as every null means a problem and the earlier you fail, the better. See Guava's arguments.I'd probably go for
if (pattern.length() >= text.length()) {
return pattern.equals(text);
}and simply let it blow on any
null.Someone less lazy would add explicit
import static com.google.common.base.Preconditions.checkNotNull;
checkNotNull(pattern);
checkNotNull(text);int[] possibles =I'd call it
result.int pos = 0;
while ((pos = kr.next()) >= 0) {This should be a
for loop.added in a performance test
Benchmarking Java is hard, have a look at JMH.
Code Snippets
// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
// long bit mask.
private static final int SHIFT = 31;(base << SHIFT | base >>> (BITS - SHIFT))public static boolean match(final String pattern, final String text)if (pattern == null || text == null || pattern.length() > text.length()) {
return false;
}if (pattern.length() >= text.length()) {
return pattern.equals(text);
}Context
StackExchange Code Review Q#70108, answer score: 4
Revisions (0)
No revisions yet.