HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How do we find the optimal modulus q in Rabin-Karp algorithm?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
theoptimalhowrabinalgorithmkarpfindmodulus

Problem

I asked a similar question here on Rabin Karp algorithm. My present question is, how do we find the best $q$ (i.e modulus)? What is the criterion? We need to choose a $q$ which will be quick to calculate and also must result in lesser number of spurious hits, right?

Wow do we ensure these things?

Solution

The Rabin-Karp algorithm looks for a substring by computing a rolling hash, of the form $h(a_{n - 1} a_{n - 2} \ldots a_0) = \sum_{0 \le k \le n - 1} a_k q^k$ for a prime $q$. Note that the algorithm works just the same if $h$ is computed modulo the word size, so using e.g. unsigned int in C would be wise. What we'd like is $q^{n - 1}$ not too large (so the needed $q^{n - 1} a_{n - 1}$ doesn't overflow, and computations are in normal unsigned integers). Clearly $n$ (the length of the pattern) is limiting here, so you want a smallish prime. One option is to use the largest prime $q$ such that if $w$ is the word size and $A$ the largest character value (for UTF-8 it's essentially $2^8$)
$$
2^w > q^{n - 1} A
$$
This guarantees no collisions, but is very restrictive.
Perhaps just taking $q = 3$ or $q = 5$ is enough (the hash value is still a full word, so collisions should be rare anyway). As they have just 2 bits one (Fermat primes, next one is 17, then 257), perhaps the compiler (or even the programmer) replaces the multiplication by the constant prime by shift and add, but if that is a net gain depends on architecture...

Note that the cited article says this makes sense only for multiple pattern search, as there are faster alternatives for single patterns.

To really answer you'd have to run tests with representative haystacks and patterns

Context

StackExchange Computer Science Q#10174, answer score: 5

Revisions (0)

No revisions yet.