patternMinor
What is the reasoning behind magic constancs in hash code calculations found in programming practice?
Viewed 0 times
reasoningthewhatpracticemagicprogramminghashcalculationsconstancsfound
Problem
In real-world programming, we frequently need to compute hash codes for complicated objects. The main desired properties are that the values should
Let's assume we have access to robust hashing for basic data types such as integers, strings, etc. How do we combine multiple such hash codes to a single one, say for an array or a struct with several properties?
I immediately went for XOR-ing all "child" codes, figuring that the "spread" of the original hash function should be maintained, roughly.
Apparently, I'm not alone.
However, if you google around, you find other approaches.
The reasons given, if any, seem vacuous to me. Does multiplying already "random" values with a (small) constant really serve any purpose? Does it matter if the factor is a prime (which everybody seems to assume)?
- be deterministic and
- have few collisions.
Let's assume we have access to robust hashing for basic data types such as integers, strings, etc. How do we combine multiple such hash codes to a single one, say for an array or a struct with several properties?
I immediately went for XOR-ing all "child" codes, figuring that the "spread" of the original hash function should be maintained, roughly.
Apparently, I'm not alone.
However, if you google around, you find other approaches.
- Here an answerer proposes adding hash codes after perturbing them by some multiplication.
- This asker uses XOR but multiplies the individual codes with some numbers, apparently consciously choosing "large" primes.
- Here, the author states that "[things] have low values (favour the lower bits), so they should be combined with a multiplication to spread them across a wider bit range of the hash code".
The reasons given, if any, seem vacuous to me. Does multiplying already "random" values with a (small) constant really serve any purpose? Does it matter if the factor is a prime (which everybody seems to assume)?
Solution
XOR is not a good method, because then the hash of $(a,b)$ will be equal to the hash of $(b,a)$. Also, the hash of $(a,a,c)$ will be equal to the hash of $(b,b,c)$ and to the hash of just $c$. That's not good.
A better method is to make the hash of $(a,b)$ be equal to $H(H(a) || H(b))$, where $H$ is your hash function on basic objects (ints, strings, etc.) and where I am using $||$ to represent concatenation of bit strings. This avoids all of those degeneracies.
The particular constructions you mention can be viewed as instances of that approach:
-
The first link basically proposes using $H(a) + 37 \times H(b) \bmod 2^{64}$. That's equivalent to the construction $H'(H(a) || H(b))$ where $H'$ is a standard Rabin-Karp hash function.
-
The second link is $1013 \times H(a) \oplus 1009 \times H(b) \bmod 2^{64}$. That has the form $H''(H(a) || H(b))$ where $H''(x||y) = c_1x \oplus c_2y$, which is also a reasonable hash function (though it takes a little bit of algebra to prove it).
-
The third link is $(c \times H(a)) \oplus H(b) \bmod 2^{64}$ where $c$ is a prime. That's the same as the previous one, but with $c_2=1$. It also turns out to be a reasonable hash function.
Does the constant multiplier (e.g., 37) need to be prime? No, it only needs to be relatively prime to the modulus. If the modulus is a power of two (as in these cases), it only needs to be odd. A prime is sufficient but not necessary. To obtain provably good results, the constant multiplier should be chosen at random; in practice, we fix a particular value and hope that will be close enough for practical purposes.
Another reasonable method is to use $H_1(a) + H_2(b)$ or $H_1(a) \oplus H_2(b)$ as the hash of $(a,b)$, where $H_1,H_2$ are two independently chosen hash functions. This also works.. but is a little more annoying to implement, because it requires us to have multiple hash functions.
Is there some theory here? Yes, you can prove that if $H_1,H_2$ are drawn uniformly at random from a 2-universal hash family, and $+$ is a group operator, then $H_1(a)+H_2(b)$ will be a 2-universal. The same is true for $H_1(H_2(a)||H_2(b))$. I am not sure about what one can prove about $H(H(a)||H(b))$; I suspect it's not provably good in a generic sense, but is normally fine in practice.
A better method is to make the hash of $(a,b)$ be equal to $H(H(a) || H(b))$, where $H$ is your hash function on basic objects (ints, strings, etc.) and where I am using $||$ to represent concatenation of bit strings. This avoids all of those degeneracies.
The particular constructions you mention can be viewed as instances of that approach:
-
The first link basically proposes using $H(a) + 37 \times H(b) \bmod 2^{64}$. That's equivalent to the construction $H'(H(a) || H(b))$ where $H'$ is a standard Rabin-Karp hash function.
-
The second link is $1013 \times H(a) \oplus 1009 \times H(b) \bmod 2^{64}$. That has the form $H''(H(a) || H(b))$ where $H''(x||y) = c_1x \oplus c_2y$, which is also a reasonable hash function (though it takes a little bit of algebra to prove it).
-
The third link is $(c \times H(a)) \oplus H(b) \bmod 2^{64}$ where $c$ is a prime. That's the same as the previous one, but with $c_2=1$. It also turns out to be a reasonable hash function.
Does the constant multiplier (e.g., 37) need to be prime? No, it only needs to be relatively prime to the modulus. If the modulus is a power of two (as in these cases), it only needs to be odd. A prime is sufficient but not necessary. To obtain provably good results, the constant multiplier should be chosen at random; in practice, we fix a particular value and hope that will be close enough for practical purposes.
Another reasonable method is to use $H_1(a) + H_2(b)$ or $H_1(a) \oplus H_2(b)$ as the hash of $(a,b)$, where $H_1,H_2$ are two independently chosen hash functions. This also works.. but is a little more annoying to implement, because it requires us to have multiple hash functions.
Is there some theory here? Yes, you can prove that if $H_1,H_2$ are drawn uniformly at random from a 2-universal hash family, and $+$ is a group operator, then $H_1(a)+H_2(b)$ will be a 2-universal. The same is true for $H_1(H_2(a)||H_2(b))$. I am not sure about what one can prove about $H(H(a)||H(b))$; I suspect it's not provably good in a generic sense, but is normally fine in practice.
Context
StackExchange Computer Science Q#77570, answer score: 6
Revisions (0)
No revisions yet.