HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What is the reasoning behind magic constancs in hash code calculations found in programming practice?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
reasoningthewhatpracticemagicprogramminghashcalculationsconstancsfound

Problem

In real-world programming, we frequently need to compute hash codes for complicated objects. The main desired properties are that the values should

  • be deterministic and



  • have few collisions.



Let's assume we have access to robust hashing for basic data types such as integers, strings, etc. How do we combine multiple such hash codes to a single one, say for an array or a struct with several properties?

I immediately went for XOR-ing all "child" codes, figuring that the "spread" of the original hash function should be maintained, roughly.
Apparently, I'm not alone.

However, if you google around, you find other approaches.

  • Here an answerer proposes adding hash codes after perturbing them by some multiplication.



  • This asker uses XOR but multiplies the individual codes with some numbers, apparently consciously choosing "large" primes.



  • Here, the author states that "[things] have low values (favour the lower bits), so they should be combined with a multiplication to spread them across a wider bit range of the hash code".



The reasons given, if any, seem vacuous to me. Does multiplying already "random" values with a (small) constant really serve any purpose? Does it matter if the factor is a prime (which everybody seems to assume)?

Solution

XOR is not a good method, because then the hash of $(a,b)$ will be equal to the hash of $(b,a)$. Also, the hash of $(a,a,c)$ will be equal to the hash of $(b,b,c)$ and to the hash of just $c$. That's not good.

A better method is to make the hash of $(a,b)$ be equal to $H(H(a) || H(b))$, where $H$ is your hash function on basic objects (ints, strings, etc.) and where I am using $||$ to represent concatenation of bit strings. This avoids all of those degeneracies.

The particular constructions you mention can be viewed as instances of that approach:

-
The first link basically proposes using $H(a) + 37 \times H(b) \bmod 2^{64}$. That's equivalent to the construction $H'(H(a) || H(b))$ where $H'$ is a standard Rabin-Karp hash function.

-
The second link is $1013 \times H(a) \oplus 1009 \times H(b) \bmod 2^{64}$. That has the form $H''(H(a) || H(b))$ where $H''(x||y) = c_1x \oplus c_2y$, which is also a reasonable hash function (though it takes a little bit of algebra to prove it).

-
The third link is $(c \times H(a)) \oplus H(b) \bmod 2^{64}$ where $c$ is a prime. That's the same as the previous one, but with $c_2=1$. It also turns out to be a reasonable hash function.

Does the constant multiplier (e.g., 37) need to be prime? No, it only needs to be relatively prime to the modulus. If the modulus is a power of two (as in these cases), it only needs to be odd. A prime is sufficient but not necessary. To obtain provably good results, the constant multiplier should be chosen at random; in practice, we fix a particular value and hope that will be close enough for practical purposes.

Another reasonable method is to use $H_1(a) + H_2(b)$ or $H_1(a) \oplus H_2(b)$ as the hash of $(a,b)$, where $H_1,H_2$ are two independently chosen hash functions. This also works.. but is a little more annoying to implement, because it requires us to have multiple hash functions.

Is there some theory here? Yes, you can prove that if $H_1,H_2$ are drawn uniformly at random from a 2-universal hash family, and $+$ is a group operator, then $H_1(a)+H_2(b)$ will be a 2-universal. The same is true for $H_1(H_2(a)||H_2(b))$. I am not sure about what one can prove about $H(H(a)||H(b))$; I suspect it's not provably good in a generic sense, but is normally fine in practice.

Context

StackExchange Computer Science Q#77570, answer score: 6

Revisions (0)

No revisions yet.