snippetcppMinor
How do you find a hash function that respects a custom equality function?
Viewed 0 times
equalityhowyourespectsfunctionhashcustomthatfind
Problem
I've been tasked with hashing arbitrary types in C++, with the caveat that
For simplicity we can assume that
For example, the expected behavior of the
Given
I can't simply hash the addresses of
I've thought of one solution, but I wonder if its optimal. It seems terribly inefficient. The solution is to build the
I could do some extra bookkeeping to ensure that inequivalent Keys are less likely to be mapped to the same hash by the random assignment, but that'
A == B implies hash(A) == hash(B) even if equality of A and B is determined by a custom equality function ==.For simplicity we can assume that
== is an equivalence relation. For example, the expected behavior of the
hash function on std::vectors is as follows:Given
using namespace std;
vector A = vector();
vector B = vector();
A == B will be true because == is overloaded for std::vector to mean equality of the underlying data. Correspondingly, hash(A) == hash(B) should also be true.I can't simply hash the addresses of
A,Bas integers because A == B but hash(&A) != hash(&B) in general. I've thought of one solution, but I wonder if its optimal. It seems terribly inefficient. The solution is to build the
hash function as new values are hashed:
using namespace std;
class Hasher{
public:
unordered_map> hashedKeys;
int max_hash
Hasher(int max_hash){
this->max_hash = max_hash;
}
int hash(Key key){
// If key has already been hashed, used that hash_value
if ( hashedKeys.count(key) == 1){
return hashedKeys[key];
}
// For pairs of saved (Key key, int hash_value)
for(unordered_map::iterator it=hashedKeys.begin(); it!=hashedKeys.end(); it++;){
// If an equal key has been inserted, just use its hash_value
if(key == *it){
hashedKeys.insert(key, *it.second);
return *it.second; //use hash value of equal Key
}
}
// If no other Keys equal this one, randomly hash it, and save
int hash_value = rand() % max_hash;
hashedKeys.insert(key, hash_value);
return hash_value;
}
}
I could do some extra bookkeeping to ensure that inequivalent Keys are less likely to be mapped to the same hash by the random assignment, but that'
Solution
The way I can think of to do this is by some sort of normalization: that is, you need to find a function $f$ such that, if $\equiv$ is your custom equality and $==$ is the normal C++ (or whatever language you use) equality, for all $x,y$, we have $x \equiv y$ if and only if $f(x)==f(y)$. We call $f(x)$ the normal form of $x$.
Then, the trick is, instead of computing hashes, you compute hashes of normal forms.
Hash functions are specifically designed to produce large changes in output for small changes in input: that's what makes them well suited to hashtables and cryptography. So there's not likely a way to make a hash function that is invariant over some custom equality, except to have it compute on normal forms.
What you've described might work from a correctness point of view, but there are a few things to consider:
-
You lose all the advantages of hashing. One main use case of hashing is quickly comparing two things.
If you hash a bunch of things ahead of time, then you very quickly check that any of those two things are for sure different, and if their hashes are the same, you know with high probability that they are actually the same. With your version, you get fast comparison, but to compute all your hashes you'll have already compared all $n^2$ unhashed pairs at least once, so you will never save work.
The other thing hashes are useful for is indexing complex data in a data structure. That is, you convert your key into a hash, and each time you do a key lookup, you compute the hash and use it to find the key in a data structure, possibly a tree or hashtable. With yours, you end up doing $n$ comparisons each time you lookup the hash key, which means you'd be better to just use an unordered list as your data structure and search through it each time, comparing each element to the key you're looking for.
Using unique identifiers instead of hashes is a fine way to index data, but then you definitely do NOT want to generate random identifiers, since there's still a risk of a collision. Usually you'd just keep a counter and generate one plus the last identifier each time you allocate a new one.
Then, the trick is, instead of computing hashes, you compute hashes of normal forms.
Hash functions are specifically designed to produce large changes in output for small changes in input: that's what makes them well suited to hashtables and cryptography. So there's not likely a way to make a hash function that is invariant over some custom equality, except to have it compute on normal forms.
What you've described might work from a correctness point of view, but there are a few things to consider:
- It is not hashing. That is, a hash is essentially a function that takes some variable-sized data and produces a fixed size output (in your case, an int). You haven't designed a function at all, you've just defined a way to assign random identifiers to input.
-
You lose all the advantages of hashing. One main use case of hashing is quickly comparing two things.
If you hash a bunch of things ahead of time, then you very quickly check that any of those two things are for sure different, and if their hashes are the same, you know with high probability that they are actually the same. With your version, you get fast comparison, but to compute all your hashes you'll have already compared all $n^2$ unhashed pairs at least once, so you will never save work.
The other thing hashes are useful for is indexing complex data in a data structure. That is, you convert your key into a hash, and each time you do a key lookup, you compute the hash and use it to find the key in a data structure, possibly a tree or hashtable. With yours, you end up doing $n$ comparisons each time you lookup the hash key, which means you'd be better to just use an unordered list as your data structure and search through it each time, comparing each element to the key you're looking for.
Using unique identifiers instead of hashes is a fine way to index data, but then you definitely do NOT want to generate random identifiers, since there's still a risk of a collision. Usually you'd just keep a counter and generate one plus the last identifier each time you allocate a new one.
Context
StackExchange Computer Science Q#116263, answer score: 5
Revisions (0)
No revisions yet.