patterncsharpMinor
GetHashCode of a Dictionary
Viewed 0 times
gethashcodedictionarystackoverflow
Problem
There is a huge documentation about using and generating hashcodes for objects that will go into a dictionary, really few more if you try to look about hashcode of a dictionary.
No, the default Dotnet dictionary doesn't override the
Taking inspiration from this Stack Overflow question, I made this implementation in my object.
The object is a simple wrapper over a dictionary, which takes different values and save them with the number of times they have been added (values are keys of the dictionary, occurrences are values).
The starting value for the hashcode is zero (for empty dictionary) and it's calculated and stored every time a value is added, to avoid the necessity of enumerate the whole dictionary every time the hashcode is requested.
What do you think about it?
No, the default Dotnet dictionary doesn't override the
Object.GetHashCode() method, so two different dictionary with same keys and same values will have different hashcodes.Taking inspiration from this Stack Overflow question, I made this implementation in my object.
The object is a simple wrapper over a dictionary, which takes different values and save them with the number of times they have been added (values are keys of the dictionary, occurrences are values).
The starting value for the hashcode is zero (for empty dictionary) and it's calculated and stored every time a value is added, to avoid the necessity of enumerate the whole dictionary every time the hashcode is requested.
What do you think about it?
private int hascode = 0;
public bool Add(T item)
{
values[item] = values.ContainsKey(item) ? values[item]++ : 1;
unchecked
{
hashcode += 486187739 * item.GetHashCode();
hashcode += 982451653 * values[item];
}
return true;
}Solution
Hashes does not need to be unique, a good hash will have a good distribution and as little as possible collisions.
Which are the implications of this? Given two dictionaries \$a\$ and \$b\$ (with \$a = b\$) and an hash function \$\operatorname{f_{hash}}(D)\$ you can assert that \$\operatorname{f_{hash}}(a) = \operatorname{f_{hash}}(b)\$; similarly for \$a \neq b\$ the same condition may also be true. With this in mind default behavior may be considered wrong but simply they can't provide an effective generic-enough function. Also note that they explicitly suggest to do not override
Which hash function is better may be not a simple choice because it has to be choosen to balance complexity, performance and distribution. Some knowledge of the elements you want to hash will also help you to pick the best one (in your case...where those numbers come from? Is there any rationale behind them to reach a better distribution?)
Let's start with a very simple case: hash is the number of items in the collection \$\operatorname{f_{hash}}(D) = \#(D)\$. It's better than default implementation and it's pretty fast but it's still a terrible hash function (huge number of collisions.) More often than not I just needed this.
Any other implementation depends on context: are items immutable (or you hash only keys)? In this case you may keep a running hash when inserting/removing items (assuming hash function you use can be reverted to subtract a now removed hashed item.)
What I sometimes do for dictionary of immutable items (unless your dictionary is made by thousands items, or you need hash in a very performance critical function) is to calculate hash only when required and cache it, cache will be invalidated when you add/remove an element. Proof of concept (here I override just
Note that if items are mutable then you can't cache calculated hash and it must be recomputed each time it is required (unless they're also observable objects...)
Which hash function? That's an hard choice, your magic numbers must be chosen to have a good distribution (if you know anything about objects stored in the dictionary.) As starting point you may use the simplest solution (unless you have some knowledge about distribution of hashes of elements): even \$\operatorname{xor}(\{d_1,\ldots,d_n\})\$ is good enough (you will use as much bits as used by hashes of items/keys in your dictionary.) It has the advantage to be fast, if you just need a quick comparison of two dictionaries (or lists) then you probably want the fastest algorithm (not the best - what a vague concept - one.)
Next step may be to use a popular general purpose (non-cryptographic!) hash function to balance between performance and quality, you may take a look to CityHash (developed by Google). It has good performance and it's easy to implement (plus: it may have different sizes then you may even decide to devote 16 bit for keys and 16 bits for values.) Its inputs are hashes from keys/values (then don't forget that also those hashes have to be of good quality)
To summarize: too many
Which are the implications of this? Given two dictionaries \$a\$ and \$b\$ (with \$a = b\$) and an hash function \$\operatorname{f_{hash}}(D)\$ you can assert that \$\operatorname{f_{hash}}(a) = \operatorname{f_{hash}}(b)\$; similarly for \$a \neq b\$ the same condition may also be true. With this in mind default behavior may be considered wrong but simply they can't provide an effective generic-enough function. Also note that they explicitly suggest to do not override
GetHashCode() if object is mutable (a dictionary is) because its hash may be used in an outer container.)Which hash function is better may be not a simple choice because it has to be choosen to balance complexity, performance and distribution. Some knowledge of the elements you want to hash will also help you to pick the best one (in your case...where those numbers come from? Is there any rationale behind them to reach a better distribution?)
Let's start with a very simple case: hash is the number of items in the collection \$\operatorname{f_{hash}}(D) = \#(D)\$. It's better than default implementation and it's pretty fast but it's still a terrible hash function (huge number of collisions.) More often than not I just needed this.
Any other implementation depends on context: are items immutable (or you hash only keys)? In this case you may keep a running hash when inserting/removing items (assuming hash function you use can be reverted to subtract a now removed hashed item.)
What I sometimes do for dictionary of immutable items (unless your dictionary is made by thousands items, or you need hash in a very performance critical function) is to calculate hash only when required and cache it, cache will be invalidated when you add/remove an element. Proof of concept (here I override just
Add() but you have to do it also for Remove() and Clear()):public override int GetHashCode() {
if (_hash == null)
_hash = CalculateHashCode();
return _hash;
}
public override void Add(TKey key, TValue value) {
base.Add(key, value);
_hash = null;
}Note that if items are mutable then you can't cache calculated hash and it must be recomputed each time it is required (unless they're also observable objects...)
Which hash function? That's an hard choice, your magic numbers must be chosen to have a good distribution (if you know anything about objects stored in the dictionary.) As starting point you may use the simplest solution (unless you have some knowledge about distribution of hashes of elements): even \$\operatorname{xor}(\{d_1,\ldots,d_n\})\$ is good enough (you will use as much bits as used by hashes of items/keys in your dictionary.) It has the advantage to be fast, if you just need a quick comparison of two dictionaries (or lists) then you probably want the fastest algorithm (not the best - what a vague concept - one.)
Next step may be to use a popular general purpose (non-cryptographic!) hash function to balance between performance and quality, you may take a look to CityHash (developed by Google). It has good performance and it's easy to implement (plus: it may have different sizes then you may even decide to devote 16 bit for keys and 16 bits for values.) Its inputs are hashes from keys/values (then don't forget that also those hashes have to be of good quality)
To summarize: too many
ifs, right? That's why to-do-nothing path was a good choice...Code Snippets
public override int GetHashCode() {
if (_hash == null)
_hash = CalculateHashCode();
return _hash;
}
public override void Add(TKey key, TValue value) {
base.Add(key, value);
_hash = null;
}Context
StackExchange Code Review Q#141978, answer score: 4
Revisions (0)
No revisions yet.