gotchaMinor
Difference between properties of good hash function: uniformity and randomness
Viewed 0 times
propertiesfunctionhashrandomnessdifferencebetweengoodanduniformity
Problem
I did go through Korth's book of DBMS and I got these definitions:
Uniformity:
Each bucket is assigned the same number of search-key values from the set of all possible values.
Randomness:
Each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file.
These definitions seem similar to me. Can anyone please differentiate, I did search on net but can't understand.
Uniformity:
Each bucket is assigned the same number of search-key values from the set of all possible values.
Randomness:
Each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file.
These definitions seem similar to me. Can anyone please differentiate, I did search on net but can't understand.
Solution
Uniformity is about potential values, while randomness is about actual values.
For example, suppose you make a very simple hash function that takes the first byte of a string, resulting in 256 buckets. It is uniform, because the number of possible strings starting with a particular byte doesn't depend on. But it is not random for most data sets: at least 12.5% of the values consist of nonprintable characters (control characters) that most data sets don't contain, some characters like
Randomness is defined with respect to a specific data set. For example, if the data set consists of SHA-1 hashes of typical strings, then taking the first byte satisfies the randomness property. If the data set consists of people's surnames, it doesn't. If the data set consists of SHA-1 hashes strings that are deliberately chosen so that their SHA-1 hash starts with the byte 0, then taking the first byte doesn't satisfy randomness either.
In fact, there is no hash function that satisfies randomness for all datasets: for any hash function (with at least two buckets), a data set consisting solely of values that hash to bucket 0 violates randomness. The best you can hope is hash functions that satisfy randomness for datasets that are not deliberately crafted to violate randomness. This is possible in practice if the hash function includes a random factor which is based on a secret seed.
For example, suppose you make a very simple hash function that takes the first byte of a string, resulting in 256 buckets. It is uniform, because the number of possible strings starting with a particular byte doesn't depend on. But it is not random for most data sets: at least 12.5% of the values consist of nonprintable characters (control characters) that most data sets don't contain, some characters like
)]} rarely appear at the beginning of a string, an English dictionary contains more words beginning with a than with z, etc.Randomness is defined with respect to a specific data set. For example, if the data set consists of SHA-1 hashes of typical strings, then taking the first byte satisfies the randomness property. If the data set consists of people's surnames, it doesn't. If the data set consists of SHA-1 hashes strings that are deliberately chosen so that their SHA-1 hash starts with the byte 0, then taking the first byte doesn't satisfy randomness either.
In fact, there is no hash function that satisfies randomness for all datasets: for any hash function (with at least two buckets), a data set consisting solely of values that hash to bucket 0 violates randomness. The best you can hope is hash functions that satisfy randomness for datasets that are not deliberately crafted to violate randomness. This is possible in practice if the hash function includes a random factor which is based on a secret seed.
Context
StackExchange Computer Science Q#59441, answer score: 4
Revisions (0)
No revisions yet.