patternMinor
Why are hash map look-ups assumed to be $O(1)$ on average
Viewed 0 times
assumedwhymaparelookaveragehashups
Problem
To look up a key in a hash map you have to
Hash calculation takes at least $O(l)$ operations when the hashes are $l$-bit-numbers.
When using an index (like a binary tree) for each bucket, finding an entry within a bucket that contains $k$ entries can be done in $O(\log k)$. With $n$ being the total number of entries in the hash map and $m$ being the number of buckets, $k$ averages to $n/m$. Due to $m=2^l$ we thus get
$O(\log k) = O(\log n/m) = O(\log n - \log m) = O(\log n - l)$.
Combining these two runtimes one gets a total look-up time of $O(l + \log n - l) = O(\log n)$, which conforms to the intuition that a lookup in a collection with $n$ entries is not possible below $O(\log n)$ operations.
In short, it is generally assumed that $l$ and $k$ are both constant with regard to $n$. But if you fix $l$ then $k$ grows with $n$.
Am I missing something here?
- calculate its hash
- find the entry in the resulting hash bucket
Hash calculation takes at least $O(l)$ operations when the hashes are $l$-bit-numbers.
When using an index (like a binary tree) for each bucket, finding an entry within a bucket that contains $k$ entries can be done in $O(\log k)$. With $n$ being the total number of entries in the hash map and $m$ being the number of buckets, $k$ averages to $n/m$. Due to $m=2^l$ we thus get
$O(\log k) = O(\log n/m) = O(\log n - \log m) = O(\log n - l)$.
Combining these two runtimes one gets a total look-up time of $O(l + \log n - l) = O(\log n)$, which conforms to the intuition that a lookup in a collection with $n$ entries is not possible below $O(\log n)$ operations.
In short, it is generally assumed that $l$ and $k$ are both constant with regard to $n$. But if you fix $l$ then $k$ grows with $n$.
Am I missing something here?
Solution
Because we generally use the RAM model of computation with uniform cost model when computing the running time of operations on a hash table, and the RAM model with uniform cost states that the time to do a single operation on an entire machine word is $O(1)$. Also, we generally assume that the hash value fits within a single machine word.
Thus, the running time of computing a hash value is not $O(l)$, but rather $O(1)$ [assuming both the value being hashed and the hash value fit within one word, or a constant number of words].
Moreover, when you choose the hash function and size of the hashtable appropriately, the expected value of $k$ is $O(1)$. In particular, the number of buckets is not fixed, but increases as the number of items in the hashtable grows. The number of buckets it is usually chosen to be some function of $n$; say $m = 4n$, or something like that. In any case, we usually choose $m$ so that $n/m = O(1)$. Therefore, the (expected) running time to find an item within the bucket is $O(1)$.
Therefore, the total (expected) running time is $O(1)+O(1) = O(1)$.
So why do we use the RAM model with uniform cost model? Because it's often a better match to reality than other alternatives.
Thus, the running time of computing a hash value is not $O(l)$, but rather $O(1)$ [assuming both the value being hashed and the hash value fit within one word, or a constant number of words].
Moreover, when you choose the hash function and size of the hashtable appropriately, the expected value of $k$ is $O(1)$. In particular, the number of buckets is not fixed, but increases as the number of items in the hashtable grows. The number of buckets it is usually chosen to be some function of $n$; say $m = 4n$, or something like that. In any case, we usually choose $m$ so that $n/m = O(1)$. Therefore, the (expected) running time to find an item within the bucket is $O(1)$.
Therefore, the total (expected) running time is $O(1)+O(1) = O(1)$.
So why do we use the RAM model with uniform cost model? Because it's often a better match to reality than other alternatives.
Context
StackExchange Computer Science Q#61030, answer score: 8
Revisions (0)
No revisions yet.