HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

Why is the Java HashMap load factor 0.75?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
whythefactorjavaloadhashmap

Problem

I can't understand why the Java HashMap load factor is 0.75. If I understand well, the formula for the load factor is n/m, where n is the number of key and m is the number of position in the hash table.

Since HashMap utilize bucket (i.e., a linked list) to store the value it is possible to have a load factor > 1 without problem, so it's not clear for me why the load factor is set to 0.75.

Solution

I don't know the answer, but I can walk you through what might be going through the mind of someone designing such a data structure.

Assuming a "good" hash function, and that $n$ is large enough, then for a given load factor $\lambda$, the probability that a bucket contains exactly $k$ elements in it is given by the Poisson distribution:

$$P(k,\lambda) = \frac{e^{-\lambda} \lambda^k}{k!}$$

So, for example, if $\lambda=1$, then the proportion of buckets containing exactly one element (i.e. those for which no chaining is needed; I'm not sure exactly how Java's HashMap actually implements it) is:

$$P(1,1) = \frac{1}{e} \approx 36.8\%$$

For $\lambda = 0.75$, it is:

$$P(1,0.75) \approx 35.4\%$$

Not much difference. However, this is measured per bucket, not per item. The proportion of successful queries that do not require traversing a chain (and is therefore loop-free) is $P(1,\lambda)/\lambda$. For $\lambda=1$, that is $36.8\%$, but for $\lambda=0.75$, that is $47.2\%$.

So by reducing $\lambda$ from $1$ to $0.75$, the proportion of successful queries that do not need to traverse a chain increases from a little over a third to a little under a half.

But it might use more memory. How much more?

I am going to assume some kind of optimised representation where the hash table is an array of $n$ words, an empty bucket is represented as a null pointer, a bucket with a single element is represented as a pointer to the object, and a bucket with more than one element is represented as a pointer to a linked list. I will assume that a linked list node is 3 words in size: the object header (which Java semantics requires), a pointer to the item, and a "next" pointer.

So you can think of the overhead cost of an empty bucket as being $1$ word, a bucket with one item in it being $1$ word, and a bucket with $k>1$ items in it being $1+3k$ words.

I'll let you work out the details for yourself, by if $\lambda=1$, I calculate the overhead to be about $2.896$ words per stored item, and for $\lambda=0.75$, it's about $2.916$ words per stored item. That's less than $0.7\%$ difference in memory overhead.

(Exercise: Under these assumptions, for what value of $\lambda$ is the memory overhead minimised? I will say that it's greater than $\frac{1}{2}$ and less than $1$. It's not $0.75$; if it was, that would be part of my answer!)

So it does seem that under these assumptions, you would get significantly more efficient lookup and only pay a tiny increase in memory overhead. And that's just one of many tradeoffs that you could look at.

My guess is that the Java library implementers performed some back-of-the-envelope calculations similar to this, and some experiments to be certain, and found that this was acceptable.

Context

StackExchange Computer Science Q#149496, answer score: 18

Revisions (0)

No revisions yet.