gotchaMinor
Why does the following function distribute things in a binomial distribution?
Viewed 0 times
distributewhythethingsfunctionbinomialfollowingdoesdistribution
Problem
I was reading the following FDS paper:
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-75.pdf
and it says that the the following hash function does not distribute things uniformly (instead it creates a binomial distribution):
$$i = (hash(g + t)) \pmod n$$
while the following did distribute things uniformly:
$$i = (hash(g) + t) \pmod n$$
Why does the above distribute things evenly while the other one doesn't?
g = is the global UID for a blob
t = tract number. (tract is the measure/units of reads and writes to a blob).
n = is the number of tract servers or a multiple of the tract servers.
hash = SHA-1 (according to the paper)
Paper Reference:
Title: Flat Datacenter Storage
Author(s): Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, Yutaka Suzue
Institution(s): Microsoft Research, Univeristy of Texas at Austin
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-75.pdf
and it says that the the following hash function does not distribute things uniformly (instead it creates a binomial distribution):
$$i = (hash(g + t)) \pmod n$$
while the following did distribute things uniformly:
$$i = (hash(g) + t) \pmod n$$
Why does the above distribute things evenly while the other one doesn't?
g = is the global UID for a blob
t = tract number. (tract is the measure/units of reads and writes to a blob).
n = is the number of tract servers or a multiple of the tract servers.
hash = SHA-1 (according to the paper)
Paper Reference:
Title: Flat Datacenter Storage
Author(s): Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, Yutaka Suzue
Institution(s): Microsoft Research, Univeristy of Texas at Austin
Solution
The second function is easier to explain: it sends tract $t$ to position $t + h \pmod{n}$, where $h = hash(g)$. The function $t \mapsto t + h \pmod{n}$ is injective (one-to-one), and so every index $i$ gets mapped exactly once (in fact, by $i - h \pmod{n}$).
For the first function, we can think of $t \mapsto hash(g+t)$ as a random function (this is a common assumption whenever hash functions are involved; the fact that we use SHA-1 is completely irrelevant). Therefore for each fixed index $i$, the probability that $t$ is mapped to $i$ is $1/n$, and the total number of $t$s mapped to $i$ is distributed $\mathrm{Bin}(1/n,1) \approx \mathrm{Po}(1)$, that is it has binomial distribution which is approximated well by a Poisson distribution with mean $1$.
For the first function, we can think of $t \mapsto hash(g+t)$ as a random function (this is a common assumption whenever hash functions are involved; the fact that we use SHA-1 is completely irrelevant). Therefore for each fixed index $i$, the probability that $t$ is mapped to $i$ is $1/n$, and the total number of $t$s mapped to $i$ is distributed $\mathrm{Bin}(1/n,1) \approx \mathrm{Po}(1)$, that is it has binomial distribution which is approximated well by a Poisson distribution with mean $1$.
Context
StackExchange Computer Science Q#23153, answer score: 4
Revisions (0)
No revisions yet.