patternMinor
What is the complexity of a variation of the Coupon collector's problem?
Viewed 0 times
problemthecollectorwhatcouponvariationcomplexity
Problem
I need to know the complexity of the following algorithm:
Draw elements from a set of size $m$, one by one, randomly, with replacement, until coming across $n$ different elements from the set ($n\le m$). We do not care what are the numbers drawn, and halt once $n$ different numbers were seen during this process.
This is a variation of the Coupon collector's problem, where one continues picking until seeing all $m$ elements of the set.
Example:
We draw numbers from the set $\{1,2,3,4,5\}$, $n=3$, $m=5$. The output
$\{1,5,2\}$ is produced by the following series of draws:
$1,1,5,1,5,2$
from the pool $\{1,2,3,4,5\}$. In this case, we halt after $6$ operations.
What is the expected time complexity of this algorithm?
Draw elements from a set of size $m$, one by one, randomly, with replacement, until coming across $n$ different elements from the set ($n\le m$). We do not care what are the numbers drawn, and halt once $n$ different numbers were seen during this process.
This is a variation of the Coupon collector's problem, where one continues picking until seeing all $m$ elements of the set.
Example:
We draw numbers from the set $\{1,2,3,4,5\}$, $n=3$, $m=5$. The output
$\{1,5,2\}$ is produced by the following series of draws:
$1,1,5,1,5,2$
from the pool $\{1,2,3,4,5\}$. In this case, we halt after $6$ operations.
What is the expected time complexity of this algorithm?
Solution
When $m$ is much larger than $n$, the expected number of trials is basically linear in $n$. We can make this more precise, as shown below.
Let $T_n$ be the random variable which counts the number of trials up to seeing $n$ different elements, where the elements are picked uniformly at random from a set of size $m$.
You can write $T_n$ as the sum of geometric random variables
$$T_n=1+G\left(1-\frac{1}{m}\right)+G\left(1-\frac{2}{m}\right)+...+G\left(1-\frac{n-1}{m}\right)$$
where $G(p)$ is a geometric random variable with parameter $p$ (number of trails up to the first success, where success happens with probability $p$ (I leave it to you to prove the equivalence).
Using linearity of expectation:
$$\begin{align*}
\mathbb{E}\left[T_n\right]&=\sum\limits_{i=0}^{n-1}\mathbb{E}\left[G\left(1-\frac{i}{m}\right)\right]\\
&=\sum\limits_{i=0}^{n-1}\frac{m}{m-i}\\
&=m\left(\frac{1}{m}+\frac{1}{m-1}+...+\frac{1}{m-n+1}\right)\\
&=m\left(H_m-H_{m-n}\right)
\end{align*}$$
where $H_k$ is the harmonic sum, which satisfies $\frac{1}{2(k+1)}\le H_k-\ln k - \gamma\le \frac{1}{2k}$, where $\gamma$ is the Euler–Mascheroni constant (see Wolfram's page on harmonic numbers). Using these bounds for harmonic numbers we have:
$$\begin{align*}
\mathbb{E}\left[T_n\right]&\le m\left(\frac{1}{2m}+\ln m +\gamma -\left(\frac{1}{2(m-n+1)}+\ln(m-n)+\gamma\right)\right)\\
&\le m\left(\frac{1}{2m}+\ln\frac{m}{m-n}\right)=\frac{1}{2}+m\ln\frac{m}{m-n}.
\end{align*}$$
Note that $m\ln\frac{m}{m-n}=m\ln\left(1+\frac{n}{m-n}\right)\approx n$ when $n\ll m$, so for $m$ much larger than $n$, the expected number of trials is linear in $n$.
Let $T_n$ be the random variable which counts the number of trials up to seeing $n$ different elements, where the elements are picked uniformly at random from a set of size $m$.
You can write $T_n$ as the sum of geometric random variables
$$T_n=1+G\left(1-\frac{1}{m}\right)+G\left(1-\frac{2}{m}\right)+...+G\left(1-\frac{n-1}{m}\right)$$
where $G(p)$ is a geometric random variable with parameter $p$ (number of trails up to the first success, where success happens with probability $p$ (I leave it to you to prove the equivalence).
Using linearity of expectation:
$$\begin{align*}
\mathbb{E}\left[T_n\right]&=\sum\limits_{i=0}^{n-1}\mathbb{E}\left[G\left(1-\frac{i}{m}\right)\right]\\
&=\sum\limits_{i=0}^{n-1}\frac{m}{m-i}\\
&=m\left(\frac{1}{m}+\frac{1}{m-1}+...+\frac{1}{m-n+1}\right)\\
&=m\left(H_m-H_{m-n}\right)
\end{align*}$$
where $H_k$ is the harmonic sum, which satisfies $\frac{1}{2(k+1)}\le H_k-\ln k - \gamma\le \frac{1}{2k}$, where $\gamma$ is the Euler–Mascheroni constant (see Wolfram's page on harmonic numbers). Using these bounds for harmonic numbers we have:
$$\begin{align*}
\mathbb{E}\left[T_n\right]&\le m\left(\frac{1}{2m}+\ln m +\gamma -\left(\frac{1}{2(m-n+1)}+\ln(m-n)+\gamma\right)\right)\\
&\le m\left(\frac{1}{2m}+\ln\frac{m}{m-n}\right)=\frac{1}{2}+m\ln\frac{m}{m-n}.
\end{align*}$$
Note that $m\ln\frac{m}{m-n}=m\ln\left(1+\frac{n}{m-n}\right)\approx n$ when $n\ll m$, so for $m$ much larger than $n$, the expected number of trials is linear in $n$.
Context
StackExchange Computer Science Q#66236, answer score: 6
Revisions (0)
No revisions yet.