gotchaMinor

Maximum coverage 1/2-approximation algorithm: why does the central lemma hold?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

whythemaximumapproximationcentralalgorithmholddoeslemmacoverage

Problem

I am looking for an approximation algorithm for the Maximum Coverage problem and a proof of its approximation ratio. As approximation algorithm I use the greedy algorithm which chooses the set that maximizes the number of new elements at each turn (same as the one proposed in the Wikipedia article). However it is not as trivial to reach a result about the approximation ratio of this greedy approach.

So I did some research and found these lecture notes that describe some useful concepts for this problem and also proves a lemma. Now let me iterate what is said in these lecture notes before we get to the lemma.

Maximum Coverage problem
Given a set of sets $U=\{S_{1},\dots,S_{m}\}$ and a number $k$, choose (at most) $k$ sets such that the union of the sets is maximized, or $\max_{I\subseteq U}\vert \cup_{j\in I}S_{j}\vert$ for some index set $I$. The sets may have elements in common. Let $\mathbf{OPT}$ denote the optimal solution, i.e. numbers at most covered by $k$ sets, let $a_{i}$ denote the number of newly covered elements in the $i$:th iteration by the greedy algorithm, let $b_{i}$ be the total number of covered elements in the $i$:th iteration, i.e. $b_{i}=\sum_{j=1}^{i}a_{i}$, and let $c_{i}$ be the number of uncovered elements at the $i$:th iteration, i.e. $c_{i}=\mathbf{OPT}-b_{i}$. Moreover, $a_{0}=0$, $b_{0}=0$ and $c_{0}=\mathbf{OPT}$.

Lemma: The number of newly covered elements at the $(i+1)$:th iteration (i.e. $a_{i+1}$) is always greater than or equal to $(1/k)\cdot$number of uncovered elements at the $i$:th iteration, or $a_{i+1}\geq c_{i}/k$.

Proof: Optimal solution covers $\mathbf{OPT}$ elements at $k$ iterations. That means, at each iteration there should be some sets in $U$ whose size is greater than or equal to the $(1/k)$ of the remaining uncovered elements, i.e., $c_{i}/k$. Otherwise, it was impossible to cover $\mathbf{OPT}$ many elements at $k$ steps by the optimal solution. Since the approximation algorithm is greedy, i.e., choosing always

Solution

Let $S_{O_1},\ldots,S_{O_k}$ be an optimal solution, and let $O = S_{O_1} \cup \cdots \cup S_{O_k}$.

At the $i$'th iteration, suppose that the elements covered by the sets chosen the algorithm form the set $T$. Let $S'_{O_i} = S_{O_i} \setminus T$ and $O' = O \setminus T$, that is, we have removed from the optimal solution all elements already covered. Since
$$
|O'| = |S'_{O_1} \cup \cdots \cup S'_{O_k}| \leq |S'_{O_1}| + \cdots + |S'_{O_k}| \leq k \max_{i=1}^k |S'_{O_i}|,
$$
we see that among the sets $S_{O_1},\ldots,S_{O_k}$, at least one (the one maximizing $\max_i |S'_{O_i}|$ covers at least $|O'|/k$ uncovered elements.

The greedy algorithm chooses a set covering at least as many uncovered elements. This allows us to obtain a lower bound on its approximation ratio, as follows. Let $T_i$ be the union of the sets chosen in the first $i$ steps of the greedy algorithm. The starting point is $T_0 = \emptyset$. At step $i$, the argument above shows that we cover at least $|O'|/k$ new elements. Since $|O'| = |O \setminus T_{i-1}| \geq |O| - |T_{i-1}|$, this shows that
$$
|T_i| \geq |T_{i-1}| + (|O| - |T_{i-1}|)/k = \left(1 - \frac{1}{k}\right) |T_{i-1}| + \frac{|O|}{k}.
$$
In particular,
$$
\begin{align}
&|T_1| \geq \frac{|O|}{k} \\
&|T_2| \geq \frac{|O|}{k} + \left(1 - \frac{1}{k}\right) \frac{|O|}{k} \\
&|T_3| \geq \frac{|O|}{k} + \left(1 - \frac{1}{k}\right) \frac{|O|}{k} + \left(1 - \frac{1}{k}\right)^2 \frac{|O|}{k}
\end{align}
$$
and so on. At the end of the algorithm we get a set $T_k$ satisfying
$$
\begin{align}
|T_k| &\geq \frac{|O|}{k} \left[1 + \left(1 - \frac{1}{k}\right) + \cdots + \left(1 - \frac{1}{k}\right)^{k-1}\right] \\ &=
\frac{|O|}{k} \frac{1 - \left(1 - \frac{1}{k}\right)^k}{1 - \left(1 - \frac{1}{k}\right)} \\ &=
\left(1 - \left(1 - \frac{1}{k}\right)^k\right) |O|.
\end{align}
$$
Now it's time for some calculus:
$$
\ln \left(1 - \frac{1}{k}\right) = - \frac{1}{k} - \frac{1}{2k^2} - \cdots \leq -\frac{1}{k},
$$
and so
$$
\left(1 - \frac{1}{k}\right)^k = \exp\left[k\ln \left(1 - \frac{1}{k}\right)\right] \leq \exp (-1) = 1/e,
$$
implying that
$$
|T_k| \geq \left(1 - \frac{1}{e}\right) |O|.
$$
In other words, the greedy algorithm gives a $(1-1/e)$-approximation (this is better than your claimed $1/2$). It turns out that $1-1/e$ is tight for the greedy algorithm, and moreover, then $1-1/e$ cannot be improved by a polynomial time algorithm unless $\mathsf{P} = \mathsf{NP}$ (and unconditionally in the value oracle model).

Context

StackExchange Computer Science Q#124664, answer score: 2

Revisions (0)

No revisions yet.