HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Heavy hitters problem linear scan and auxiliary space

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
heavylinearproblemspacescanhittersandauxiliary

Problem

I'm looking at Lecture #2 from this Stanford course http://web.stanford.edu/class/cs168/index.html, let me introduce the HH problem:

We have an array of $n$ elements and we want to know which elements appear at least $\frac{n}{k}$ times, for a given $k$. You can think of $n$ as being pretty large and $k$ small.

Proposition There's no algorithm that solves the Heavy Hitters problem in one pass while using a sublinear amount of space.

An intuition for why this holds is given in the notes:

  • for $k=\frac{n}{2}$ we're looking at elements that appear at least twice in the array.



  • suppose the array is of the form $x_1, x_2, \ldots , x_{n-1}, y$ , say $S=\{x_i \mid 1\leq i\leq n\}$, then basically the HH problem reduces to a membership problem, is $y\in S$?



  • we can solve this problem by saving $S$ as a hash table, and the idea is that we can't solve all the memebership problems (for say $n/2$ elements) using sublinear space.



They suggest using the PHP but I don't see how that applies in this context, any idea?

I can see why the statement is true but I'm looking for a formal proof of it.

Solution

Here is the idea, taking as an example the case $k = n/2$ which you mention. Also, let's assume that a word is $\log n$ bits, and that each element consists of $C$ words. In total there are $n^C$ possible elements, forming the universe $U$.

Suppose that an algorithm uses $m$ words of space. For each subset $S$ of $U$ of size $n-1$, let $T(S)$ denote the contents of memory after the algorithm has read the first the elements of $S$ in order. We claim that $T(S_1) \neq T(S_2)$ if $S_1 \neq S_2$. Indeed, suppose without loss of generality that $S_1$ isn't a subset of $S_2$, and let $x \in S_1 \setminus S_2$. If the last element that the algorithm reads is $x$ then it should output $x$ if it had read the elements in $S_1$, and it should output nothing if it has read the elements in $S_2$.

Since there are $\binom{|U|}{n-1} = \binom{n^C}{n-1}$ different sets $S$ and $n^m$ different memory settings, we conclude that $\binom{n^C}{n-1} \leq n^m$. Since $\binom{a}{b} \geq (a/b)^b$, we conclude that $(n^{C-1})^n \lesssim n^m$, and so $(C-1)n \leq m$ or $m = \Omega(n)$ (for $C > 1$).

Context

StackExchange Computer Science Q#57051, answer score: 3

Revisions (0)

No revisions yet.