HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Distinct elements count of huge multiset

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
distinctmultisetelementshugecount

Problem

I know that HyperLogLog can approximate the distinct elements count of a huge multiset but I was wondering if it was possible, using a method I saw mentioned on an IRC channel, to get an exact answer while still using significantly less space than a traditional approach.

Would the following work to compute the exact (not approximate) distinct elements count of a huge multiset?

The idea is to use a Bloom filter twice, first processing the data one way, and then in reverse and additionally using a map (whose size is much smaller than the huge multiset -- this property is provided by the Bloom filter).

1st pass
create an empty bloom filter
create an empty map
for each element from the multiset
    if already in the bloom add to the map with key element / value 0
    else add to the bloom


Second pass is nearly identical but elements from the huge multiset are processed in reverse (multiset doesn't need to be an ordered set: it simply needs to have its elements iterated from the end to its beginning).

2nd pass
    empty the bloom filter (but not the set)
    for each element from the multiset (but processed this time in reverse order)
        if already in the bloom add to the map with key element / value 0
        else add to the bloom


At this point the map contains a list of potentially, but not necessarily, clashing entries in the multiset. So the entries are all set to zero.

3rd pass
   cnt = 0
   for each element in the multiset
       if present in the map, increment value in that map for that key
       else increment cnt


Exact cardinality is now equal to "cnt + nb entries in the map whose value is 1 or greater".

This requires three passes but uses space identical to a Bloom filter plus a map to count the x% of entries the Bloom filter couldn't answer if they were present or not.

I hacked a quick proof of concept and did some generative testing and it seemed to give the correct answer but I'm really not sure.

Is it possible to

Solution

Let me simplify your third step: count the number of elements in your multiset which are not in the map, and add to it the number of elements in the map.

Suppose that your elements are $x_1,\ldots,x_n$. For a given hash $h$, let $x_{i_1},\ldots,x_{i_\ell}$ be the elements hashing to $h$ (in order). If $\ell = 1$, then the elements won't be added to the map in the first or second step. Otherwise, elements $x_{i_2},\ldots,x_{i_\ell}$ will be added to the map in the first step, and elements $x_{i_{\ell-1}},\ldots,x_{i_1}$ will be added to the map in the second step.

In total, this shows that after the first two steps, the map will contain all elements whose hash appears more than once.

After the third step, cnt will contain the number of elements whose hash appears exactly once. All of these elements are distinct, and also distinct from the elements in the map. You then add to cnt the number of elements present in the map, and so obtain the number of distinct elements.

Context

StackExchange Computer Science Q#106475, answer score: 3

Revisions (0)

No revisions yet.