HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Ways to perform "batch" Approximate Member Queries efficiently

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
efficientlyapproximatewaysbatchmemberperformqueries

Problem

In this problem, I'm first given n number of values which I have to store in a space efficient manner. Then I'm given m number of values which I have to check if they were added to the data structure before (such that after the operation, I know which value was a member, and which weren't).

I could use a Bloom Filter and simply iterate through m, but because m is considerably large, I wanted to see if there are more efficient ways. One potentially exploitable property of the data is that most values will not be present. I'm thinking I could construct an "aggregated query" which tells me if any of the value is present or not in one operation, and do something like a binary search, but wanted to see if there is already something out there.

Some additional details:

  • I can tolerate both false positives and false negatives (with predictable error rate)



  • n is very large (trillions)



  • m is considerably large (millions)



  • Only very few values in m will be members (ca. 1 in 100K)



  • The data structure can be immutable (no need for frequent writes)

Solution

If the query results are mostly false, the answer will be returned in $O(1)$ on average. (In traditional Bloom filters, negative results are faster than positive ones.)

This might be slow, however, since the lookups are random and have bad cache behavior. There are a few ways to fix this.

I'd suggest:

-
Use a blocked bloom filter (from Cache-, Hash- and Space-Efficient Bloom Filters) in which each item is mapped to a single cache-line-sized Bloom filter.

-
With the hash function that is used to map each item to a cache-line-sized Bloom filter, sort your queries in hash-value order.

-
Do the queries as usual, but in order.

Another approach is:

-
Use a split bloom filter, in which instead of setting $k$ values from a single array on each insert, set one value from each of $k$ arrays on insert.

-
After doing the inserts, use your first hash function to sort the values you wish to query in hash value order.

-
Query the first of your $k$ arrays for these values. Only a constant fraction will remain - $O(m)$ will be eliminated.

-
Take the remainder and repeat the last two steps, but using the second hash function, and the second bit array. Repeat $k$ times.

Sorting is expensive in internal memory, but it might be cheaper than all the random accesses. It depends on your machine and your data sizes.

Context

StackExchange Computer Science Q#47846, answer score: 3

Revisions (0)

No revisions yet.