patternMinor
Ways to perform "batch" Approximate Member Queries efficiently
Viewed 0 times
efficientlyapproximatewaysbatchmemberperformqueries
Problem
In this problem, I'm first given
I could use a Bloom Filter and simply iterate through
Some additional details:
n number of values which I have to store in a space efficient manner. Then I'm given m number of values which I have to check if they were added to the data structure before (such that after the operation, I know which value was a member, and which weren't).I could use a Bloom Filter and simply iterate through
m, but because m is considerably large, I wanted to see if there are more efficient ways. One potentially exploitable property of the data is that most values will not be present. I'm thinking I could construct an "aggregated query" which tells me if any of the value is present or not in one operation, and do something like a binary search, but wanted to see if there is already something out there.Some additional details:
- I can tolerate both false positives and false negatives (with predictable error rate)
nis very large (trillions)
mis considerably large (millions)
- Only very few values in
mwill be members (ca. 1 in 100K)
- The data structure can be immutable (no need for frequent writes)
Solution
If the query results are mostly false, the answer will be returned in $O(1)$ on average. (In traditional Bloom filters, negative results are faster than positive ones.)
This might be slow, however, since the lookups are random and have bad cache behavior. There are a few ways to fix this.
I'd suggest:
-
Use a blocked bloom filter (from Cache-, Hash- and Space-Efficient Bloom Filters) in which each item is mapped to a single cache-line-sized Bloom filter.
-
With the hash function that is used to map each item to a cache-line-sized Bloom filter, sort your queries in hash-value order.
-
Do the queries as usual, but in order.
Another approach is:
-
Use a split bloom filter, in which instead of setting $k$ values from a single array on each insert, set one value from each of $k$ arrays on insert.
-
After doing the inserts, use your first hash function to sort the values you wish to query in hash value order.
-
Query the first of your $k$ arrays for these values. Only a constant fraction will remain - $O(m)$ will be eliminated.
-
Take the remainder and repeat the last two steps, but using the second hash function, and the second bit array. Repeat $k$ times.
Sorting is expensive in internal memory, but it might be cheaper than all the random accesses. It depends on your machine and your data sizes.
This might be slow, however, since the lookups are random and have bad cache behavior. There are a few ways to fix this.
I'd suggest:
-
Use a blocked bloom filter (from Cache-, Hash- and Space-Efficient Bloom Filters) in which each item is mapped to a single cache-line-sized Bloom filter.
-
With the hash function that is used to map each item to a cache-line-sized Bloom filter, sort your queries in hash-value order.
-
Do the queries as usual, but in order.
Another approach is:
-
Use a split bloom filter, in which instead of setting $k$ values from a single array on each insert, set one value from each of $k$ arrays on insert.
-
After doing the inserts, use your first hash function to sort the values you wish to query in hash value order.
-
Query the first of your $k$ arrays for these values. Only a constant fraction will remain - $O(m)$ will be eliminated.
-
Take the remainder and repeat the last two steps, but using the second hash function, and the second bit array. Repeat $k$ times.
Sorting is expensive in internal memory, but it might be cheaper than all the random accesses. It depends on your machine and your data sizes.
Context
StackExchange Computer Science Q#47846, answer score: 3
Revisions (0)
No revisions yet.