snippetjavaMinor
Counting Bloom Filter
Viewed 0 times
bloomfiltercounting
Problem
I'm working on writing a Java class library which includes the following implementation of a counting bloom filter:
Is there anything here that can reasonably be improved?
package sj224.lib.util;
import java.util.function.Predicate;
import java.util.Random;
public class BloomFilter implements Predicate{
private final int size;
private int count;
private final int bpe;
private final int[] data;
public int hashCode(){
long t=count;
for(int i=0;i source){
this(64);
for(V i:source)insert(i);
}
public boolean remove(V v){
int[] a=hash(v);
for(int i=0;i<size;i++){
if(data[i]<a[i])return false;
}
for(int i=0;i<size;i++){
data[i]-=a[i];
}
count--;
return true;
}
public void clear(){
for(int i=0;i<size;i++)data[i]=0;
}
public void insert(V v){
int[] a=hash(v);
for(int i=0;i<size;i++){
data[i]+=a[i];
}
count++;
}
public boolean contains(V v){
int[] a=hash(v);
for(int i=0;i<size;i++){
if(data[i]<a[i])return false;
}
return true;
}
public boolean test(V v){
return contains(v);
}
}Is there anything here that can reasonably be improved?
Solution
Typo
This probably happened when you cut and pasted your code, but your
Operations taking O(size) time, could be faster
Your add, remove, and contains operations all do approximately the same thing. They create a hash where the hash contains up to
The important thing here is that
Operations using 32x the required space
Also, when you create your array of bits, you use one int per bit even though you only ever set the int to 1 or 0. If size is very large, such as 100 million, you could be using 400 MB when you only need to be using 12.5 MB.
Putting it all together
I rewrote your functions, and also combined all of their functionality into the hashing function since all 3 functions were so similar. In the new code, the loops only run through
Addendum: Testing the fastest way to hash using maaartinus's method
Maaartinus showed a fast way to hash using just 2 hash functions. I decided to test out 4 variants:
Note, the following is all done in C:
Variant 1
All variants use the following setup. The thing that varies will be in the loop:
Variant 2
Variant 3
Variant 4
Timing results
Running each variant for 1 billion iterations (results in seconds):
However, this is a bit misleading because there was nothing in the loop except for the hash calculation. If I add in the part where we use the hash on each loop:
Then the timings all become much closer:
So it appears that at large sizes, the memory manipulation of the bit array will be the limiting factor and not the hash calculation. So you can choose from the ab
This probably happened when you cut and pasted your code, but your
contains function had a few lines switched in their ordering.Operations taking O(size) time, could be faster
Your add, remove, and contains operations all do approximately the same thing. They create a hash where the hash contains up to
bpe bits set in an array of size bits. Then they iterate over the size bits.The important thing here is that
bpe is much smaller than size. For any size larger than 64, bpe will be sqrt(size). Since only bpe bits are set, you only need to take O(bpe) time instead of O(size) time.Operations using 32x the required space
Also, when you create your array of bits, you use one int per bit even though you only ever set the int to 1 or 0. If size is very large, such as 100 million, you could be using 400 MB when you only need to be using 12.5 MB.
Putting it all together
I rewrote your functions, and also combined all of their functionality into the hashing function since all 3 functions were so similar. In the new code, the loops only run through
bpe iterations instead of size iterations. I also packed the bits to minimize the space used./**
* Performs one of three operations: add, remove, or contains.
*
* @param v The element to add/remove/test.
* @param operation 1 = add, -1 = remove, 0 = contains
* @return If operation is "contains", returns true
* if v is contained in the bloom filter, and
* false if it is not. If operation is
* anything else, returns true.
*/
private boolean hashOperation(V v, int operation)
{
Random r = new Random(v.hashCode());
int[] bitsFound = new int[(size+31)/32];
for(int i=0;i> 5;
int bit = (1 << (bitNum & 31));
if ((bitsFound[bitIndex] & bit) == 0) {
// New bit. Perform operation.
bitsFound[bitIndex] |= bit;
if (operation == 0) {
if(data[bitNum] == 0)
return false;
} else {
data[bitNum] += operation;
}
}
}
return true;
}
public boolean remove(V v)
{
if (contains(v)) {
hashOperation(v, -1);
count--;
return true;
}
return false;
}
public void insert(V v)
{
hashOperation(v, 1);
count++;
}
public boolean contains(V v)
{
return hashOperation(v, 0);
}Addendum: Testing the fastest way to hash using maaartinus's method
Maaartinus showed a fast way to hash using just 2 hash functions. I decided to test out 4 variants:
- Using a power of 2 size (should be fastest)
- Using modulus operator every loop (should be slowest)
- Using a compare + subtract
- Doing the subtract without using any compare
Note, the following is all done in C:
Variant 1
All variants use the following setup. The thing that varies will be in the loop:
uint32_t hash, hash1, hash2, size;
size = rand();
hash1 = rand() % size;
hash2 = rand() % size;
hash = hash1;
uint32_t sizeMask = 0x3fffffff; // Variant 1 only
for (i=0;i<iterations;i++) {
hash = (hash + hash2) & (sizeMask);
}Variant 2
for (i=0;i<iterations;i++) {
hash = (hash + hash2) % size;
}Variant 3
// If size is size)
hash -= size;
}
// This works for all cases (and has the same runtime as the one above)
for (i=0;i hash2)
hash += hash2;
else
hash -= (size - hash2);
}Variant 4
// In C, >> is not guaranteed to do an arithmetic shift, so
// we use a logical shift instead. The compiler optimizes this code
// to do an arithmetic shift anyways. In java, change the ">>" here
// to a ">>>".
for (i=0;i> 31)) & size;
}
// In java, where >> is arithmetic shift, you could write it like this.
for (i=0;i> 31) & size;
}Timing results
Running each variant for 1 billion iterations (results in seconds):
Variant 1 (& power of 2): 0.58
Variant 2 (% size) : 6.56
Variant 3 (if, subtract): 1.25
Variant 4 (no if) : 1.44
However, this is a bit misleading because there was nothing in the loop except for the hash calculation. If I add in the part where we use the hash on each loop:
bits[hash>>5] |= (1 << (hash & 31));Then the timings all become much closer:
Size = 1 billion
Variant 1 (& power of 2): 15.92
Variant 2 (% size) : 18.92
Variant 3 (if, subtract): 17.48
Variant 4 (no if) : 16.76
Size = 4 million
Variant 1 (& power of 2): 2.48
Variant 2 (% size) : 7.11
Variant 3 (if, subtract): 2.58
Variant 4 (no if) : 2.52
Size = 256K
Variant 1 (& power of 2): 1.24
Variant 2 (% size) : 6.64
Variant 3 (if, subtract): 1.79
Variant 4 (no if) : 1.95
So it appears that at large sizes, the memory manipulation of the bit array will be the limiting factor and not the hash calculation. So you can choose from the ab
Code Snippets
/**
* Performs one of three operations: add, remove, or contains.
*
* @param v The element to add/remove/test.
* @param operation 1 = add, -1 = remove, 0 = contains
* @return If operation is "contains", returns true
* if v is contained in the bloom filter, and
* false if it is not. If operation is
* anything else, returns true.
*/
private boolean hashOperation(V v, int operation)
{
Random r = new Random(v.hashCode());
int[] bitsFound = new int[(size+31)/32];
for(int i=0;i<bpe;i++){
int bitNum = r.nextInt(size);
int bitIndex = bitNum >> 5;
int bit = (1 << (bitNum & 31));
if ((bitsFound[bitIndex] & bit) == 0) {
// New bit. Perform operation.
bitsFound[bitIndex] |= bit;
if (operation == 0) {
if(data[bitNum] == 0)
return false;
} else {
data[bitNum] += operation;
}
}
}
return true;
}
public boolean remove(V v)
{
if (contains(v)) {
hashOperation(v, -1);
count--;
return true;
}
return false;
}
public void insert(V v)
{
hashOperation(v, 1);
count++;
}
public boolean contains(V v)
{
return hashOperation(v, 0);
}uint32_t hash, hash1, hash2, size;
size = rand();
hash1 = rand() % size;
hash2 = rand() % size;
hash = hash1;
uint32_t sizeMask = 0x3fffffff; // Variant 1 only
for (i=0;i<iterations;i++) {
hash = (hash + hash2) & (sizeMask);
}for (i=0;i<iterations;i++) {
hash = (hash + hash2) % size;
}// If size is < 0x80000000
for (i=0;i<iterations;i++) {
hash += hash2;
if (hash > size)
hash -= size;
}
// This works for all cases (and has the same runtime as the one above)
for (i=0;i<iterations;i++) {
if (size - hash > hash2)
hash += hash2;
else
hash -= (size - hash2);
}// In C, >> is not guaranteed to do an arithmetic shift, so
// we use a logical shift instead. The compiler optimizes this code
// to do an arithmetic shift anyways. In java, change the ">>" here
// to a ">>>".
for (i=0;i<iterations;i++) {
hash += hash2;
hash -= (-((size - hash) >> 31)) & size;
}
// In java, where >> is arithmetic shift, you could write it like this.
for (i=0;i<iterations;i++) {
hash += hash2;
hash -= ((size - hash) >> 31) & size;
}Context
StackExchange Code Review Q#90218, answer score: 4
Revisions (0)
No revisions yet.