snippetjavaMinor

Counting Bloom Filter

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

bloomfiltercounting

Problem

I'm working on writing a Java class library which includes the following implementation of a counting bloom filter:

package sj224.lib.util;

import java.util.function.Predicate;
import java.util.Random;

public class BloomFilter implements Predicate{
    private final int size;
    private int count;
    private final int bpe;
    private final int[] data;
    public int hashCode(){
        long t=count;
        for(int i=0;i source){
        this(64);
        for(V i:source)insert(i);
    }
    public boolean remove(V v){
        int[] a=hash(v);
        for(int i=0;i<size;i++){
            if(data[i]<a[i])return false;
        }
        for(int i=0;i<size;i++){
            data[i]-=a[i];
        }
        count--;
        return true;
    }
    public void clear(){
        for(int i=0;i<size;i++)data[i]=0;
    }
    public void insert(V v){
        int[] a=hash(v);
        for(int i=0;i<size;i++){
            data[i]+=a[i];
        }
        count++;
    }
    public boolean contains(V v){
        int[] a=hash(v);
        for(int i=0;i<size;i++){
            if(data[i]<a[i])return false;
        }
        return true;
    }

    public boolean test(V v){
        return contains(v);
    }
}

Is there anything here that can reasonably be improved?

Solution

Typo

This probably happened when you cut and pasted your code, but your contains function had a few lines switched in their ordering.

Operations taking O(size) time, could be faster

Your add, remove, and contains operations all do approximately the same thing. They create a hash where the hash contains up to bpe bits set in an array of size bits. Then they iterate over the size bits.

The important thing here is that bpe is much smaller than size. For any size larger than 64, bpe will be sqrt(size). Since only bpe bits are set, you only need to take O(bpe) time instead of O(size) time.

Operations using 32x the required space

Also, when you create your array of bits, you use one int per bit even though you only ever set the int to 1 or 0. If size is very large, such as 100 million, you could be using 400 MB when you only need to be using 12.5 MB.

Putting it all together

I rewrote your functions, and also combined all of their functionality into the hashing function since all 3 functions were so similar. In the new code, the loops only run through bpe iterations instead of size iterations. I also packed the bits to minimize the space used.

/**
 * Performs one of three operations: add, remove, or contains.
 *
 * @param        v                The element to add/remove/test.
 * @param        operation        1 = add, -1 = remove, 0 = contains
 * @return                        If operation is "contains", returns true
 *                                if v is contained in the bloom filter, and
 *                                false if it is not.  If operation is
 *                                anything else, returns true.
 */
private boolean hashOperation(V v, int operation)
{
    Random r         = new Random(v.hashCode());
    int[]  bitsFound = new int[(size+31)/32];

    for(int i=0;i> 5;
        int bit      = (1 << (bitNum & 31));
        if ((bitsFound[bitIndex] & bit) == 0) {
            // New bit.  Perform operation.
            bitsFound[bitIndex] |= bit;
            if (operation == 0) {
                if(data[bitNum] == 0)
                    return false;
            } else {
                data[bitNum] += operation;
            }
        }
    }
    return true;
}

public boolean remove(V v)
{
    if (contains(v)) {
        hashOperation(v, -1);
        count--;
        return true;
    }
    return false;
}

public void insert(V v)
{
    hashOperation(v, 1);
    count++;
}

public boolean contains(V v)
{
    return hashOperation(v, 0);
}

Addendum: Testing the fastest way to hash using maaartinus's method

Maaartinus showed a fast way to hash using just 2 hash functions. I decided to test out 4 variants:

Using a power of 2 size (should be fastest)

Using modulus operator every loop (should be slowest)

Using a compare + subtract

Doing the subtract without using any compare

Note, the following is all done in C:

Variant 1

All variants use the following setup. The thing that varies will be in the loop:

uint32_t hash, hash1, hash2, size;

size  = rand();
hash1 = rand() % size;
hash2 = rand() % size;

hash = hash1;
uint32_t sizeMask = 0x3fffffff;  // Variant 1 only
for (i=0;i<iterations;i++) {
    hash = (hash + hash2) & (sizeMask);
}

Variant 2

for (i=0;i<iterations;i++) {
    hash = (hash + hash2) % size;
}

Variant 3

// If size is  size)
        hash -= size;
}

// This works for all cases (and has the same runtime as the one above)
for (i=0;i hash2)
        hash += hash2;
    else
        hash -= (size - hash2);
}

Variant 4

// In C, >> is not guaranteed to do an arithmetic shift, so
// we use a logical shift instead.  The compiler optimizes this code
// to do an arithmetic shift anyways.  In java, change the ">>" here
// to a ">>>".
for (i=0;i> 31)) & size;
}

// In java, where >> is arithmetic shift, you could write it like this.
for (i=0;i> 31) & size;
}

Timing results

Running each variant for 1 billion iterations (results in seconds):

Variant 1 (& power of 2): 0.58
Variant 2 (% size)      : 6.56
Variant 3 (if, subtract): 1.25
Variant 4 (no if)       : 1.44

However, this is a bit misleading because there was nothing in the loop except for the hash calculation. If I add in the part where we use the hash on each loop:

bits[hash>>5] |= (1 << (hash & 31));

Then the timings all become much closer:

Size = 1 billion
Variant 1 (& power of 2): 15.92
Variant 2 (% size)      : 18.92
Variant 3 (if, subtract): 17.48
Variant 4 (no if)       : 16.76

Size = 4 million
Variant 1 (& power of 2): 2.48
Variant 2 (% size)      : 7.11
Variant 3 (if, subtract): 2.58
Variant 4 (no if)       : 2.52

Size = 256K
Variant 1 (& power of 2): 1.24
Variant 2 (% size)      : 6.64
Variant 3 (if, subtract): 1.79
Variant 4 (no if)       : 1.95

So it appears that at large sizes, the memory manipulation of the bit array will be the limiting factor and not the hash calculation. So you can choose from the ab

Code Snippets

/**
 * Performs one of three operations: add, remove, or contains.
 *
 * @param        v                The element to add/remove/test.
 * @param        operation        1 = add, -1 = remove, 0 = contains
 * @return                        If operation is "contains", returns true
 *                                if v is contained in the bloom filter, and
 *                                false if it is not.  If operation is
 *                                anything else, returns true.
 */
private boolean hashOperation(V v, int operation)
{
    Random r         = new Random(v.hashCode());
    int[]  bitsFound = new int[(size+31)/32];

    for(int i=0;i<bpe;i++){
        int bitNum   = r.nextInt(size);
        int bitIndex = bitNum >> 5;
        int bit      = (1 << (bitNum & 31));
        if ((bitsFound[bitIndex] & bit) == 0) {
            // New bit.  Perform operation.
            bitsFound[bitIndex] |= bit;
            if (operation == 0) {
                if(data[bitNum] == 0)
                    return false;
            } else {
                data[bitNum] += operation;
            }
        }
    }
    return true;
}

public boolean remove(V v)
{
    if (contains(v)) {
        hashOperation(v, -1);
        count--;
        return true;
    }
    return false;
}

public void insert(V v)
{
    hashOperation(v, 1);
    count++;
}

public boolean contains(V v)
{
    return hashOperation(v, 0);
}

uint32_t hash, hash1, hash2, size;

size  = rand();
hash1 = rand() % size;
hash2 = rand() % size;

hash = hash1;
uint32_t sizeMask = 0x3fffffff;  // Variant 1 only
for (i=0;i<iterations;i++) {
    hash = (hash + hash2) & (sizeMask);
}

for (i=0;i<iterations;i++) {
    hash = (hash + hash2) % size;
}

// If size is < 0x80000000
for (i=0;i<iterations;i++) {
    hash += hash2;
    if (hash > size)
        hash -= size;
}

// This works for all cases (and has the same runtime as the one above)
for (i=0;i<iterations;i++) {
    if (size - hash > hash2)
        hash += hash2;
    else
        hash -= (size - hash2);
}

// In C, >> is not guaranteed to do an arithmetic shift, so
// we use a logical shift instead.  The compiler optimizes this code
// to do an arithmetic shift anyways.  In java, change the ">>" here
// to a ">>>".
for (i=0;i<iterations;i++) {
    hash += hash2;
    hash -= (-((size - hash) >> 31)) & size;
}

// In java, where >> is arithmetic shift, you could write it like this.
for (i=0;i<iterations;i++) {
    hash += hash2;
    hash -= ((size - hash) >> 31) & size;
}

Context

StackExchange Code Review Q#90218, answer score: 4

Revisions (0)

No revisions yet.