snippetjavaMinor
Bloom filter implementation
Viewed 0 times
implementationbloomfilter
Problem
This is an implementation of Bloom filter with
I'm also verifying complexity: \$O(1)\$
```
public class BloomFilter {
private final BitSet bitSet;
private final int hashFunctionCount;
private final MessageDigest md5Digest;
/**
* Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from the total size of
* the Bloom and the number of expected elements.
*
* @param bitSetSize
* defines how many bits should be used in total for the filter.
* @param expectedNumberOfElements
* defines the maximum number of elements the filter is expected to contain.
* @throws NoSuchAlgorithmException
*/
public BloomFilter(int bitSetSize, int expectedNumberOfElements) throws NoSuchAlgorithmException {
bitSet = new BitSet(bitSetSize);
/*
* The natural logarithm is the logarithm to the base e, where e is an irrational and
* transcendental constant approximately equal to 2.718281828.
*/
hashFunctionCount = (int) Math.round((bitSetSize / (double)expectedNumberOfElements) * Math.log(2.0));
md5Digest = java.security.MessageDigest.getInstance("MD5");
}
/**
* Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The
* digest function is called until the required number of int's are produced. For each call to digest a salt
* is prepended to the data. The salt is increased by 1 for each call.
*
* @param data specifies input data.
* @param hashes number of hashes/int's to produce.
* @return array of int-sized hashes
*/
private int[] createHashes(byte[] data) {
int[] result = new int[hashFunctionCount];
int k = 0;
byte salt = 0;
while (k c) {
for (E
add and contain functionality. I'm looking for code review, best practices, optimizations etc.I'm also verifying complexity: \$O(1)\$
```
public class BloomFilter {
private final BitSet bitSet;
private final int hashFunctionCount;
private final MessageDigest md5Digest;
/**
* Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from the total size of
* the Bloom and the number of expected elements.
*
* @param bitSetSize
* defines how many bits should be used in total for the filter.
* @param expectedNumberOfElements
* defines the maximum number of elements the filter is expected to contain.
* @throws NoSuchAlgorithmException
*/
public BloomFilter(int bitSetSize, int expectedNumberOfElements) throws NoSuchAlgorithmException {
bitSet = new BitSet(bitSetSize);
/*
* The natural logarithm is the logarithm to the base e, where e is an irrational and
* transcendental constant approximately equal to 2.718281828.
*/
hashFunctionCount = (int) Math.round((bitSetSize / (double)expectedNumberOfElements) * Math.log(2.0));
md5Digest = java.security.MessageDigest.getInstance("MD5");
}
/**
* Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The
* digest function is called until the required number of int's are produced. For each call to digest a salt
* is prepended to the data. The salt is increased by 1 for each call.
*
* @param data specifies input data.
* @param hashes number of hashes/int's to produce.
* @return array of int-sized hashes
*/
private int[] createHashes(byte[] data) {
int[] result = new int[hashFunctionCount];
int k = 0;
byte salt = 0;
while (k c) {
for (E
Solution
-
Where's the class-level documentation?
-
will need renaming if you change your hash function. Better to call it
-
Great, but would it be useful to have a function to estimate an optimal filter size?
-
isn't very useful. You could say under what conditions it will throw that exception, but basically the conditions come down to "the JRE violates the spec, which says that MD5 will be supported", so it would be more sensible to catch the
-
As rolfl commented, this is a useless comment. Logarithms and
-
seems like a lot of implementation detail for the javadoc. I would be inclined to inline
(on the assumption that the use of buckets is documented at the class level, either directly or via literature reference).
-
doesn't seem to correspond to a parameter of the method.
-
confused me. I had to read the code to understand the comment, which rather defeats the purpose. Why are you splitting it?
-
is something which could be pulled out into a general-purpose bit bashing library, as a function
-
seems more complicated to me than
and frankly you can safely use
because there's no widely used MessageDigest whose output isn't a multiple of 32 bits.
-
Two things: firstly, the fact that it uses the object's
-
is a bit suspect. You could get different results with the same test case on different machines, or even on the same machine in different locales.
Where's the class-level documentation?
-
private final MessageDigest md5Digest;will need renaming if you change your hash function. Better to call it
hasher?-
* Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from the total size of
* the Bloom and the number of expected elements.Great, but would it be useful to have a function to estimate an optimal filter size?
-
* @throws NoSuchAlgorithmExceptionisn't very useful. You could say under what conditions it will throw that exception, but basically the conditions come down to "the JRE violates the spec, which says that MD5 will be supported", so it would be more sensible to catch the
NoSuchAlgorithmException and rethrow it wrapped in a RuntimeException. Then just include a unit test to make sure you haven't typoed "MD5".-
/*
* The natural logarithm is the logarithm to the base e, where e is an irrational and
* transcendental constant approximately equal to 2.718281828.
*/As rolfl commented, this is a useless comment. Logarithms and
e are far more basic concepts than Bloom filters, but you haven't tried to define Bloom filters anywhere. A useful comment here would be a literature reference to justify the estimate.-
* Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The
* digest function is called until the required number of int's are produced. For each call to digest a salt
* is prepended to the data. The salt is increased by 1 for each call.seems like a lot of implementation detail for the javadoc. I would be inclined to inline
getBitIndex into this method, rename it getBuckets, and simplify the javadoc to something like* Hashes the data to generate a set of filter buckets.(on the assumption that the use of buckets is documented at the class level, either directly or via literature reference).
-
* @param hashes number of hashes/int's to produce.doesn't seem to correspond to a parameter of the method.
-
* we divide an array into blocks of 4 for example:
* - 100 length digest array is broken into pieces of 25
*
* But i advances by 4, not by 25.confused me. I had to read the code to understand the comment, which rather defeats the purpose. Why are you splitting it?
* The MessageDigest output has far more entropy than we need, so
* we break it into 4-byte chunks and use each chunk to compute a
* bucket index.-
int h = 0;
// 4 bits are consumed for a single hash.
for (int j = (i*4); j < (i*4)+4; j++) {
h <<= 8;
h |= ((int) digest[j]) & 0xFF;
}is something which could be pulled out into a general-purpose bit bashing library, as a function
toIntBigEndian(byte[] buf, int off). Also, that cast is unnecessary: bytes and shorts are automatically promoted to int by any arithmetic or bitwise operation.-
for (int i = 0; i < digest.length/4 && k < hashFunctionCount; i++) {seems more complicated to me than
for (int i = 0; i + 3 < digest.length && k < hashFunctionCount; i += 4) {and frankly you can safely use
for (int i = 0; i < digest.length && k < hashFunctionCount; i += 4) {because there's no widely used MessageDigest whose output isn't a multiple of 32 bits.
-
* Adds an object to the Bloom filter. The output from the object's
* toString() method is used as input to the hash functions.Two things: firstly, the fact that it uses the object's
toString() as its identity is something that should be mentioned at the class level. Secondly, it strikes me as a bad idea anyway. If you're only working with strings, make that explicit by taking strings instead of genericising the class. Alternatively, add a field for a "serialiser" object which converts E to byte[], with the default being one that works via toString().-
element.toString().getBytes()is a bit suspect. You could get different results with the same test case on different machines, or even on the same machine in different locales.
Code Snippets
private final MessageDigest md5Digest;* Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from the total size of
* the Bloom and the number of expected elements.* @throws NoSuchAlgorithmException/*
* The natural logarithm is the logarithm to the base e, where e is an irrational and
* transcendental constant approximately equal to 2.718281828.
*/* Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The
* digest function is called until the required number of int's are produced. For each call to digest a salt
* is prepended to the data. The salt is increased by 1 for each call.Context
StackExchange Code Review Q#41589, answer score: 2
Revisions (0)
No revisions yet.