snippetrubyMinor
Bloom filter implementation using a BitArray
Viewed 0 times
bitarraybloomfilterusingimplementation
Problem
I am following the Wikipedia definition of the Bloom filter. My implementation uses the
The test:
BitArray I created here. Here is an alternative implementation by Ilya Grigorik.# n = Number of items in the filter
# p = Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
# m = Number of bits in the filter
# k = Number of hash functions
require_relative '../bitarray/bit_array'
require 'digest'
class BloomFilter
def initialize k: 1, m: 10
@k = k
@m = m
@n = 0
@b = BitArray.new
end
# Probability of a false positive based on formula in wikipedia
def false_positive_rate
fp_rate = (1-(1-1/@m.to_f)**(@k.to_f*@n.to_f))**@k.to_f
p fp_rate
fp_rate.round(2)
end
def insert item
@n += 1
@k.times do |n|
hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
position = hash_val % @m
@b.set position
end
self
end
def include? item
@k.times do |n|
hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
position = hash_val % @m
return @b.get(position) == 1
end
return false
end
endThe test:
require 'minitest/autorun'
require_relative 'bloom_filter'
class BloomFilterTest < MiniTest::Unit::TestCase
# Verified probability formula and tests from http://hur.st/bloomfilter?n=1&p=0.01
def test_stats
assert_equal 0.01, BloomFilter.new(k: 7, m: 10).insert("a").false_positive_rate
b = BloomFilter.new(k: 2, m: 29)
10.times { |n| b.insert(n) }
assert_equal 0.25, b.false_positive_rate
end
def test_include
b = BloomFilter.new(k: 2, m: 100)
b.insert "The Trumpet of the Swan"
assert_equal true, b.include?("The Trumpet of the Swan")
refute b.include?("E. B. White")
end
endSolution
-
You're using
-
You might want to check that \$k ≤ m\$. Doesn't really make sense otherwise. (Of course it doesn't really make sense for \$k\$ to equal \$m\$ either, as adding just 1 item would set every bit right away, but at least that's technically valid.)
-
In this case, I'm OK with using single-letter variables like
-
-
Why does
-
Your
This also means that your false positive rate is at best
$$1 - (1 - \frac{1}{m})^n$$
because \$k\$ is effectively always 1 when checking. But if \$k\$ is higher, then we'd be setting more bits than we're checking, so the bit array fills up faster.
For instance, if \$k = 2\$ and \$m = 4\$ (extreme values; just for illustration), then the first item (call it \$A\$) we insert should set half of the bits in the bit array. That should (if my math is correct) give a false positive rate of ~19% after adding that one item: There's a 19% chance that another item, \$B\$, will match the same 2 bits. But if we only check 1 bit, then the false positive rate is essentially 50%, since half the bits in the array are set.
Making only only minimal changes, you could do this:
Now, it'll only return if a bit is zero, otherwise it'll keep going.
-
But your hashing logic is flawed too, I'm afraid. There's no guarantee that it'll return \$k\$ unique bit positions (for \$k > 1\$). It's possible (though unlikely) that you'll calculate \$k\$ MD5 hashes that are all cleanly divisible by (i.e. are multiples of) \$m\$. In that case, you'd just get the bit position zero \$k\$ number of times and end up setting a single bit. It's a similar situation to the one above, since again, the false positive rate rises.
If, for example, \$k = 2\$ and \$m = 4\$ again, and we've added item \$A\$ which (correctly) set 2 bits, we should have that 19% false positive rate when checking \$B\$. But if the hash for \$B\$ works out so that it's just 1 bit instead of 2, then we're back to a 50% false positive rate since it's only that 1 bit that would have to match, rather than 2 separate bits.
Frankly, I don't know enough about creating robust hashing functions to offer a good solution to this. I'll continue and I'll provide alternative code for different things, but it all includes this basic flaw (except in the addendum at the end). The code I provide here won't fix that; it's there to illustrate other aspects of the code.
-
You've duplicated the logic for hashing items. Duplication is always bad, but this is a prime example of why: If the two pieces of code aren't exactly identical, the whole thing stops functioning. I'd extract the hashing logic into a separate, private method that returns an array of bit positions
You'll note that I'm using string interpolation; IMO it's clearer than manually calling
-
Using MD5 and then discarding a bunch of stuff really feels like overkill. Ruby already maintains a hash for anything that inherits from
(Again, this is just as flawed [edit: Actually, it's more flawed, see below] as your current implementation (it can re
You're using
assert_equal to compare something with true - which is exactly what the basic assert method already does. Just like you use refute to check that something's false, assert checks that something's true.-
You might want to check that \$k ≤ m\$. Doesn't really make sense otherwise. (Of course it doesn't really make sense for \$k\$ to equal \$m\$ either, as adding just 1 item would set every bit right away, but at least that's technically valid.)
-
In this case, I'm OK with using single-letter variables like
k and m, since that appears to be conventional enough, and they're hard to give simple, yet descriptive names. However, I'd also add copious comments to explain exactly what the heck they mean. I had to go to Wikipedia to find out. (Edit: Sorry, comments already there, just not by the #initialize method, where I expected them to be.)-
p fp_rate - don't leave print calls in final code (unless that code's purpose is to print, of course).-
Why does
#false_positive_rate round its output? Rounding is a presentation concern; the rate just is what it is, decimals and all. I'm thinking you're maybe rounding so it passes your tests, but in that case you're writing your code backwards.-
Your
#include? method is broken. It'll return (from the method; not just the block) at the first bit it checks - it won't check all of them. So if the first bit it checks is zero, you get false - which is correct! But if the first bit is 1, you immediate get true even though none of the other k-1 bits have been checked.This also means that your false positive rate is at best
$$1 - (1 - \frac{1}{m})^n$$
because \$k\$ is effectively always 1 when checking. But if \$k\$ is higher, then we'd be setting more bits than we're checking, so the bit array fills up faster.
For instance, if \$k = 2\$ and \$m = 4\$ (extreme values; just for illustration), then the first item (call it \$A\$) we insert should set half of the bits in the bit array. That should (if my math is correct) give a false positive rate of ~19% after adding that one item: There's a 19% chance that another item, \$B\$, will match the same 2 bits. But if we only check 1 bit, then the false positive rate is essentially 50%, since half the bits in the array are set.
Making only only minimal changes, you could do this:
@k.times do |n|
hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
position = hash_val % @m
return false if @b.get(position).zero?
end
trueNow, it'll only return if a bit is zero, otherwise it'll keep going.
-
But your hashing logic is flawed too, I'm afraid. There's no guarantee that it'll return \$k\$ unique bit positions (for \$k > 1\$). It's possible (though unlikely) that you'll calculate \$k\$ MD5 hashes that are all cleanly divisible by (i.e. are multiples of) \$m\$. In that case, you'd just get the bit position zero \$k\$ number of times and end up setting a single bit. It's a similar situation to the one above, since again, the false positive rate rises.
If, for example, \$k = 2\$ and \$m = 4\$ again, and we've added item \$A\$ which (correctly) set 2 bits, we should have that 19% false positive rate when checking \$B\$. But if the hash for \$B\$ works out so that it's just 1 bit instead of 2, then we're back to a 50% false positive rate since it's only that 1 bit that would have to match, rather than 2 separate bits.
Frankly, I don't know enough about creating robust hashing functions to offer a good solution to this. I'll continue and I'll provide alternative code for different things, but it all includes this basic flaw (except in the addendum at the end). The code I provide here won't fix that; it's there to illustrate other aspects of the code.
-
You've duplicated the logic for hashing items. Duplication is always bad, but this is a prime example of why: If the two pieces of code aren't exactly identical, the whole thing stops functioning. I'd extract the hashing logic into a separate, private method that returns an array of bit positions
def item_bit_positions item
@k.times.map do |n|
hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16)
hash % @m
end
endYou'll note that I'm using string interpolation; IMO it's clearer than manually calling
to_s and concatenating. The results the same, though, but that result may not be what you want (see below)-
Using MD5 and then discarding a bunch of stuff really feels like overkill. Ruby already maintains a hash for anything that inherits from
Object (i.e. everything). It's what's used to check for object equality, and it's hexadecimal number you see printed if you print an object that doesn't have a #to_s method.#hash is just a (signed) integer. You could do something like:def item_bit_positions item
@k.times.map { |n| item.hash % (@m-n) }
end(Again, this is just as flawed [edit: Actually, it's more flawed, see below] as your current implementation (it can re
Code Snippets
@k.times do |n|
hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
position = hash_val % @m
return false if @b.get(position).zero?
end
truedef item_bit_positions item
@k.times.map do |n|
hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16)
hash % @m
end
enddef item_bit_positions item
@k.times.map { |n| item.hash % (@m-n) }
endclass BloomFilter
# This reader is mostly here for test purposes,
# which isn't too pretty, I'm afraid. At least
# it's a pretty harmless addition
attr_reader :bit_array
# TODO: Explain what k and m mean!
def initialize(k = 1, m = 10)
k, m = k.to_i, m.to_i
raise RangeError, "k must be smaller than m" if k >= m
raise ArgumentError, "k and m must be positive integers" if k < 1
@k = k
@m = m
@bit_array = 0
end
def insert(item)
@bit_array |= digest(item)
self
end
def include?(item)
hash = digest(item)
@bit_array & hash == hash
end
alias_method :member?, :include?
private
# WARNING: This method is flawed! See answer
def digest(item)
hash = item.hash
@k.times.map do |n|
bit = (hash % (@m-n))
2**bit
end.inject(&:|)
end
end
gem 'minitest'
require 'minitest/autorun'
class BloomFilterTest < Minitest::Test
def test_initialize_with_valid_args
assert BloomFilter.new(1, 2)
end
def test_initialize_with_invalid_args
assert_raises RangeError do
BloomFilter.new(2, 2)
end
end
def test_initialize_with_negative_args
assert_raises ArgumentError do
BloomFilter.new(-2, 2)
end
end
def test_insert_uses_the_items_hash
item = mock_item(123)
instance = BloomFilter.new
instance.insert(item)
item.verify
end
def test_insert_changes_the_bit_array
instance = BloomFilter.new
previous = instance.bit_array
instance.insert(Object.new)
refute_equal previous, instance.bit_array
end
def test_include_uses_the_items_hash
item = mock_item(123)
instance = BloomFilter.new
instance.include?(item)
item.verify
end
def test_include_returns_true_for_member
item = mock_item(123, 2)
instance = BloomFilter.new
instance.insert(item)
assert instance.include?(item)
end
def test_include_returns_false_for_nonmember
member = mock_item(123)
non_member = mock_item(122)
instance = BloomFilter.new(3, 10)
instance.insert(member)
refute instance.include?(non_member)
end
# An intentionally failing test
def test_digest_always_sets_k_bits
k = 3
item = mock_item(720) # multiple of 10, 8 and 9
instance = BloomFilter.new(k, 10)
hash = instance.send(:digest, item) # testing private method
assert_equal k, count_bits(hash), "Note: This test intentionally fails!"
end
# A helper method (mock object factory)
def mock_item(hash, calls = 1)
mock = Minitest::Mock.new
calls.times { mock.expect(:hash, hash) }
mock
end
def count_bits(number)
count = 0
until number == 0
count += (number & 1)
number = number >> 1
end
count
end
endmask = 0
k.times do |n|
hash = calc_hash(n, item) # calculate a big 'ol hash (assume this method exists)
mask = mask << f # shift our current mask f places
mask |= 2**(hash % f) # set a bit somewhere in the lower f places
endContext
StackExchange Code Review Q#67401, answer score: 2
Revisions (0)
No revisions yet.