HiveBrain v1.2.0
Get Started
← Back to all entries
snippetrubyMinor

Bloom filter implementation using a BitArray

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
bitarraybloomfilterusingimplementation

Problem

I am following the Wikipedia definition of the Bloom filter. My implementation uses the BitArray I created here. Here is an alternative implementation by Ilya Grigorik.

# n = Number of items in the filter
# p = Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
# m = Number of bits in the filter
# k = Number of hash functions

require_relative '../bitarray/bit_array'
require 'digest'

class BloomFilter

  def initialize k: 1, m: 10
    @k = k
    @m = m
    @n = 0
    @b = BitArray.new
  end

  # Probability of a false positive based on formula in wikipedia
  def false_positive_rate
    fp_rate = (1-(1-1/@m.to_f)**(@k.to_f*@n.to_f))**@k.to_f
    p fp_rate
    fp_rate.round(2)
  end 

  def insert item
    @n += 1
    @k.times do |n|
      hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
      position = hash_val % @m
      @b.set position
    end  
    self
  end   

  def include? item
    @k.times do |n|
      hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
      position = hash_val % @m
      return @b.get(position) == 1
    end     
    return false
  end  

end


The test:

require 'minitest/autorun'
require_relative 'bloom_filter'

class BloomFilterTest < MiniTest::Unit::TestCase

  # Verified probability formula and tests from http://hur.st/bloomfilter?n=1&p=0.01
  def test_stats
    assert_equal 0.01, BloomFilter.new(k: 7, m: 10).insert("a").false_positive_rate
    b = BloomFilter.new(k: 2, m: 29)
    10.times { |n| b.insert(n) }
    assert_equal 0.25, b.false_positive_rate
  end 

  def test_include
    b = BloomFilter.new(k: 2, m: 100)
    b.insert "The Trumpet of the Swan"
    assert_equal true, b.include?("The Trumpet of the Swan")
    refute b.include?("E. B. White")   
  end  

end

Solution

-
You're using assert_equal to compare something with true - which is exactly what the basic assert method already does. Just like you use refute to check that something's false, assert checks that something's true.

-
You might want to check that \$k ≤ m\$. Doesn't really make sense otherwise. (Of course it doesn't really make sense for \$k\$ to equal \$m\$ either, as adding just 1 item would set every bit right away, but at least that's technically valid.)

-
In this case, I'm OK with using single-letter variables like k and m, since that appears to be conventional enough, and they're hard to give simple, yet descriptive names. However, I'd also add copious comments to explain exactly what the heck they mean. I had to go to Wikipedia to find out. (Edit: Sorry, comments already there, just not by the #initialize method, where I expected them to be.)

-
p fp_rate - don't leave print calls in final code (unless that code's purpose is to print, of course).

-
Why does #false_positive_rate round its output? Rounding is a presentation concern; the rate just is what it is, decimals and all. I'm thinking you're maybe rounding so it passes your tests, but in that case you're writing your code backwards.

-
Your #include? method is broken. It'll return (from the method; not just the block) at the first bit it checks - it won't check all of them. So if the first bit it checks is zero, you get false - which is correct! But if the first bit is 1, you immediate get true even though none of the other k-1 bits have been checked.

This also means that your false positive rate is at best

$$1 - (1 - \frac{1}{m})^n$$

because \$k\$ is effectively always 1 when checking. But if \$k\$ is higher, then we'd be setting more bits than we're checking, so the bit array fills up faster.

For instance, if \$k = 2\$ and \$m = 4\$ (extreme values; just for illustration), then the first item (call it \$A\$) we insert should set half of the bits in the bit array. That should (if my math is correct) give a false positive rate of ~19% after adding that one item: There's a 19% chance that another item, \$B\$, will match the same 2 bits. But if we only check 1 bit, then the false positive rate is essentially 50%, since half the bits in the array are set.

Making only only minimal changes, you could do this:

@k.times do |n|
  hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
  position = hash_val % @m
  return false if @b.get(position).zero?
end     
true


Now, it'll only return if a bit is zero, otherwise it'll keep going.

-
But your hashing logic is flawed too, I'm afraid. There's no guarantee that it'll return \$k\$ unique bit positions (for \$k > 1\$). It's possible (though unlikely) that you'll calculate \$k\$ MD5 hashes that are all cleanly divisible by (i.e. are multiples of) \$m\$. In that case, you'd just get the bit position zero \$k\$ number of times and end up setting a single bit. It's a similar situation to the one above, since again, the false positive rate rises.

If, for example, \$k = 2\$ and \$m = 4\$ again, and we've added item \$A\$ which (correctly) set 2 bits, we should have that 19% false positive rate when checking \$B\$. But if the hash for \$B\$ works out so that it's just 1 bit instead of 2, then we're back to a 50% false positive rate since it's only that 1 bit that would have to match, rather than 2 separate bits.

Frankly, I don't know enough about creating robust hashing functions to offer a good solution to this. I'll continue and I'll provide alternative code for different things, but it all includes this basic flaw (except in the addendum at the end). The code I provide here won't fix that; it's there to illustrate other aspects of the code.

-
You've duplicated the logic for hashing items. Duplication is always bad, but this is a prime example of why: If the two pieces of code aren't exactly identical, the whole thing stops functioning. I'd extract the hashing logic into a separate, private method that returns an array of bit positions

def item_bit_positions item
   @k.times.map do |n|
     hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16)
     hash % @m
   end
 end


You'll note that I'm using string interpolation; IMO it's clearer than manually calling to_s and concatenating. The results the same, though, but that result may not be what you want (see below)

-
Using MD5 and then discarding a bunch of stuff really feels like overkill. Ruby already maintains a hash for anything that inherits from Object (i.e. everything). It's what's used to check for object equality, and it's hexadecimal number you see printed if you print an object that doesn't have a #to_s method.

#hash is just a (signed) integer. You could do something like:

def item_bit_positions item
   @k.times.map { |n| item.hash % (@m-n) }
 end


(Again, this is just as flawed [edit: Actually, it's more flawed, see below] as your current implementation (it can re

Code Snippets

@k.times do |n|
  hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
  position = hash_val % @m
  return false if @b.get(position).zero?
end     
true
def item_bit_positions item
   @k.times.map do |n|
     hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16)
     hash % @m
   end
 end
def item_bit_positions item
   @k.times.map { |n| item.hash % (@m-n) }
 end
class BloomFilter
  # This reader is mostly here for test purposes,
  # which isn't too pretty, I'm afraid. At least
  # it's a pretty harmless addition
  attr_reader :bit_array

  # TODO: Explain what k and m mean!
  def initialize(k = 1, m = 10)
    k, m = k.to_i, m.to_i
    raise RangeError, "k must be smaller than m" if k >= m
    raise ArgumentError, "k and m must be positive integers" if k < 1
    @k = k
    @m = m
    @bit_array = 0
  end

  def insert(item)
    @bit_array |= digest(item)
    self
  end

  def include?(item)
    hash = digest(item)
    @bit_array & hash == hash
  end
  alias_method :member?, :include?

  private

  # WARNING: This method is flawed! See answer
  def digest(item)
    hash = item.hash
    @k.times.map do |n|
      bit = (hash % (@m-n))
      2**bit
    end.inject(&:|)
  end
end



gem 'minitest'
require 'minitest/autorun'

class BloomFilterTest < Minitest::Test
  def test_initialize_with_valid_args
    assert BloomFilter.new(1, 2)
  end

  def test_initialize_with_invalid_args
    assert_raises RangeError do
      BloomFilter.new(2, 2)
    end
  end

  def test_initialize_with_negative_args
    assert_raises ArgumentError do
      BloomFilter.new(-2, 2)
    end
  end

  def test_insert_uses_the_items_hash
    item = mock_item(123)
    instance = BloomFilter.new
    instance.insert(item)
    item.verify
  end

  def test_insert_changes_the_bit_array
    instance = BloomFilter.new
    previous = instance.bit_array
    instance.insert(Object.new)
    refute_equal previous, instance.bit_array
  end

  def test_include_uses_the_items_hash
    item = mock_item(123)
    instance = BloomFilter.new
    instance.include?(item)
    item.verify
  end

  def test_include_returns_true_for_member
    item = mock_item(123, 2)
    instance = BloomFilter.new
    instance.insert(item)
    assert instance.include?(item)
  end

  def test_include_returns_false_for_nonmember
    member = mock_item(123)
    non_member = mock_item(122)
    instance = BloomFilter.new(3, 10)
    instance.insert(member)
    refute instance.include?(non_member)
  end

  # An intentionally failing test
  def test_digest_always_sets_k_bits
    k = 3
    item = mock_item(720) # multiple of 10, 8 and 9
    instance = BloomFilter.new(k, 10)
    hash = instance.send(:digest, item) # testing private method
    assert_equal k, count_bits(hash), "Note: This test intentionally fails!"
  end

  # A helper method (mock object factory)
  def mock_item(hash, calls = 1)
    mock = Minitest::Mock.new
    calls.times { mock.expect(:hash, hash) }
    mock
  end

  def count_bits(number)
    count = 0
    until number == 0
      count += (number & 1)
      number = number >> 1
    end
    count
  end
end
mask = 0
k.times do |n|
  hash = calc_hash(n, item) # calculate a big 'ol hash (assume this method exists)
  mask = mask << f          # shift our current mask f places
  mask |= 2**(hash % f)     # set a bit somewhere in the lower f places
end

Context

StackExchange Code Review Q#67401, answer score: 2

Revisions (0)

No revisions yet.