HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Python bloom filter

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
bloomfilterpython

Problem

I was hoping to get some feedback on my bloom filter implementation using mmh3. mmh3 is a hashing library based on the murmur hash which is a fast non-cryptographically secure hashing algorithm.

I tested it with some simple words and it seems to be working.
import hashlib

import bitarray
import mmh3

def calc_optimal_hash_func(lenght_of_entries):
m = (-lenght_of_entries * math.log(0.01)) / ((math.log(2)) ** 2)
k = (m / lenght_of_entries) * math.log(2)

return int(m), int(math.ceil(k))

def lookup(string, bit_array, seeds):
for seed in range(seeds):
result = mmh3.hash(string, seed) % len(bit_array)
# print "seed", seed, "pos", result, "->", bit_array[result]
if bit_array[result] == False:
return string, "Def not in dictionary"

return string, "Probably in here"

def load_words():
data = []
word_loc = '/usr/share/dict/words'

with open(word_loc, 'r') as f:
for word in f.readlines():
data.append(word.strip())

return data

def get_bit_array():
words = load_words()
w_length = len(words)
array_length, seeds = calc_optimal_hash_func(w_length)
bit_array = array_length * bitarray.bitarray('0')

for word in words:
try:
for seed in range(seeds):
# print "word", word, "seed", seed, "pos", result, "->", bit_array[result]
pos = mmh3.hash(word, seed) % array_length
bit_array[pos] = True
except:
pass

return seeds, bit_array

if __name__ == '__main__':
seeds, ba = get_bit_array()
print(lookup('badwordforsure', ba, seeds))
print(lookup('cat', ba, seeds))
print(lookup('hello', ba, seeds))
print(lookup('jsalj', ba, seeds))

Solution

Sadly I'm no really into Bloom filter, so those aspects needs to be reviewed by someone else. However, there are some idiomatic stuff I would like to address in your code.

First let me mention some style issues (mostly my preferences, and not really large issues):

  • Consider specifying imports – You are using limited parts of bitarray and mmh3, so you could consider using from bitarray import bitarray and from mmh3 import hash. However, this is based on personal preferences, and using mmh3.hash() does clearly indicate which hash function you're using.



  • Do you need the hashlib import? – It doesn't seem like you using anything from it. Is it needed?



  • Don't use more parentheses than needed – In calc_optimal_hash_func() you use a lot of parentheses. Are really all of those needed? It seems a little too much, and I'd prefer not to use that many, as it kind of clutters up the formulas to some extent. This is still more of a general advice, though.



  • Don't test for == False – Use the not operator instead, and do if not bit_array[result]:. It simply make more sense.



  • Variable naming – Naming a variable result, or bit_array, or string doesn't convey anything about the purpose of the variable. These and some of the other could benefit from better naming.



  • Comment on the non-obvious stuff – It would be nice to see some comments describing what actually happens in your code. What kind of a lookup do you do?



  • Mostly good spacing – Most of your code is reasonable easy to read, but I don't like the alignment at start of get_bit_array(). I think it would be better to use the default way of words = load_words() and so on.



Code smells

In addition to the smaller style issues, there are some bigger code smells in your code which I would like to address:

  • load_words() hides a global constant – It always loads /usr/share/dict/words, which kind of removes the need for a function. It would make a little more sense if you had load_words(dictionary_file).



  • load_words() reads the whole file into memory – If I'm not mistaken, part of the reason you're wanting to use a Bloom Filter is to verify membership within a really large data structure. When you load the entire thing into memory, there is no need for the filter, you'd be better off checking for membership in the array directly!



  • Why the try...except around the mmh3.hash()? – This seems a little strange, as you pass the catch all the time. Does this serve some unknown purpose? If so, it should have been documented. And if not, it should be removed.



-
Consider making a class, instead of functions – Having get_bit_array() return two variables, which you need to shuffle around later on, makes me think that this would better be served with a class. Imaging something like the following main section:

bloom_filter = BloomFilter('/usr/share/dict/words')

print(bloom_filter.lookup('badwordforsure'))
print(bloom_filter.lookup('cat'))
print(bloom_filter.lookup('hello'))
print(bloom_filter.lookup('jsalj'))


This would also expose what you're filtering towards, and it would allow for better interface and handling in general. It does seem like most of the functions are only used once, with the exception of the lookup function.

-
Join load_words() and get_bit_array() – In order to avoid keeping the entire dictionary in memory, I would build the bit_array directly when reading the file. This would really ease the memory requirements of your code.

Alternative implementation

Here is an implementation were I've taken most of these advice into account:

```
import mmh3
import bitarray
import math

class BloomFilter:
"""By building a bit array based upon a dictionary, this class
allows for probable membership, and certain non-membership of
any lookups within the filter."""

def __init__(self, dictionary_file):
"""Based on the dictionary_file, builds a bit array to
be used for testing membership within the file for a given
percentage, and accurate non-membership."""

# Skip file to get number of words
number_words = sum(1 for line in open(dictionary_file))

# Initialize some variables
self.calc_optimal_hash(number_words)
self.bit_array = self.array_length * bitarray.bitarray('0')

# Reread file, and build bit array
with (open(dictionary_file, 'r')) as dict_file:
for word in dict_file.readlines():
for seed in range(self.seed_count):
hashIndex = mmh3.hash(word, seed) % self.array_length
self.bit_array[hashIndex] = True

def calc_optimal_hash(self, number_words):
"""Calculate array_length and seed_count."""

# If I'm mistaken in precedence, re-add parentheses :-)
m = -number_words * math.log(0.01) / math.log(2) ** 2
k = m / number_words * math.log(2)

self.array_length = int(m)
self.seed_count = int(

Code Snippets

bloom_filter = BloomFilter('/usr/share/dict/words')

print(bloom_filter.lookup('badwordforsure'))
print(bloom_filter.lookup('cat'))
print(bloom_filter.lookup('hello'))
print(bloom_filter.lookup('jsalj'))
import mmh3
import bitarray
import math

class BloomFilter:
    """By building a bit array based upon a dictionary, this class
    allows for probable membership, and certain non-membership of
    any lookups within the filter."""

    def __init__(self, dictionary_file):
        """Based on the dictionary_file, builds a bit array to 
        be used for testing membership within the file for a given
        percentage, and accurate non-membership."""

        # Skip file to get number of words
        number_words = sum(1 for line in open(dictionary_file))

        # Initialize some variables
        self.calc_optimal_hash(number_words)
        self.bit_array = self.array_length * bitarray.bitarray('0')

        # Reread file, and build bit array 
        with (open(dictionary_file, 'r')) as dict_file:
            for word in dict_file.readlines():
                for seed in range(self.seed_count):
                    hashIndex = mmh3.hash(word, seed) % self.array_length
                    self.bit_array[hashIndex] = True


    def calc_optimal_hash(self, number_words):
        """Calculate array_length and seed_count."""

        # If I'm mistaken in precedence, re-add parentheses :-)
        m = -number_words * math.log(0.01) / math.log(2) ** 2
        k = m / number_words * math.log(2)

        self.array_length = int(m)
        self.seed_count = int(math.ceil(k))


    def probable_member(self, word):
        """Test whether word probably is in the dictionary, or
        are surely not in the dictionary."""

        for seed in range(self.seed_count):
            candidateHash = mmh3.hash(word, seed) % self.array_length
            if not self.bit_array[candidateHash]:
                return False

        return True


    def lookup(self, word):
        """Test whether word probably is in the dictionary, or
        are surely not in the dictionary."""

        if self.probable_member(word):
            return '"{}" is most likely in dictionary'.format(word)
        else:
            return '"{}" is not in dictionary'.format(word)


def main():

   bloom_filter = BloomFilter('/usr/share/dict/words')

   print(bloom_filter.lookup('badwordforsure'))
   print(bloom_filter.lookup('cat'))
   print(bloom_filter.lookup('hello'))
   print(bloom_filter.lookup('jsalj'))

main()

Context

StackExchange Code Review Q#162064, answer score: 3

Revisions (0)

No revisions yet.