snippetpythonMinor
Python bloom filter
Viewed 0 times
bloomfilterpython
Problem
I was hoping to get some feedback on my bloom filter implementation using
I tested it with some simple words and it seems to be working.
mmh3. mmh3 is a hashing library based on the murmur hash which is a fast non-cryptographically secure hashing algorithm.I tested it with some simple words and it seems to be working.
import hashlib
import bitarray
import mmh3
def calc_optimal_hash_func(lenght_of_entries):
m = (-lenght_of_entries * math.log(0.01)) / ((math.log(2)) ** 2)
k = (m / lenght_of_entries) * math.log(2)
return int(m), int(math.ceil(k))
def lookup(string, bit_array, seeds):
for seed in range(seeds):
result = mmh3.hash(string, seed) % len(bit_array)
# print "seed", seed, "pos", result, "->", bit_array[result]
if bit_array[result] == False:
return string, "Def not in dictionary"
return string, "Probably in here"
def load_words():
data = []
word_loc = '/usr/share/dict/words'
with open(word_loc, 'r') as f:
for word in f.readlines():
data.append(word.strip())
return data
def get_bit_array():
words = load_words()
w_length = len(words)
array_length, seeds = calc_optimal_hash_func(w_length)
bit_array = array_length * bitarray.bitarray('0')
for word in words:
try:
for seed in range(seeds):
# print "word", word, "seed", seed, "pos", result, "->", bit_array[result]
pos = mmh3.hash(word, seed) % array_length
bit_array[pos] = True
except:
pass
return seeds, bit_array
if __name__ == '__main__':
seeds, ba = get_bit_array()
print(lookup('badwordforsure', ba, seeds))
print(lookup('cat', ba, seeds))
print(lookup('hello', ba, seeds))
print(lookup('jsalj', ba, seeds))
Solution
Sadly I'm no really into Bloom filter, so those aspects needs to be reviewed by someone else. However, there are some idiomatic stuff I would like to address in your code.
First let me mention some style issues (mostly my preferences, and not really large issues):
Code smells
In addition to the smaller style issues, there are some bigger code smells in your code which I would like to address:
-
Consider making a class, instead of functions – Having
This would also expose what you're filtering towards, and it would allow for better interface and handling in general. It does seem like most of the functions are only used once, with the exception of the
-
Join
Alternative implementation
Here is an implementation were I've taken most of these advice into account:
```
import mmh3
import bitarray
import math
class BloomFilter:
"""By building a bit array based upon a dictionary, this class
allows for probable membership, and certain non-membership of
any lookups within the filter."""
def __init__(self, dictionary_file):
"""Based on the dictionary_file, builds a bit array to
be used for testing membership within the file for a given
percentage, and accurate non-membership."""
# Skip file to get number of words
number_words = sum(1 for line in open(dictionary_file))
# Initialize some variables
self.calc_optimal_hash(number_words)
self.bit_array = self.array_length * bitarray.bitarray('0')
# Reread file, and build bit array
with (open(dictionary_file, 'r')) as dict_file:
for word in dict_file.readlines():
for seed in range(self.seed_count):
hashIndex = mmh3.hash(word, seed) % self.array_length
self.bit_array[hashIndex] = True
def calc_optimal_hash(self, number_words):
"""Calculate array_length and seed_count."""
# If I'm mistaken in precedence, re-add parentheses :-)
m = -number_words * math.log(0.01) / math.log(2) ** 2
k = m / number_words * math.log(2)
self.array_length = int(m)
self.seed_count = int(
First let me mention some style issues (mostly my preferences, and not really large issues):
- Consider specifying imports – You are using limited parts of
bitarrayandmmh3, so you could consider usingfrom bitarray import bitarrayandfrom mmh3 import hash. However, this is based on personal preferences, and usingmmh3.hash()does clearly indicate which hash function you're using.
- Do you need the
hashlibimport? – It doesn't seem like you using anything from it. Is it needed?
- Don't use more parentheses than needed – In
calc_optimal_hash_func()you use a lot of parentheses. Are really all of those needed? It seems a little too much, and I'd prefer not to use that many, as it kind of clutters up the formulas to some extent. This is still more of a general advice, though.
- Don't test for
== False– Use thenotoperator instead, and doif not bit_array[result]:. It simply make more sense.
- Variable naming – Naming a variable
result, orbit_array, orstringdoesn't convey anything about the purpose of the variable. These and some of the other could benefit from better naming.
- Comment on the non-obvious stuff – It would be nice to see some comments describing what actually happens in your code. What kind of a
lookupdo you do?
- Mostly good spacing – Most of your code is reasonable easy to read, but I don't like the alignment at start of
get_bit_array(). I think it would be better to use the default way ofwords = load_words()and so on.
Code smells
In addition to the smaller style issues, there are some bigger code smells in your code which I would like to address:
load_words()hides a global constant – It always loads/usr/share/dict/words, which kind of removes the need for a function. It would make a little more sense if you hadload_words(dictionary_file).
load_words()reads the whole file into memory – If I'm not mistaken, part of the reason you're wanting to use a Bloom Filter is to verify membership within a really large data structure. When you load the entire thing into memory, there is no need for the filter, you'd be better off checking for membership in the array directly!
- Why the
try...exceptaround themmh3.hash()? – This seems a little strange, as youpassthe catch all the time. Does this serve some unknown purpose? If so, it should have been documented. And if not, it should be removed.
-
Consider making a class, instead of functions – Having
get_bit_array() return two variables, which you need to shuffle around later on, makes me think that this would better be served with a class. Imaging something like the following main section:bloom_filter = BloomFilter('/usr/share/dict/words')
print(bloom_filter.lookup('badwordforsure'))
print(bloom_filter.lookup('cat'))
print(bloom_filter.lookup('hello'))
print(bloom_filter.lookup('jsalj'))This would also expose what you're filtering towards, and it would allow for better interface and handling in general. It does seem like most of the functions are only used once, with the exception of the
lookup function. -
Join
load_words() and get_bit_array() – In order to avoid keeping the entire dictionary in memory, I would build the bit_array directly when reading the file. This would really ease the memory requirements of your code.Alternative implementation
Here is an implementation were I've taken most of these advice into account:
```
import mmh3
import bitarray
import math
class BloomFilter:
"""By building a bit array based upon a dictionary, this class
allows for probable membership, and certain non-membership of
any lookups within the filter."""
def __init__(self, dictionary_file):
"""Based on the dictionary_file, builds a bit array to
be used for testing membership within the file for a given
percentage, and accurate non-membership."""
# Skip file to get number of words
number_words = sum(1 for line in open(dictionary_file))
# Initialize some variables
self.calc_optimal_hash(number_words)
self.bit_array = self.array_length * bitarray.bitarray('0')
# Reread file, and build bit array
with (open(dictionary_file, 'r')) as dict_file:
for word in dict_file.readlines():
for seed in range(self.seed_count):
hashIndex = mmh3.hash(word, seed) % self.array_length
self.bit_array[hashIndex] = True
def calc_optimal_hash(self, number_words):
"""Calculate array_length and seed_count."""
# If I'm mistaken in precedence, re-add parentheses :-)
m = -number_words * math.log(0.01) / math.log(2) ** 2
k = m / number_words * math.log(2)
self.array_length = int(m)
self.seed_count = int(
Code Snippets
bloom_filter = BloomFilter('/usr/share/dict/words')
print(bloom_filter.lookup('badwordforsure'))
print(bloom_filter.lookup('cat'))
print(bloom_filter.lookup('hello'))
print(bloom_filter.lookup('jsalj'))import mmh3
import bitarray
import math
class BloomFilter:
"""By building a bit array based upon a dictionary, this class
allows for probable membership, and certain non-membership of
any lookups within the filter."""
def __init__(self, dictionary_file):
"""Based on the dictionary_file, builds a bit array to
be used for testing membership within the file for a given
percentage, and accurate non-membership."""
# Skip file to get number of words
number_words = sum(1 for line in open(dictionary_file))
# Initialize some variables
self.calc_optimal_hash(number_words)
self.bit_array = self.array_length * bitarray.bitarray('0')
# Reread file, and build bit array
with (open(dictionary_file, 'r')) as dict_file:
for word in dict_file.readlines():
for seed in range(self.seed_count):
hashIndex = mmh3.hash(word, seed) % self.array_length
self.bit_array[hashIndex] = True
def calc_optimal_hash(self, number_words):
"""Calculate array_length and seed_count."""
# If I'm mistaken in precedence, re-add parentheses :-)
m = -number_words * math.log(0.01) / math.log(2) ** 2
k = m / number_words * math.log(2)
self.array_length = int(m)
self.seed_count = int(math.ceil(k))
def probable_member(self, word):
"""Test whether word probably is in the dictionary, or
are surely not in the dictionary."""
for seed in range(self.seed_count):
candidateHash = mmh3.hash(word, seed) % self.array_length
if not self.bit_array[candidateHash]:
return False
return True
def lookup(self, word):
"""Test whether word probably is in the dictionary, or
are surely not in the dictionary."""
if self.probable_member(word):
return '"{}" is most likely in dictionary'.format(word)
else:
return '"{}" is not in dictionary'.format(word)
def main():
bloom_filter = BloomFilter('/usr/share/dict/words')
print(bloom_filter.lookup('badwordforsure'))
print(bloom_filter.lookup('cat'))
print(bloom_filter.lookup('hello'))
print(bloom_filter.lookup('jsalj'))
main()Context
StackExchange Code Review Q#162064, answer score: 3
Revisions (0)
No revisions yet.