HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

BK Tree implementation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
implementationtreestackoverflow

Problem

I'm working on a system that can search for similar images. This involves being able to search by edit-distance, and as such I've implemented a special data-structure called a BK tree. Some notes here.

The core idea here is to be able to search for items by edit distance, in this case Hamming distance between discrete items. My hashes for this system are stored as 64 bit integers, but fundamentally they can be considered bit-fields, and are treated as such.

In any event, I have some test code that seems to work fairly well, and I'd greatly appreciate some input.

``
from libc.stdint cimport uint64_t

# Compute number of bits that are not common between
a and b.
# return value is a plain integer
cdef uint64_t hamming(uint64_t a, uint64_t b):

cdef uint64_t x
cdef int tot

tot = 0

x = (a ^ b)
while x > 0:
tot += x & 1
x >>= 1
return tot

cdef class BkHammingNode(object):

cdef uint64_t nodeHash
cdef set nodeData
cdef dict children

def __init__(self, nodeHash, nodeData):
self.nodeData = set((nodeData, ))
self.children = {}
self.nodeHash = nodeHash

# Insert phash
nodeHash into tree, with the associated data nodeData
cpdef insert(self, uint64_t nodeHash, nodeData):

# If the current node has the same has as the data we're inserting,
# add the data to the current node's data set
if nodeHash == self.nodeHash:
self.nodeData.add(nodeData)
return

# otherwise, calculate the edit distance between the new phash and the current node's hash,
# and either recursively insert the data, or create a new child node for the phash
distance = hamming(self.nodeHash, nodeHash)
if not distance in self.children:
self.children[distance] = BkHammingNode(nodeHash, nodeData)
else:
self.children[distance].insert(nodeHash, nodeData)

# Remove node with hash
nodeHash` and accompanying

Solution

I haven't really checked for correctness, but here are some general style points.
I haven't taken the time to actually understand the algorithm here; these are
surface-level opinions. If you want deeper analysis of the code correctness, I suggest you give me a testing harness (just something to pyximport and run the code), so I get a look at how it's used.

I don't have substantial criticism of the code; it's mostly really neat and readable.

First: docstrings. Instead of:

# Compute number of bits that are not common between `a` and `b`.
# return value is a plain integer
cdef uint64_t hamming(uint64_t a, uint64_t b):


write

cdef uint64_t hamming(uint64_t a, uint64_t b):
    """
    Compute number of bits that are not common between `a` and `b`.
    return value is a plain integer
    """


If you're actually targetting 3.4 only, don't write (object) in the inheritance list.
If you're potentially wanting 2.x compatibility as well, though, it's good to keep it.

You have

cdef set nodeData
    cdef dict children


It's worth noting that these aren't much faster than bog-standard untyped attributes,
but as long as you're aware they are a good datatype. This is actually why I expect
PyPy will give better speed advantages.

Instead of set((nodeData, )) just write self.nodeData = {nodeData}.

In remove you have cdef uint64_t deleted = 0, but only add to it once. Giving
this a type is pointless, especially because your return casts it
to a Python type. A similar but lesser concern exists for moved.

Instead of self.children.pop(selfDist) do del self.children[selfDist]. pop is
meant for if you use the return value.

You write:

ret = set()

        if selfDist <= distance:
            ret |= set(self.nodeData)


It seems to me that

ret = set()

        if selfDist <= distance
            ret = set(self.nodeData)


would be better. If not, try ret.update(self.nodeData).

You don't need to call .keys() here:

for key in self.children.keys():


This:

if key = selfDist-distance:


should just be

if selfDist-distance <= key <= selfDist+distance:


or even

if abs(key - selfDist) <= distance:


I don't know how this would work with selfDist being a uint64_t and key being a Python int, but if you just remove the cdef uint64_t it'll likely work.

You should probably avoid class variables like root = None and just add them to
__init__. Class variables are basically global variables,
and although having an immutable global as fallback works, it goes counter to how
I'd expect the language to be used.

Code Snippets

# Compute number of bits that are not common between `a` and `b`.
# return value is a plain integer
cdef uint64_t hamming(uint64_t a, uint64_t b):
cdef uint64_t hamming(uint64_t a, uint64_t b):
    """
    Compute number of bits that are not common between `a` and `b`.
    return value is a plain integer
    """
cdef set nodeData
    cdef dict children
ret = set()

        if selfDist <= distance:
            ret |= set(self.nodeData)
ret = set()

        if selfDist <= distance
            ret = set(self.nodeData)

Context

StackExchange Code Review Q#67895, answer score: 2

Revisions (0)

No revisions yet.