patternpythonMinor
BK Tree implementation
Viewed 0 times
implementationtreestackoverflow
Problem
I'm working on a system that can search for similar images. This involves being able to search by edit-distance, and as such I've implemented a special data-structure called a BK tree. Some notes here.
The core idea here is to be able to search for items by edit distance, in this case Hamming distance between discrete items. My hashes for this system are stored as 64 bit integers, but fundamentally they can be considered bit-fields, and are treated as such.
In any event, I have some test code that seems to work fairly well, and I'd greatly appreciate some input.
``
The core idea here is to be able to search for items by edit distance, in this case Hamming distance between discrete items. My hashes for this system are stored as 64 bit integers, but fundamentally they can be considered bit-fields, and are treated as such.
In any event, I have some test code that seems to work fairly well, and I'd greatly appreciate some input.
``
from libc.stdint cimport uint64_t
# Compute number of bits that are not common between a and b.
# return value is a plain integer
cdef uint64_t hamming(uint64_t a, uint64_t b):
cdef uint64_t x
cdef int tot
tot = 0
x = (a ^ b)
while x > 0:
tot += x & 1
x >>= 1
return tot
cdef class BkHammingNode(object):
cdef uint64_t nodeHash
cdef set nodeData
cdef dict children
def __init__(self, nodeHash, nodeData):
self.nodeData = set((nodeData, ))
self.children = {}
self.nodeHash = nodeHash
# Insert phash nodeHash into tree, with the associated data nodeData
cpdef insert(self, uint64_t nodeHash, nodeData):
# If the current node has the same has as the data we're inserting,
# add the data to the current node's data set
if nodeHash == self.nodeHash:
self.nodeData.add(nodeData)
return
# otherwise, calculate the edit distance between the new phash and the current node's hash,
# and either recursively insert the data, or create a new child node for the phash
distance = hamming(self.nodeHash, nodeHash)
if not distance in self.children:
self.children[distance] = BkHammingNode(nodeHash, nodeData)
else:
self.children[distance].insert(nodeHash, nodeData)
# Remove node with hash nodeHash` and accompanyingSolution
I haven't really checked for correctness, but here are some general style points.
I haven't taken the time to actually understand the algorithm here; these are
surface-level opinions. If you want deeper analysis of the code correctness, I suggest you give me a testing harness (just something to
I don't have substantial criticism of the code; it's mostly really neat and readable.
First: docstrings. Instead of:
write
If you're actually targetting 3.4 only, don't write
If you're potentially wanting 2.x compatibility as well, though, it's good to keep it.
You have
It's worth noting that these aren't much faster than bog-standard untyped attributes,
but as long as you're aware they are a good datatype. This is actually why I expect
PyPy will give better speed advantages.
Instead of
In
this a type is pointless, especially because your return casts it
to a Python type. A similar but lesser concern exists for
Instead of
meant for if you use the return value.
You write:
It seems to me that
would be better. If not, try
You don't need to call
This:
should just be
or even
I don't know how this would work with
You should probably avoid class variables like
and although having an immutable global as fallback works, it goes counter to how
I'd expect the language to be used.
I haven't taken the time to actually understand the algorithm here; these are
surface-level opinions. If you want deeper analysis of the code correctness, I suggest you give me a testing harness (just something to
pyximport and run the code), so I get a look at how it's used.I don't have substantial criticism of the code; it's mostly really neat and readable.
First: docstrings. Instead of:
# Compute number of bits that are not common between `a` and `b`.
# return value is a plain integer
cdef uint64_t hamming(uint64_t a, uint64_t b):write
cdef uint64_t hamming(uint64_t a, uint64_t b):
"""
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
"""If you're actually targetting 3.4 only, don't write
(object) in the inheritance list.If you're potentially wanting 2.x compatibility as well, though, it's good to keep it.
You have
cdef set nodeData
cdef dict childrenIt's worth noting that these aren't much faster than bog-standard untyped attributes,
but as long as you're aware they are a good datatype. This is actually why I expect
PyPy will give better speed advantages.
Instead of
set((nodeData, )) just write self.nodeData = {nodeData}.In
remove you have cdef uint64_t deleted = 0, but only add to it once. Givingthis a type is pointless, especially because your return casts it
to a Python type. A similar but lesser concern exists for
moved.Instead of
self.children.pop(selfDist) do del self.children[selfDist]. pop ismeant for if you use the return value.
You write:
ret = set()
if selfDist <= distance:
ret |= set(self.nodeData)It seems to me that
ret = set()
if selfDist <= distance
ret = set(self.nodeData)would be better. If not, try
ret.update(self.nodeData).You don't need to call
.keys() here:for key in self.children.keys():This:
if key = selfDist-distance:should just be
if selfDist-distance <= key <= selfDist+distance:or even
if abs(key - selfDist) <= distance:I don't know how this would work with
selfDist being a uint64_t and key being a Python int, but if you just remove the cdef uint64_t it'll likely work.You should probably avoid class variables like
root = None and just add them to__init__. Class variables are basically global variables,and although having an immutable global as fallback works, it goes counter to how
I'd expect the language to be used.
Code Snippets
# Compute number of bits that are not common between `a` and `b`.
# return value is a plain integer
cdef uint64_t hamming(uint64_t a, uint64_t b):cdef uint64_t hamming(uint64_t a, uint64_t b):
"""
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
"""cdef set nodeData
cdef dict childrenret = set()
if selfDist <= distance:
ret |= set(self.nodeData)ret = set()
if selfDist <= distance
ret = set(self.nodeData)Context
StackExchange Code Review Q#67895, answer score: 2
Revisions (0)
No revisions yet.