patternpythonMinor
NumPy array as quasi-hash table
Viewed 0 times
quasinumpyarrayhashtable
Problem
Motivation: I have a large number of large read-only dicts with large string keys mapped to tuples of two floats. They are taking up a large amount of heap space.
Proposed solution: a 3xn NumPy array, with the first column being the hash value of the key, and sorted by that column. Only contains and
I realize that this solution is \$O(log(n))\$ rather than \$O(c)\$ but I assume that this isn't that big a problem for my application as there aren't that many lookups per request and my primary problem is memory constraints.
I'm not sure how well
Proposed solution: a 3xn NumPy array, with the first column being the hash value of the key, and sorted by that column. Only contains and
getitem ops are required for my needs.I realize that this solution is \$O(log(n))\$ rather than \$O(c)\$ but I assume that this isn't that big a problem for my application as there aren't that many lookups per request and my primary problem is memory constraints.
I'm not sure how well
searchsorted performs here though, or if I should try to change the stride somehow so that the first column is contiguous or something like that. I should also probably be using a record array.import numpy
class FCD(object):
def __init__(self, d):
data = []
for key, value in d.iteritems():
data.append((hash(key), value[0], value[1]))
data.sort()
self.data = numpy.array(data)
def __getitem__(self, key):
hkey = hash(key)
ix = numpy.searchsorted(self.data[:, 0], hkey)
if ix < len(self.data) and self.data[ix][0] == hkey:
return self.data[ix][1:]
else:
raise KeyError
def __contains__(self, key):
hkey = hash(key)
ix = numpy.searchsorted(self.data[:, 0], hkey)
return ix < len(self.data) and self.data[ix][0] == hkeySolution
I think this looks pretty efficient. Let me add a few more minor, detailed comments (more than 1 year after the question was asked; I hope they are still useful):
-
After you convert to a NumPy array, the keys are longer guaranteed to be unique (if someone re-uses your code in a different context, etc). This will cause
to make sure users and readers of your code know that the first column of your array must be unique.
-
I agree that record arrays (aka structured arrays) would be better. Your code would be more readable because instead of things like:
you could write:
(but please use more descriptive names for your context than
Also, and more importantly: without using a structured array, your hash keys will be stored as floats instead of ints. That will take extra memory, and make future calls to
-
Your
-
You are duplicating the
-
Very minor, but the idiom is always
-
After you convert to a NumPy array, the keys are longer guaranteed to be unique (if someone re-uses your code in a different context, etc). This will cause
searchsorted to fail. You might want to add a check such as:assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])to make sure users and readers of your code know that the first column of your array must be unique.
-
I agree that record arrays (aka structured arrays) would be better. Your code would be more readable because instead of things like:
self.data[ix][1:]you could write:
self.data[ix]['Float_1'], self.data[ix]['Float_2'](but please use more descriptive names for your context than
'Float_1 etc.!)Also, and more importantly: without using a structured array, your hash keys will be stored as floats instead of ints. That will take extra memory, and make future calls to
searchsorted and __eq slightly less efficient.-
Your
__contains__ method seems overly stringent to me, which would result in it being slower. Do you need to call searcshorted in __contains__ at all? Wouldn't something like this suffice and be more efficient?def __contains__(self, key):
hkey = hash(key)
return np.in1d(self.data[:, 0], hkey).any()-
You are duplicating the
__contains__ check in your __getitem__ code. I'd just say it this way:if __contains__(self, key): return self.data[ix][1:]-
Very minor, but the idiom is always
import numpy as np. It's probably best to use that so you can do np.searchsorted etc. instead of typing numpy.searchsorted as that's what most NumPy users do.Code Snippets
assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])self.data[ix][1:]self.data[ix]['Float_1'], self.data[ix]['Float_2']def __contains__(self, key):
hkey = hash(key)
return np.in1d(self.data[:, 0], hkey).any()if __contains__(self, key): return self.data[ix][1:]Context
StackExchange Code Review Q#47022, answer score: 3
Revisions (0)
No revisions yet.