HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

NumPy array as quasi-hash table

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
quasinumpyarrayhashtable

Problem

Motivation: I have a large number of large read-only dicts with large string keys mapped to tuples of two floats. They are taking up a large amount of heap space.

Proposed solution: a 3xn NumPy array, with the first column being the hash value of the key, and sorted by that column. Only contains and getitem ops are required for my needs.

I realize that this solution is \$O(log(n))\$ rather than \$O(c)\$ but I assume that this isn't that big a problem for my application as there aren't that many lookups per request and my primary problem is memory constraints.

I'm not sure how well searchsorted performs here though, or if I should try to change the stride somehow so that the first column is contiguous or something like that. I should also probably be using a record array.

import numpy

class FCD(object):

  def __init__(self, d):
    data = []
    for key, value in d.iteritems():
      data.append((hash(key), value[0], value[1]))
    data.sort()
    self.data = numpy.array(data)

  def __getitem__(self, key):
    hkey = hash(key)
    ix = numpy.searchsorted(self.data[:, 0], hkey)
    if ix < len(self.data) and self.data[ix][0] == hkey:
      return self.data[ix][1:]
    else:
      raise KeyError

  def __contains__(self, key):
    hkey = hash(key)
    ix = numpy.searchsorted(self.data[:, 0], hkey)
    return ix < len(self.data) and self.data[ix][0] == hkey

Solution

I think this looks pretty efficient. Let me add a few more minor, detailed comments (more than 1 year after the question was asked; I hope they are still useful):

-
After you convert to a NumPy array, the keys are longer guaranteed to be unique (if someone re-uses your code in a different context, etc). This will cause searchsorted to fail. You might want to add a check such as:

assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])


to make sure users and readers of your code know that the first column of your array must be unique.

-
I agree that record arrays (aka structured arrays) would be better. Your code would be more readable because instead of things like:

self.data[ix][1:]


you could write:

self.data[ix]['Float_1'], self.data[ix]['Float_2']


(but please use more descriptive names for your context than 'Float_1 etc.!)

Also, and more importantly: without using a structured array, your hash keys will be stored as floats instead of ints. That will take extra memory, and make future calls to searchsorted and __eq slightly less efficient.

-
Your __contains__ method seems overly stringent to me, which would result in it being slower. Do you need to call searcshorted in __contains__ at all? Wouldn't something like this suffice and be more efficient?

def __contains__(self, key):
    hkey = hash(key)
    return np.in1d(self.data[:, 0], hkey).any()


-
You are duplicating the __contains__ check in your __getitem__ code. I'd just say it this way:

if __contains__(self, key): return self.data[ix][1:]


-
Very minor, but the idiom is always import numpy as np. It's probably best to use that so you can do np.searchsorted etc. instead of typing numpy.searchsorted as that's what most NumPy users do.

Code Snippets

assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])
self.data[ix][1:]
self.data[ix]['Float_1'], self.data[ix]['Float_2']
def __contains__(self, key):
    hkey = hash(key)
    return np.in1d(self.data[:, 0], hkey).any()
if __contains__(self, key): return self.data[ix][1:]

Context

StackExchange Code Review Q#47022, answer score: 3

Revisions (0)

No revisions yet.