HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Fastest static associative map

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
associativemapfasteststatic

Problem

Let's say that I wanted to build a key to value associative map where the only requirement was that lookup times were fast.

Once built, the associative map would not allow inserts, deletes, or modifications, so is a static data structure.

What would the best algorithm be for making and using such a thing?

I'm interested to know if there is a provable "best solution", or if it's provably NP-Hard or similar.

Here are some ideas of my own:

Minimally Perfect Hash

My best idea would be to use minimal perfect hashing. I would find a hashing algorithm that hashes the $N$ known inputs to $[0,N)$, that resulting value being able to be looked up in an array.

I would want to find the computationally cheapest (average time) hashing algorithm for my data set.

However, finding a minimally perfect hash function is a challenging problem already without wanting the computationally cheapest one.

Feature Based Indexing

Another thought I have would be to look at my input and find bits which are differing between items. For instance, file paths may have a lot of the same characters in them, especially if they are absolute paths to the same deep folder.

I could find where the bits are that matter and make a tree of objects out of them.

The challenge here I think is that I would ideally want a balanced tree, and it might be hard to test all the permutations of bits that actually matter, to make the tree as balanced as possible.

I think Ideally, my hope is that the entire tree could go away, and I could instead take the bits that mattered and make some equation like "xor bit 2 against bit 3 and add bit 5" to come up with an index into an array.

Solution

Here's a way to think about the feature-based approach: you select a set of candidate features, then look for the smallest decision tree that assigns each of the $N$ keys a different index in $[0,N)$ using only those features. This is effectively a way of building a perfect hash, where the hash function is expressed as a decision tree applied to the feature vector constructed from the input.

Building the smallest decision tree is NP-hard in general, but there are heuristic algorithms that do a reasonable job. You might look at the ID3 algorithm, for instance; in many settings it does a pretty good job of building a small (though not necessarily minimal) decision tree. You can compare that to other ways of building perfect hash functions and see which is computationally more efficient for your particular data set. Probably the only way to know is to try it out; I'm not sure there is any theory that will give you a definitive answer, as the answer probably depends on characteristics of your dataset that will be hard to model in a theoretical setting.

Context

StackExchange Computer Science Q#63920, answer score: 6

Revisions (0)

No revisions yet.