HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Hash table versus binary search lookup for unchanging data

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
searchversusunchanginghashbinaryforlookupdatatable

Problem

Let's say I have some static, unchanging data (no adds, modifies or deletes) which is looked up by a string value, and that I'm looking to minimize size in memory while also trying to minimize lookup time.

It seems to me that keeping the strings sorted in a list and then doing a binary search is always going to be better than a hash table since the hash table has memory overhead to store info about the buckets, and lookups ought to be slower unless I have a lot of buckets.

Is my reasoning correct? Are there any better data structures or algorithms for this situation?

Implementation details

My specific goal is to have a data structure which associates data entries with strings.

The data structure is meant to be friendly for disk serialization - where i can read the whole file into memory with one allocation and one disk read, and then do "pointer fixup" to make the data structure valid and usable.

This data structure is static (constant) and does not change. It's read only data, so the search for data entry by string function is the only operation that matters for execution time.

The size in memory of the data structure is also important, but is not as important as search time.

Below are the two options I described above, as I see them implemented, but let me know if you see anything that could be improved, or if there is something better I could do that is completely different!

Binary Search

This one is pretty simple. I'm leaving out details like fourcc in header, file versioning and proper syntax for avoiding padding in structs since they are out of scope of the algorithmic questions.

In the file, i would have a uint32 specifying how many entries there were.

Then, I would have the entries $0...n$, which might look like:

struct SEntry
{
  char *string;
  uint32 data; // or whatever data was associated with the string
}


Immediately after the last data entry, the file would then contain the (null terminated) string for the first entry, then th

Solution

Have you considered "perfect hashing"? Basically you chose the hashfunction such that not collisions occur (every bucket has at most one entry). This is usually quite memory efficient, because you simply have a an array of values, no whole memory pages or linked list of objects to manage buckets (if you would do that). The plain array structure also makes it quite fast.

If you are concerned about memory efficiency you may also want to look into critbit-trees / binary tries. They store only the bits of the key that differ from other keys (more or less). This is reasonably fast and also quite memory efficient, especially with long keys, such as uint256 or arbitrary length. An example implementation is here (my code).

I'm not sure whether these structures fit your requirements, but I think they are worth mentioning.

Context

StackExchange Computer Science Q#50360, answer score: 3

Revisions (0)

No revisions yet.