patternMinor

Building static hash table with particular collisions

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

hash-tables cs hash stackoverflow data-structures efficiency

withbuildinghashcollisionsparticulartablestatic

Problem

Is there efficient algorithm to encode keys in hash function with provided collisions?

By efficient I mean with low-ish runtime of lookup operation (taking constants into account) and realistic time of finding such function.

Keys are floating point values - I can compare them for equality as they are raw, unprocessed. Equality holds, and there are no round-offs.

If there is better chance for algorithm on integers - I am still interested.

I did not included specific scheme for building perfect hash - it might be the wrong way or different scheme supports such change.

Let $h(x)$ be hash function (the one to be found)

$X$ be set of sets of float values that $\forall x \in X_i, h(x) = c_i$

$\forall X_i, |X_i| \in [2, 20]$

Let $Y$ be set of float values that there are no collisions (it does not collide with any $X_i$)

For example, I have set $X = \{\{1, 5.5, 24\}, \{7, 20\}\}$ and the set $Y = \{4, 25, 31.2, ...\}$

$h(1) = h(5.5) = h(24) = c_0$

$h(7) = h(20) = c_1$

$h(4) = c_2$

$h(25) = c_3$

$h(31.2) = c_4$

$c_i = c_j \iff i = j$

$X$ set containts colliding tuples, $Y$ contains values that collide with nothing.

$Y$ is about 85-95% of input, $X$ occupies the rest.

If for example $X$ has values in tuples of multiplicities {3, 5, 2, 2, 20, 14, 3}, this are 49 values, but 7 distinct buckets and Y has 451 elements, these are 451 distinct buckets.

In the hashtable there will be at least 458 buckets (this is perfect, and preffered), but might be a bit bigger up to 1%.

All possible inputs are given, $h(x)$ will never execute for value that was not in the input.

Order of the values from input does not matter, $h(x)$ does not need to preserve order, does not need to be monotone.
TLDR; Below are previous attempts, some background etc.

Background

In the initial phase I gather data from the device, this phase ends when for given settings I have collected all possible outputs.

This data is collected as floating point numbers. I calculate outputs based on what

Solution

The easiest way is to construct a static hash table $T$ containing all the collisions, in the following form: for each set of keys $S$ which are supposed to map to the same value, single out some $x \in S$, and put all other $y \in S$ in the table with an entry stating "$x$".

Now take a good hash function $h$, and construct a new one as follows:

On input $x$, check whether $x \in T$.

If $x \notin T$, return $h(x)$.

If $x \in T$, return $h(T[x])$.

If $h$ were a uniformly random function, then this would result in a uniformly random function conditioned on your collisions.

There are some optimizations possible here:

The hash function used to access $T$ can be weak.

The table $T$ might be too large to fit in the cache, so accessing $T$ might be slow. We can add a Bloom filter that does fit in cache to ameliorate this.

Context

StackExchange Computer Science Q#53094, answer score: 7

Revisions (0)

No revisions yet.