HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Redistributing a set of uniformly distributed numbers to an arbitrarily defined shape

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
redistributingarbitrarilyuniformlynumbersdistributeddefinedshapeset

Problem

Lets say I have a random number generator that spits out uniform numbers from 0 to 1

Next, I have a shape defined by a series of vertices, like { [0, 0.4], [0.5, 0.2], [1, 0.4] }

In those vertices, the first value determines the position and second the distribution weight in that point (interpolated in between).

Now I need a magic function to redistribute the uniform random numbers into a distribution that would statistically match the shape of the vertices provided.

Essentially, if I were to run 10000 random numbers through the function, sort them into buckets like { [0 - 0.1], [0.1 - 0.2] ... } and count the buckets so that [x = x, y = count], it should produce a shape similar to if I were to use the original shape as [x, y].

To illustrate

ms paint at its finest.

Anyways, the question - I'm looking for an algorithm to build that magic function.

The shapes that I want to feed it would be based on real world statistics which often don't fit any of the neat looking distributions like gauss or normal, I've seen all kinds of wonky lines.

Solution

Here is a simple algorithm. Suppose that the shape is given by a sequence
$$ (x_0,y_0), \ldots, (x_n,y_n). $$
Define $A_i = (x_i - x_{i-1}) \cdot (y_{i-1} + y_i)$ for $1 \leq i \leq n$, and let $A = \sum_{i=1}^n A_i$, $\alpha_i = A_i/A$, and $\beta_i = \sum_{j=1}^i \alpha_i$.


Sample a value $x$ uniformly on $[0,1]$, and determine $1 \leq i \leq n$ such that $\beta_{i-1} \leq x

This is because the area below the density curve between $x_{i-1}$ and $x_i$ is $(x_i - x_{i-1}) \frac{y_{i-1}+y_i}{2}$.

Consider now the function
$$
\begin{align*}
\varphi_i(t) &= \int_{x_{i-1}}^t y_{i-1} + (y_i-y_{i-1}) \frac{t-x_{i-1}}{x_i-x_{i-1}} \, dt \\ &=
y_{i-1} (t-x_{i-1}) + (y_i - y_{i-1})\frac{(t-x_{i-1})^2}{2(x_i-x_{i-1})}
\end{align*}
$$
and its normalized cousin
$$
F_i(t) = \frac{\varphi_i(t)}{\varphi_i(x_i)} =
\frac{2y_{i-1}}{y_{i-1}+y_i} \frac{t-x_{i-1}}{x_i-x_{i-1}} + \frac{y_i-y_{i-1}}{y_i+y_{i+1}} \left(\frac{t-x_{i-1}}{x_i-x_{i-1}}\right)^2,
$$
which is the CDF of your density function restricted to the interval $[x_{i-1},x_i]$.


Output $F_i^{-1}(c)$.

Since $c \sim U(0,1)$, the resulting element has CDF $F_i$, and so has the correct distribution. It remains to explicitly invert $F_i$. Let $a = \frac{y_i-y_{i-1}}{y_i+y_{i-1}}$, $b = \frac{2y_{i-1}}{y_i+y_{i-1}}$, and $s = \frac{t-x_{i-1}}{x_i-x_{i-1}}$ (note that $0 \leq s \leq 1$). If $F(t) = c$ then $as^2 + bs = c$, and so we can calculate the output $t$ as
$$
s = \frac{-b + \sqrt{b^2+4ac}}{2a}, \qquad
t = x_{i-1} + (x_i - x_{i-1}) s.
$$
The formula for $s$ simplifies to
$$
s = \frac{\sqrt{(1-c) y_{i-1}^2 + c y_i^2} - y_{i-1}}{y_i-y_{i-1}}.
$$
We can thus rephrase the second step as


Output
$$ x_{i-1} + (x_i - x_{i-1}) \frac{\sqrt{(1-c) y_{i-1}^2 + c y_i^2} - y_{i-1}}{y_i-y_{i-1}}. $$

Another option for the second step is rejection sampling. Let $A$ be the area below the rectangle bounded by $(x_{i-1},0),(x_{i-1},y_{i-1}),(x_i,y_i),(x_i,0)$, and let $R$ be the upper bounding rectangle $(x_{i-1},0),(x_{i-1},y),(x_i,y),(x_i,0)$, where $y = \max(y_{i-1},y_i)$. Generate random points $(x,y)$ in $R$ repeatedly until one of them belongs to $A$, and output its ordinate $x$. Since the area of $A$ is at least half the area of $R$, on average this requires at most two samples.

In the first step you have to determine $i$ such that $\beta_{i-1} \leq x < \beta_i$. While a naive implementation would take time $O(n)$, or $O(\log n)$ using binary search trees, you can use the alias method to reduce sampling time to $O(1)$.

Context

StackExchange Computer Science Q#66133, answer score: 2

Revisions (0)

No revisions yet.