patternMinor
Redistributing a set of uniformly distributed numbers to an arbitrarily defined shape
Viewed 0 times
redistributingarbitrarilyuniformlynumbersdistributeddefinedshapeset
Problem
Lets say I have a random number generator that spits out uniform numbers from 0 to 1
Next, I have a shape defined by a series of vertices, like { [0, 0.4], [0.5, 0.2], [1, 0.4] }
In those vertices, the first value determines the position and second the distribution weight in that point (interpolated in between).
Now I need a magic function to redistribute the uniform random numbers into a distribution that would statistically match the shape of the vertices provided.
Essentially, if I were to run 10000 random numbers through the function, sort them into buckets like { [0 - 0.1], [0.1 - 0.2] ... } and count the buckets so that [x = x, y = count], it should produce a shape similar to if I were to use the original shape as [x, y].
To illustrate
ms paint at its finest.
Anyways, the question - I'm looking for an algorithm to build that magic function.
The shapes that I want to feed it would be based on real world statistics which often don't fit any of the neat looking distributions like gauss or normal, I've seen all kinds of wonky lines.
Next, I have a shape defined by a series of vertices, like { [0, 0.4], [0.5, 0.2], [1, 0.4] }
In those vertices, the first value determines the position and second the distribution weight in that point (interpolated in between).
Now I need a magic function to redistribute the uniform random numbers into a distribution that would statistically match the shape of the vertices provided.
Essentially, if I were to run 10000 random numbers through the function, sort them into buckets like { [0 - 0.1], [0.1 - 0.2] ... } and count the buckets so that [x = x, y = count], it should produce a shape similar to if I were to use the original shape as [x, y].
To illustrate
ms paint at its finest.
Anyways, the question - I'm looking for an algorithm to build that magic function.
The shapes that I want to feed it would be based on real world statistics which often don't fit any of the neat looking distributions like gauss or normal, I've seen all kinds of wonky lines.
Solution
Here is a simple algorithm. Suppose that the shape is given by a sequence
$$ (x_0,y_0), \ldots, (x_n,y_n). $$
Define $A_i = (x_i - x_{i-1}) \cdot (y_{i-1} + y_i)$ for $1 \leq i \leq n$, and let $A = \sum_{i=1}^n A_i$, $\alpha_i = A_i/A$, and $\beta_i = \sum_{j=1}^i \alpha_i$.
Sample a value $x$ uniformly on $[0,1]$, and determine $1 \leq i \leq n$ such that $\beta_{i-1} \leq x
This is because the area below the density curve between $x_{i-1}$ and $x_i$ is $(x_i - x_{i-1}) \frac{y_{i-1}+y_i}{2}$.
Consider now the function
$$
\begin{align*}
\varphi_i(t) &= \int_{x_{i-1}}^t y_{i-1} + (y_i-y_{i-1}) \frac{t-x_{i-1}}{x_i-x_{i-1}} \, dt \\ &=
y_{i-1} (t-x_{i-1}) + (y_i - y_{i-1})\frac{(t-x_{i-1})^2}{2(x_i-x_{i-1})}
\end{align*}
$$
and its normalized cousin
$$
F_i(t) = \frac{\varphi_i(t)}{\varphi_i(x_i)} =
\frac{2y_{i-1}}{y_{i-1}+y_i} \frac{t-x_{i-1}}{x_i-x_{i-1}} + \frac{y_i-y_{i-1}}{y_i+y_{i+1}} \left(\frac{t-x_{i-1}}{x_i-x_{i-1}}\right)^2,
$$
which is the CDF of your density function restricted to the interval $[x_{i-1},x_i]$.
Output $F_i^{-1}(c)$.
Since $c \sim U(0,1)$, the resulting element has CDF $F_i$, and so has the correct distribution. It remains to explicitly invert $F_i$. Let $a = \frac{y_i-y_{i-1}}{y_i+y_{i-1}}$, $b = \frac{2y_{i-1}}{y_i+y_{i-1}}$, and $s = \frac{t-x_{i-1}}{x_i-x_{i-1}}$ (note that $0 \leq s \leq 1$). If $F(t) = c$ then $as^2 + bs = c$, and so we can calculate the output $t$ as
$$
s = \frac{-b + \sqrt{b^2+4ac}}{2a}, \qquad
t = x_{i-1} + (x_i - x_{i-1}) s.
$$
The formula for $s$ simplifies to
$$
s = \frac{\sqrt{(1-c) y_{i-1}^2 + c y_i^2} - y_{i-1}}{y_i-y_{i-1}}.
$$
We can thus rephrase the second step as
Output
$$ x_{i-1} + (x_i - x_{i-1}) \frac{\sqrt{(1-c) y_{i-1}^2 + c y_i^2} - y_{i-1}}{y_i-y_{i-1}}. $$
Another option for the second step is rejection sampling. Let $A$ be the area below the rectangle bounded by $(x_{i-1},0),(x_{i-1},y_{i-1}),(x_i,y_i),(x_i,0)$, and let $R$ be the upper bounding rectangle $(x_{i-1},0),(x_{i-1},y),(x_i,y),(x_i,0)$, where $y = \max(y_{i-1},y_i)$. Generate random points $(x,y)$ in $R$ repeatedly until one of them belongs to $A$, and output its ordinate $x$. Since the area of $A$ is at least half the area of $R$, on average this requires at most two samples.
In the first step you have to determine $i$ such that $\beta_{i-1} \leq x < \beta_i$. While a naive implementation would take time $O(n)$, or $O(\log n)$ using binary search trees, you can use the alias method to reduce sampling time to $O(1)$.
$$ (x_0,y_0), \ldots, (x_n,y_n). $$
Define $A_i = (x_i - x_{i-1}) \cdot (y_{i-1} + y_i)$ for $1 \leq i \leq n$, and let $A = \sum_{i=1}^n A_i$, $\alpha_i = A_i/A$, and $\beta_i = \sum_{j=1}^i \alpha_i$.
Sample a value $x$ uniformly on $[0,1]$, and determine $1 \leq i \leq n$ such that $\beta_{i-1} \leq x
This is because the area below the density curve between $x_{i-1}$ and $x_i$ is $(x_i - x_{i-1}) \frac{y_{i-1}+y_i}{2}$.
Consider now the function
$$
\begin{align*}
\varphi_i(t) &= \int_{x_{i-1}}^t y_{i-1} + (y_i-y_{i-1}) \frac{t-x_{i-1}}{x_i-x_{i-1}} \, dt \\ &=
y_{i-1} (t-x_{i-1}) + (y_i - y_{i-1})\frac{(t-x_{i-1})^2}{2(x_i-x_{i-1})}
\end{align*}
$$
and its normalized cousin
$$
F_i(t) = \frac{\varphi_i(t)}{\varphi_i(x_i)} =
\frac{2y_{i-1}}{y_{i-1}+y_i} \frac{t-x_{i-1}}{x_i-x_{i-1}} + \frac{y_i-y_{i-1}}{y_i+y_{i+1}} \left(\frac{t-x_{i-1}}{x_i-x_{i-1}}\right)^2,
$$
which is the CDF of your density function restricted to the interval $[x_{i-1},x_i]$.
Output $F_i^{-1}(c)$.
Since $c \sim U(0,1)$, the resulting element has CDF $F_i$, and so has the correct distribution. It remains to explicitly invert $F_i$. Let $a = \frac{y_i-y_{i-1}}{y_i+y_{i-1}}$, $b = \frac{2y_{i-1}}{y_i+y_{i-1}}$, and $s = \frac{t-x_{i-1}}{x_i-x_{i-1}}$ (note that $0 \leq s \leq 1$). If $F(t) = c$ then $as^2 + bs = c$, and so we can calculate the output $t$ as
$$
s = \frac{-b + \sqrt{b^2+4ac}}{2a}, \qquad
t = x_{i-1} + (x_i - x_{i-1}) s.
$$
The formula for $s$ simplifies to
$$
s = \frac{\sqrt{(1-c) y_{i-1}^2 + c y_i^2} - y_{i-1}}{y_i-y_{i-1}}.
$$
We can thus rephrase the second step as
Output
$$ x_{i-1} + (x_i - x_{i-1}) \frac{\sqrt{(1-c) y_{i-1}^2 + c y_i^2} - y_{i-1}}{y_i-y_{i-1}}. $$
Another option for the second step is rejection sampling. Let $A$ be the area below the rectangle bounded by $(x_{i-1},0),(x_{i-1},y_{i-1}),(x_i,y_i),(x_i,0)$, and let $R$ be the upper bounding rectangle $(x_{i-1},0),(x_{i-1},y),(x_i,y),(x_i,0)$, where $y = \max(y_{i-1},y_i)$. Generate random points $(x,y)$ in $R$ repeatedly until one of them belongs to $A$, and output its ordinate $x$. Since the area of $A$ is at least half the area of $R$, on average this requires at most two samples.
In the first step you have to determine $i$ such that $\beta_{i-1} \leq x < \beta_i$. While a naive implementation would take time $O(n)$, or $O(\log n)$ using binary search trees, you can use the alias method to reduce sampling time to $O(1)$.
Context
StackExchange Computer Science Q#66133, answer score: 2
Revisions (0)
No revisions yet.