HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Randomly generate a list with predetermined mean

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withmeanrandomlygeneratepredeterminedlist

Problem

Aim: randomly generate \$n\$ numbers between \$a\$ and \$b\$ with (roughly) mean \$m\$.

Problem: The simple beginner's code I wrote becomes very inefficient as \$m\$ moves away from \$\frac{a + b}{2}\$.

import random
import numpy as np
import time

low = 0
high = 100

data = 50

def lst_w_mean(target):
    HT = [x for x in range(low, high+1)]
    target_mean = 0

    while target_mean <> int(target):
        outcomes = [random.choice(HT) for x in range(data)]
        target_mean = np.mean(outcomes)
    return outcomes

t1 = time.clock()
data_new = lst_w_mean(54)
print "elapsed time: ", time.clock() - t1
print data_new

t1 = time.clock()
data_new = lst_w_mean(32)
print "elapsed time: ", time.clock() - t1
print data_new


Works pretty decently for means close to the the mean of HT (e.g., means between between 40 and 60). Takes for ever (without generating a run-time error) beyond that narrow range (roughly 28 minutes for mean =30)

%run 
elapsed time for mean = 54:  0.00734800000009
[46, 82, 6, 0, 90, 73, 99, 2, 34, 88, 51, 14, 89, 72, 40, 8, 97, 44, 46, 45, 89, 10, 22, 52, 96, 98, 43, 52, 59, 74, 52, 54, 64, 66, 21, 71, 92, 34, 76, 33, 26, 36, 53, 74, 64, 85, 59, 26, 69, 24]

elapsed time for mean = 30:  1695.210358
[67, 43, 30, 78, 67, 33, 13, 24, 27, 1, 2, 4, 4, 2, 49, 1, 41, 16, 6, 9, 6, 90, 1, 31, 52, 91, 5, 11, 4, 94, 2, 74, 60, 38, 8, 2, 10, 20, 92, 25, 17, 36, 12, 9, 7, 22, 16, 72, 44, 32]


Reformulating HT with something like:

if target < 40:
 high = 70


etc., so that the target mean is closer to the mean of HT does not generate realistic data besides being an ad hoc solution. And yes, I would like what I call "realistic data" to have a "high" standard deviation perhaps with a number of "outliers".

The process speeds up significantly as the value of data diminishes, so maybe the better approach is to try to use recursion with smaller data values (sort of like in merge-sort)?

Solution

Review

The official Python style-guide recommends using lower_case for variables (your variable HT violates this).

I would make all relevant parameters actual parameters of the function. This makes it a lot easier to test.

The main code of a module should be wrapped in a if __name__ == "__main__": guard. This allows you to do in another script:

from random_with_mean import lst_w_mean


without executing the tests.

I would try to find a different name for the function, without too many abbreviations.

Alternative approach

I would use a completely different approach to this.

First, note that the Poisson distribution is a positive-semidefinite distribution for integers. If we want to generate a integers in the range (0, 100) (similar to your example, but note the off-by-one error), with mean 50, We can just take a np.random.poisson(50) (which I will call P(50) from now on), ensuring to throw away all values which are larger than 100 and redraw them.

This also works when generating values in the range (0, 100) but with mean 10. Only when the mean becomes larger than the half-way point do we start getting into trouble. Suddenly we are starting to loose a larger and larger part of the tail to the upper boundary. To avoid this, we can just swap the problem around and declare the upper boundary to be zero and generate the values as high - P(high - target), so we generate how far the value is away from the boundary.

Similarly, when the lower bound is not zero, we need to shift the values upwards and generate low + P(target - low).

An implementation of this is:

import numpy as np

def lst_w_mean(low, high, num, target):
    swap = target > (high - low) / 2
    out = []
    to_generate = num
    while to_generate:
        if swap:
            x = high - np.random.poisson(high - target, size=to_generate)
            x = x[low <= x]  # x can't be larger than high by construction
        else:
            x = low + np.random.poisson(target - low, size=to_generate)
            x = x[x < high]  # x can't be smaller than low by construction
        out.extend(x)
        to_generate = num - len(out)
    return out

if __name__ == "__main__":
    print np.mean(lst_w_mean(0, 100, 50, 50))
    print np.mean(lst_w_mean(0, 100, 50, 10))
    print np.mean(lst_w_mean(0, 100, 50, 70))


Note that I always generate multiple values (enough to fill the out list if all were inside the boundaries) at the same time, then throw away all values outside of the boundaries and loop until the out list is the right size. Most of the time only one iteration is needed, because it can only go outside of the boundary on the side away from the closer boundary.

Note that the variance of the Poisson distribution is the same as its mean and so its standard deviation is the square root of that. This means that you will get some outliers, but the distribution is far from uniform over your range.

To visualize this, here are three different samples (with n=5000), one with mean 10, 50, 75 and 99 each using the Poisson distribution:

As an alternative, you could take the Beta distribution, which is only defined on the interval [0, 1], which helps us a lot here, because we don't need to take care of edge effects, we only need to shift and rescale our values. Given a mean and sigma (I took sigma to be an arbitrary value), we can derive the necessary parameters \$\alpha\$ and \$\beta\$, as given as alternative parametrization on wikipedia:

\$\alpha = \mu \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right),\quad \beta = (1 - \mu) \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right)\$

which only holds when \$\sigma^2 < \mu(1 - \mu)\$.

With this, I ended up with the following code:

from __future__ import division

def lst_beta(low, high, num, mean):
    diff = high - low
    mu = (mean - low) / diff
    sigma = mu * (1 - mu)
    c = 1 / sigma - 1
    a, b = mu * c, (1 - mu) * c
    return low + diff * np.random.beta(a, b, size=num)


Visualized for \$\mu = 10, 50, 75, 99\$ this looks like this:

In comparison, this is what @Peilonrayz's algorithm produces for these means:

Edit:

Here is the code to generate the graphs:

import matplotlib.pyplot as plt

#function definitions here
def lst_w_mean(low, high, num, target):
    ...

plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
    plt.subplot(len(means), 1, i)
    l = lst_w_mean(0, 100, 5000, mean)
    plt.hist(l, bins=100, range=(0, 100))

plt.show()

Code Snippets

from random_with_mean import lst_w_mean
import numpy as np


def lst_w_mean(low, high, num, target):
    swap = target > (high - low) / 2
    out = []
    to_generate = num
    while to_generate:
        if swap:
            x = high - np.random.poisson(high - target, size=to_generate)
            x = x[low <= x]  # x can't be larger than high by construction
        else:
            x = low + np.random.poisson(target - low, size=to_generate)
            x = x[x < high]  # x can't be smaller than low by construction
        out.extend(x)
        to_generate = num - len(out)
    return out

if __name__ == "__main__":
    print np.mean(lst_w_mean(0, 100, 50, 50))
    print np.mean(lst_w_mean(0, 100, 50, 10))
    print np.mean(lst_w_mean(0, 100, 50, 70))
from __future__ import division

def lst_beta(low, high, num, mean):
    diff = high - low
    mu = (mean - low) / diff
    sigma = mu * (1 - mu)
    c = 1 / sigma - 1
    a, b = mu * c, (1 - mu) * c
    return low + diff * np.random.beta(a, b, size=num)
import matplotlib.pyplot as plt

#function definitions here
def lst_w_mean(low, high, num, target):
    ...

plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
    plt.subplot(len(means), 1, i)
    l = lst_w_mean(0, 100, 5000, mean)
    plt.hist(l, bins=100, range=(0, 100))

plt.show()

Context

StackExchange Code Review Q#149857, answer score: 5

Revisions (0)

No revisions yet.