snippetpythonMinor
Randomly generate a list with predetermined mean
Viewed 0 times
withmeanrandomlygeneratepredeterminedlist
Problem
Aim: randomly generate \$n\$ numbers between \$a\$ and \$b\$ with (roughly) mean \$m\$.
Problem: The simple beginner's code I wrote becomes very inefficient as \$m\$ moves away from \$\frac{a + b}{2}\$.
Works pretty decently for means close to the the mean of HT (e.g., means between between 40 and 60). Takes for ever (without generating a run-time error) beyond that narrow range (roughly 28 minutes for mean =30)
Reformulating HT with something like:
etc., so that the target mean is closer to the mean of HT does not generate realistic data besides being an ad hoc solution. And yes, I would like what I call "realistic data" to have a "high" standard deviation perhaps with a number of "outliers".
The process speeds up significantly as the value of
Problem: The simple beginner's code I wrote becomes very inefficient as \$m\$ moves away from \$\frac{a + b}{2}\$.
import random
import numpy as np
import time
low = 0
high = 100
data = 50
def lst_w_mean(target):
HT = [x for x in range(low, high+1)]
target_mean = 0
while target_mean <> int(target):
outcomes = [random.choice(HT) for x in range(data)]
target_mean = np.mean(outcomes)
return outcomes
t1 = time.clock()
data_new = lst_w_mean(54)
print "elapsed time: ", time.clock() - t1
print data_new
t1 = time.clock()
data_new = lst_w_mean(32)
print "elapsed time: ", time.clock() - t1
print data_newWorks pretty decently for means close to the the mean of HT (e.g., means between between 40 and 60). Takes for ever (without generating a run-time error) beyond that narrow range (roughly 28 minutes for mean =30)
%run
elapsed time for mean = 54: 0.00734800000009
[46, 82, 6, 0, 90, 73, 99, 2, 34, 88, 51, 14, 89, 72, 40, 8, 97, 44, 46, 45, 89, 10, 22, 52, 96, 98, 43, 52, 59, 74, 52, 54, 64, 66, 21, 71, 92, 34, 76, 33, 26, 36, 53, 74, 64, 85, 59, 26, 69, 24]
elapsed time for mean = 30: 1695.210358
[67, 43, 30, 78, 67, 33, 13, 24, 27, 1, 2, 4, 4, 2, 49, 1, 41, 16, 6, 9, 6, 90, 1, 31, 52, 91, 5, 11, 4, 94, 2, 74, 60, 38, 8, 2, 10, 20, 92, 25, 17, 36, 12, 9, 7, 22, 16, 72, 44, 32]Reformulating HT with something like:
if target < 40:
high = 70etc., so that the target mean is closer to the mean of HT does not generate realistic data besides being an ad hoc solution. And yes, I would like what I call "realistic data" to have a "high" standard deviation perhaps with a number of "outliers".
The process speeds up significantly as the value of
data diminishes, so maybe the better approach is to try to use recursion with smaller data values (sort of like in merge-sort)?Solution
Review
The official Python style-guide recommends using
I would make all relevant parameters actual parameters of the function. This makes it a lot easier to test.
The main code of a module should be wrapped in a
without executing the tests.
I would try to find a different name for the function, without too many abbreviations.
Alternative approach
I would use a completely different approach to this.
First, note that the Poisson distribution is a positive-semidefinite distribution for integers. If we want to generate a integers in the range
This also works when generating values in the range
Similarly, when the lower bound is not zero, we need to shift the values upwards and generate
An implementation of this is:
Note that I always generate multiple values (enough to fill the
Note that the variance of the Poisson distribution is the same as its mean and so its standard deviation is the square root of that. This means that you will get some outliers, but the distribution is far from uniform over your range.
To visualize this, here are three different samples (with n=5000), one with mean 10, 50, 75 and 99 each using the Poisson distribution:
As an alternative, you could take the Beta distribution, which is only defined on the interval [0, 1], which helps us a lot here, because we don't need to take care of edge effects, we only need to shift and rescale our values. Given a mean and sigma (I took sigma to be an arbitrary value), we can derive the necessary parameters \$\alpha\$ and \$\beta\$, as given as alternative parametrization on wikipedia:
\$\alpha = \mu \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right),\quad \beta = (1 - \mu) \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right)\$
which only holds when \$\sigma^2 < \mu(1 - \mu)\$.
With this, I ended up with the following code:
Visualized for \$\mu = 10, 50, 75, 99\$ this looks like this:
In comparison, this is what @Peilonrayz's algorithm produces for these means:
Edit:
Here is the code to generate the graphs:
The official Python style-guide recommends using
lower_case for variables (your variable HT violates this).I would make all relevant parameters actual parameters of the function. This makes it a lot easier to test.
The main code of a module should be wrapped in a
if __name__ == "__main__": guard. This allows you to do in another script:from random_with_mean import lst_w_meanwithout executing the tests.
I would try to find a different name for the function, without too many abbreviations.
Alternative approach
I would use a completely different approach to this.
First, note that the Poisson distribution is a positive-semidefinite distribution for integers. If we want to generate a integers in the range
(0, 100) (similar to your example, but note the off-by-one error), with mean 50, We can just take a np.random.poisson(50) (which I will call P(50) from now on), ensuring to throw away all values which are larger than 100 and redraw them.This also works when generating values in the range
(0, 100) but with mean 10. Only when the mean becomes larger than the half-way point do we start getting into trouble. Suddenly we are starting to loose a larger and larger part of the tail to the upper boundary. To avoid this, we can just swap the problem around and declare the upper boundary to be zero and generate the values as high - P(high - target), so we generate how far the value is away from the boundary.Similarly, when the lower bound is not zero, we need to shift the values upwards and generate
low + P(target - low).An implementation of this is:
import numpy as np
def lst_w_mean(low, high, num, target):
swap = target > (high - low) / 2
out = []
to_generate = num
while to_generate:
if swap:
x = high - np.random.poisson(high - target, size=to_generate)
x = x[low <= x] # x can't be larger than high by construction
else:
x = low + np.random.poisson(target - low, size=to_generate)
x = x[x < high] # x can't be smaller than low by construction
out.extend(x)
to_generate = num - len(out)
return out
if __name__ == "__main__":
print np.mean(lst_w_mean(0, 100, 50, 50))
print np.mean(lst_w_mean(0, 100, 50, 10))
print np.mean(lst_w_mean(0, 100, 50, 70))Note that I always generate multiple values (enough to fill the
out list if all were inside the boundaries) at the same time, then throw away all values outside of the boundaries and loop until the out list is the right size. Most of the time only one iteration is needed, because it can only go outside of the boundary on the side away from the closer boundary.Note that the variance of the Poisson distribution is the same as its mean and so its standard deviation is the square root of that. This means that you will get some outliers, but the distribution is far from uniform over your range.
To visualize this, here are three different samples (with n=5000), one with mean 10, 50, 75 and 99 each using the Poisson distribution:
As an alternative, you could take the Beta distribution, which is only defined on the interval [0, 1], which helps us a lot here, because we don't need to take care of edge effects, we only need to shift and rescale our values. Given a mean and sigma (I took sigma to be an arbitrary value), we can derive the necessary parameters \$\alpha\$ and \$\beta\$, as given as alternative parametrization on wikipedia:
\$\alpha = \mu \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right),\quad \beta = (1 - \mu) \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right)\$
which only holds when \$\sigma^2 < \mu(1 - \mu)\$.
With this, I ended up with the following code:
from __future__ import division
def lst_beta(low, high, num, mean):
diff = high - low
mu = (mean - low) / diff
sigma = mu * (1 - mu)
c = 1 / sigma - 1
a, b = mu * c, (1 - mu) * c
return low + diff * np.random.beta(a, b, size=num)Visualized for \$\mu = 10, 50, 75, 99\$ this looks like this:
In comparison, this is what @Peilonrayz's algorithm produces for these means:
Edit:
Here is the code to generate the graphs:
import matplotlib.pyplot as plt
#function definitions here
def lst_w_mean(low, high, num, target):
...
plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
plt.subplot(len(means), 1, i)
l = lst_w_mean(0, 100, 5000, mean)
plt.hist(l, bins=100, range=(0, 100))
plt.show()Code Snippets
from random_with_mean import lst_w_meanimport numpy as np
def lst_w_mean(low, high, num, target):
swap = target > (high - low) / 2
out = []
to_generate = num
while to_generate:
if swap:
x = high - np.random.poisson(high - target, size=to_generate)
x = x[low <= x] # x can't be larger than high by construction
else:
x = low + np.random.poisson(target - low, size=to_generate)
x = x[x < high] # x can't be smaller than low by construction
out.extend(x)
to_generate = num - len(out)
return out
if __name__ == "__main__":
print np.mean(lst_w_mean(0, 100, 50, 50))
print np.mean(lst_w_mean(0, 100, 50, 10))
print np.mean(lst_w_mean(0, 100, 50, 70))from __future__ import division
def lst_beta(low, high, num, mean):
diff = high - low
mu = (mean - low) / diff
sigma = mu * (1 - mu)
c = 1 / sigma - 1
a, b = mu * c, (1 - mu) * c
return low + diff * np.random.beta(a, b, size=num)import matplotlib.pyplot as plt
#function definitions here
def lst_w_mean(low, high, num, target):
...
plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
plt.subplot(len(means), 1, i)
l = lst_w_mean(0, 100, 5000, mean)
plt.hist(l, bins=100, range=(0, 100))
plt.show()Context
StackExchange Code Review Q#149857, answer score: 5
Revisions (0)
No revisions yet.