patternjavaMinor
Sampling from a distribution with given probabilities
Viewed 0 times
withsamplingprobabilitiesdistributionfromgiven
Problem
I have found myself needing to sample numbers from a given distribution and was wondering what the most efficient method of doing so is.
I have a set of
My approach is to create a new array,
This is my code so far (in Java):
I'm mostly unsure about the behavior of the
I have a set of
N elements 0 to N-1 and an array mapping those elements to the probability of their selection, pp.My approach is to create a new array,
cdf, and accumulate the entries of pp so that the first element of cdf is pp[0] and the last element is the sum of all the elements in pp, namely 1. Then, since this array is necessarily sorted, I choose a random number between 0 and 1 and perform a binary search in the array.This is my code so far (in Java):
public class Sample {
/**
* Selects q elements chosen at random according to the probabilities
* given in pp.
*
* @param pp The probabilites corresponding to each index
* (item 0 has probability pp[0] and so forth)
* @param q The number of elements to sample
* @return An array containing the elements sampled
*/
public static int [] sample(double [] pp, int q) {
double [] cdf = pp.clone();
for (int i = 1; i < cdf.length; i++)
cdf[i] += cdf[i - 1];
int [] values = new int[q];
for (int i = 0; i < q; i++) {
// binarySearch returns a negative number at the insertion point
// if the exact number isn't found. We don't expect to hit an
// exact match with Math.random(), so the Math.abs() call
// can't be too expensive.
values[i] = Math.abs(Arrays.binarySearch(cdf, Math.random()));
}
return values;
}
}I'm mostly unsure about the behavior of the
Arrays.binarySearch function. Am I at risk of missing the first/last element, or going out of bounds? The running time of this algorithm is \$O(N)\$ for the cdf construction, plus \$O(q \log N)\$ for the actual sampling. The second operation dominates (\$q\$ could be greater than \$N\$), so this is an \$O(q \log N)\Solution
Just a suggestion:
I think you can use better variable names than
This doesn't matter so much but increases the readability of the program.
I think you can use better variable names than
pp and q.cdf can be clonedArray, and the same for other variable names.This doesn't matter so much but increases the readability of the program.
Context
StackExchange Code Review Q#55409, answer score: 2
Revisions (0)
No revisions yet.