HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Sampling from a distribution with given probabilities

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withsamplingprobabilitiesdistributionfromgiven

Problem

I have found myself needing to sample numbers from a given distribution and was wondering what the most efficient method of doing so is.

I have a set of N elements 0 to N-1 and an array mapping those elements to the probability of their selection, pp.

My approach is to create a new array, cdf, and accumulate the entries of pp so that the first element of cdf is pp[0] and the last element is the sum of all the elements in pp, namely 1. Then, since this array is necessarily sorted, I choose a random number between 0 and 1 and perform a binary search in the array.

This is my code so far (in Java):

public class Sample {
    /**
     * Selects q elements chosen at random according to the probabilities
     * given in pp.
     *
     * @param pp  The probabilites corresponding to each index
     *            (item 0 has probability pp[0] and so forth)
     * @param q   The number of elements to sample
     * @return    An array containing the elements sampled
     */
    public static int [] sample(double [] pp, int q) {
        double [] cdf = pp.clone();
        for (int i = 1; i < cdf.length; i++)
            cdf[i] += cdf[i - 1];

        int [] values = new int[q];
        for (int i = 0; i < q; i++) {
            // binarySearch returns a negative number at the insertion point
            // if the exact number isn't found. We don't expect to hit an
            // exact match with Math.random(), so the Math.abs() call
            // can't be too expensive.
            values[i] = Math.abs(Arrays.binarySearch(cdf, Math.random()));
        }

        return values;
    }
}


I'm mostly unsure about the behavior of the Arrays.binarySearch function. Am I at risk of missing the first/last element, or going out of bounds? The running time of this algorithm is \$O(N)\$ for the cdf construction, plus \$O(q \log N)\$ for the actual sampling. The second operation dominates (\$q\$ could be greater than \$N\$), so this is an \$O(q \log N)\

Solution

Just a suggestion:

I think you can use better variable names than pp and q.

cdf can be clonedArray, and the same for other variable names.

This doesn't matter so much but increases the readability of the program.

Context

StackExchange Code Review Q#55409, answer score: 2

Revisions (0)

No revisions yet.