HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Randomly sampling a population and keeping means: tidy up, generalize, document?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
keepingsamplinggeneralizemeansrandomlytidydocumentpopulationand

Problem

This is part from an answer to a Stack Overflow question. The OP needed a way to perform calculations on samples from a population, but was hitting memory errors due to keeping samples in memory.

The function is based on part of random.sample, but only the code branch using a set is present.

If we can tidy and comment this well enough, it might be worth publishing as a recipe at the Python Cookbook.

import random

def sampling_mean(population, k, times):
    # Part of this is lifted straight from random.py
    _int = int
    _random = random.random

    n = len(population)
    kf = float(k)
    result = []

    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"

    for t in xrange(times):
        selected = set()
        sum_ = 0
        selected_add = selected.add

        for i in xrange(k):
            j = _int(_random() * n)
            while j in selected:
                j = _int(_random() * n)
            selected_add(j)
            sum_ += population[j]

        # Partial result we're interested in
        mean = sum_/kf
        result.append(mean)
    return result

sampling_mean(x, 1000000, 100)


Maybe it'd be interesting to generalize it so you can pass a function that calculates the value you're interested in from the sample?

Solution

Making a generator version of random.sample() seems to be a much better idea:

from __future__ import division
from random import random
from math import ceil as _ceil, log as _log

def xsample(population, k):
    """A generator version of random.sample"""
    n = len(population)
    if not 0  5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in xrange(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            yield pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in xrange(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                yield population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            for x in sample(tuple(population), k):
                yield x


Taking a sampling mean then becomes trivial:

def sampling_mean(population, k, times):
    for t in xrange(times):
        yield sum(xsample(population, k))/k


That said, as a code review, not much can be said about your code as it is more or less taking directly from the Python source, which can be said to be authoritative. ;) It does have a lot of silly speed-ups that make the code harder to read.

Code Snippets

from __future__ import division
from random import random
from math import ceil as _ceil, log as _log

def xsample(population, k):
    """A generator version of random.sample"""
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"
    _int = int
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in xrange(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            yield pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in xrange(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                yield population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            for x in sample(tuple(population), k):
                yield x
def sampling_mean(population, k, times):
    for t in xrange(times):
        yield sum(xsample(population, k))/k

Context

StackExchange Code Review Q#217, answer score: 2

Revisions (0)

No revisions yet.