HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Is this code an efficient implementation of Reservoir Sampling?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thissamplingefficientreservoircodeimplementation

Problem

I posted this an answer this to a question about drawing a random line from a file too large to put into memory. I hacked the below code together. In essence, this is what Reservoir Sampling does, in pseudo-code:

Scan over the 'tape'
    Put the first 'n' samples in a reservoir(of size n)
    // samples are lines of a document, numbers, whatever
    After the first 'n' :
        Pick a random number between 1 and NumberOfLinesCounted
        if the number is between 1 and n
            replace an existing line with a 1/n chance


Here is the code designed to scan over a document, with said sampling method:

import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;

public class reservoirSampling {

    public static void main(String[] args) throws FileNotFoundException, IOException{
        Sampler mySampler = new Sampler();
        List myList = mySampler.sampler(10);
        for(int index = 0;index sampler (int reservoirSize) throws FileNotFoundException, IOException
    {
        String currentLine=null;
        //reservoirList is where our selected lines stored
        List  reservoirList= new ArrayList(reservoirSize); 
        // we will use this counter to count the current line number while iterating
        int count=0; 

        Random ra = new Random();
        int randomNumber = 0;
        Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
        while (sc.hasNext())
        {
            currentLine = sc.next();
            count ++;
            if (count<=reservoirSize)
            {
                reservoirList.add(currentLine);
            }
            else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
            {
                reservoirList.set(randomNumber, currentLine);
            }
        }
        return reservoirList;
    }
}


I'm using something very similar on a project I am on at the moment (i.e. the code is close enough I can transfer any changes ea

Solution

Reviving an old, but really interesting post, that did not get a great opportunity for review the first time around....
What you've done well:

This code, for the most part, is excellent. As for your specific questions:

Is this:

-
efficient

Mostly, yes. The inefficiencies in here are all related to the IO, and the Scanner, not your code. A more low-level API may net you some performance benefits, but not likely. You are also (re)using the Random class well, and there should not be a problem there.

-
actually reservoir sampling (with equal odds of any line being drawn)?

Yes. This is a good implementation of the reservoir algorithm. The random arguments are good, and the substitutions should be fine.

What's concerning

-
The most significant beef is that you do not close the Scanner. Scanners can hold on to open file-handles, and, if the file really is big, this may have system-wide consequences. A try-catch-finally block would be appropriate (or, a try-with-resources if you're on Java7 now).

-
A second (smaller) issue is that reservoir sampling is designed to handle huge data in a streaming way. I know it is perhaps unrealistic to consider in your use-case, but multi-gigabyte files are not unexpected, and the likelihood of overflowing the count variable is not unrealistic. That would cause problems....

-
You should consider converting count and randomNumber to be long values, and only convert it back to an int if it is less-than reservoirSize

Finally....

Nice code to review. Thanks. Obviously, the new language features in more recent Java versions may be interesting to consider....

Context

StackExchange Code Review Q#16154, answer score: 7

Revisions (0)

No revisions yet.