patternjavaMinor
Is this code an efficient implementation of Reservoir Sampling?
Viewed 0 times
thissamplingefficientreservoircodeimplementation
Problem
I posted this an answer this to a question about drawing a random line from a file too large to put into memory. I hacked the below code together. In essence, this is what Reservoir Sampling does, in pseudo-code:
Here is the code designed to scan over a document, with said sampling method:
I'm using something very similar on a project I am on at the moment (i.e. the code is close enough I can transfer any changes ea
Scan over the 'tape'
Put the first 'n' samples in a reservoir(of size n)
// samples are lines of a document, numbers, whatever
After the first 'n' :
Pick a random number between 1 and NumberOfLinesCounted
if the number is between 1 and n
replace an existing line with a 1/n chanceHere is the code designed to scan over a document, with said sampling method:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservoirSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List myList = mySampler.sampler(10);
for(int index = 0;index sampler (int reservoirSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservoirList is where our selected lines stored
List reservoirList= new ArrayList(reservoirSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservoirSize)
{
reservoirList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
{
reservoirList.set(randomNumber, currentLine);
}
}
return reservoirList;
}
}I'm using something very similar on a project I am on at the moment (i.e. the code is close enough I can transfer any changes ea
Solution
Reviving an old, but really interesting post, that did not get a great opportunity for review the first time around....
What you've done well:
This code, for the most part, is excellent. As for your specific questions:
Is this:
-
efficient
Mostly, yes. The inefficiencies in here are all related to the IO, and the Scanner, not your code. A more low-level API may net you some performance benefits, but not likely. You are also (re)using the Random class well, and there should not be a problem there.
-
actually reservoir sampling (with equal odds of any line being drawn)?
Yes. This is a good implementation of the reservoir algorithm. The random arguments are good, and the substitutions should be fine.
What's concerning
-
The most significant beef is that you do not close the
-
A second (smaller) issue is that reservoir sampling is designed to handle huge data in a streaming way. I know it is perhaps unrealistic to consider in your use-case, but multi-gigabyte files are not unexpected, and the likelihood of overflowing the
-
You should consider converting
Finally....
Nice code to review. Thanks. Obviously, the new language features in more recent Java versions may be interesting to consider....
What you've done well:
This code, for the most part, is excellent. As for your specific questions:
Is this:
-
efficient
Mostly, yes. The inefficiencies in here are all related to the IO, and the Scanner, not your code. A more low-level API may net you some performance benefits, but not likely. You are also (re)using the Random class well, and there should not be a problem there.
-
actually reservoir sampling (with equal odds of any line being drawn)?
Yes. This is a good implementation of the reservoir algorithm. The random arguments are good, and the substitutions should be fine.
What's concerning
-
The most significant beef is that you do not close the
Scanner. Scanners can hold on to open file-handles, and, if the file really is big, this may have system-wide consequences. A try-catch-finally block would be appropriate (or, a try-with-resources if you're on Java7 now).-
A second (smaller) issue is that reservoir sampling is designed to handle huge data in a streaming way. I know it is perhaps unrealistic to consider in your use-case, but multi-gigabyte files are not unexpected, and the likelihood of overflowing the
count variable is not unrealistic. That would cause problems....-
You should consider converting
count and randomNumber to be long values, and only convert it back to an int if it is less-than reservoirSizeFinally....
Nice code to review. Thanks. Obviously, the new language features in more recent Java versions may be interesting to consider....
Context
StackExchange Code Review Q#16154, answer score: 7
Revisions (0)
No revisions yet.