patternpythonMinor
Single pass algorithm for finding the topX percent of items
Viewed 0 times
thepasstopxpercentalgorithmsingleitemsforfinding
Problem
I'm looking for a single-pass algorithm for finding the topX percent of floats in a stream where I do not know the total number ahead of time ... but its on the order of 5-30 million floats. It needs to be single-pass since the data is generated on the fly and recreate the exact stream a second time.
The algorithm I have so far is to keep a sorted list of the topX items that I've seen so far. As the stream continues I enlarge the list as needed. Then I use
Below is the algorithm I have so far:
In the real case the data does not come from any standard distribution (otherwise I could use some statistics knowledge).
Any suggestions would be appreciated.
The algorithm I have so far is to keep a sorted list of the topX items that I've seen so far. As the stream continues I enlarge the list as needed. Then I use
bisect_left to find the insertion point if needed.Below is the algorithm I have so far:
from bisect import bisect_left
from random import uniform
from itertools import islice
def data_gen(num):
for _ in xrange(num):
yield uniform(0,1)
def get_top_X_percent(iterable, percent = 0.01, min_guess = 1000):
top_nums = sorted(list(islice(iterable, int(percent*min_guess)))) #get an initial guess
for ind, val in enumerate(iterable, len(top_nums)):
if int(percent*ind) > len(top_nums):
top_nums.insert(0,None)
newind = bisect_left(top_nums, val)
if newind > 0:
top_nums.insert(newind, val)
top_nums.pop(0)
return top_nums
if __name__ == '__main__':
num = 1000000
all_data = sorted(data_gen(num))
result = get_top_X_percent(all_data)
assert result[0] == all_data[-int(num*0.01)], 'Too far off, lowest num:%f' % result[0]
print result[0]In the real case the data does not come from any standard distribution (otherwise I could use some statistics knowledge).
Any suggestions would be appreciated.
Solution
top_nums = sorted(list(islice(iterable, int(percent*min_guess)))) #get an initial guessThere is no reason to make a list out of it before you sort it.
for ind, val in enumerate(iterable, len(top_nums))I dislike abbreviations. I think it makes it harder to figure out what ind and val are doing.
all_data = sorted(data_gen(num))Why are you sorting your test data?
As I understand your problem, your code is wrong. It only works in your test case because you sort the incoming data.
Your algorithm regularly increases the size of the list of values. But when it does so, there have been previous numbers which have been thrown away which may greater then the value you insert at that point. As a result, you cannot be sure you've actually ended up with with the top 1%.
How should you fix it? If you can upper bound the size of your input, then you can start with a list of sufficient size and then scale back at the end. Otherwise I don't think you can do it. The problem being that you cannot throw away any values because there is no way to be sure you won't need them later.
You might consider using a heap. Python has a heapq module including a function heapq.nlargest which does pretty much what you are doing, (but uses a count rather then a percentage) A heap is pretty much a semi-sorted list and lets you do things like find/remove/replace the lowest value without the overhead of actually sorting.
Code Snippets
top_nums = sorted(list(islice(iterable, int(percent*min_guess)))) #get an initial guessfor ind, val in enumerate(iterable, len(top_nums))all_data = sorted(data_gen(num))Context
StackExchange Code Review Q#3429, answer score: 5
Revisions (0)
No revisions yet.