snippetMinor
What is the optimal strategy for filtering a large collection of items with multiple filter functions?
Viewed 0 times
theoptimalwhatwithcollectionlargeitemsformultiplestrategy
Problem
I have a large collection of items, and a list of independent filters (boolean functions). I want to find the collection of items that pass all of my filters as quickly as possible. This must involve looping over every item in the collection and applying each filter to each item. An item will be rejected early if any one of the filters fails during this process.
Each filter has a runtime that varies significantly. And, we know a-priori what percent of the collection each filter will filter out. Given this, how do I find the ordering I should apply my filters to each item with the fasted expected run-time overall?
For example, filter A runs in 5 ms and filters out 50% of the collection on average. Filter B runs in 1 ms and filters out 25% of the collection on average. From this, we know that ordering A,B gives 5 + 0.5 1 = 5.5 ms average runtime, and B,A gives 1 + 0.75 5 = 4.75 ms average runtime. So B,A is our fastest ordering.
I think this admits a dynamic programming solution (since fastest ordering is runtime of first filter + (1 - filter fraction) * (optimal ordering of the rest), but I was wondering if this problem has a name and has been studied before? Could somebody point me to any papers?
EDIT: also, I forgot that the probability of filtering and the run-times will be correlated between filters, so my analysis above is wrong and the problem is harder. I think I can solve this with enough time but I want to catch up on prior-work before I start, if there is any (it seems like it could be a common problem). I haven't had any luck finding anything so far though.
Each filter has a runtime that varies significantly. And, we know a-priori what percent of the collection each filter will filter out. Given this, how do I find the ordering I should apply my filters to each item with the fasted expected run-time overall?
For example, filter A runs in 5 ms and filters out 50% of the collection on average. Filter B runs in 1 ms and filters out 25% of the collection on average. From this, we know that ordering A,B gives 5 + 0.5 1 = 5.5 ms average runtime, and B,A gives 1 + 0.75 5 = 4.75 ms average runtime. So B,A is our fastest ordering.
I think this admits a dynamic programming solution (since fastest ordering is runtime of first filter + (1 - filter fraction) * (optimal ordering of the rest), but I was wondering if this problem has a name and has been studied before? Could somebody point me to any papers?
EDIT: also, I forgot that the probability of filtering and the run-times will be correlated between filters, so my analysis above is wrong and the problem is harder. I think I can solve this with enough time but I want to catch up on prior-work before I start, if there is any (it seems like it could be a common problem). I haven't had any luck finding anything so far though.
Solution
Independent filters
If the filters are independent, a greedy algorithm can be used to find the optimal order. Sort the filters by increasing value of $t/p$, where $t$ is the time to run the filter and $p$ is the fraction of items that are filtered out by the filter. This gives the optimal order.
To give some intuition for why this might be a plausible answer: ideally, we'd prefer filters that take less time; and ideally, we'd prefer filters that filter out a higher fraction of items. Intuitively, you can see how $t/p$ somehow incorporates both of those.
A formal proof of optimality involves an exchange argument. I won't detail the proof here, as it can be derived easily using standard techniques. For a detailed writeup of the proof, see Determine what is the best order for running filters on a dataset.
Correlated filters
If there can be correlations between the filters (i.e., they are not independent), the optimal order cannot be determined from the information provided without knowledge of the correlations.
Here's an example, based on making a small modification to your example. Suppose we have two filters, filter A and filter B. Suppose filter A runs in 4.5ms and filters out 50% on average; filter B runs in 1ms and filters out 25% on average. Also suppose that anything filtered out by filter B is also filtered out by filter A. Then the optimal order is to just do filter A. Your formulas give A,B: $4.5+0.5 \times 1 = 5.0$ and B,A: $1 + 0.75 \times 4.5 = 4.375$, so your formulas might make B,A look like the best choice -- but in fact just doing A alone is even better.
Unfortunately, in practice I would expect that filters might often be correlated. To deal with this, you'd need to know some information about the correlation structure. If you don't have any information on the correlation structure, one possible approach is to use the greedy strategy, recognizing that it might not be optimal for your particular situation. Another possible approach is to gather data on the correlation structure and then use that additional information to try to find a better order, though this may not admit as clean of an answer.
If the filters are independent, a greedy algorithm can be used to find the optimal order. Sort the filters by increasing value of $t/p$, where $t$ is the time to run the filter and $p$ is the fraction of items that are filtered out by the filter. This gives the optimal order.
To give some intuition for why this might be a plausible answer: ideally, we'd prefer filters that take less time; and ideally, we'd prefer filters that filter out a higher fraction of items. Intuitively, you can see how $t/p$ somehow incorporates both of those.
A formal proof of optimality involves an exchange argument. I won't detail the proof here, as it can be derived easily using standard techniques. For a detailed writeup of the proof, see Determine what is the best order for running filters on a dataset.
Correlated filters
If there can be correlations between the filters (i.e., they are not independent), the optimal order cannot be determined from the information provided without knowledge of the correlations.
Here's an example, based on making a small modification to your example. Suppose we have two filters, filter A and filter B. Suppose filter A runs in 4.5ms and filters out 50% on average; filter B runs in 1ms and filters out 25% on average. Also suppose that anything filtered out by filter B is also filtered out by filter A. Then the optimal order is to just do filter A. Your formulas give A,B: $4.5+0.5 \times 1 = 5.0$ and B,A: $1 + 0.75 \times 4.5 = 4.375$, so your formulas might make B,A look like the best choice -- but in fact just doing A alone is even better.
Unfortunately, in practice I would expect that filters might often be correlated. To deal with this, you'd need to know some information about the correlation structure. If you don't have any information on the correlation structure, one possible approach is to use the greedy strategy, recognizing that it might not be optimal for your particular situation. Another possible approach is to gather data on the correlation structure and then use that additional information to try to find a better order, though this may not admit as clean of an answer.
Context
StackExchange Computer Science Q#35502, answer score: 4
Revisions (0)
No revisions yet.