patternpythonMinor
Keep items based on list membership
Viewed 0 times
membershipkeepitemsbasedlist
Problem
I have list of tuples like so:
I need to only keep the tuples that contain:
Which I am currently doing using this statement:
This works but it is too slow for me since the actual
lst = [(106, 210, 108, 134, 134),
(106, 210, 108, 134, 210),
(106, 210, 108, 168, 268),
(106, 210, 108, 168, 671),
...]I need to only keep the tuples that contain:
keep = (106, 210, 108, 168)Which I am currently doing using this statement:
kept = [item for item in lst if set(keep) < set(item)]This works but it is too slow for me since the actual
list I am working with contains 2.6 million items. It takes about 1.5 seconds now. Is there a good way to significantly speed this up?Solution
Your code as it stand is on the border of being off-topic as it is a stub, however since the following actually works and seems to do the job, I'm going to answer it:
Firstly some comments on the code:
Alternate solution
You're asking for a faster solution, so let us analyze what is happening in your code currently:
The operations in here is \$O(N)\$ where N is the number in the list, and that you can't beat. The cost for each element is the creation of one or two sets, and the comparison of those sets, which is dependent on the size of the sets \$O(M)\$. In general since the \$N << M\$, the loop over the elements should be prominent.
This means, without changing the data structures there is not a whole lot to be gained, as you do need to loop through all elements, and you need to verify membership agains the
Another view on your solution is the readability of your code, and how to understand what is happening. The list comprehension is understandable, but the
This code expands the list comprehension into the double
The trick to this solution is that if the inner
Whilst I was writing this answer, the answer with using the
This was tested using the original
lst = [(106, 210, 108, 134, 134),
(106, 210, 108, 134, 210),
(106, 210, 108, 168, 268),
(106, 210, 108, 168, 671)]
keep = (106, 210, 108, 168)
kept = [item for item in lst if set(keep) < set(item)]
print keptFirstly some comments on the code:
- Please provide a fully working example – In general, if there in the question is not a fully working example code, it'll be shut down immediately. Please provide something like the above code segment.
- Name and comment – Providing good names and comment to illustrate what the code does is vital. Using anonymous name like
lst,item, and no comments, makes the code a lot harder to read.
- Do you keep
lstin memory? – In the text you state thatlsthas over 2.6 million entries, do you keep all of this in memory? That could possibly effect your running times severly depending on your hardware resources. You might want to look into some algorithm leaving it in file or in a database, or whatever suits your needs.
Alternate solution
You're asking for a faster solution, so let us analyze what is happening in your code currently:
- You loop through the entire in-memory
lstonce, and create a new set for each item
- This
set(item)is then compared to repeatedly createdset(keep)(could be optimised away by the compiler), to check which is lesser
- If lesser, the list comprehension keeps the
item
The operations in here is \$O(N)\$ where N is the number in the list, and that you can't beat. The cost for each element is the creation of one or two sets, and the comparison of those sets, which is dependent on the size of the sets \$O(M)\$. In general since the \$N << M\$, the loop over the elements should be prominent.
This means, without changing the data structures there is not a whole lot to be gained, as you do need to loop through all elements, and you need to verify membership agains the
keep list. If any optimization is to be performed it needs to address the actual comparison somehow.Another view on your solution is the readability of your code, and how to understand what is happening. The list comprehension is understandable, but the
set comparison is not obvious to me, at least, and I would have liked a comment. Or a rewrite, so let us attempt an rewrite for readability and see how it performs. This rewrite is using a Python specific concept to avoid a flag variable:keepers = []
for candidate in lst:
for keeper in keep:
if not keeper in candidate:
break
else:
keepers.append(candidate)This code expands the list comprehension into the double
for loop which is hidden in the list comprehension. The inner for loop breaks up the keep list into each element, and tests for memberships. The criteria for the candidate to be kept is that all elements should be members, so if any member is not present, we break out and start checking the next candidate.The trick to this solution is that if the inner
for loop doesn't complete naturally (aka no breaks), the else part is not executed. Try the following and play with it, to understand this mechanism:for i in range(5):
if i == 3:
break
else:
print "This didn't end naturally" # Not executed
for i in range(5):
if i == 7:
break
else:
print "The loop finished without breaking" # ExecutesWhilst I was writing this answer, the answer with using the
all concept came in, so I included that in some basic timeit tests to check for running times, and the result surprised me a little:Original method: 2.62324810028
Using all(): 3.84744811058
Double for loop: 1.84868502617This was tested using the original
lst duplicated a few times to get a slightly larger set, but it does show that using the double for loop like I did, is currently the faster solution, and it runs at about 33% faster than the original code, and the solution using all() is actually quite a bit slower.Code Snippets
lst = [(106, 210, 108, 134, 134),
(106, 210, 108, 134, 210),
(106, 210, 108, 168, 268),
(106, 210, 108, 168, 671)]
keep = (106, 210, 108, 168)
kept = [item for item in lst if set(keep) < set(item)]
print keptkeepers = []
for candidate in lst:
for keeper in keep:
if not keeper in candidate:
break
else:
keepers.append(candidate)for i in range(5):
if i == 3:
break
else:
print "This didn't end naturally" # Not executed
for i in range(5):
if i == 7:
break
else:
print "The loop finished without breaking" # ExecutesOriginal method: 2.62324810028
Using all(): 3.84744811058
Double for loop: 1.84868502617Context
StackExchange Code Review Q#162109, answer score: 7
Revisions (0)
No revisions yet.