patternpythonMinor
Duplicate-preserving collection intersection
Viewed 0 times
duplicatepreservingcollectionintersection
Problem
Python
Specifically among the others,
This is endlessly handy, but the huge downside of
What to do? Why, spin my own, of course!
How it works, because it might be a little cryptic:
Used as above:
Success!
How can I improve it? Is there a faster or more efficient way, or perhaps a less flimsy algorithm?
Assertions that should pass:
An inconsistency I don't understand, but which is definitely a bug:
I don't understand this in the slightest.
I wrote this
sets are magical things. Mutable, deduplicating data stores with handy operator overloads for &, ^ and |.Specifically among the others,
sets overload the & operator to do mutual membership tests like so:>>> set("aasd") & set("aabc")
{'a'}This is endlessly handy, but the huge downside of
sets is they do not allow duplicate entries because they're hash tables, so for instance a fuzzy-finder will be a little too fuzzy.What to do? Why, spin my own, of course!
def membertester(a, b):
"""a slow (iterative) dupe-preserving sorting subset tester,
for mutual membership tests (like set overloads &)"""
c = []
a, b = sorted(list(a)), sorted(list(b))
(l, maxl, s, minl) = {
True: (a, len(a), b, len(b)),
False: (b, len(b), a, len(a))
}[ len(a) >= len(b) ]
for i in range(maxl):
try:
if s.count(l[i]) >= l.count(l[i]):
c.append(a[i])
except IndexError:
pass
return cHow it works, because it might be a little cryptic:
- Retrieve the longer and shorter of the two sorted lists, and their lengths, using a switch statement.
- For each item in the longer range, take note of the item if the number of occurences of the same item in the short list is greater than or equal to the number of occurences of the item in the longer list.
Used as above:
>>> membertest("aasd", "aabc")
['a', 'a']Success!
How can I improve it? Is there a faster or more efficient way, or perhaps a less flimsy algorithm?
Assertions that should pass:
assert membertest("a" * 6, '') == []
assert membertest("abdeg", "rrertabasdddhjkdeg") == ['a', 'b', 'd', 'e', 'g']
assert membertest("abddeg", "rrertabasdddhjkdeg") == ['a', 'b', 'd', 'd', 'e', 'g']An inconsistency I don't understand, but which is definitely a bug:
>>> is_subset('10', '122')
['0']
>>> is_subset("ab", "acc")
['a']I don't understand this in the slightest.
I wrote this
Solution
A few quick comments:
-
As written, the code doesn’t always give the correct answer. Here are some cases I found that give what appear to be incorrect results:
If I’ve understood the intention of the program correctly, those are all wrong. If they’re correct, you need to do a better job of explaining the program.
-
The way you’ve written it is a bit fiddly – some comments interspersed with the program would make it much easier to read. That, and choosing better variable names – characters are cheap, don’t just use single letters.
For example, how about
-
The construction for assigning
-
You could assign the two remaining variables in a neater way:
or as an if statement if you wanted to be really explicit.
After a bit of tidying, this is what my version of the function looks like (note that it still has the same bugs as above):
Sidebar: how did I find those examples?
I’m not psychic, and I’m not good at spotting obscure bugs. Instead, I used Hypothesis, which is a powerful randomised, property-based testing library. Here’s the property I used:
Every mutual member of two lists must be a member of each list.
Sounds reasonable, right?
And here’s how I expressed that as a test:
Hypothesis tries hundreds of randomised examples, and checks whether this property holds – it found a bug on the second run. I highly recommend trying it.
Sidebar 2: what causes that bug?
Notice that in all of the examples, we have
with
With that in mind, you can make the code much cleaner:
I’ve been unable to find any bugs in this implementation.
-
As written, the code doesn’t always give the correct answer. Here are some cases I found that give what appear to be incorrect results:
>>> membertester('00000112', '000000011')
['2']
>>> print membertester('0001', '000/2')
['0', '0', '1']
>>> print membertester('002', '001/')
['0', '2']
>>> print membertester('10', '122')
['0']If I’ve understood the intention of the program correctly, those are all wrong. If they’re correct, you need to do a better job of explaining the program.
-
The way you’ve written it is a bit fiddly – some comments interspersed with the program would make it much easier to read. That, and choosing better variable names – characters are cheap, don’t just use single letters.
For example, how about
longer and shorter instead of l and s?-
The construction for assigning
(l, maxl, s, minl) is very weird, and it took me several reads to make sure I’d understood it correctly. This is too clever – I’d break it up into multiple statements. A few ways to do simplify it:- You don’t use the
minlvariable, so get rid of it. (If you meant to use it, is that a bug?)
- The for loop can be replaced by
for i, item in enumerate(longer), doing away with the need for themaxlvariable.
-
You could assign the two remaining variables in a neater way:
shorter, longer = sorted([a, b], key=len)or as an if statement if you wanted to be really explicit.
After a bit of tidying, this is what my version of the function looks like (note that it still has the same bugs as above):
def membertester(list_a, list_b):
"""a slow (iterative) dupe-preserving sorting subset tester,
for mutual membership tests (like set overloads &)"""
common = []
shorter, longer = sorted([list_a, list_b], key=len)
for idx, item in enumerate(longer):
try:
if shorter.count(item) >= longer.count(item):
common.append(list_a[idx])
except IndexError:
pass
return commonSidebar: how did I find those examples?
I’m not psychic, and I’m not good at spotting obscure bugs. Instead, I used Hypothesis, which is a powerful randomised, property-based testing library. Here’s the property I used:
Every mutual member of two lists must be a member of each list.
Sounds reasonable, right?
And here’s how I expressed that as a test:
from hypothesis import given, strategies as st
@given(st.text(), st.text())
def test_that_mutual_members_are_in_both_lists(list_a, list_b):
mutual_members = membertester(list_a, list_b)
for item in mutual_members:
assert item in list_a
assert item in list_bHypothesis tries hundreds of randomised examples, and checks whether this property holds – it found a bug on the second run. I highly recommend trying it.
Sidebar 2: what causes that bug?
Notice that in all of the examples, we have
len(b) > len(a). So we have longer = b, shorter = a. As you go through the for loop, you’re iterating over the elements of b, but you’re using a to choose which element gets marked as common. You want to replacec.append(a[i])with
c.append(l[i])With that in mind, you can make the code much cleaner:
def membertester(list_a, list_b):
"""a slow (iterative) dupe-preserving sorting subset tester,
for mutual membership tests (like set overloads &)"""
common = []
shorter, longer = sorted([list_a, list_b], key=len)
for item in longer:
if shorter.count(item) >= longer.count(item):
common.append(item)
return commonI’ve been unable to find any bugs in this implementation.
Code Snippets
>>> membertester('00000112', '000000011')
['2']
>>> print membertester('0001', '000/2')
['0', '0', '1']
>>> print membertester('002', '001/')
['0', '2']
>>> print membertester('10', '122')
['0']shorter, longer = sorted([a, b], key=len)def membertester(list_a, list_b):
"""a slow (iterative) dupe-preserving sorting subset tester,
for mutual membership tests (like set overloads &)"""
common = []
shorter, longer = sorted([list_a, list_b], key=len)
for idx, item in enumerate(longer):
try:
if shorter.count(item) >= longer.count(item):
common.append(list_a[idx])
except IndexError:
pass
return commonfrom hypothesis import given, strategies as st
@given(st.text(), st.text())
def test_that_mutual_members_are_in_both_lists(list_a, list_b):
mutual_members = membertester(list_a, list_b)
for item in mutual_members:
assert item in list_a
assert item in list_bc.append(a[i])Context
StackExchange Code Review Q#122266, answer score: 3
Revisions (0)
No revisions yet.