HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Duplicate-preserving collection intersection

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
duplicatepreservingcollectionintersection

Problem

Python sets are magical things. Mutable, deduplicating data stores with handy operator overloads for &, ^ and |.

Specifically among the others, sets overload the & operator to do mutual membership tests like so:

>>> set("aasd") & set("aabc")
{'a'}


This is endlessly handy, but the huge downside of sets is they do not allow duplicate entries because they're hash tables, so for instance a fuzzy-finder will be a little too fuzzy.

What to do? Why, spin my own, of course!

def membertester(a, b):
    """a slow (iterative) dupe-preserving sorting subset tester,
    for mutual membership tests (like set overloads &)"""
    c = []
    a, b = sorted(list(a)), sorted(list(b))
    (l, maxl, s, minl) = {
        True:  (a, len(a), b, len(b)),
        False: (b, len(b), a, len(a))
    }[ len(a) >= len(b) ]

    for i in range(maxl):
        try:
            if s.count(l[i]) >= l.count(l[i]):
                c.append(a[i])

        except IndexError:
            pass
    return c


How it works, because it might be a little cryptic:

  • Retrieve the longer and shorter of the two sorted lists, and their lengths, using a switch statement.



  • For each item in the longer range, take note of the item if the number of occurences of the same item in the short list is greater than or equal to the number of occurences of the item in the longer list.



Used as above:

>>> membertest("aasd", "aabc")
['a', 'a']


Success!

How can I improve it? Is there a faster or more efficient way, or perhaps a less flimsy algorithm?

Assertions that should pass:

assert membertest("a" * 6, '') == []
assert membertest("abdeg", "rrertabasdddhjkdeg") == ['a', 'b', 'd', 'e', 'g']
assert membertest("abddeg", "rrertabasdddhjkdeg") == ['a', 'b', 'd', 'd', 'e', 'g']


An inconsistency I don't understand, but which is definitely a bug:

>>> is_subset('10', '122')
['0']
>>> is_subset("ab", "acc")
['a']


I don't understand this in the slightest.

I wrote this

Solution

A few quick comments:

-
As written, the code doesn’t always give the correct answer. Here are some cases I found that give what appear to be incorrect results:

>>> membertester('00000112', '000000011')
['2']
>>> print membertester('0001', '000/2')
['0', '0', '1']
>>> print membertester('002', '001/')
['0', '2']
>>> print membertester('10', '122')
['0']


If I’ve understood the intention of the program correctly, those are all wrong. If they’re correct, you need to do a better job of explaining the program.

-
The way you’ve written it is a bit fiddly – some comments interspersed with the program would make it much easier to read. That, and choosing better variable names – characters are cheap, don’t just use single letters.

For example, how about longer and shorter instead of l and s?

-
The construction for assigning (l, maxl, s, minl) is very weird, and it took me several reads to make sure I’d understood it correctly. This is too clever – I’d break it up into multiple statements. A few ways to do simplify it:

  • You don’t use the minl variable, so get rid of it. (If you meant to use it, is that a bug?)



  • The for loop can be replaced by for i, item in enumerate(longer), doing away with the need for the maxl variable.



-
You could assign the two remaining variables in a neater way:

shorter, longer = sorted([a, b], key=len)


or as an if statement if you wanted to be really explicit.

After a bit of tidying, this is what my version of the function looks like (note that it still has the same bugs as above):

def membertester(list_a, list_b):
    """a slow (iterative) dupe-preserving sorting subset tester,
    for mutual membership tests (like set overloads &)"""
    common = []
    shorter, longer = sorted([list_a, list_b], key=len)

    for idx, item in enumerate(longer):
        try:
            if shorter.count(item) >= longer.count(item):
                common.append(list_a[idx])
        except IndexError:
            pass
    return common


Sidebar: how did I find those examples?

I’m not psychic, and I’m not good at spotting obscure bugs. Instead, I used Hypothesis, which is a powerful randomised, property-based testing library. Here’s the property I used:


Every mutual member of two lists must be a member of each list.

Sounds reasonable, right?

And here’s how I expressed that as a test:

from hypothesis import given, strategies as st

@given(st.text(), st.text())
def test_that_mutual_members_are_in_both_lists(list_a, list_b):
    mutual_members = membertester(list_a, list_b)
    for item in mutual_members:
        assert item in list_a
        assert item in list_b


Hypothesis tries hundreds of randomised examples, and checks whether this property holds – it found a bug on the second run. I highly recommend trying it.

Sidebar 2: what causes that bug?

Notice that in all of the examples, we have len(b) > len(a). So we have longer = b, shorter = a. As you go through the for loop, you’re iterating over the elements of b, but you’re using a to choose which element gets marked as common. You want to replace

c.append(a[i])


with

c.append(l[i])


With that in mind, you can make the code much cleaner:

def membertester(list_a, list_b):
    """a slow (iterative) dupe-preserving sorting subset tester,
    for mutual membership tests (like set overloads &)"""
    common = []
    shorter, longer = sorted([list_a, list_b], key=len)

    for item in longer:
        if shorter.count(item) >= longer.count(item):
            common.append(item)

    return common


I’ve been unable to find any bugs in this implementation.

Code Snippets

>>> membertester('00000112', '000000011')
['2']
>>> print membertester('0001', '000/2')
['0', '0', '1']
>>> print membertester('002', '001/')
['0', '2']
>>> print membertester('10', '122')
['0']
shorter, longer = sorted([a, b], key=len)
def membertester(list_a, list_b):
    """a slow (iterative) dupe-preserving sorting subset tester,
    for mutual membership tests (like set overloads &)"""
    common = []
    shorter, longer = sorted([list_a, list_b], key=len)

    for idx, item in enumerate(longer):
        try:
            if shorter.count(item) >= longer.count(item):
                common.append(list_a[idx])
        except IndexError:
            pass
    return common
from hypothesis import given, strategies as st

@given(st.text(), st.text())
def test_that_mutual_members_are_in_both_lists(list_a, list_b):
    mutual_members = membertester(list_a, list_b)
    for item in mutual_members:
        assert item in list_a
        assert item in list_b
c.append(a[i])

Context

StackExchange Code Review Q#122266, answer score: 3

Revisions (0)

No revisions yet.