HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Merge two list and discarding duplicates

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
mergediscardingtwoandlistduplicates

Problem

I am trying to implement a function that merges two ordered lists into a third (ordered) one, but duplicates have to be discarded (basically the last step of a mergesort).

I think that this code can be generalized to an arbitrary number or lists, by keeping a list of indices, rather than 2 explicit ones.

def merge_no_duplicates(list_1, list_2):
    i = j = 0
    import collections
    result = collections.deque()

    while i  1 and list_1[i] == result[-1]:
            i += 1
            continue
        if len(result) > 1 and list_2[j] == result[-1]:
            j += 1
            continue

        if list_1[i] <= list_2[j]:

            result.append(list_1[i])
            i += 1
        elif list_2[j] < list_1[i]:
            result.append(list_2[j])
            j += 1

    # we still need to consume part of list_2
    if i == len(list_1):
        while j < len(list_2):

            if list_2[j] == result[-1]:
                j += 1
                continue 

            result.append(list_2[j])
            j += 1

    # still need to consume part of list_1
    if j == len(list_2):
        while i < len(list_1):

            if list_1[i] == result[-1]:
                i += 1
                continue 

            result.append(list_1[i])
            i += 1
    return result


Possible improvements: factor out the repetitive parts, for example by using something similar to this helper function:

check_duplicates(my_list, index):
    if my_list[index] == result[-1]:
        index += 1


Unfortunately this won't work (index is a parameter, and will not influence the behavior of i or j). Is this possible at all?

Solution

Pick a good return type

A deque is a Double Ended QUEue. Deques are great for when you have to insert and erase from the front as well as the back. You never need this for your problem - all you ever need is to insert at the end. You just need a normal list.

Generate when you can

Rather than necessarily giving the entire result all at once, it's better to just yield the next element as you go. This is a simple change in your algorithm (just yield x instead of result.append(x)), but could have serious performance implications down the line if you have lots of large iterables. If the caller wants a full list, then can always explicitly write list(merge_no_duplicates(a, b, c)).

Use the standard when you can

There is a function that already merges sorted inputs into a single sorted output: heapq.merge. It will give you duplicates, but that seems like a much better starting point than writing everything from scratch:

def merge_no_duplicates(*iterables):
    last = object()

    for val in heapq.merge(*iterables):
        if val != last:
            last = val
            yield val


If you don't want to use heapq.merge, then you can at least use this framework to separate the "merge sorted iterables" concern from the "remove duplicates" concern.

Code Snippets

def merge_no_duplicates(*iterables):
    last = object()

    for val in heapq.merge(*iterables):
        if val != last:
            last = val
            yield val

Context

StackExchange Code Review Q#108171, answer score: 5

Revisions (0)

No revisions yet.