HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Find and process duplicates in list of lists

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
processlistsfindandlistduplicates

Problem

I'm trying to merge counts for items (URLs) in the list:

[['foo',1], ['bar',3],['foo',4]]


I came up with a function, but it's slow when I run it with 50k entries. I'd appreciate if somebody could please review and suggest improvements.

def dedupe(data):
    ''' Finds duplicates in data and merges the counts '''
    result = []
    for row in data:
        url, count = row
        url_already_in_result = filter(lambda res_row: res_row[0] == url, result)
        if url_already_in_result:
            url_already_in_result[0][1] += count
        else:
            result.append(row)
    return result

def test_dedupe():
    data = [['foo',1], ['bar',3],['foo',4]]
    assert dedupe(data) == [['foo',5], ['bar',3]]

Solution

It looks like you could use collections.Counter. Although you may want to use it earlier in your code, when you create the list of pairs you pass to dedupe. As is, you could use the following in your code:

from collections import Counter

def dedupe(data):
    result = Counter()
    for row in data:
        result.update(dict([row]))
    return result.items()

>>> data = [['foo',1], ['bar',3],['foo',4]]
>>> dedupe(data)
[('foo', 5), ('bar', 3)]

Code Snippets

from collections import Counter

def dedupe(data):
    result = Counter()
    for row in data:
        result.update(dict([row]))
    return result.items()

>>> data = [['foo',1], ['bar',3],['foo',4]]
>>> dedupe(data)
[('foo', 5), ('bar', 3)]

Context

StackExchange Code Review Q#24458, answer score: 6

Revisions (0)

No revisions yet.