patternpythonMinor

Remove duplicates in list of lists where an item in each list is ignored when determining duplicity

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

itemdeterminingignoredeachwhereduplicityremovelistslistwhen

Problem

I have lines_list_copy of the form:

[['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ... ]

I needed to remove all duplicate entries where a,b,c,d are identical. So note that I don't care about what value e has. So for example, If

lines_list_copy = [['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ['a1', 'b1', 'c1', 'd1', 'e1'], ['a1', 'b1', 'c1', 'd1', 'e2']]

we have three values that are the same, namely lines_list_copy[0], lines_list_copy[2] and lines_list_copy[3] and any 2 of them need to be deleted which will give us the value of lines_list. At the end deleting any two results in a valid outputs for lines_list

lines_list_copy has lengths typically exceeding 200000 and realistically will eventually exceed 500000 with the amount of data we are collecting. Thus I needed a way to remove duplicates fast. I found a way to efficiently remove all duplicates, but the method would take e into account and thus wouldn't give me what I need. Therefore, I delete all the e values in each list first like so:

for x in lines_list_copy:
  del x[cfg.TEXT_LOC_COL]
lines_list_copy = [list(x) for x in set(tuple(x) for x in lines_list_copy)]

After which I have lines_list_copy as I need it. All I need to do is re-add any one of the e values for each list. My double for loop is admittedly naive and more so that I didn't think it would bring my program to a crawl.

for line_copy_ind in range(len(lines_list_copy)):
        for line_ind in range(len(lines_list)):
            if lines_list_copy[line_copy_ind][cfg.TIME_COL] == lines_list[line_ind][cfg.TIME_COL] and \
                    len(lines_list_copy[line_copy_ind]) == 4:
                lines_list_copy[line_copy_ind].append(lines_list[line_ind][cfg.TEXT_LOC_COL])
    lines_list = lines_list_copy

I looked into vectorizing and using filter but just cant seem to reverse engineer solutions to other problems and make them work for my problem of a

Solution

Maybe there's an elegant way for me to instead not delete the e column and still remove duplicates efficiently without considering the e values?

Your set-based approach almost works, and should be more efficient than the nested loops. Try storing only the first 4 items in the set, as opposed to the entire row:

def unique_by_first_n(n, coll):
    seen = set()
    for item in coll:
        compare = tuple(item[:n])    # Keep only the first `n` elements in the set
        if compare not in seen:
            seen.add(compare)
            yield item

filtered_list = list(unique_by_first_n(4, lines_list_copy))

Here I've adapted your set approach to consider only the first n items when comparing rows, but then to yield the entire row if it hasn't been seen yet. yield could also be replaced by first defining an empty output list and appending to it, like in your second solution.

We get around the del e problem by not deleting anything, just returning all of the rows where the first n columns have not been seen yet.

Code Snippets

def unique_by_first_n(n, coll):
    seen = set()
    for item in coll:
        compare = tuple(item[:n])    # Keep only the first `n` elements in the set
        if compare not in seen:
            seen.add(compare)
            yield item

filtered_list = list(unique_by_first_n(4, lines_list_copy))

Context

StackExchange Code Review Q#114793, answer score: 6

Revisions (0)

No revisions yet.