patternpythonMinor
Remove duplicates in list of lists where an item in each list is ignored when determining duplicity
Viewed 0 times
itemdeterminingignoredeachwhereduplicityremovelistslistwhen
Problem
I have
I needed to remove all duplicate entries where
After which I have
I looked into vectorizing and using filter but just cant seem to reverse engineer solutions to other problems and make them work for my problem of a
lines_list_copy of the form: [['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ... ]I needed to remove all duplicate entries where
a,b,c,d are identical. So note that I don't care about what value e has. So for example, If lines_list_copy = [['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ['a1', 'b1', 'c1', 'd1', 'e1'], ['a1', 'b1', 'c1', 'd1', 'e2']] we have three values that are the same, namely lines_list_copy[0], lines_list_copy[2] and lines_list_copy[3] and any 2 of them need to be deleted which will give us the value of lines_list. At the end deleting any two results in a valid outputs for lines_listlines_list_copy has lengths typically exceeding 200000 and realistically will eventually exceed 500000 with the amount of data we are collecting. Thus I needed a way to remove duplicates fast. I found a way to efficiently remove all duplicates, but the method would take e into account and thus wouldn't give me what I need. Therefore, I delete all the e values in each list first like so:for x in lines_list_copy:
del x[cfg.TEXT_LOC_COL]
lines_list_copy = [list(x) for x in set(tuple(x) for x in lines_list_copy)]After which I have
lines_list_copy as I need it. All I need to do is re-add any one of the e values for each list. My double for loop is admittedly naive and more so that I didn't think it would bring my program to a crawl.for line_copy_ind in range(len(lines_list_copy)):
for line_ind in range(len(lines_list)):
if lines_list_copy[line_copy_ind][cfg.TIME_COL] == lines_list[line_ind][cfg.TIME_COL] and \
len(lines_list_copy[line_copy_ind]) == 4:
lines_list_copy[line_copy_ind].append(lines_list[line_ind][cfg.TEXT_LOC_COL])
lines_list = lines_list_copyI looked into vectorizing and using filter but just cant seem to reverse engineer solutions to other problems and make them work for my problem of a
Solution
Maybe there's an elegant way for me to instead not delete the e column and still remove duplicates efficiently without considering the e values?
Your
Here I've adapted your
We get around the
Your
set-based approach almost works, and should be more efficient than the nested loops. Try storing only the first 4 items in the set, as opposed to the entire row:def unique_by_first_n(n, coll):
seen = set()
for item in coll:
compare = tuple(item[:n]) # Keep only the first `n` elements in the set
if compare not in seen:
seen.add(compare)
yield item
filtered_list = list(unique_by_first_n(4, lines_list_copy))Here I've adapted your
set approach to consider only the first n items when comparing rows, but then to yield the entire row if it hasn't been seen yet. yield could also be replaced by first defining an empty output list and appending to it, like in your second solution.We get around the
del e problem by not deleting anything, just returning all of the rows where the first n columns have not been seen yet.Code Snippets
def unique_by_first_n(n, coll):
seen = set()
for item in coll:
compare = tuple(item[:n]) # Keep only the first `n` elements in the set
if compare not in seen:
seen.add(compare)
yield item
filtered_list = list(unique_by_first_n(4, lines_list_copy))Context
StackExchange Code Review Q#114793, answer score: 6
Revisions (0)
No revisions yet.