patternpythonMinor

Finding line numbers of duplicate lines in a log file

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

filelinelogduplicatenumbersfindinglines

Problem

I have a log file with a million lines, and my task is to code something that outputs the lines that have duplicates, and the line numbers.

I've thought of two ways to approach this :

1) Use python's inner tools :

load my file's lines in a list,

load them in a dict in which the keys are the lines, and the values are the number of times they appear

for each element of my dict where value >= 2, output the element and the indexes of this element in my list

Here's my code :

def find_dupl(log):
    # put in list
    with open(log) as l:
        liste = l.readlines()

    # put in dict
    dico = dict()
    for i in liste:
        dico[i] = dico.get(i, 0) + 1

    output_dict = {}    
    for i in dico:
        if dico[i] > 1: # for dico's element where value >= 2
#            print(i,        # print the element
#                dico[i],    # how many times it appears
#                [a+1 for a, b in enumerate(liste) if b == i] # the lines where it appears
#                )
            output_dict[i] = [a+1 for a, b in enumerate(liste) if b == i]
    return(output_dict)

2) Go bruteforce, which is something like this :

load my file's lines in a list

for each element i of the list

check all the elements j after i

if i == j, output i, i's index and j's index

Option 1's code works fine and is fast (like 1 second on my (pretty average) computer), so my job is done and I haven't written the code for option 2. But I'm being curious on which one would be faster : I can see that option 2 is something like O(n^2) (or am I mistaken ?), but as I don't know the inner workings of lists and dicts (I didn't major in CS), I'm not really able to tell the O() of option 1.

I'm also curious about if there would be an even faster way (without modules) ?

Solution

To go through your programs:

-

Load the file into a list - \$O(n)\$ time and memory.

Go through the list and add to a dict - \$O(n)\$ time and memory.

-
Loop through the new dict: \$O(n)\$ time and memory,

if the item has duplicates, get the lines: \$O(n)\$ time, \$O(1)\$ memory.

In total \$O(n^2)\$ time, \$O(n)\$ memory.

And so it's \$O(n^2)\$ time, \$O(n)\$ memory.

-

Load the file into a list - \$O(n)\$ time and memory.

-
Loop through the file - \$O(n)\$ time, \$O(1)\$ memory.

Loop through the file for second index - \$O(n)\$ time and \$O(1)\$ memory.

In total \$O(n^2)\$ time, \$O(n)\$ memory.

It's \$O(n^2)\$ time, and \$O(n)\$ memory.

And so I'd change your program to remove the second loop through the file, to force this you can remove changing the file to a list.
You want to use what you are in (1), but to store a list of all the file indexes. After this you then want to filter all items that don't have a length greater than 1.
And so you should be able to get something like:

from collections import defaultdict

def find_dupl(log):
    output = defaultdict(list)
    with open(log) as f:
        for i, line in enumerate(f):
            output[line].append(i)
    return {k: v for k, v in output.items() if len(v) > 1}

This has \$O(n)\$ time and \$O(n)\$ memory. Yes it does use a module, but it's built in and makes your code much faster, and simpler.

Code Snippets

from collections import defaultdict

def find_dupl(log):
    output = defaultdict(list)
    with open(log) as f:
        for i, line in enumerate(f):
            output[line].append(i)
    return {k: v for k, v in output.items() if len(v) > 1}

Context

StackExchange Code Review Q#149566, answer score: 4

Revisions (0)

No revisions yet.