patternpythonMinor
Finding line numbers of duplicate lines in a log file
Viewed 0 times
filelinelogduplicatenumbersfindinglines
Problem
I have a log file with a million lines, and my task is to code something that outputs the lines that have duplicates, and the line numbers.
I've thought of two ways to approach this :
1) Use python's inner tools :
Here's my code :
2) Go bruteforce, which is something like this :
Option 1's code works fine and is fast (like 1 second on my (pretty average) computer), so my job is done and I haven't written the code for option 2. But I'm being curious on which one would be faster : I can see that option 2 is something like O(n^2) (or am I mistaken ?), but as I don't know the inner workings of lists and dicts (I didn't major in CS), I'm not really able to tell the O() of option 1.
I'm also curious about if there would be an even faster way (without modules) ?
I've thought of two ways to approach this :
1) Use python's inner tools :
- load my file's lines in a list,
- load them in a dict in which the keys are the lines, and the values are the number of times they appear
- for each element of my dict where value >= 2, output the element and the indexes of this element in my list
Here's my code :
def find_dupl(log):
# put in list
with open(log) as l:
liste = l.readlines()
# put in dict
dico = dict()
for i in liste:
dico[i] = dico.get(i, 0) + 1
output_dict = {}
for i in dico:
if dico[i] > 1: # for dico's element where value >= 2
# print(i, # print the element
# dico[i], # how many times it appears
# [a+1 for a, b in enumerate(liste) if b == i] # the lines where it appears
# )
output_dict[i] = [a+1 for a, b in enumerate(liste) if b == i]
return(output_dict)2) Go bruteforce, which is something like this :
- load my file's lines in a list
- for each element i of the list
- check all the elements j after i
- if i == j, output i, i's index and j's index
Option 1's code works fine and is fast (like 1 second on my (pretty average) computer), so my job is done and I haven't written the code for option 2. But I'm being curious on which one would be faster : I can see that option 2 is something like O(n^2) (or am I mistaken ?), but as I don't know the inner workings of lists and dicts (I didn't major in CS), I'm not really able to tell the O() of option 1.
I'm also curious about if there would be an even faster way (without modules) ?
Solution
To go through your programs:
-
-
Loop through the new dict: \$O(n)\$ time and memory,
if the item has duplicates, get the lines: \$O(n)\$ time, \$O(1)\$ memory.
In total \$O(n^2)\$ time, \$O(n)\$ memory.
And so it's \$O(n^2)\$ time, \$O(n)\$ memory.
-
-
Loop through the file - \$O(n)\$ time, \$O(1)\$ memory.
Loop through the file for second index - \$O(n)\$ time and \$O(1)\$ memory.
In total \$O(n^2)\$ time, \$O(n)\$ memory.
It's \$O(n^2)\$ time, and \$O(n)\$ memory.
And so I'd change your program to remove the second loop through the file, to force this you can remove changing the file to a list.
You want to use what you are in (1), but to store a list of all the file indexes. After this you then want to filter all items that don't have a length greater than 1.
And so you should be able to get something like:
This has \$O(n)\$ time and \$O(n)\$ memory. Yes it does use a module, but it's built in and makes your code much faster, and simpler.
-
- Load the file into a list - \$O(n)\$ time and memory.
- Go through the list and add to a dict - \$O(n)\$ time and memory.
-
Loop through the new dict: \$O(n)\$ time and memory,
if the item has duplicates, get the lines: \$O(n)\$ time, \$O(1)\$ memory.
In total \$O(n^2)\$ time, \$O(n)\$ memory.
And so it's \$O(n^2)\$ time, \$O(n)\$ memory.
-
- Load the file into a list - \$O(n)\$ time and memory.
-
Loop through the file - \$O(n)\$ time, \$O(1)\$ memory.
Loop through the file for second index - \$O(n)\$ time and \$O(1)\$ memory.
In total \$O(n^2)\$ time, \$O(n)\$ memory.
It's \$O(n^2)\$ time, and \$O(n)\$ memory.
And so I'd change your program to remove the second loop through the file, to force this you can remove changing the file to a list.
You want to use what you are in (1), but to store a list of all the file indexes. After this you then want to filter all items that don't have a length greater than 1.
And so you should be able to get something like:
from collections import defaultdict
def find_dupl(log):
output = defaultdict(list)
with open(log) as f:
for i, line in enumerate(f):
output[line].append(i)
return {k: v for k, v in output.items() if len(v) > 1}This has \$O(n)\$ time and \$O(n)\$ memory. Yes it does use a module, but it's built in and makes your code much faster, and simpler.
Code Snippets
from collections import defaultdict
def find_dupl(log):
output = defaultdict(list)
with open(log) as f:
for i, line in enumerate(f):
output[line].append(i)
return {k: v for k, v in output.items() if len(v) > 1}Context
StackExchange Code Review Q#149566, answer score: 4
Revisions (0)
No revisions yet.