principlepythonMinor
Python compare every line in file with itself
Viewed 0 times
itselffilelinewitheverypythoncompare
Problem
I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop(\$O(n^2)\$) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
```
import ast, sys, os, shutil
list1 = []
filterValue = 3
end = 0
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop(\$O(n^2)\$) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the
list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the with open mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation. I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
```
import ast, sys, os, shutil
list1 = []
filterValue = 3
end = 0
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
Solution
On reading the file twice
Constructing data structures is costly. You do it once for each line in the file (\$n\$ times) and then do it again \$n^2\$ times. There is absolutely no need as you could only use one list to do so:
That way, with only one list, you avoid spending time converting strings to datastructure and can focus on your algorithm.
So just build the list once and use it any time you want.
Use functions
It lets you reuse your work more easily, by changing parameters. You can also more easily add code to handle command-line arguments and not need to change your code each time you change your input/output file.
A simple layout could look like:
I’m assuming Python 3 here for the
Separating concerns with functions let you focus on being efficient for simple tasks. And, if later you want to improve your code thanks to
Iterate directly over items
Rather than iterating over indices and using these indices to retrieve items. Instead of:
You should use:
It's both easier to read and faster. If you truly need the index, you can still use
But as far as eliminating tests against the same line in our
Note that I used
Managing "stop-early" in loops
Python allows loops (
In your case, you could use that at your advantage in the inner
Miscellaneous
Proposed improvements
You should have noticed that I changed the output format a bit. Now each line is a
Constructing data structures is costly. You do it once for each line in the file (\$n\$ times) and then do it again \$n^2\$ times. There is absolutely no need as you could only use one list to do so:
>>> a = [[1], [2], [3]]
>>> for i in a:
... for j in a:
... print(i, j)
...
[1] [1]
[1] [2]
[1] [3]
[2] [1]
[2] [2]
[2] [3]
[3] [1]
[3] [2]
[3] [3]That way, with only one list, you avoid spending time converting strings to datastructure and can focus on your algorithm.
So just build the list once and use it any time you want.
Use functions
It lets you reuse your work more easily, by changing parameters. You can also more easily add code to handle command-line arguments and not need to change your code each time you change your input/output file.
A simple layout could look like:
def parse_input(filename):
...
def compute_output(output_file, data, filter_value):
...
def filter_file(path, filter_value=3, in_name='file', out_name='filteredSets'):
data = parse_input('{}/{}'.format(path, in_name))
with open('{}/{}{}'.format(path, out_name, filter_value), 'w') as out_file:
if len(data) > 1:
compute_output(out_file, data, filter_value)
else:
for datum in data:
print(datum, file=out_file)
if __name__ == '__main__':
filter_file(arg2 + arg1)I’m assuming Python 3 here for the
print(datum, file=out_file) line. Change it accordingly if you are using Python 2. It, however, will change your output a bit, but more on that later. I'm also using a for loop in the else just in case there is no rows in data.Separating concerns with functions let you focus on being efficient for simple tasks. And, if later you want to improve your code thanks to
multiprocessing, it will be easier to process it in parallel (using map_async for instance).Iterate directly over items
Rather than iterating over indices and using these indices to retrieve items. Instead of:
for i in range(0, len(list1)):
...
set2 = list1[i]You should use:
for set2 in list1:It's both easier to read and faster. If you truly need the index, you can still use
enumerate:for i, set2 in enumerate(list1):But as far as eliminating tests against the same line in our
product you can test if the object is the same:for set1 in list1:
for set2 in list1:
if set1 is not set2:
# Do somethingNote that I used
is instead of == to be sure that we are performing tests on the same line and not on lines with identical values:>>> a = [[1], [1], [1]]
>>> for i in a:
... for j in a:
... print(i, j, i==j, i is j)
...
[1] [1] True True
[1] [1] True False
[1] [1] True False
[1] [1] True False
[1] [1] True True
[1] [1] True False
[1] [1] True False
[1] [1] True False
[1] [1] True TrueManaging "stop-early" in loops
Python allows loops (
for and while) an optional else clause. This clause executes only if the loop executed entirely without reaching a break (or a return, for what it's worth).In your case, you could use that at your advantage in the inner
for loop: instead of waiting to reach the end of the file for each line, you can break as soon as you find that two lines shares more than 3 items. If you handle writing lines in the else clause of the inner for loop, you can rely on that break to not print a line that has too many items in common with an other one.Miscellaneous
- You can use
os.path.jointo concatenate folders and filenames.
- You can use
set1 & set2as a shorthand toset1.intersection(set2).
- You should write each import statement on its own line.
Proposed improvements
import os
from ast import literal_eval
def parse_input(filename):
with open(filename) as f:
data = [set(literal_eval(line)) for line in f]
return data
def compute_output(output_file, data, filter_value):
for set1 in data:
for set2 in data:
if set1 is set2:
# Do not try to compare a row with itself
continue
if len(set1 & set2) >= filter_value:
# Do not keep lines that have more than a few items
# in common with any other line
break
else:
# If the second for loop did not break, then keep the line
print(set1, file=output_file)
def filter_file(path, filter_value=3, in_name='file', out_name='filteredSets'):
data = parse_input(os.path.join(path, in_name))
output_filename = os.path.join(path, '{}{}'.format(out_name, filter_value))
with open(output_filename, 'w') as out_file:
compute_output(out_file, data, filter_value)
if __name__ == '__main__':
filter_file(arg2 + arg1)You should have noticed that I changed the output format a bit. Now each line is a
set instead of a list. It's easy to change that by calling `print(list(set1), fiCode Snippets
>>> a = [[1], [2], [3]]
>>> for i in a:
... for j in a:
... print(i, j)
...
[1] [1]
[1] [2]
[1] [3]
[2] [1]
[2] [2]
[2] [3]
[3] [1]
[3] [2]
[3] [3]def parse_input(filename):
...
def compute_output(output_file, data, filter_value):
...
def filter_file(path, filter_value=3, in_name='file', out_name='filteredSets'):
data = parse_input('{}/{}'.format(path, in_name))
with open('{}/{}{}'.format(path, out_name, filter_value), 'w') as out_file:
if len(data) > 1:
compute_output(out_file, data, filter_value)
else:
for datum in data:
print(datum, file=out_file)
if __name__ == '__main__':
filter_file(arg2 + arg1)for i in range(0, len(list1)):
...
set2 = list1[i]for set2 in list1:for i, set2 in enumerate(list1):Context
StackExchange Code Review Q#119862, answer score: 6
Revisions (0)
No revisions yet.