HiveBrain v1.2.0
Get Started
← Back to all entries
principlepythonMinor

Python compare every line in file with itself

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
itselffilelinewitheverypythoncompare

Problem

I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.

I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop(\$O(n^2)\$) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.

I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the with open mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.

I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.

```
import ast, sys, os, shutil
list1 = []
filterValue = 3
end = 0

# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))

infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):

Solution

On reading the file twice

Constructing data structures is costly. You do it once for each line in the file (\$n\$ times) and then do it again \$n^2\$ times. There is absolutely no need as you could only use one list to do so:

>>> a = [[1], [2], [3]]
>>> for i in a:
...     for j in a:
...         print(i, j)
...
[1] [1]
[1] [2]
[1] [3]
[2] [1]
[2] [2]
[2] [3]
[3] [1]
[3] [2]
[3] [3]


That way, with only one list, you avoid spending time converting strings to datastructure and can focus on your algorithm.

So just build the list once and use it any time you want.

Use functions

It lets you reuse your work more easily, by changing parameters. You can also more easily add code to handle command-line arguments and not need to change your code each time you change your input/output file.

A simple layout could look like:

def parse_input(filename):
    ...

def compute_output(output_file, data, filter_value):
    ...

def filter_file(path, filter_value=3, in_name='file', out_name='filteredSets'):
    data = parse_input('{}/{}'.format(path, in_name))
    with open('{}/{}{}'.format(path, out_name, filter_value), 'w') as out_file:
        if len(data) > 1:
            compute_output(out_file, data, filter_value)
        else:
            for datum in data:
                print(datum, file=out_file)

if __name__ == '__main__':
    filter_file(arg2 + arg1)


I’m assuming Python 3 here for the print(datum, file=out_file) line. Change it accordingly if you are using Python 2. It, however, will change your output a bit, but more on that later. I'm also using a for loop in the else just in case there is no rows in data.

Separating concerns with functions let you focus on being efficient for simple tasks. And, if later you want to improve your code thanks to multiprocessing, it will be easier to process it in parallel (using map_async for instance).

Iterate directly over items

Rather than iterating over indices and using these indices to retrieve items. Instead of:

for i in range(0, len(list1)):
    ...
    set2 = list1[i]


You should use:

for set2 in list1:


It's both easier to read and faster. If you truly need the index, you can still use enumerate:

for i, set2 in enumerate(list1):


But as far as eliminating tests against the same line in our product you can test if the object is the same:

for set1 in list1:
    for set2 in list1:
        if set1 is not set2:
            # Do something


Note that I used is instead of == to be sure that we are performing tests on the same line and not on lines with identical values:

>>> a = [[1], [1], [1]]
>>> for i in a:
...     for j in a:
...         print(i, j, i==j, i is j)
...
[1] [1] True True
[1] [1] True False
[1] [1] True False
[1] [1] True False
[1] [1] True True
[1] [1] True False
[1] [1] True False
[1] [1] True False
[1] [1] True True


Managing "stop-early" in loops

Python allows loops (for and while) an optional else clause. This clause executes only if the loop executed entirely without reaching a break (or a return, for what it's worth).

In your case, you could use that at your advantage in the inner for loop: instead of waiting to reach the end of the file for each line, you can break as soon as you find that two lines shares more than 3 items. If you handle writing lines in the else clause of the inner for loop, you can rely on that break to not print a line that has too many items in common with an other one.

Miscellaneous

  • You can use os.path.join to concatenate folders and filenames.



  • You can use set1 & set2 as a shorthand to set1.intersection(set2).



  • You should write each import statement on its own line.



Proposed improvements

import os
from ast import literal_eval

def parse_input(filename):
    with open(filename) as f:
        data = [set(literal_eval(line)) for line in f]
    return data

def compute_output(output_file, data, filter_value):
    for set1 in data:
        for set2 in data:
            if set1 is set2:
                # Do not try to compare a row with itself
                continue
            if len(set1 & set2) >= filter_value:
                # Do not keep lines that have more than a few items
                # in common with any other line
                break
        else:
            # If the second for loop did not break, then keep the line
            print(set1, file=output_file)

def filter_file(path, filter_value=3, in_name='file', out_name='filteredSets'):
    data = parse_input(os.path.join(path, in_name))
    output_filename = os.path.join(path, '{}{}'.format(out_name, filter_value))
    with open(output_filename, 'w') as out_file:
        compute_output(out_file, data, filter_value)

if __name__ == '__main__':
    filter_file(arg2 + arg1)


You should have noticed that I changed the output format a bit. Now each line is a set instead of a list. It's easy to change that by calling `print(list(set1), fi

Code Snippets

>>> a = [[1], [2], [3]]
>>> for i in a:
...     for j in a:
...         print(i, j)
...
[1] [1]
[1] [2]
[1] [3]
[2] [1]
[2] [2]
[2] [3]
[3] [1]
[3] [2]
[3] [3]
def parse_input(filename):
    ...

def compute_output(output_file, data, filter_value):
    ...

def filter_file(path, filter_value=3, in_name='file', out_name='filteredSets'):
    data = parse_input('{}/{}'.format(path, in_name))
    with open('{}/{}{}'.format(path, out_name, filter_value), 'w') as out_file:
        if len(data) > 1:
            compute_output(out_file, data, filter_value)
        else:
            for datum in data:
                print(datum, file=out_file)


if __name__ == '__main__':
    filter_file(arg2 + arg1)
for i in range(0, len(list1)):
    ...
    set2 = list1[i]
for set2 in list1:
for i, set2 in enumerate(list1):

Context

StackExchange Code Review Q#119862, answer score: 6

Revisions (0)

No revisions yet.