HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading, writing and filtering a CSV file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingfilecsvwritingandfiltering

Problem

I have a CSV with 5+ million rows and I want to filter it:

with open("test1.csv") as csv_file:
    writer =csv.writer(outputfile, delimiter=',')
    for row in csv.reader(csv_file, delimiter=','):
        row = [int(x) for x in row]

        for i in range (1,11):
            if add_i (row) == True:
                break

            elif row not in (row_dup):
                row_dup.append(row)
                writer.writerow(row)


I have many functions, the simplest one being add_i which you can see in the snippet above.

if row[0:3] == range(row[0], row[0] +3 * i, i):
    return True


row_dup = [] is an empty array I use to store duplicates so that they don't get written to the file.

I want to improve the execution time of my script, but it takes hours. According to the calculation of a friend, it takes 530 hours or so. I'm not sure about that but regardless, it is still taking a long time.

How I tried to improve the speed of the script:

  • Using PyPy but I didn't notice a significant improvement.



  • Removing row = [int(x) for x in row] but then I'd have to write if [int(x) for x in row[0:4]] == range(int(row[0]), int(row[0]) + 4 * i, i):


it's annoying and there's still a for loop.

  • Removing for i in range (1,11) but then I would have to write the indexes myself.



-
Removing for i in range (1,11) and row = [int(x) for x in row] and doing it like so

if (int(row[0]) + 1 == int(row[1])) and int(row[1]) + 1 == int(row[2])
and int(row[2]) + 1 == int(row[3]) and int(row[3]) + 1 == int(row[4]):
    do something


This is a desperate plan and the most annoying. I get rid of two for loops and do everything by hand, increasing the index value from 1 to 11 by copy pasting.

-
Considered using Hadoop but the VPS I have are so much less powerful than my PC, low end box.

-
Considered using SQLite instead of CSV but not convinced it would make things much faster.

-
Considered writing it in C++, and in fact the first version

Solution

This is desperate plan and the most annoying, I get rid of 2 for loops and do everything by hand

I will take this question to be 'How can I remove the two inner for loops if possible?'.

First off, you change the row slice from 'the first three' to 'the first four' in your question. And so post all your code. Don't make us guess.
Anyway, this is what I think you want to give us:

import csv

def add_i(row, i):
    if row[0:3] == range(row[0], row[0] +3 * i, i):
        return True

row_dup = []

with open("out_file", "w") as out_file:
    writer =csv.writer(out_file, delimiter=',')
    with open("test1.csv") as csv_file:
        for row in csv.reader(csv_file, delimiter=','):
            row = [int(x) for x in row]

            for i in range (1,11):
                if add_i (row, i) == True:
                    break

                elif row not in (row_dup):
                    row_dup.append(row)
                    writer.writerow(row)


First, we don't want this to be the worst version of cp, and so we will use the for else statement.

for i in range(1,11):
    if add_i(row, i):
        break
else:
    if row not in row_dup:
        # ...


This makes it so that [2, 4, 6] gets removed as well. This is as you would have a bug, so that in add_i, range(2, 5, 1) would return False, and add it to the output.

we can make the for statement, just an if statement. This is as you use range(start, stop, step). I will use i, j and k for start, stop and step respectfully.
We can say that range returns this sort of structure [i, i+k, i+2k, ... i+((j-i)//k)k]. As you use range(row[0], row[0] + 3i, i) we know that is [i, i+k, i+2k] where i and k are row[0] and i respectfully.
As this is getting rather complicated. Here is how I would implement it.

def add_i(row, range):
    step = row[1] - row[0]
    if range[0] <= step < range[1]:
        return step == (row[2] - row[1])
    return False


This while not comparing lists uses the logic, as the difference between them all has to be step, and step has to be in the allowed range.

Now that we have gotten rid of one of the two internal for loops, we need to get rid of the other, for this to be what I would deem is the 'correct' answer.
This loop is a list comprehension, and if you have only 3 columns in all the rows, it would be a relatively fast and a really easy conversion. However, I am guessing that is not the case.
And so if we just remove it, we will have to manually convert the columns into numbers. We can just do this in add_i again.

def add_i(row, range):
    r1 = int(row[1])
    step = r1 - int(row[0])
    if range[0] <= step < range[1]:
        return step == (int(row[2]) - r1)
    return False


Edit, I thought I should make a benchmark for sets, and found it was very fast.

row_dup = set()

# ... in the for loop

row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
    row_dup.add(row_)


This allows for very fast look-ups. And only has a ~27% impact on speed. The way that we make it so that we can use sets, is by making the data hash-able. Strings are hash-able, and with the change in the program, where the data in the row is all strings, it allows you to simply str.join it all together.

And so we have an answer with only one for loop.

import csv

def add_i(row, range):
    r1 = int(row[1])
    step = r1 - int(row[0])
    if range[0] <= step < range[1]:
        return step == (int(row[2]) - r1)
    return False

row_dup = set()

with open("out_file", "w") as out_file:
    writer = csv.writer(out_file, delimiter=',')
    with open("test1.csv") as csv_file:
        for row in csv.reader(csv_file, delimiter=','):
            row_ = ''.join(row)
            if not add_i(row, (1, 11)) and row_ not in row_dup:
                row_dup.add(row_)
                writer.writerow(row)


This is roughly O(n). And you can add a list comprehension to allow more steps easily.

def add_i(row, range):
    r0 = int(row[0])
    step = int(row[1]) - r0
    if range[0] <= step < range[1]:
        return row[1:3] == [str(i) for i in range(r0 + step, r0 + 3*step, step)]
    return False


As you seem to want to look over more than just 3 columns, due to your example with four, using a list comprehension is probably the way how you would want to do this.

If you expand on this then you would want to change add_i to only take a row, and be called something like format_test. This way, you are testing if the row follows a format, and if it does not then you add it, if not already added, to the new file. You can also now tell what the main part of the function is doing better, as it is saying, 'if this row passes this test, add it to the out file.'

Also, if you want some benchmarks.
I used 100000 random 3 column csv lines, of numbers up to 10000.

New program with row_dup as type list:

real    3m37.109s
user    3m36.857s
sys     0m0.057s


New prog

Code Snippets

import csv

def add_i(row, i):
    if row[0:3] == range(row[0], row[0] +3 * i, i):
        return True

row_dup = []

with open("out_file", "w") as out_file:
    writer =csv.writer(out_file, delimiter=',')
    with open("test1.csv") as csv_file:
        for row in csv.reader(csv_file, delimiter=','):
            row = [int(x) for x in row]

            for i in range (1,11):
                if add_i (row, i) == True:
                    break


                elif row not in (row_dup):
                    row_dup.append(row)
                    writer.writerow(row)
for i in range(1,11):
    if add_i(row, i):
        break
else:
    if row not in row_dup:
        # ...
def add_i(row, range):
    step = row[1] - row[0]
    if range[0] <= step < range[1]:
        return step == (row[2] - row[1])
    return False
def add_i(row, range):
    r1 = int(row[1])
    step = r1 - int(row[0])
    if range[0] <= step < range[1]:
        return step == (int(row[2]) - r1)
    return False
row_dup = set()

# ... in the for loop

row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
    row_dup.add(row_)

Context

StackExchange Code Review Q#95511, answer score: 2

Revisions (0)

No revisions yet.