patternpythonMinor
Reading, writing and filtering a CSV file
Viewed 0 times
readingfilecsvwritingandfiltering
Problem
I have a CSV with 5+ million rows and I want to filter it:
I have many functions, the simplest one being
I want to improve the execution time of my script, but it takes hours. According to the calculation of a friend, it takes 530 hours or so. I'm not sure about that but regardless, it is still taking a long time.
How I tried to improve the speed of the script:
it's annoying and there's still a
-
Removing
This is a desperate plan and the most annoying. I get rid of two
-
Considered using Hadoop but the VPS I have are so much less powerful than my PC, low end box.
-
Considered using SQLite instead of CSV but not convinced it would make things much faster.
-
Considered writing it in C++, and in fact the first version
with open("test1.csv") as csv_file:
writer =csv.writer(outputfile, delimiter=',')
for row in csv.reader(csv_file, delimiter=','):
row = [int(x) for x in row]
for i in range (1,11):
if add_i (row) == True:
break
elif row not in (row_dup):
row_dup.append(row)
writer.writerow(row)I have many functions, the simplest one being
add_i which you can see in the snippet above.if row[0:3] == range(row[0], row[0] +3 * i, i):
return Truerow_dup = [] is an empty array I use to store duplicates so that they don't get written to the file.I want to improve the execution time of my script, but it takes hours. According to the calculation of a friend, it takes 530 hours or so. I'm not sure about that but regardless, it is still taking a long time.
How I tried to improve the speed of the script:
- Using PyPy but I didn't notice a significant improvement.
- Removing
row = [int(x) for x in row]but then I'd have to writeif [int(x) for x in row[0:4]] == range(int(row[0]), int(row[0]) + 4 * i, i):
it's annoying and there's still a
for loop.- Removing
for i in range (1,11)but then I would have to write the indexes myself.
-
Removing
for i in range (1,11) and row = [int(x) for x in row] and doing it like soif (int(row[0]) + 1 == int(row[1])) and int(row[1]) + 1 == int(row[2])
and int(row[2]) + 1 == int(row[3]) and int(row[3]) + 1 == int(row[4]):
do somethingThis is a desperate plan and the most annoying. I get rid of two
for loops and do everything by hand, increasing the index value from 1 to 11 by copy pasting.-
Considered using Hadoop but the VPS I have are so much less powerful than my PC, low end box.
-
Considered using SQLite instead of CSV but not convinced it would make things much faster.
-
Considered writing it in C++, and in fact the first version
Solution
This is desperate plan and the most annoying, I get rid of 2 for loops and do everything by hand
I will take this question to be 'How can I remove the two inner for loops if possible?'.
First off, you change the row slice from 'the first three' to 'the first four' in your question. And so post all your code. Don't make us guess.
Anyway, this is what I think you want to give us:
First, we don't want this to be the worst version of
This makes it so that
we can make the
We can say that range returns this sort of structure
As this is getting rather complicated. Here is how I would implement it.
This while not comparing lists uses the logic, as the difference between them all has to be
Now that we have gotten rid of one of the two internal for loops, we need to get rid of the other, for this to be what I would deem is the 'correct' answer.
This loop is a list comprehension, and if you have only 3 columns in all the rows, it would be a relatively fast and a really easy conversion. However, I am guessing that is not the case.
And so if we just remove it, we will have to manually convert the columns into numbers. We can just do this in
Edit, I thought I should make a benchmark for sets, and found it was very fast.
This allows for very fast look-ups. And only has a ~27% impact on speed. The way that we make it so that we can use sets, is by making the data hash-able. Strings are hash-able, and with the change in the program, where the data in the
And so we have an answer with only one for loop.
This is roughly O(n). And you can add a list comprehension to allow more steps easily.
As you seem to want to look over more than just 3 columns, due to your example with four, using a list comprehension is probably the way how you would want to do this.
If you expand on this then you would want to change
Also, if you want some benchmarks.
I used 100000 random 3 column csv lines, of numbers up to 10000.
New program with
New prog
I will take this question to be 'How can I remove the two inner for loops if possible?'.
First off, you change the row slice from 'the first three' to 'the first four' in your question. And so post all your code. Don't make us guess.
Anyway, this is what I think you want to give us:
import csv
def add_i(row, i):
if row[0:3] == range(row[0], row[0] +3 * i, i):
return True
row_dup = []
with open("out_file", "w") as out_file:
writer =csv.writer(out_file, delimiter=',')
with open("test1.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
row = [int(x) for x in row]
for i in range (1,11):
if add_i (row, i) == True:
break
elif row not in (row_dup):
row_dup.append(row)
writer.writerow(row)First, we don't want this to be the worst version of
cp, and so we will use the for else statement.for i in range(1,11):
if add_i(row, i):
break
else:
if row not in row_dup:
# ...This makes it so that
[2, 4, 6] gets removed as well. This is as you would have a bug, so that in add_i, range(2, 5, 1) would return False, and add it to the output.we can make the
for statement, just an if statement. This is as you use range(start, stop, step). I will use i, j and k for start, stop and step respectfully.We can say that range returns this sort of structure
[i, i+k, i+2k, ... i+((j-i)//k)k]. As you use range(row[0], row[0] + 3i, i) we know that is [i, i+k, i+2k] where i and k are row[0] and i respectfully.As this is getting rather complicated. Here is how I would implement it.
def add_i(row, range):
step = row[1] - row[0]
if range[0] <= step < range[1]:
return step == (row[2] - row[1])
return FalseThis while not comparing lists uses the logic, as the difference between them all has to be
step, and step has to be in the allowed range.Now that we have gotten rid of one of the two internal for loops, we need to get rid of the other, for this to be what I would deem is the 'correct' answer.
This loop is a list comprehension, and if you have only 3 columns in all the rows, it would be a relatively fast and a really easy conversion. However, I am guessing that is not the case.
And so if we just remove it, we will have to manually convert the columns into numbers. We can just do this in
add_i again.def add_i(row, range):
r1 = int(row[1])
step = r1 - int(row[0])
if range[0] <= step < range[1]:
return step == (int(row[2]) - r1)
return FalseEdit, I thought I should make a benchmark for sets, and found it was very fast.
row_dup = set()
# ... in the for loop
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
row_dup.add(row_)This allows for very fast look-ups. And only has a ~27% impact on speed. The way that we make it so that we can use sets, is by making the data hash-able. Strings are hash-able, and with the change in the program, where the data in the
row is all strings, it allows you to simply str.join it all together.And so we have an answer with only one for loop.
import csv
def add_i(row, range):
r1 = int(row[1])
step = r1 - int(row[0])
if range[0] <= step < range[1]:
return step == (int(row[2]) - r1)
return False
row_dup = set()
with open("out_file", "w") as out_file:
writer = csv.writer(out_file, delimiter=',')
with open("test1.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
row_dup.add(row_)
writer.writerow(row)This is roughly O(n). And you can add a list comprehension to allow more steps easily.
def add_i(row, range):
r0 = int(row[0])
step = int(row[1]) - r0
if range[0] <= step < range[1]:
return row[1:3] == [str(i) for i in range(r0 + step, r0 + 3*step, step)]
return FalseAs you seem to want to look over more than just 3 columns, due to your example with four, using a list comprehension is probably the way how you would want to do this.
If you expand on this then you would want to change
add_i to only take a row, and be called something like format_test. This way, you are testing if the row follows a format, and if it does not then you add it, if not already added, to the new file. You can also now tell what the main part of the function is doing better, as it is saying, 'if this row passes this test, add it to the out file.'Also, if you want some benchmarks.
I used 100000 random 3 column csv lines, of numbers up to 10000.
New program with
row_dup as type list:real 3m37.109s
user 3m36.857s
sys 0m0.057sNew prog
Code Snippets
import csv
def add_i(row, i):
if row[0:3] == range(row[0], row[0] +3 * i, i):
return True
row_dup = []
with open("out_file", "w") as out_file:
writer =csv.writer(out_file, delimiter=',')
with open("test1.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
row = [int(x) for x in row]
for i in range (1,11):
if add_i (row, i) == True:
break
elif row not in (row_dup):
row_dup.append(row)
writer.writerow(row)for i in range(1,11):
if add_i(row, i):
break
else:
if row not in row_dup:
# ...def add_i(row, range):
step = row[1] - row[0]
if range[0] <= step < range[1]:
return step == (row[2] - row[1])
return Falsedef add_i(row, range):
r1 = int(row[1])
step = r1 - int(row[0])
if range[0] <= step < range[1]:
return step == (int(row[2]) - r1)
return Falserow_dup = set()
# ... in the for loop
row_ = ''.join(row)
if not add_i(row, (1, 11)) and row_ not in row_dup:
row_dup.add(row_)Context
StackExchange Code Review Q#95511, answer score: 2
Revisions (0)
No revisions yet.