snippetpythonMinor
Efficiently filter a large (100gb+) csv file (v2)
Viewed 0 times
efficientlyfilecsv100gbfilterlarge
Problem
EDIT: This question is followed up by this question.
I'm in the process of filtering some very(!) large files (100gb+): I can't download files with a lower granularity. This is a followup from this question.
The problem is as follows: I need to filter large files that look like the following (3b+ rows).
I filter based on TICKER+DATE combinations found in an external file. I have, on average, ~ 1200 dates of interest per firm for ~ 700 firms. The large file contains all dates for the firms of interest, for which I want to extract only a few dates of interest. The big files are split by month (2013-01, 2013-02 etc.).
A few changes were made since the previous post:
I'm currently at 6 minutes of processing time for 30 million rows (1% of the file); I tested a few files and it works properly. However, with about 3 billion rows per file, that puts it at ~10 hours for one 120gb file. Seeing as I have about twelve files, I'm very curious whether I can get significant performance improvements by doing things differently.
Any tips are greatly appreciated.
```
import os
import datetime
import csv
import re
ROOT_DIR = "H:/ROOT_DIR/"
SOURCE_FILES = os.path.join(ROOT_DIR, "10. Intradayfiles (source)/")
EXPORT_DIR = os.path.join(ROOT_DIR, "11. CSV Export (step 1 Extract relevant firmdates)/")
DATES_FILE = os.path.join(ROOT_DIR, "10. Dates of interest/firm_date_of_interest.csv")
# Build the original date dict
# For example:
# d['AAPL'] is a list with ['20140901', '20140902', '20140901']
with open(DATES_FILE, "r") as csvfile:
d = {}
reader = csv.reader(csvfile)
reader.next()
for line in reader:
firm = l
I'm in the process of filtering some very(!) large files (100gb+): I can't download files with a lower granularity. This is a followup from this question.
The problem is as follows: I need to filter large files that look like the following (3b+ rows).
TIC, Date, Time, Bid, Offer
AAPL, 20090901, 09:45, 145, 145.5
AAPL, 20090902, 09:45, 145, 145.5
AAPL, 20090903, 09:45, 145, 145.5I filter based on TICKER+DATE combinations found in an external file. I have, on average, ~ 1200 dates of interest per firm for ~ 700 firms. The large file contains all dates for the firms of interest, for which I want to extract only a few dates of interest. The big files are split by month (2013-01, 2013-02 etc.).
AAPL, 20090902
AAPL, 20090903A few changes were made since the previous post:
- I used the CSV module, as was suggested.
- I write the the rows to be retained to disk after each 5m rows.
- I iterate over the files using a try except statement.
I'm currently at 6 minutes of processing time for 30 million rows (1% of the file); I tested a few files and it works properly. However, with about 3 billion rows per file, that puts it at ~10 hours for one 120gb file. Seeing as I have about twelve files, I'm very curious whether I can get significant performance improvements by doing things differently.
Any tips are greatly appreciated.
```
import os
import datetime
import csv
import re
ROOT_DIR = "H:/ROOT_DIR/"
SOURCE_FILES = os.path.join(ROOT_DIR, "10. Intradayfiles (source)/")
EXPORT_DIR = os.path.join(ROOT_DIR, "11. CSV Export (step 1 Extract relevant firmdates)/")
DATES_FILE = os.path.join(ROOT_DIR, "10. Dates of interest/firm_date_of_interest.csv")
# Build the original date dict
# For example:
# d['AAPL'] is a list with ['20140901', '20140902', '20140901']
with open(DATES_FILE, "r") as csvfile:
d = {}
reader = csv.reader(csvfile)
reader.next()
for line in reader:
firm = l
Solution
You can make the reading of the file a generator function:
This extracts the actual filter to a different function.
I would also suggest not accumulating the list of values to write out but instead use the generator to create chunks directly.
def getLines(filename, d):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
for row in datareader:
try:
if row[1] in d[row[0]]:
yield row
except KeyError:
continue
except IndexError:
continueThis extracts the actual filter to a different function.
I would also suggest not accumulating the list of values to write out but instead use the generator to create chunks directly.
Code Snippets
def getLines(filename, d):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
for row in datareader:
try:
if row[1] in d[row[0]]:
yield row
except KeyError:
continue
except IndexError:
continueContext
StackExchange Code Review Q#88778, answer score: 3
Revisions (0)
No revisions yet.