HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Efficiently filter a large (100gb+) csv file (v2)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientlyfilecsv100gbfilterlarge

Problem

EDIT: This question is followed up by this question.

I'm in the process of filtering some very(!) large files (100gb+): I can't download files with a lower granularity. This is a followup from this question.

The problem is as follows: I need to filter large files that look like the following (3b+ rows).

TIC, Date, Time, Bid, Offer
AAPL, 20090901, 09:45, 145, 145.5 
AAPL, 20090902, 09:45, 145, 145.5
AAPL, 20090903, 09:45, 145, 145.5


I filter based on TICKER+DATE combinations found in an external file. I have, on average, ~ 1200 dates of interest per firm for ~ 700 firms. The large file contains all dates for the firms of interest, for which I want to extract only a few dates of interest. The big files are split by month (2013-01, 2013-02 etc.).

AAPL, 20090902
AAPL, 20090903


A few changes were made since the previous post:

  • I used the CSV module, as was suggested.



  • I write the the rows to be retained to disk after each 5m rows.



  • I iterate over the files using a try except statement.



I'm currently at 6 minutes of processing time for 30 million rows (1% of the file); I tested a few files and it works properly. However, with about 3 billion rows per file, that puts it at ~10 hours for one 120gb file. Seeing as I have about twelve files, I'm very curious whether I can get significant performance improvements by doing things differently.

Any tips are greatly appreciated.

```
import os
import datetime
import csv
import re
ROOT_DIR = "H:/ROOT_DIR/"
SOURCE_FILES = os.path.join(ROOT_DIR, "10. Intradayfiles (source)/")
EXPORT_DIR = os.path.join(ROOT_DIR, "11. CSV Export (step 1 Extract relevant firmdates)/")
DATES_FILE = os.path.join(ROOT_DIR, "10. Dates of interest/firm_date_of_interest.csv")

# Build the original date dict
# For example:
# d['AAPL'] is a list with ['20140901', '20140902', '20140901']
with open(DATES_FILE, "r") as csvfile:
d = {}
reader = csv.reader(csvfile)
reader.next()
for line in reader:
firm = l

Solution

You can make the reading of the file a generator function:

def getLines(filename, d):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        for row in datareader:
            try:
                if row[1] in d[row[0]]:
                    yield row
            except KeyError:
                continue
            except IndexError:
                continue


This extracts the actual filter to a different function.

I would also suggest not accumulating the list of values to write out but instead use the generator to create chunks directly.

Code Snippets

def getLines(filename, d):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        for row in datareader:
            try:
                if row[1] in d[row[0]]:
                    yield row
            except KeyError:
                continue
            except IndexError:
                continue

Context

StackExchange Code Review Q#88778, answer score: 3

Revisions (0)

No revisions yet.