HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

MemoryError in Python while combining multiple JSON files and outputting as single CSV

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
combiningwhileoutputtingcsvandmemoryerrorfilessinglepythonmultiple

Problem

I have a number of JSON files to combine and output as a single CSV (to load into R), with each JSON file at about 1.5GB. While doing a trial on 4-5 JSON files at 250MB each, the code works when I only use 2-3 files but chokes when the total file sizes get larger.

I'm running Python version 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on 8GB RAM and Windows 7 Professional 64 bit.

I'm a Python novice and have little experience with writing optimized code. I would appreciate guidance on optimizing my script below.

Python MemoryError

Traceback (most recent call last):
  File "C:\Users\...\tweetjson_to_csv.py", line 52, in 
    for line in file:
MemoryError
[Finished in 29.5s]


JSON to CSV conversion script

# csv file that you want to save to
out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)

# change argument to the file you want to open
for file in open_files:
    for line in file:
        # only keep tweets and not the empty lines
        if line.rstrip():
            try:
                tweets.append(json.loads(line))
            except:
                pass

for tweet in tweets:
    ids.append(tweet["id_str"])
    texts.append(tweet["text"])
    time_created.append(tweet["created_at"])
    retweet_counts.append(tweet["retweet_count"])
... ...

print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)

csv = writer(out)

for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

out.close()

Solution

You're storing lots of results in lists, which could be streamed instead. Fortunately, using generators, you can make this ‘streaming’ relatively easy without changing too much of the structure of your code. Essentially, put each ‘step’ into a function, and then rather than appending to a list, yield each value. Then you’d have something that might look like this:

def load_json():
    for file in open_files:
        for line in file:
            # only keep tweets and not the empty lines
            if line.rstrip():
                try:
                    datum = json.loads(line)
                except:
                    pass
                else:
                    yield datum


Then you could replace

for tweet in tweets:


with

for tweet in load_json():


But this will still store the IDs into ids, texts into texts, etc. You could use a bunch of generators and zip them together with itertools.izip, but the better solution would be to extract the columns from the tweet when writing each line. Then (omitting the UTF-8 encoding piece, which you'd want to rewrite to work on dictionaries) you’d have

for tweet in load_json():
    csv.writerow((tweet['id'], tweet['text'], ...))


Lastly, since this is Code Review, you might consider putting all the keys you want to pull out into a list:

columns = ['id', 'text', ...]


Then your row-writing code can be simplified to

csv.writerow([tweet[key] for key in columns])


Rewriting it this way, you can then UTF-8 encode rather easily (using a helper function):

def encode_if_possible(value, codec):
    if hasattr(value, 'encode'):
        return value.encode(codec)
    else:
        return value

csv.writerow([encode_if_possible(tweet[key], 'utf-8') for key in columns])

Code Snippets

def load_json():
    for file in open_files:
        for line in file:
            # only keep tweets and not the empty lines
            if line.rstrip():
                try:
                    datum = json.loads(line)
                except:
                    pass
                else:
                    yield datum
for tweet in tweets:
for tweet in load_json():
for tweet in load_json():
    csv.writerow((tweet['id'], tweet['text'], ...))
columns = ['id', 'text', ...]

Context

StackExchange Code Review Q#49792, answer score: 7

Revisions (0)

No revisions yet.