patternpythonMinor
MemoryError in Python while combining multiple JSON files and outputting as single CSV
Viewed 0 times
combiningwhileoutputtingcsvandmemoryerrorfilessinglepythonmultiple
Problem
I have a number of JSON files to combine and output as a single CSV (to load into R), with each JSON file at about 1.5GB. While doing a trial on 4-5 JSON files at 250MB each, the code works when I only use 2-3 files but chokes when the total file sizes get larger.
I'm running Python version 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on 8GB RAM and Windows 7 Professional 64 bit.
I'm a Python novice and have little experience with writing optimized code. I would appreciate guidance on optimizing my script below.
Python MemoryError
JSON to CSV conversion script
I'm running Python version 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on 8GB RAM and Windows 7 Professional 64 bit.
I'm a Python novice and have little experience with writing optimized code. I would appreciate guidance on optimizing my script below.
Python MemoryError
Traceback (most recent call last):
File "C:\Users\...\tweetjson_to_csv.py", line 52, in
for line in file:
MemoryError
[Finished in 29.5s]JSON to CSV conversion script
# csv file that you want to save to
out = open("output.csv", "ab")
filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)
# change argument to the file you want to open
for file in open_files:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
tweets.append(json.loads(line))
except:
pass
for tweet in tweets:
ids.append(tweet["id_str"])
texts.append(tweet["text"])
time_created.append(tweet["created_at"])
retweet_counts.append(tweet["retweet_count"])
... ...
print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)
csv = writer(out)
for row in rows:
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()Solution
You're storing lots of results in lists, which could be streamed instead. Fortunately, using generators, you can make this ‘streaming’ relatively easy without changing too much of the structure of your code. Essentially, put each ‘step’ into a function, and then rather than
Then you could replace
with
But this will still store the IDs into
Lastly, since this is Code Review, you might consider putting all the keys you want to pull out into a list:
Then your row-writing code can be simplified to
Rewriting it this way, you can then UTF-8 encode rather easily (using a helper function):
appending to a list, yield each value. Then you’d have something that might look like this:def load_json():
for file in open_files:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
datum = json.loads(line)
except:
pass
else:
yield datumThen you could replace
for tweet in tweets:with
for tweet in load_json():But this will still store the IDs into
ids, texts into texts, etc. You could use a bunch of generators and zip them together with itertools.izip, but the better solution would be to extract the columns from the tweet when writing each line. Then (omitting the UTF-8 encoding piece, which you'd want to rewrite to work on dictionaries) you’d havefor tweet in load_json():
csv.writerow((tweet['id'], tweet['text'], ...))Lastly, since this is Code Review, you might consider putting all the keys you want to pull out into a list:
columns = ['id', 'text', ...]Then your row-writing code can be simplified to
csv.writerow([tweet[key] for key in columns])Rewriting it this way, you can then UTF-8 encode rather easily (using a helper function):
def encode_if_possible(value, codec):
if hasattr(value, 'encode'):
return value.encode(codec)
else:
return value
csv.writerow([encode_if_possible(tweet[key], 'utf-8') for key in columns])Code Snippets
def load_json():
for file in open_files:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
datum = json.loads(line)
except:
pass
else:
yield datumfor tweet in tweets:for tweet in load_json():for tweet in load_json():
csv.writerow((tweet['id'], tweet['text'], ...))columns = ['id', 'text', ...]Context
StackExchange Code Review Q#49792, answer score: 7
Revisions (0)
No revisions yet.