HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Printing out JSON data from Twitter as a CSV

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
csvprintingjsontwitterfromdataout

Problem

I'm extremely new to Python, and the Twitter API, but I found an example online that walked me through the process. Now that I've been playing around with it for awhile, I've begun to push the limits of what my laptop can handle in terms of processing power with my current code. I was hoping someone here could check over it and make recommendations on optimizing it.

Goal: Take JSON output from the Twitter streaming API and print out specific fields to a CSV file. (The sysarg is used to pass the input and output filenames).

# import necessary modules
import json
import sys

# define a new variable for tweets
tweets=[]

# import tweets from JSON
for line in open(sys.argv[1]):
    try:
        tweets.append(json.loads(line))
    except:
        pass

# print the name of the file and number of tweets imported
print "File Imported:", str(sys.argv[1])
print "# Tweets Imported:", len(tweets)

# create a new variable for a single tweets
tweet=tweets[0]

# pull out various data from the tweets
tweet_id = [tweet['id'] for tweet in tweets]
tweet_text = [tweet['text'] for tweet in tweets]
tweet_time = [tweet['created_at'] for tweet in tweets]
tweet_author = [tweet['user']['screen_name'] for tweet in tweets]
tweet_author_id = [tweet['user']['id_str'] for tweet in tweets]
tweet_language = [tweet['lang'] for tweet in tweets]
tweet_geo = [tweet['geo'] for tweet in tweets]

#outputting to CSV
out = open(sys.argv[2], 'w')
print >> out, 'tweet_id, tweet_time, tweet_author, tweet_author_id,    tweet_language, tweet_geo, tweet_text'

rows = zip(tweet_id, tweet_time, tweet_author, tweet_author_id,    tweet_language, tweet_geo, tweet_text)

from csv import writer
csv = writer(out)

for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

out.close()

# print name of exported file
print "File Exported:", str(sys.argv[2])

Solution

-
Put all import statements at the top.

-
You might leak a file descriptor, since you call open(sys.argv[1]) without closing it. (Whether a leak actually occurs depends on the garbage collector of your Python implementation.) Best practice is to use a with block, which automatically closes the resources when it terminates.

with open(sys.argv[1]) as in_file, \
     open(sys.argv[2]) as out_file:
    …


-
It would be better to define your variables in the same order as they will appear in the CSV output.

-
Rather than creating an empty array and appending to it in a loop, use a list comprehension.

tweets = [json.loads(line) for line in in_file]


-
You read all the tweets into an array of JSON objects, then slice the data "vertically" by attribute, then re-aggregate the data "horizontally". That's inefficient in terms of memory usage as well as cache locality.

-
Unless you have a good reason, just transform one line of input at a time. (A good reason might be that you want to produce no output file at all if an error occurs while processing any line.)

import json
import sys
from csv import writer

with open(sys.argv[1]) as in_file, \
     open(sys.argv[2], 'w') as out_file:
    print >> out_file, 'tweet_id, tweet_time, tweet_author, tweet_author_id,    tweet_language, tweet_geo, tweet_text'
    csv = writer(out_file)
    tweet_count = 0

    for line in in_file:
        tweet_count += 1
        tweet = json.loads(line)

        # Pull out various data from the tweets
        row = (
            tweet['id'],                    # tweet_id
            tweet['created_at'],            # tweet_time
            tweet['user']['screen_name'],   # tweet_author
            tweet['user']['id_str'],        # tweet_authod_id
            tweet['lang'],                  # tweet_language
            tweet['geo'],                   # tweet_geo
            tweet['text']                   # tweet_text
        )
        values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
        csv.writerow(values)

# print the name of the file and number of tweets imported
print "File Imported:", str(sys.argv[1])
print "# Tweets Imported:", tweet_count
print "File Exported:", str(sys.argv[2])

Code Snippets

with open(sys.argv[1]) as in_file, \
     open(sys.argv[2]) as out_file:
    …
tweets = [json.loads(line) for line in in_file]
import json
import sys
from csv import writer

with open(sys.argv[1]) as in_file, \
     open(sys.argv[2], 'w') as out_file:
    print >> out_file, 'tweet_id, tweet_time, tweet_author, tweet_author_id,    tweet_language, tweet_geo, tweet_text'
    csv = writer(out_file)
    tweet_count = 0

    for line in in_file:
        tweet_count += 1
        tweet = json.loads(line)

        # Pull out various data from the tweets
        row = (
            tweet['id'],                    # tweet_id
            tweet['created_at'],            # tweet_time
            tweet['user']['screen_name'],   # tweet_author
            tweet['user']['id_str'],        # tweet_authod_id
            tweet['lang'],                  # tweet_language
            tweet['geo'],                   # tweet_geo
            tweet['text']                   # tweet_text
        )
        values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
        csv.writerow(values)

# print the name of the file and number of tweets imported
print "File Imported:", str(sys.argv[1])
print "# Tweets Imported:", tweet_count
print "File Exported:", str(sys.argv[2])

Context

StackExchange Code Review Q#44349, answer score: 8

Revisions (0)

No revisions yet.