HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Processing large file in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileprocessinglargepython

Problem

I have some code that calculates the "sentiment" of a Tweet.

The task starts with an AFINN file, that is a tab-separated list of around 2500 key-value pairs. I read this into a dict using the following function:

csv.register_dialect('tsv', delimiter='\t')

def processSentiments(file):
    with open(file) as sentiments:
        reader = csv.reader(sentiments,'tsv')
        return dict((word, int(sentiment)) for word, sentiment in reader)


I'm happy with this code - it's pretty clear and I'm not worried about its speed and memory usage as the file is small.

Using the created dict I now need to process many tweets. Lets say I have 1GB of then in JSON format - from the live stream.

What I need to do is:

  • read a tweet file, with a JSON tweet on each line



  • parse each tweet to a dict using json.loads



  • extract the text field from the tweet - giving the content of the tweet



  • for each word in the content, check it if has a sentiment



  • for each sentiment word in the tweet, calculate it's value (from the AFINN dict) and sum across the tweet



  • store that number



So whilst I have a lot of tweet data, I only need a list of integers. This is obviously much smaller.

I want to be able to stream the data from the file and convert each line to its sentiment value in the most efficient way possible. Here's what I came up with:

def extractSentiment(tweetText, sentimentsDict):
    return sum([sentimentsDict.get(word, 0) for word in tweetText.split()])

def processTweet(tweet, sentimentsDict):
    try:
        tweetJson = json.loads(tweet)
    except ValueError, e:
        raise Exception('Invalid tweet: ', tweet, e)
    tweetText = tweetJson.get('text', False)
    if tweetText:
        return extractSentiment(tweetText, sentimentsDict)
    else:
        return 0

def processTweets(file, sentimentsDict):
    with open(file) as tweetFile:
        return [processTweet(tweet, sentimentsDict) for tweet in tweetFile]


So I am using a for-comprehension

Solution

As you may have probably suspected, the list comprehension in processTweets makes this non-streaming and eat a lot of memory, as it has to contain the entire result dictionary in memory before returning to the caller. Which might be fine, as it's just a list of integers.

You can make this streaming by turning this method into a generator:

def processTweets(path, sentimentsDict):
    with open(path) as fh:
        for tweet in fh:
            yield processTweet(tweet, sentimentsDict)


Note that I renamed your variables (file -> path, tweetFile -> fh, because file shadows an existing Python class name (and usually syntax highlighted), and because I like to use fh for throwaway filehandles ;-)

You don't need the default False value in tweetJson.get('text', False), this will work just fine, because the .get will return None, which is falsy:

tweetText = tweetJson.get('text')
if tweetText:
    return extractSentiment(tweetText, sentimentsDict)
else:
    return 0


In processTweet, you catch ValueError and raise an Exception. Since you are not doing anything special to handle the ValueError, you could just let the error bubble up, and it might speed up the process a bit if Python doesn't need to wrap the code within a try-catch for every single tweet.

Code Snippets

def processTweets(path, sentimentsDict):
    with open(path) as fh:
        for tweet in fh:
            yield processTweet(tweet, sentimentsDict)
tweetText = tweetJson.get('text')
if tweetText:
    return extractSentiment(tweetText, sentimentsDict)
else:
    return 0

Context

StackExchange Code Review Q#56256, answer score: 6

Revisions (0)

No revisions yet.