patternpythonMinor
Processing large file in Python
Viewed 0 times
fileprocessinglargepython
Problem
I have some code that calculates the "sentiment" of a Tweet.
The task starts with an AFINN file, that is a tab-separated list of around 2500 key-value pairs. I read this into a
I'm happy with this code - it's pretty clear and I'm not worried about its speed and memory usage as the file is small.
Using the created
What I need to do is:
So whilst I have a lot of tweet data, I only need a list of integers. This is obviously much smaller.
I want to be able to stream the data from the file and convert each line to its sentiment value in the most efficient way possible. Here's what I came up with:
So I am using a for-comprehension
The task starts with an AFINN file, that is a tab-separated list of around 2500 key-value pairs. I read this into a
dict using the following function:csv.register_dialect('tsv', delimiter='\t')
def processSentiments(file):
with open(file) as sentiments:
reader = csv.reader(sentiments,'tsv')
return dict((word, int(sentiment)) for word, sentiment in reader)I'm happy with this code - it's pretty clear and I'm not worried about its speed and memory usage as the file is small.
Using the created
dict I now need to process many tweets. Lets say I have 1GB of then in JSON format - from the live stream.What I need to do is:
- read a tweet file, with a JSON tweet on each line
- parse each tweet to a
dictusingjson.loads
- extract the
textfield from the tweet - giving the content of the tweet
- for each word in the content, check it if has a sentiment
- for each sentiment word in the tweet, calculate it's value (from the AFINN
dict) and sum across the tweet
- store that number
So whilst I have a lot of tweet data, I only need a list of integers. This is obviously much smaller.
I want to be able to stream the data from the file and convert each line to its sentiment value in the most efficient way possible. Here's what I came up with:
def extractSentiment(tweetText, sentimentsDict):
return sum([sentimentsDict.get(word, 0) for word in tweetText.split()])
def processTweet(tweet, sentimentsDict):
try:
tweetJson = json.loads(tweet)
except ValueError, e:
raise Exception('Invalid tweet: ', tweet, e)
tweetText = tweetJson.get('text', False)
if tweetText:
return extractSentiment(tweetText, sentimentsDict)
else:
return 0
def processTweets(file, sentimentsDict):
with open(file) as tweetFile:
return [processTweet(tweet, sentimentsDict) for tweet in tweetFile]So I am using a for-comprehension
Solution
As you may have probably suspected, the list comprehension in
You can make this streaming by turning this method into a generator:
Note that I renamed your variables (
You don't need the default
In
processTweets makes this non-streaming and eat a lot of memory, as it has to contain the entire result dictionary in memory before returning to the caller. Which might be fine, as it's just a list of integers.You can make this streaming by turning this method into a generator:
def processTweets(path, sentimentsDict):
with open(path) as fh:
for tweet in fh:
yield processTweet(tweet, sentimentsDict)Note that I renamed your variables (
file -> path, tweetFile -> fh, because file shadows an existing Python class name (and usually syntax highlighted), and because I like to use fh for throwaway filehandles ;-)You don't need the default
False value in tweetJson.get('text', False), this will work just fine, because the .get will return None, which is falsy:tweetText = tweetJson.get('text')
if tweetText:
return extractSentiment(tweetText, sentimentsDict)
else:
return 0In
processTweet, you catch ValueError and raise an Exception. Since you are not doing anything special to handle the ValueError, you could just let the error bubble up, and it might speed up the process a bit if Python doesn't need to wrap the code within a try-catch for every single tweet.Code Snippets
def processTweets(path, sentimentsDict):
with open(path) as fh:
for tweet in fh:
yield processTweet(tweet, sentimentsDict)tweetText = tweetJson.get('text')
if tweetText:
return extractSentiment(tweetText, sentimentsDict)
else:
return 0Context
StackExchange Code Review Q#56256, answer score: 6
Revisions (0)
No revisions yet.