HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Data analytics on static file of 50,000+ tweets

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileanalytics000tweetsdatastatic

Problem

I'm trying to optimize the main loop portion of this code, as well as learn any "best practices" insights I can for all of the code. This script currently reads in one large file full of tweets (50MB to 1GB). It uses pandas to play with the data, and matplotlib to generate 2D graphs.

Currently, this does not scale well and consumes massive amounts of RAM. To help save on cost/VPS resources, I would like to refine this code (:

Example import file:

```
{"created_at":"Mon Jan 25 21:41:03 +0000 2016","id":691737570879918080,"id_str":"691737570879918080","text":"Suspect Named in Antarctica \"Billy\" Case #fakeheadlinebot #learntocode #makeatwitterbot #javascript","source":"\u003ca href=\"http:\/\/javascriptiseasy.com\" rel=\"nofollow\"\u003eJavaScript is Easy\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":4382400263,"id_str":"4382400263","name":"JavaScript is Easy","screen_name":"javascriptisez","location":"Your Console","url":"http:\/\/javascriptiseasy.com","description":"Get learning!","protected":false,"verified":false,"followers_count":158,"friends_count":68,"listed_count":140,"favourites_count":11,"statuses_count":37545,"created_at":"Sat Dec 05 11:18:00 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"FFCC4D","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/673099606348070912\/xNxp4zOt_normal.jpg","profile_ima

Solution

If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.

def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e


Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.

You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:

# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')


Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):

try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None


That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

Context

StackExchange Code Review Q#118073, answer score: 3

Revisions (0)

No revisions yet.