HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

40 million tweets from 200k users

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
million200kuserstweetsfrom

Problem

mysql> select count(username) from users where no_tweets  > 50;
+-----------------+
| count(username) |
+-----------------+
|          282366 |
+-----------------+
1 row in set (5.41 sec)

mysql> select sum(no_tweets) from users where no_tweets > 50;
+----------------+
| sum(no_tweets) |
+----------------+
|       38569853 |
+----------------+
1 row in set (1.75 sec)


I have that many users, who have collectively tweeted that many tweets. My aim is to store them in a file and then find out what a user generally tweets about (for starters, my aim is to run vanilla LDA and see if it works well on short documents), but the problem is that I ran the python code like an hour back and it has still not finished even 2% of users. I have posted the python code below.

'''
The plan is : Find out all users  from the users database who have more than 50 tweets
Create a file for them in the directory passed as an argument with the same name as the username and write all the tweets in them 
'''
def fifty(cursor):
    ''' Find all users and return , all of them having more than 50 tweets'''
    cursor.execute("select username from users where no_tweets>50")
    return cursor.fetchall()

def connect():
    ''' Cursor for mySQLdb'''
    conn = MySQLdb.connect(host="localhost",user="rohit",passwd="passwd",db="Twitter")
    return conn.cursor()

def tweets(cursor,name):
    ''' All tweets by a given user'''
    cursor.execute("select tweets from tweets where username='"+name+"'")
    return cursor.fetchall()
import sys,os,MySQLdb
directory = sys.argv[1] #Directory to write the user files
cursor = connect()
rows = fifty(cursor) #Find all users who have more than 50 tweets

for i in rows:#For all users
    data = open(os.path.join(directory,i[0]),'w') #Open a file same as their name
    allTweets = tweets(cursor,i[0]) #Find all tweets by them
    for j in allTweets:
        data.write(j[0]+"\n") #Write them 
    data.close()


The problem is that the code is running too s

Solution

Yikes, you're running almost 3,000,000 individual queries. If you could do 4 a second (and you probably cannot) that is still a day!

How about just

select username, tweets from tweets order by username


Then of course you have to do some fancy footwork to switch from one file to the next when the user changes.

It should only take a few hours to run.

Code Snippets

select username, tweets from tweets order by username

Context

StackExchange Code Review Q#7645, answer score: 8

Revisions (0)

No revisions yet.