patternpythonMinor
40 million tweets from 200k users
Viewed 0 times
million200kuserstweetsfrom
Problem
mysql> select count(username) from users where no_tweets > 50;
+-----------------+
| count(username) |
+-----------------+
| 282366 |
+-----------------+
1 row in set (5.41 sec)
mysql> select sum(no_tweets) from users where no_tweets > 50;
+----------------+
| sum(no_tweets) |
+----------------+
| 38569853 |
+----------------+
1 row in set (1.75 sec)I have that many users, who have collectively tweeted that many tweets. My aim is to store them in a file and then find out what a user generally tweets about (for starters, my aim is to run vanilla LDA and see if it works well on short documents), but the problem is that I ran the python code like an hour back and it has still not finished even 2% of users. I have posted the python code below.
'''
The plan is : Find out all users from the users database who have more than 50 tweets
Create a file for them in the directory passed as an argument with the same name as the username and write all the tweets in them
'''
def fifty(cursor):
''' Find all users and return , all of them having more than 50 tweets'''
cursor.execute("select username from users where no_tweets>50")
return cursor.fetchall()
def connect():
''' Cursor for mySQLdb'''
conn = MySQLdb.connect(host="localhost",user="rohit",passwd="passwd",db="Twitter")
return conn.cursor()
def tweets(cursor,name):
''' All tweets by a given user'''
cursor.execute("select tweets from tweets where username='"+name+"'")
return cursor.fetchall()
import sys,os,MySQLdb
directory = sys.argv[1] #Directory to write the user files
cursor = connect()
rows = fifty(cursor) #Find all users who have more than 50 tweets
for i in rows:#For all users
data = open(os.path.join(directory,i[0]),'w') #Open a file same as their name
allTweets = tweets(cursor,i[0]) #Find all tweets by them
for j in allTweets:
data.write(j[0]+"\n") #Write them
data.close()The problem is that the code is running too s
Solution
Yikes, you're running almost 3,000,000 individual queries. If you could do 4 a second (and you probably cannot) that is still a day!
How about just
Then of course you have to do some fancy footwork to switch from one file to the next when the user changes.
It should only take a few hours to run.
How about just
select username, tweets from tweets order by usernameThen of course you have to do some fancy footwork to switch from one file to the next when the user changes.
It should only take a few hours to run.
Code Snippets
select username, tweets from tweets order by usernameContext
StackExchange Code Review Q#7645, answer score: 8
Revisions (0)
No revisions yet.