HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reddit-scraping API bot

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
apiredditscrapingbot

Problem

I would like to improve my Reddit-scraping code to make it faster, but I don't know how. I am using deque instead of list to improve append performance. Otherwise my code calls the PRAW API and type checks one class attribute (I don't know why this is necessary but otherwise I get type errors). Is there a faster way to do this? These functions are surprisingly slow, and performance time is also highly variable in addition to having a high mean execution time.

For a given user, I'm trying to retrieve their most recent comments and posts. I've pasted the code at the bottom of this post, but I have two problems. First, the code seems to run awfully slow considering how little information I'm retrieving. Second, the variance in execution time from one function call to the next is surprisingly high. You'll see I have two sleep commands for 1 second each, so the minimum time for get_user_comments_and_posts() to execute is 2 seconds. Sometimes I see this, but sometimes I see 14 seconds! And when I print out the output for a given function call, there doesn't seem to be anything special about the output for a short vs. long execution time.

```
COMMENT_LIMIT = 2

@timeit
def get_user_comments_and_posts(self):
time.sleep(TIME_SLEEP)
self.get_comments()
time.sleep(TIME_SLEEP)
self.get_submissions()

@timeit
def get_comments(self):
comments_list = deque([])
comments = self.get_c_praw_call()
for c in comments:
if isinstance(c.subreddit, basestring):
subreddit_name = c.subreddit
else:
subreddit_name = c.subreddit.display_name
new_comment = comment(c.created_utc, c.ups, subreddit_name)
comments_list.append(new_comment)
self.comments = comments_list
for commentD in self.comments:
print commentD

@timeit
def get_c_praw_call(self):
return self.praw_object.get_redditor(self.username).get_comments(limit=COMMENT_LIMIT)

@timeit
def get_submissions(self):
submissions_li

Solution

The issue probably isn't with your code. An operation like this is almost certainly IO bound, not CPU bound. This means that the bulk of the time is being used waiting for the network to respond, not waiting for your CPU to process something locally.

You can speed a program like this up by multithreading it. This strategy would allow you to have multiple concurrent requests open to reddit at once. I would expect the speed to scale almost linearly with the number of threads, up to some maximum (likely bound by the PRAW API). Check out the concurrent futures library for simple multithreading: https://docs.python.org/3.4/library/concurrent.futures.html

In case you have never dealt with multithreading, beware that you will need to monitor when multiple threads access a single object to store results from the PRAW API. You will need to make sure to lock that object before accessing it.

EDIT: I just realized you were using Python 2... concurrent.futures won't do you much good. You should take a look at the Queue module: https://docs.python.org/2/library/queue.html#module-Queue
Threading will also be helpful: https://docs.python.org/2/library/threading.html

Also, a few other comments. isinstance is generally frowned upon in Python. Python subscribes to a philosophy called duck typing. Basically, instead of only proceeding only if something is the correct type, we wrap questionable statements in try/except blocks instead. If there is an error, we simply catch it. Lastly, you aren't going to realize any gains here from a deque; it will just consume more memory as a normal list. A deque would be valuable if you were popping comments off the front of the list as well as the back, but in your case, you only ever access comments_list to iterate over it (using that for loop)

Context

StackExchange Code Review Q#93348, answer score: 5

Revisions (0)

No revisions yet.