patternpythonMinor
Reddit bot to check reposts
Viewed 0 times
repostscheckredditbot
Problem
I am a moderator of /r/sweepstakes on Reddit, which lets users post their referral links to contests/sweepstakes. One main rule is that a user is not allowed to post their link to a contest, if another user has already done so. It's not so simple checking for reposts since all referral links have a different URL (i.e.
I thought a good way to find a repost is to retrieve the title of the webpage (the `
import traceback
import praw # simple interface to the reddit API, also handles rate limiting of requests
import time
import sqli
contest.com/?ref=Kevin & contest.com/?ref=Steve).I thought a good way to find a repost is to retrieve the title of the webpage (the `
tag) and store it in a database along with some other vital information.
It scans the subreddit every 15m for new posts. For every post it does the following:
-
See if we have already looked at the post by searching the DB for the pid (PostId). If we have that, skip and move onto the next post.
-
Get the final URL using urllib. Some URLs redirect to another webpage (i.e. bit.ly links)
-
Get the title () of the webpage by using BeautifulSoup.
-
Search the DB for the Title. If the title is in the database, then that means the submitted post is a repost and we want to retrieve some information on the original post (permalink, submitter). We add this information to a string that will be sent to the moderators.
-
If the submitted post's title does not already exist in the database, then it is a unique post and we will add it to the database.
-
Once all posts have been processed, send the message of all reposts to the moderators for them to manually inspect.
I ran into a lot of issues and they predominantly had to do with finding the final URL of the post and finding the page's title. To keep things simple, I may end up removing the function to find a URL's final URL, since it isn't very important.
I ran into ASCII/Unicode issues and I kept getting UnicodeEncodeError/UnicodeDecodeError exceptions.
Suggestions on how to improve the code would be appreciated.
``import traceback
import praw # simple interface to the reddit API, also handles rate limiting of requests
import time
import sqli
Solution
Use modern versions of things
The most obvious thing here is use Python 3. This will help massively with your Unicode problems, because Python 3 maintains a stricter separation between things that Python 2 conflated. In some cases, your errors will just be artifacts of Python 2's way of doing things, and will just go away. In others, you'll get errors that give you a much better idea of what the problem is.
In general, the only reason to use Python 2 for new code these days is if you have to use one of an increasingly small number of libraries that hasn't been ported. You use three non-stdlib packages:
Which leaves: BeautifulSoup. The fact that you are importing it as
Use requests
You import requests, but you also import
General Pythonisms
can be shortened to:
In general, Python style prefers iteration to recursion. So, instead of retrying like this:
do this (also converted to use
I've also removed your exception handling for generic errors in here, because I don't think this is the right place to handle them. Instead, let them bubble up to the main line and deal with them there.
This has a flow on implication down here:
That
You should probably also rename
This is the right place to handle that error. But instead of calling
Above this:
There's no need to turn that result into a list. Anything you can pass to
Just use
That is a lovely abomination. It.. looks like you're trying to handle the page being in an arbitrary encoding, and standardise it to UTF8? If that's the case, do this:
In Python 3,
If you're happy with the raw bytes in whatever encoding they happen to be, do this:
For reposts, you gradually build a string message to send to someone. It would be better (and probably a little faster) to build a list of the pertinent information:
Sqlite
The most obvious thing here is use Python 3. This will help massively with your Unicode problems, because Python 3 maintains a stricter separation between things that Python 2 conflated. In some cases, your errors will just be artifacts of Python 2's way of doing things, and will just go away. In others, you'll get errors that give you a much better idea of what the problem is.
In general, the only reason to use Python 2 for new code these days is if you have to use one of an increasingly small number of libraries that hasn't been ported. You use three non-stdlib packages:
requests and praw both support Python 3. Which leaves: BeautifulSoup. The fact that you are importing it as
BeautifulSoup implies you're using bs3, which only works on Python 2.x and hasn't had an update since 2012. Upgrade to BeautifulSoup 4 - it is actively maintained (at the time of this post, the last release was just shy of 4 weeks ago), and supports all current versions of Python. Use requests
You import requests, but you also import
urllib and urllib2. Of those, the easiest to use for what you want is requests, and the only one you actually use is urllib2. General Pythonisms
e.code == 403 or e.code == 429can be shortened to:
e.code in 403, 429In general, Python style prefers iteration to recursion. So, instead of retrying like this:
def resolve_redirects(url, tries):
tries -= 1
# Several lines of code unrelated to tries
...
except urllib2.HTTPError, e:
time.sleep(5)
resolve_redirects(url, tries)do this (also converted to use
requests, and string formatting instead of concatenation):def resolve_redirects(url, tries):
for _ in range(tries):
response = requests.get(url, headers=...)
if response.status_code in 403, 429:
print('HTTP Error: {} ')
continue
elif response.status_code != 200:
# Generic error
response.raise_for_status()
else:
return responseI've also removed your exception handling for generic errors in here, because I don't think this is the right place to handle them. Instead, let them bubble up to the main line and deal with them there.
This has a flow on implication down here:
try:
post_url = resolve_redirects(url, 3)
effective_url = post_url.geturl()
except AttributeError:
print "AttributeError: Post URL/Effective URL"
continueThat
AttributeError was almost certainly coming up because of your previous exception handling. You were printing the error and then ignoring it and continuing on, which made resolve_redirects return None by falling off the end. So now, you can change this guard to except URLError: so it gives you a better idea of what's going on.You should probably also rename
post_url, since it's not really a url anymore (it's a Response, so for lack of a better name, let's call it post_response). This is the right place to handle that error. But instead of calling
print here, consider using the logging module . Above this:
submissions = list(subreddit.get_new(limit=MAXPOSTS))There's no need to turn that result into a list. Anything you can pass to
list you can also iterate over directly. Only bother turning it into a list if you need to iterate over it more than once (you don't). url = post.url
domain = post.domainJust use
post.url and post.domain directly.try:
post_title = get_title(post_url).encode('utf-8').strip()
except UnicodeDecodeError:
post_title = unicode(get_title(post_url).strip(),"utf-8")
except UnicodeEncodeError:
print "UnicodeError: " + post.title
continueThat is a lovely abomination. It.. looks like you're trying to handle the page being in an arbitrary encoding, and standardise it to UTF8? If that's the case, do this:
title = get_title(post_response.text).strip().encode('utf8')In Python 3,
encode will not raise a UnicodeDecodeError, because someone realised that that was a little odd. Encoding to utf8 should not raise a UnicodeEncodeError, because there are no unicode codepoints that utf8 can't encode. If you're happy with the raw bytes in whatever encoding they happen to be, do this:
title = get_title(post_response.content).strip()For reposts, you gradually build a string message to send to someone. It would be better (and probably a little faster) to build a list of the pertinent information:
reposts = []
for post in posts:
...
if row:
# There's a repost
reposts.append((tuple of the things you current make a string for))
...
if reposts:
msg = 'Repost: [{}]({}) by /u/{}. Original: [Here]({}) by /u/{}.'
msg = '\n\n'.join(msg.format(post) for post in reposts)
r.send_message(...)Sqlite
row objects can be access by column name - rename your row variable to repost, and you canCode Snippets
e.code == 403 or e.code == 429e.code in 403, 429def resolve_redirects(url, tries):
tries -= 1
# Several lines of code unrelated to tries
...
except urllib2.HTTPError, e:
time.sleep(5)
resolve_redirects(url, tries)def resolve_redirects(url, tries):
for _ in range(tries):
response = requests.get(url, headers=...)
if response.status_code in 403, 429:
print('HTTP Error: {} ')
continue
elif response.status_code != 200:
# Generic error
response.raise_for_status()
else:
return responsetry:
post_url = resolve_redirects(url, 3)
effective_url = post_url.geturl()
except AttributeError:
print "AttributeError: Post URL/Effective URL"
continueContext
StackExchange Code Review Q#98552, answer score: 7
Revisions (0)
No revisions yet.