patternpythonMinor

Reddit bot to check reposts

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

repostscheckredditbot

Problem

I am a moderator of /r/sweepstakes on Reddit, which lets users post their referral links to contests/sweepstakes. One main rule is that a user is not allowed to post their link to a contest, if another user has already done so. It's not so simple checking for reposts since all referral links have a different URL (i.e. contest.com/?ref=Kevin & contest.com/?ref=Steve).

I thought a good way to find a repost is to retrieve the title of the webpage (the `

 tag) and store it in a database along with some other vital information.

It scans the subreddit every 15m for new posts. For every post it does the following:

- 
See if we have already looked at the post by searching the DB for the

pid

 (PostId). If we have that, skip and move onto the next post.

- 
Get the final URL using

urllib

. Some URLs redirect to another webpage (i.e. bit.ly links)

- 
Get the title (

) of the webpage by using BeautifulSoup

.

- 
Search the DB for the Title. If the title is in the database, then that means the submitted post is a repost and we want to retrieve some information on the original post (

permalink, submitter

). We add this information to a string that will be sent to the moderators.

- 
If the submitted post's title does not already exist in the database, then it is a unique post and we will add it to the database.

- 
Once all posts have been processed, send the message of all reposts to the moderators for them to manually inspect.

I ran into a lot of issues and they predominantly had to do with finding the final URL of the post and finding the page's title. To keep things simple, I may end up removing the function to find a URL's final URL, since it isn't very important.

I ran into ASCII/Unicode issues and I kept getting

UnicodeEncodeError/UnicodeDecodeError

 exceptions. 

Suggestions on how to improve the code would be appreciated.

``
import traceback
import praw # simple interface to the reddit API, also handles rate limiting of requests
import time
import sqli

Solution

Use modern versions of things

The most obvious thing here is use Python 3. This will help massively with your Unicode problems, because Python 3 maintains a stricter separation between things that Python 2 conflated. In some cases, your errors will just be artifacts of Python 2's way of doing things, and will just go away. In others, you'll get errors that give you a much better idea of what the problem is.

In general, the only reason to use Python 2 for new code these days is if you have to use one of an increasingly small number of libraries that hasn't been ported. You use three non-stdlib packages: requests and praw both support Python 3.

Which leaves: BeautifulSoup. The fact that you are importing it as BeautifulSoup implies you're using bs3, which only works on Python 2.x and hasn't had an update since 2012. Upgrade to BeautifulSoup 4 - it is actively maintained (at the time of this post, the last release was just shy of 4 weeks ago), and supports all current versions of Python.

Use requests

You import requests, but you also import urllib and urllib2. Of those, the easiest to use for what you want is requests, and the only one you actually use is urllib2.

General Pythonisms

e.code == 403 or e.code == 429

can be shortened to:

e.code in 403, 429

In general, Python style prefers iteration to recursion. So, instead of retrying like this:

def resolve_redirects(url, tries):
    tries -= 1
    # Several lines of code unrelated to tries
    ...
    except urllib2.HTTPError, e:
        time.sleep(5)
        resolve_redirects(url, tries)

do this (also converted to use requests, and string formatting instead of concatenation):

def resolve_redirects(url, tries):
    for _ in range(tries):
        response = requests.get(url, headers=...)
        if response.status_code in 403, 429:
            print('HTTP Error: {} ')
            continue
        elif response.status_code != 200:
            # Generic error 
            response.raise_for_status()
        else:
            return response

I've also removed your exception handling for generic errors in here, because I don't think this is the right place to handle them. Instead, let them bubble up to the main line and deal with them there.

This has a flow on implication down here:

try:
    post_url = resolve_redirects(url, 3)
    effective_url = post_url.geturl()
except AttributeError:
    print "AttributeError: Post URL/Effective URL"
    continue

That AttributeError was almost certainly coming up because of your previous exception handling. You were printing the error and then ignoring it and continuing on, which made resolve_redirects return None by falling off the end. So now, you can change this guard to except URLError: so it gives you a better idea of what's going on.

You should probably also rename post_url, since it's not really a url anymore (it's a Response, so for lack of a better name, let's call it post_response).

This is the right place to handle that error. But instead of calling print here, consider using the logging module .

Above this:

submissions = list(subreddit.get_new(limit=MAXPOSTS))

There's no need to turn that result into a list. Anything you can pass to list you can also iterate over directly. Only bother turning it into a list if you need to iterate over it more than once (you don't).

url = post.url
domain = post.domain

Just use post.url and post.domain directly.

try:    
    post_title = get_title(post_url).encode('utf-8').strip()
except UnicodeDecodeError:
    post_title = unicode(get_title(post_url).strip(),"utf-8")
except UnicodeEncodeError:
    print "UnicodeError: " + post.title
    continue

That is a lovely abomination. It.. looks like you're trying to handle the page being in an arbitrary encoding, and standardise it to UTF8? If that's the case, do this:

title = get_title(post_response.text).strip().encode('utf8')

In Python 3, encode will not raise a UnicodeDecodeError, because someone realised that that was a little odd. Encoding to utf8 should not raise a UnicodeEncodeError, because there are no unicode codepoints that utf8 can't encode.

If you're happy with the raw bytes in whatever encoding they happen to be, do this:

title = get_title(post_response.content).strip()

For reposts, you gradually build a string message to send to someone. It would be better (and probably a little faster) to build a list of the pertinent information:

reposts = []
for post in posts:
    ...
    if row:
        # There's a repost
        reposts.append((tuple of the things you current make a string for))
    ...
if reposts:
    msg = 'Repost: [{}]({}) by /u/{}. Original: [Here]({}) by /u/{}.'
    msg = '\n\n'.join(msg.format(post) for post in reposts)
    r.send_message(...)

Sqlite row objects can be access by column name - rename your row variable to repost, and you can

Code Snippets

e.code == 403 or e.code == 429

e.code in 403, 429

def resolve_redirects(url, tries):
    tries -= 1
    # Several lines of code unrelated to tries
    ...
    except urllib2.HTTPError, e:
        time.sleep(5)
        resolve_redirects(url, tries)

def resolve_redirects(url, tries):
    for _ in range(tries):
        response = requests.get(url, headers=...)
        if response.status_code in 403, 429:
            print('HTTP Error: {} ')
            continue
        elif response.status_code != 200:
            # Generic error 
            response.raise_for_status()
        else:
            return response

try:
    post_url = resolve_redirects(url, 3)
    effective_url = post_url.geturl()
except AttributeError:
    print "AttributeError: Post URL/Effective URL"
    continue

Context

StackExchange Code Review Q#98552, answer score: 7

Revisions (0)

No revisions yet.