HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scraping my CS teacher's website, then emailing me when the site is updated

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
websitethescrapingemailingthenwhensiteupdatedteacher

Problem

I've been working on creating an individual final project for my python CS class that checks my teacher's website on a daily basis and determines if he's changed any of the web pages on his website since the last time the program ran or not.

I would really love some suggestions for improving my code, especially since it works now! I have added some functionality to it, so that it runs via a cron job on a cloud server and sends out an email when a page changes!

```
import requests ## downloads the html
from bs4 import BeautifulSoup ## parses the html
import filecmp ## compares files
import os, sys ## used for renaming files
import difflib ## used to see differences in link files
import smtplib ## used for sending email
from email.mime.multipart import MIMEMultipart ## used for areas of email such as subject, toaddr, fromaddr, etc.
from email.mime.text import MIMEText ## used for areas of email such as body, etc.

root_url = "https://sites.google.com"
index_url = root_url + "/site/csc110winter2015/home"

def get_site_links():
'''
Gets links from the website's list items' HTML elements
'''
response = requests.get(index_url)
soup = BeautifulSoup(response.text)
links = [a.attrs.get('href') for a in soup.select('li.topLevel a[href^=/site/csc110winter2015/]')]
return links

def try_read_links_file():
'''
Tries to read the links.txt file; if links.txt is found, then rename links.txt to previous_links.txt
'''
try:
os.rename("links.txt", "previous_links.txt")
write_links_file()
except (OSError, IOError):
print("No links.txt file exists; creating one now.")
write_links_file()
try_read_links_file()

def write_links_file():
'''
Writes the links.txt file from the website's links
'''
links = get_site_links()
with open("links.txt", mode='wt', encoding='utf-8') as out_file:
out_file.write('\n'.join(links))

def check_links():
'''
Checks to see if links ha

Solution

Bugs

The e-mail lists the pages whose contents have changed, but not the pages that were added or removed. Additions and deletions are merely printed to sys.stdout.

The files where the page contents are saved have filenames of the form previous_.site.csc110winter2015.somethingsomething␤.txt. The newline character preceding .txt is weird.

If the links are merely reordered, you'll see it reported as a removal and addition.

If try_read_links_file() is unable to create links.txt (due to directory permissions, for example), it will recurse infinitely.

Inefficiencies

You call check_pages() up to three times:

  • Once in main() for no apparent reason



  • Once in send_mail() in an apparent attempt to check whether any changes were detected. Bizarrely, this check is done after the SMTP handshake — why bother connecting to the SMTP server at all if you have nothing to send?



  • If you do decide to send mail, then you call check_pages() once more to incorporate the list of modified pages in the message body.



General critique

The technique you have employed is very file-centric. The five functions that you call from main() communicate with each other not by passing parameters and returning values, nor via global variables, but through the filesystem! This style of programming drastically complicates the code. Every function ends up being concerned with reading files, stripping newlines (if you remember to do so), mangling paths, and saving the results.

try_read_pages_files() is misleadingly named, as it actually also writes the files. Similarly, try_read_links_file() has side-effects that I wouldn't expect.

If you just want to detect whether content has changed, you need not save the entire website's contents. Storing a cryptographic checksum of each page will suffice. With that insight, you can summarize the entire website in a single file, one line per page.

It would be nicer to pass the whole initial URL to the program, rather than breaking it up into a root_url and an index_url. Also, appending the href values to the root_url makes a nasty assumption that all of the hrefs are absolute URLs. Use urllib.parse.urljoin() to resolve URLs instead.

In send_mail(), first compose the message, then send it. Avoid interleaving the two operations. You don't need multi-part MIME if all you are sending is a plain-text message.

In the suggested solution below, look at main() to see how functions should interact with each other.

```
from base64 import b64encode, b64decode
from bs4 import BeautifulSoup
from email.mime.text import MIMEText
from hashlib import sha256
from smtplib import SMTP
from urllib.parse import urljoin
from urllib.request import urlopen

def summarize_site(index_url):
'''
Return a dict that maps the URL to the SHA-256 sum of its page contents
for each link in the index_url.
'''
summary = {}
with urlopen(index_url) as index_req:
soup = BeautifulSoup(index_req.read())
links = [urljoin(index_url, a.attrs.get('href'))
for a in soup.select('li.topLevel a[href^=/site/csc110winter2015/]')]
for page in links:
# Ignore the sitemap page
if page == '/site/csc110winter2015/system/app/pages/sitemap/hierarchy':
continue
with urlopen(page) as page_req:
fingerprint = sha256()
soup = BeautifulSoup(page_req.read())
for div in soup.find_all('div', class_='sites-attachments-row'):
fingerprint.update(div.encode())
summary[page] = fingerprint.digest()
return summary

def save_site_summary(filename, summary):
with open(filename, 'wt', encoding='utf-8') as f:
for path, fingerprint in summary.items():
f.write("{} {}\n".format(b64encode(fingerprint).decode(), path))

def load_site_summary(filename):
summary = {}
with open(filename, 'rt', encoding='utf-8') as f:
for line in f:
fingerprint, path = line.rstrip().split(' ', 1)
summary[path] = b64decode(fingerprint)
return summary

def diff(old, new):
return {
'added': new.keys() - old.keys(),
'removed': old.keys() - new.keys(),
'modified': [page for page in set(new.keys()).intersection(old.keys())
if old[page] != new[page]],
}

def describe_diff(diff):
desc = []
for change in ('added', 'removed', 'modified'):
if not diff[change]:
continue
desc.append('The following page(s) have been {}:\n{}'.format(
change,
'\n'.join(' ' + path for path in sorted(diff[change]))
))
return '\n\n'.join(desc)

def send_mail(body):
## Compose the email
fromaddr = "Sending Email"
toaddr = "Receiving Email"
msg = MIMEText(body, 'plain')
msg['From'] = fromaddr
msg['To'] = toaddr
msg['Subject'] = "Incoming CSC110 website changes!"

Code Snippets

from base64 import b64encode, b64decode
from bs4 import BeautifulSoup
from email.mime.text import MIMEText
from hashlib import sha256
from smtplib import SMTP
from urllib.parse import urljoin
from urllib.request import urlopen

def summarize_site(index_url):
    '''
    Return a dict that maps the URL to the SHA-256 sum of its page contents
    for each link in the index_url.
    '''
    summary = {}
    with urlopen(index_url) as index_req:
        soup = BeautifulSoup(index_req.read())
        links = [urljoin(index_url, a.attrs.get('href'))
                 for a in soup.select('li.topLevel a[href^=/site/csc110winter2015/]')]
        for page in links:
            # Ignore the sitemap page
            if page == '/site/csc110winter2015/system/app/pages/sitemap/hierarchy':
                continue    
            with urlopen(page) as page_req:
                fingerprint = sha256()
                soup = BeautifulSoup(page_req.read())
                for div in soup.find_all('div', class_='sites-attachments-row'):
                    fingerprint.update(div.encode())
                summary[page] = fingerprint.digest()
    return summary

def save_site_summary(filename, summary):
    with open(filename, 'wt', encoding='utf-8') as f:
        for path, fingerprint in summary.items():
            f.write("{} {}\n".format(b64encode(fingerprint).decode(), path))

def load_site_summary(filename):
    summary = {}
    with open(filename, 'rt', encoding='utf-8') as f:
        for line in f:
            fingerprint, path = line.rstrip().split(' ', 1)
            summary[path] = b64decode(fingerprint)
    return summary

def diff(old, new):
    return {
        'added': new.keys() - old.keys(),
        'removed': old.keys() - new.keys(),
        'modified': [page for page in set(new.keys()).intersection(old.keys())
                     if old[page] != new[page]],
    }

def describe_diff(diff):
    desc = []
    for change in ('added', 'removed', 'modified'):
        if not diff[change]:
            continue
        desc.append('The following page(s) have been {}:\n{}'.format(
            change,
            '\n'.join(' ' + path for path in sorted(diff[change]))
        ))
    return '\n\n'.join(desc)

def send_mail(body):
    ## Compose the email
    fromaddr = "Sending Email"
    toaddr = "Receiving Email"
    msg = MIMEText(body, 'plain')
    msg['From'] = fromaddr
    msg['To'] = toaddr
    msg['Subject'] = "Incoming CSC110 website changes!"

    ## Send it
    server = SMTP('smtp.gmail.com', 587)
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login("Sending Email", "Password")
    server.sendmail(fromaddr, toaddr, msg.as_string())
    server.quit()

def main(index_url, filename):
    summary = summarize_site(index_url)
    try:
        prev_summary = load_site_summary(filename)    
        if prev_summary:
            diff_description = describe_diff(diff(prev_summary, summary))
            if diff_description:
                

Context

StackExchange Code Review Q#84065, answer score: 5

Revisions (0)

No revisions yet.