HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Object-oriented web scraping with Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withscrapingpythonweborientedobject

Problem

I usually write function-only Python programs, but have decided on an OOP approach (my first thereof) for my current program, a web-scraper. Here is the working code I wrote which I am looking to have critiqued:

import csv
import urllib2

NO_VACANCIES = ['no vacancies', 'not hiring']

class Page(object):
    def __init__(self, url):
        self.url = url

    def get_source(self):
        self.source = urllib2.urlopen(url).read()
        return self.source  

class HubPage(Page):
    def has_vacancies(self):
        return not(any(text for text in NO_VACANCIES if text in self.source.lower()))

urls = []
with open('25.csv', 'rb') as spreadsheet:
    reader = csv.reader(spreadsheet)
    for row in reader:
        urls.append(row[0].strip())

for url in urls:
    page = HubPage(url)
    source = page.get_source()
    if page.has_vacancies():
        print 'Has vacancies'


Some context: HubPage represents a typical 'jobs' page on a company's web site. I am subclassing Page because I well eventually subclass it again for individual job pages, and some methods that will be used only to extract data for individual job pages.

Here's my issue: I know from experience that urllib2, while it has its critics, is fast - very fast - at doing what it does, namely fetch a page's source. Yet I notice that in my design, processing of each url is taking a few orders of magnitude longer than what I typically observe.

  • Is it the fact that class instantiations are involved (unnecessarily,


perhaps)?

  • Might the fact that HubPage is inherited be at cause?



  • Is the call to any() expensive?

Solution

Use top-level functions

Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under if __name__ == '__main__' to:

  • separate the intents;



  • reuse the code more easily;



  • not have random code execution when importing the module.



Use filter

You are only interested in knowing which URL have vacancies. The easiest way to perform that is to:

interesting_urls = filter(has_vacancies, url)


with the proper definition of has_vacancies.

You also only print 'Has vacancies' when a given URL has so, letting the user try to figure out which one actually has. You might want to print the URL instead to make your script more usable:

print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
    print url


Use generators

There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:

def read_urls(filename):
    with open(filename, 'rb') as spreadsheet:
        reader = csv.reader(spreadsheet)
        for row in reader:
            yield row[0].strip()


It also allows you to parametrize with the name of the file to read, in case you have several of them.

Do not force OOP when not required

You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:

def has_vacancies(url):
    source = urllib2.urlopen(url).read()
    return not(any(text for text in NO_VACANCIES if text in source.lower()))


See what I did there? This is the exact function we need for the filter I introduced earlier.

Remove extraneous computation

Calling lower() on a string has its cost. Calling it once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.

Of a lower interest, any is converting pieces of text to their truthy value after a test text in source already returned True. You also store source, self.url, self.source (and, to a lesser extend, page) without really needing it. But I’ve shown ways to avoid it.

Proposed improvements

import csv
import urllib2

NO_VACANCIES = ['no vacancies', 'not hiring']

def has_vacancies(url):
    source = urllib2.urlopen(url).read().lower()
    return not any(text in source for text in NO_VACANCIES)

def read_urls(filename):
    with open(filename, 'rb') as spreadsheet:
        reader = csv.reader(spreadsheet)
        for row in reader:
            yield row[0].strip()

def check_vacancies(filename):
    print 'Following URLs have vacancies'
    for url in filter(has_vacancies, read_urls(filename)):
        print url

if __name__ == '__main__':
    check_vacancies('25.csv')


Going further

You might be interested in trying the re module to perform text matches, there may be a performance improvement.

Alternatively, the multiprocessing module provide a map_async performed in parallel by a pool of workers. You might want to change has_vacancies a bit and use that instead of filter to increase overall performances.

You may be interested into handling exceptions from urllib2 so a sudden networking failure or an invalid URL won't crash everything. Something simple can do:

def has_vacancies(url):
    try:
        source = urllib2.urlopen(url).read()
    except urllib2.URLError:
        print 'Error processing', url
        return False
    else:
        source = source.lower()
    return not any(text in source for text in NO_VACANCIES)

Code Snippets

interesting_urls = filter(has_vacancies, url)
print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
    print url
def read_urls(filename):
    with open(filename, 'rb') as spreadsheet:
        reader = csv.reader(spreadsheet)
        for row in reader:
            yield row[0].strip()
def has_vacancies(url):
    source = urllib2.urlopen(url).read()
    return not(any(text for text in NO_VACANCIES if text in source.lower()))
import csv
import urllib2

NO_VACANCIES = ['no vacancies', 'not hiring']

def has_vacancies(url):
    source = urllib2.urlopen(url).read().lower()
    return not any(text in source for text in NO_VACANCIES)

def read_urls(filename):
    with open(filename, 'rb') as spreadsheet:
        reader = csv.reader(spreadsheet)
        for row in reader:
            yield row[0].strip()

def check_vacancies(filename):
    print 'Following URLs have vacancies'
    for url in filter(has_vacancies, read_urls(filename)):
        print url

if __name__ == '__main__':
    check_vacancies('25.csv')

Context

StackExchange Code Review Q#118594, answer score: 2

Revisions (0)

No revisions yet.