patternpythonMinor
Object-oriented web scraping with Python
Viewed 0 times
withscrapingpythonweborientedobject
Problem
I usually write function-only Python programs, but have decided on an OOP approach (my first thereof) for my current program, a web-scraper. Here is the working code I wrote which I am looking to have critiqued:
Some context:
Here's my issue: I know from experience that
perhaps)?
import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
class Page(object):
def __init__(self, url):
self.url = url
def get_source(self):
self.source = urllib2.urlopen(url).read()
return self.source
class HubPage(Page):
def has_vacancies(self):
return not(any(text for text in NO_VACANCIES if text in self.source.lower()))
urls = []
with open('25.csv', 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
urls.append(row[0].strip())
for url in urls:
page = HubPage(url)
source = page.get_source()
if page.has_vacancies():
print 'Has vacancies'Some context:
HubPage represents a typical 'jobs' page on a company's web site. I am subclassing Page because I well eventually subclass it again for individual job pages, and some methods that will be used only to extract data for individual job pages.Here's my issue: I know from experience that
urllib2, while it has its critics, is fast - very fast - at doing what it does, namely fetch a page's source. Yet I notice that in my design, processing of each url is taking a few orders of magnitude longer than what I typically observe.- Is it the fact that class instantiations are involved (unnecessarily,
perhaps)?
- Might the fact that
HubPageis inherited be at cause?
- Is the call to
any()expensive?
Solution
Use top-level functions
Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under
Use
You are only interested in knowing which URL have vacancies. The easiest way to perform that is to:
with the proper definition of
You also only
Use generators
There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:
It also allows you to parametrize with the name of the file to read, in case you have several of them.
Do not force OOP when not required
You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:
See what I did there? This is the exact function we need for the
Remove extraneous computation
Calling
Of a lower interest,
Proposed improvements
Going further
You might be interested in trying the
Alternatively, the
You may be interested into handling exceptions from
Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under
if __name__ == '__main__' to:- separate the intents;
- reuse the code more easily;
- not have random code execution when
importing the module.
Use
filterYou are only interested in knowing which URL have vacancies. The easiest way to perform that is to:
interesting_urls = filter(has_vacancies, url)with the proper definition of
has_vacancies.You also only
print 'Has vacancies' when a given URL has so, letting the user try to figure out which one actually has. You might want to print the URL instead to make your script more usable:print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
print urlUse generators
There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:
def read_urls(filename):
with open(filename, 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
yield row[0].strip()It also allows you to parametrize with the name of the file to read, in case you have several of them.
Do not force OOP when not required
You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:
def has_vacancies(url):
source = urllib2.urlopen(url).read()
return not(any(text for text in NO_VACANCIES if text in source.lower()))See what I did there? This is the exact function we need for the
filter I introduced earlier.Remove extraneous computation
Calling
lower() on a string has its cost. Calling it once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.Of a lower interest,
any is converting pieces of text to their truthy value after a test text in source already returned True. You also store source, self.url, self.source (and, to a lesser extend, page) without really needing it. But I’ve shown ways to avoid it.Proposed improvements
import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
def has_vacancies(url):
source = urllib2.urlopen(url).read().lower()
return not any(text in source for text in NO_VACANCIES)
def read_urls(filename):
with open(filename, 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
yield row[0].strip()
def check_vacancies(filename):
print 'Following URLs have vacancies'
for url in filter(has_vacancies, read_urls(filename)):
print url
if __name__ == '__main__':
check_vacancies('25.csv')Going further
You might be interested in trying the
re module to perform text matches, there may be a performance improvement.Alternatively, the
multiprocessing module provide a map_async performed in parallel by a pool of workers. You might want to change has_vacancies a bit and use that instead of filter to increase overall performances.You may be interested into handling exceptions from
urllib2 so a sudden networking failure or an invalid URL won't crash everything. Something simple can do:def has_vacancies(url):
try:
source = urllib2.urlopen(url).read()
except urllib2.URLError:
print 'Error processing', url
return False
else:
source = source.lower()
return not any(text in source for text in NO_VACANCIES)Code Snippets
interesting_urls = filter(has_vacancies, url)print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
print urldef read_urls(filename):
with open(filename, 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
yield row[0].strip()def has_vacancies(url):
source = urllib2.urlopen(url).read()
return not(any(text for text in NO_VACANCIES if text in source.lower()))import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
def has_vacancies(url):
source = urllib2.urlopen(url).read().lower()
return not any(text in source for text in NO_VACANCIES)
def read_urls(filename):
with open(filename, 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
yield row[0].strip()
def check_vacancies(filename):
print 'Following URLs have vacancies'
for url in filter(has_vacancies, read_urls(filename)):
print url
if __name__ == '__main__':
check_vacancies('25.csv')Context
StackExchange Code Review Q#118594, answer score: 2
Revisions (0)
No revisions yet.