HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Web crawlers for three image sites

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
threecrawlersimagesitesforweb

Problem

I'm very new to python and only vaguely remember OOP from doing some Java a few years ago so I don't know what the best way to do this is.

I've build a bunch of classes that represent a crawler that scrapes images from a specific website. e.g: for the website stocksnap I have a class StocksnapCrawler

I have 9 of these crawler classes and it's awful, I know there can be a much better way of representing them, they share a lot in common.

here are three of these crawlers:

```
class MagdeleineCrawler:

def __init__(self, crawler_db):
self.current_page = crawler_db.current_page
self.crawler_db = crawler_db

def crawl(self):
current_page = self.current_page
print("Starting crawl on page " + str(current_page))
while True:
print("crawling page " + str(current_page))
page_response = requests.get(
'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
page_soup = BeautifulSoup(page_response.text)
image_links = [link["href"]
for link in page_soup.find_all('a', {'class': 'photo-link'})]

for image_link in image_links:
print("scraping image at " + image_link)
response = requests.get(image_link)
image_page_soup = BeautifulSoup(response.text)
print('getting image source link')
image_source_link = image_page_soup.find(
'a', {'class': 'download'})['href']

# Get Tags
print('getting tags')
ul = image_page_soup.find('ul', {'class': 'tags'})
if ul:
tag_links = ul.find_all('a', {'rel': 'tag'})
tag_names = [tag_link.string for tag_link in tag_links]
try:
tag_names.remove('editor\'s pick')
except:
pass

thumbnail_url = im

Solution

Your first problem is that the crawl function contains almost all the code. One large function is harder to reuse, read and make changes to. If you break it up things will get a lot easier.

Think about each function as a task. You want them to do one thing each. For example, you could get all the image links as one function:

def get_links(self, page):
    page_response = requests.get(
        'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
    page_soup = BeautifulSoup(page_response.text)
    return [link["href"] for link in
                page_soup.find_all('a', {'class': 'photo-link'})]


But notice, you have almost the same process for all three classes shown here. The difference is the URL base that you're requesting from and the 'class' attribute. But those should both be attributes of the class. Then you could rewrite the function like this:

def get_links(self, page):
    page_response = requests.get(self.base_url.format(current_page))
    page_soup = BeautifulSoup(page_response.text)
    return [link["href"] for link in
                page_soup.find_all('a', {'class': self.image_class})]


You understand the program and can likely form better names, but now this could be the same function in all three cases.

Similarly you could turn other parts into functions too:

def scrape_image(self, image_link, image_class, html_tag):
    print("scraping image at " + image_link)
    response = requests.get(image_link)
    image_page_soup = BeautifulSoup(response.text)
    print('getting image source link')
    image_source_link = image_page_soup.find(
        'a', {'class': image_class})[html_tag]
    return image_page_soup


You can then call this with individual Crawler's attributes, like this:

def crawl(self):

    ...

    self.scrape_image(image_link, self.scrape_image_class,
                      self.scrape_html_tag)


Your get_tags function is trickier, as there are entirely difference commands in different cases. But in this case you could overwrite the function instead. Did you learn about inheritance when you did OOP before? Inheritance is basically when one class takes the attributes of another and then adds to them. So in this case, perhaps you have a Crawler class that has the common attributes as well as common similar functions like scrape_image, but then you add on new features for that class, here's a quick template of how the Crawler might look:

class Crawler():
    def __init__():
    def crawl():
    def scrape_image():


And now you make the MagdeleineCrawler. To inherit from Crawler just put it in brackets with the class definition.

class MagdeleineCrawler(Crawler):
    def __init__():
        Crawler.__init__(self) # Pass parameters to Crawler in here
    def get_tags():


This way you can do a mix of common similar functions that get passed attributes as well as defining individual functions for each class.

Code Snippets

def get_links(self, page):
    page_response = requests.get(
        'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
    page_soup = BeautifulSoup(page_response.text)
    return [link["href"] for link in
                page_soup.find_all('a', {'class': 'photo-link'})]
def get_links(self, page):
    page_response = requests.get(self.base_url.format(current_page))
    page_soup = BeautifulSoup(page_response.text)
    return [link["href"] for link in
                page_soup.find_all('a', {'class': self.image_class})]
def scrape_image(self, image_link, image_class, html_tag):
    print("scraping image at " + image_link)
    response = requests.get(image_link)
    image_page_soup = BeautifulSoup(response.text)
    print('getting image source link')
    image_source_link = image_page_soup.find(
        'a', {'class': image_class})[html_tag]
    return image_page_soup
def crawl(self):

    ...

    self.scrape_image(image_link, self.scrape_image_class,
                      self.scrape_html_tag)
class Crawler():
    def __init__():
    def crawl():
    def scrape_image():

Context

StackExchange Code Review Q#106350, answer score: 3

Revisions (0)

No revisions yet.