patternpythonMinor
Web crawlers for three image sites
Viewed 0 times
threecrawlersimagesitesforweb
Problem
I'm very new to python and only vaguely remember OOP from doing some Java a few years ago so I don't know what the best way to do this is.
I've build a bunch of classes that represent a crawler that scrapes images from a specific website. e.g: for the website stocksnap I have a class
I have 9 of these crawler classes and it's awful, I know there can be a much better way of representing them, they share a lot in common.
here are three of these crawlers:
```
class MagdeleineCrawler:
def __init__(self, crawler_db):
self.current_page = crawler_db.current_page
self.crawler_db = crawler_db
def crawl(self):
current_page = self.current_page
print("Starting crawl on page " + str(current_page))
while True:
print("crawling page " + str(current_page))
page_response = requests.get(
'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
page_soup = BeautifulSoup(page_response.text)
image_links = [link["href"]
for link in page_soup.find_all('a', {'class': 'photo-link'})]
for image_link in image_links:
print("scraping image at " + image_link)
response = requests.get(image_link)
image_page_soup = BeautifulSoup(response.text)
print('getting image source link')
image_source_link = image_page_soup.find(
'a', {'class': 'download'})['href']
# Get Tags
print('getting tags')
ul = image_page_soup.find('ul', {'class': 'tags'})
if ul:
tag_links = ul.find_all('a', {'rel': 'tag'})
tag_names = [tag_link.string for tag_link in tag_links]
try:
tag_names.remove('editor\'s pick')
except:
pass
thumbnail_url = im
I've build a bunch of classes that represent a crawler that scrapes images from a specific website. e.g: for the website stocksnap I have a class
StocksnapCrawlerI have 9 of these crawler classes and it's awful, I know there can be a much better way of representing them, they share a lot in common.
here are three of these crawlers:
```
class MagdeleineCrawler:
def __init__(self, crawler_db):
self.current_page = crawler_db.current_page
self.crawler_db = crawler_db
def crawl(self):
current_page = self.current_page
print("Starting crawl on page " + str(current_page))
while True:
print("crawling page " + str(current_page))
page_response = requests.get(
'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
page_soup = BeautifulSoup(page_response.text)
image_links = [link["href"]
for link in page_soup.find_all('a', {'class': 'photo-link'})]
for image_link in image_links:
print("scraping image at " + image_link)
response = requests.get(image_link)
image_page_soup = BeautifulSoup(response.text)
print('getting image source link')
image_source_link = image_page_soup.find(
'a', {'class': 'download'})['href']
# Get Tags
print('getting tags')
ul = image_page_soup.find('ul', {'class': 'tags'})
if ul:
tag_links = ul.find_all('a', {'rel': 'tag'})
tag_names = [tag_link.string for tag_link in tag_links]
try:
tag_names.remove('editor\'s pick')
except:
pass
thumbnail_url = im
Solution
Your first problem is that the
Think about each function as a task. You want them to do one thing each. For example, you could get all the image links as one function:
But notice, you have almost the same process for all three classes shown here. The difference is the URL base that you're requesting from and the
You understand the program and can likely form better names, but now this could be the same function in all three cases.
Similarly you could turn other parts into functions too:
You can then call this with individual
Your
And now you make the
This way you can do a mix of common similar functions that get passed attributes as well as defining individual functions for each class.
crawl function contains almost all the code. One large function is harder to reuse, read and make changes to. If you break it up things will get a lot easier.Think about each function as a task. You want them to do one thing each. For example, you could get all the image links as one function:
def get_links(self, page):
page_response = requests.get(
'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
page_soup = BeautifulSoup(page_response.text)
return [link["href"] for link in
page_soup.find_all('a', {'class': 'photo-link'})]But notice, you have almost the same process for all three classes shown here. The difference is the URL base that you're requesting from and the
'class' attribute. But those should both be attributes of the class. Then you could rewrite the function like this:def get_links(self, page):
page_response = requests.get(self.base_url.format(current_page))
page_soup = BeautifulSoup(page_response.text)
return [link["href"] for link in
page_soup.find_all('a', {'class': self.image_class})]You understand the program and can likely form better names, but now this could be the same function in all three cases.
Similarly you could turn other parts into functions too:
def scrape_image(self, image_link, image_class, html_tag):
print("scraping image at " + image_link)
response = requests.get(image_link)
image_page_soup = BeautifulSoup(response.text)
print('getting image source link')
image_source_link = image_page_soup.find(
'a', {'class': image_class})[html_tag]
return image_page_soupYou can then call this with individual
Crawler's attributes, like this:def crawl(self):
...
self.scrape_image(image_link, self.scrape_image_class,
self.scrape_html_tag)Your
get_tags function is trickier, as there are entirely difference commands in different cases. But in this case you could overwrite the function instead. Did you learn about inheritance when you did OOP before? Inheritance is basically when one class takes the attributes of another and then adds to them. So in this case, perhaps you have a Crawler class that has the common attributes as well as common similar functions like scrape_image, but then you add on new features for that class, here's a quick template of how the Crawler might look:class Crawler():
def __init__():
def crawl():
def scrape_image():And now you make the
MagdeleineCrawler. To inherit from Crawler just put it in brackets with the class definition.class MagdeleineCrawler(Crawler):
def __init__():
Crawler.__init__(self) # Pass parameters to Crawler in here
def get_tags():This way you can do a mix of common similar functions that get passed attributes as well as defining individual functions for each class.
Code Snippets
def get_links(self, page):
page_response = requests.get(
'http://magdeleine.co/license/cc0/page/{}/'.format(current_page))
page_soup = BeautifulSoup(page_response.text)
return [link["href"] for link in
page_soup.find_all('a', {'class': 'photo-link'})]def get_links(self, page):
page_response = requests.get(self.base_url.format(current_page))
page_soup = BeautifulSoup(page_response.text)
return [link["href"] for link in
page_soup.find_all('a', {'class': self.image_class})]def scrape_image(self, image_link, image_class, html_tag):
print("scraping image at " + image_link)
response = requests.get(image_link)
image_page_soup = BeautifulSoup(response.text)
print('getting image source link')
image_source_link = image_page_soup.find(
'a', {'class': image_class})[html_tag]
return image_page_soupdef crawl(self):
...
self.scrape_image(image_link, self.scrape_image_class,
self.scrape_html_tag)class Crawler():
def __init__():
def crawl():
def scrape_image():Context
StackExchange Code Review Q#106350, answer score: 3
Revisions (0)
No revisions yet.