HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Amazon web scraper

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
amazonscraperweb

Problem

I am trying to improve my programming and programming design skills (poor at the moment). I created a small Amazon scraper program. It is a working program. I would be very grateful if you could review my design. What could be improved? Does my class design make sense?

main.py:

if __name__ == "__main__":
    from Scraper import Scraper
    scraper = Scraper("Schach", 19, 30)
    productLinks = scraper.scrapeProductLinks()
    for link in productLinks:
        try:
            product = scraper.scrapeProductDetails(link)
            price = product[1]
            alternative_price = product[2]
            reviews = product[3]
            if (price < 20) or (alternative_price < 20) or (reviews) < 30:
                continue
            scraper.saveProductListToCSV(product)
        except Exception as err:
            print(err)


Extractor.py:

```
from lxml import html as xmlhtml

class Extractor:
def __init__(self, html, link=""):
self.html = html
self.link = link
self.xml_tree = xmlhtml.fromstring(html)

def castPrice(self, price):
if not isinstance(price, str):
price = "-1"
return float(price.replace("EUR ", "").replace(",", "."))

def extractTitle(self):
try:
self.title = self.xml_tree.xpath('//*[@id="productTitle"]/text()')[0]
except IndexError:
return "NA"
return self.title

def extractPrice(self):
try:
price = self.xml_tree.xpath('//*[@id="priceblock_ourprice"]/text()')
alternative_price = self.xml_tree.xpath('//*[@id="buyNewSection"]/div/div/div/div[2]/a/div/div[2]/span[1]/span/text()')
if price:
price = price[0]
elif alternative_price:
price = alternative_price[0]
except IndexError:
return "-1"
return self.castPrice(price)

def extractCompetitionPrice(self):
try:
competition_price = self.xml_tree.

Solution

Starting in your main and following the path the code takes:

if __name__ == "__main__":
    #large block of code


This is something I usually don't like to see. I'd have expected your module to be executable from the outside, even if I'm not invoking it directly. Like this it's impossible.

I'd have expected instead:

def main:
    """ docstring """
    #large block of code

if __name__ == "__main__":
     main()


This allows me to call your program from somewhere else like:

from module import Main as AmazonScraper
AmazonScraper.main()


This gets me to my next point:

Import at the top of your file if you can. It's much simpler to keep track of your imports if you import at the top of your file, which is also the recommended practice in the official python styleguide (PEP-008)

The next thing referenced is your Scraper. In the main-method there are 4 things done with your scraper.

scraper = Scraper("Schach", 19, 30)
scaper.scrapeProductLinks()
scraper.scrapeProductDetails()
scraper.saveProductListToCSV(product)


You are violating naming conventions of pyton here. Methods are supposed to be named in snake_case (just like your variables are)

first you construct your scraper with a String and two numbers. The numbers seem to be completely arbitrary and I have absolutely no idea what they are there for. Numbers like that are called: Magic Numbers.

It's considered good practice to extract such magic numbers to appropriately named constants.

Then you have your scraper scrape_product_links which seems to give you a list. So far so good.

Next thing is, you have your scraper scrape_product_details, which seems to give you a fixed-size list or tuple. Also nothing extremely wrong with that. No wait. There is a piece missing.

You're directly after accessing the tuple's data with the numbers 1, 2 and 3. These are again magic numbers. You should instead define a class that can contain product data:

class Product
    def __init__(self, price, alt_price, reviews):
        self.price = price
        self.alt_price = alt_price
        self.reviews = reviews

    # getters for immutable properties


This would completely eliminate the magic numbers and allow you to extend the product class without having to fear breaking calling code.

And last but not least: You call your scraper to save_product_list_to_csv. And this is where I have to intervene.

You're violating single responsibility principles here. Your scraper should only scrape_* things. He mustn't be responsible for saving that data to some csv. That should be the job of something else.

On the other hand your Extractor is clean, simple and has a single responsibility. But there's one thing I'm completely missing overall:

Documentation:

It's critical to provide documentation alongside your code. You should document assumptions made and parameter responsibilities in all your classes and methods.

While that might sound dumb right now Mr. Maintainer (aka. Your future self) will thank you when they look at this code in 2 months and understand zilch of what happens where. It saves them headaches and a lot of time.

Overall:

  • You have no documentation whatsoever. That's bad.



  • You also do not respect some major python conventions. That's not quite as bad, but alas.



  • Your Scraper isn't quite clean, since he's responsible for "too much". That's meh.



  • Your code cannot be cleanly invoked from the outside. That's bad.



  • Your extractor is clean and rather simple to understand. That's good!



  • Your main is easy to follow. Your program is nicely abstracted (well when skipping the magic number stuff). That's also good!



In summary: You can definitely do better ;)

Code Snippets

if __name__ == "__main__":
    #large block of code
def main:
    """ docstring """
    #large block of code

if __name__ == "__main__":
     main()
from module import Main as AmazonScraper
AmazonScraper.main()
scraper = Scraper("Schach", 19, 30)
scaper.scrapeProductLinks()
scraper.scrapeProductDetails()
scraper.saveProductListToCSV(product)
class Product
    def __init__(self, price, alt_price, reviews):
        self.price = price
        self.alt_price = alt_price
        self.reviews = reviews

    # getters for immutable properties

Context

StackExchange Code Review Q#82721, answer score: 3

Revisions (0)

No revisions yet.