patternpythonModerate

Scraping new product data from an online store

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

newscrapingproductonlinestorefromdata

Problem

I get a lot done with Python scripts but always feel like I'm working with a messy desk when I'm using it. I assume this is because I am not coding in Python correctly. Coming from a C# and Ruby background, I've used Python mainly as a utility language writing together things that I need quickly - never really trying to refine my work - until now.

I want criticism as to what I'm doing that's making my Python code feel sloppy - I feel I will advance more quickly if I have such feedback, and, since I don't have a Python mentor, here I am.

The below script scrapes new product data from an online store. I typically run this script from my console simply by typing python my_scraper_script.py and it will scrape the data and upload it to my rails site that I have running locally (this is a hobby project of mine).

Some initial concerns:

Does it make sense to declare some of the methods here in this script and some of the methods in another script (scraper_tools) when some of those methods are shared by similar scripts?

My websiteProduct class does all of the work in the initializer part of the class, this feels dirty to me.

I have a bunch of variables declared in what feels like an arbitrary position within the file. Is this acceptable?

The part where this script actually gets started doing things is just hanging out in the open somewhere arbitrarily toward the end of the actual file, is this normal?

I'm sure there's much more awry - please fix me.

```
import json
import re
import urllib2
from scraper_tools import soupify, getHtml, buildJsonPostRequest

__author__ = 'ecnalyr'

def getImageLinkFromDiv(div):
"""Expects a BeautifulSoup div from {website's} new product page"""
try:
fullSrc = div.find('img')['src']
return fullSrc.replace("//", "http://") # removes the unnecessary // from the beginning of the string
except AttributeError:
return "There is no Image Link"

def getSkuFromDiv(div):
"""Expects a Beauti

Solution

import json
import re
import urllib2
from scraper_tools import soupify, getHtml, buildJsonPostRequest

__author__ = 'ecnalyr'

def getImageLinkFromDiv(div):

The python style guide recommends lowercase_with_underscores for function names

"""Expects a BeautifulSoup div from {website's} new product page"""
    try:
        fullSrc = div.find('img')['src']

Python convention is for local variables to be lowercase_with_underscores

return fullSrc.replace("//", "http://") # removes the unnecessary // from the beginning of the string

This will also make replacements throughout the entire string, not just the beginning. It's also not really removing it from the beginning. It also seems like a more generic thing and perhaps should be in a seperate normalize_url function.

except AttributeError:

It's rarely a good idea to catch an AttributeError, python will produce that errors for simple typos in your code so catching it here is likely to mask real bugs. I'm not sure which attribute access you are catching, but I'm guess its that fullSrc can be None? In that case you should really check if fullSrc is None:

return "There is no Image Link"

That's a terrible way to report an error. Either you should return None or an empty string or raise an exception. But returning a string with an error message is bad because the rest of the program won't treat it as special.

def getSkuFromDiv(div):
    """Expects a BeautifulSoup div from {website's} new product page"""
    try:
        baseUrl = div.find('a')['href']
        baseSku = re.search('\productId=(\d+_\d+)', baseUrl).group(0)

You should use .group(1) as that will return the part in the parens. Then you don't need to replace trick.

return baseSku.replace("productId=", "")
    except AttributeError:
        return "There is no Sku"

def encodeBrandNameToUTF8(brandAndName):
    return str(brandAndName.encode("utf-8"))

.encode() already returns a string, you don't need to stringify it again. The function also implies its specific to brand names, but its not.

def getBrandAndNameFromDiv(div):
    """Expects a BeautifulSoup div from {website's} new product page"""
    try:
        brandAndName = div.find('span', {'class': 'name'}).string
        return str(brandAndName)
    except UnicodeEncodeError:
        return encodeBrandNameToUTF8(brandAndName)

Just return encodeBrandNameToUTF8(brandAndName) the first time. You don't gain anything by trying and failing and trying again.

#        return "unicode encode error"
    except AttributeError:
        return "There is no brand and / or name"

def getPriceFromDiv(div):
    """Expects a BeautifulSoup div from {website's} new product page"""
    try:
        price = div.find('span', {'class': 'price'}).string

This line is very similiar to the line in the previous function. I'd write a function to share the similiarties.

return price.replace("$", "") # have to remove the dollar sign from the price

I don't like this because it implies that you are removing it from across the whole string rather then just the beginning. Instead I'd use return price.lstrip('$')

except AttributeError:
        return "There is no price"

def buildLinkFromSKU(sku):
    """Expects a SKU representing a product at {website}"""
    return productBaseUrl + sku

class websiteProduct:

Python style guide recomend CamelCase for class names

"""A product from {website's} created using a newItemDiv in a BeautifulSoup format"""
    def __init__(self, productDiv):
        self.price = str(getPriceFromDiv(productDiv))
        self.brandAndName = str(getBrandAndNameFromDiv(productDiv))

Python style guide recommends lowercase_with_underscores for object attributes.

self.sku = str(getSkuFromDiv(productDiv))
        self.imageLink = str(getImageLinkFromDiv(productDiv))
        self.link = str(buildLinkFromSKU(self.sku))

Stylistically, it'd make more sense for your get function to convert to strings. Especially since many of them already do.

self.json = json.dumps({'store':"Website Name", 'name':self.brandAndName, 'price':self.price, 'sku':self.sku,
                                'link':self.link, 'imageLink':self.imageLink})

    def getPrice(self):
        return self.price

    def getBrandAndName(self):
        return self.brandAndName

    def getLink(self):
        return self.link

    def getImageLink(self):
        return self.imageLink

    def getSku(self):
        return self.sku

    def getJson(self):
        return self.json

In python, we do not have get functions. Just access the attributes directly. If you need more complicated getter logic, you can use properties.

```
uploadProductsUrl = "http://MySite.com/products.json"
websiteNewItemLink = "http://www.website.com/newItems"
newItemSoup = soupify(getHtml(websiteNewItemLink))
newItems = newItemSoup.find_all('div', {'class': 'productNew'})
productBaseUrl = "

Code Snippets

import json
import re
import urllib2
from scraper_tools import soupify, getHtml, buildJsonPostRequest

__author__ = 'ecnalyr'


def getImageLinkFromDiv(div):

"""Expects a BeautifulSoup div from {website's} new product page"""
    try:
        fullSrc = div.find('img')['src']

return fullSrc.replace("//", "http://") # removes the unnecessary // from the beginning of the string

except AttributeError:

return "There is no Image Link"

Context

StackExchange Code Review Q#21290, answer score: 16

Revisions (0)

No revisions yet.