HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing item page for an online shop

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
itemonlineparsingforpageshop

Problem

In this function I'm parsing the page of the item in the online shop. Some items lack the picture, some lack price etc, so there are few if-else checks.

My questions are:

  • How to get rid of these ugly if-else checks?



  • How I can "pythonize" my code further?



-
Is it correct? I.e when I manipulate the variable value using previously assigned value:

name = soup.find('span',{"class":"project_name"})
if name is not None:
    name = re.sub(r'[«,»]','', name.text)


Below is my full code:

def grab_properties(url,html):
    soup = BeautifulSoup(html)
    link = soup.find('iframe', {"src":"e.smth.com"})

    if link is not None:
        link = link.get("src")
        id = re.match(".*\[(\w+)\].*", link).group(1)
        if id is not None:
            if id not in ids:
                name = soup.find('span',{"class":"project_name"})
                if name is not None:
                    name = re.sub(r'[«,»]','', name.text)
                    if (soup.find('img',{"id":"item0image"}) is not None):
                        img_url = soup.find('img',{"id":"item0image"}).get('src')
                        price = soup.find('span',{"class":"project_price"})
                        if price is not None:
                            price = price.text
                            price = re.sub(u'[руб,\., ]','', price)
                        else:
                            price =u"Call us!"
                        return [id.encode('utf8'), name.encode('utf8'), url.encode('utf8'), img_url.encode('utf8'), price.encode('utf8')]

Solution

You cannot get rid of those if-else checks, but you can get rid of the nesting. For example:

if foo:
    if bar:
        return foo + bar


could be flattened to

if not foo:
    return None
if not bar:
    return None
return foo + bar


While this is is in fact longer, such code tends to be easier to read.

If we apply that to your code (and fix some style issues like putting a space after each comma), then we end up with:

def grab_properties(url, html):
    soup = BeautifulSoup(html)

    link = soup.find('iframe', {"src": "e.smth.com"})
    if link is None:
        return None
    link = link.get("src")

    id = re.match(".*\[(\w+)\].*", link).group(1)
    if (id is None) or (id in ids):
        return None

    name = soup.find('span', {"class": "project_name"})
    if name is None:
        return None
    name = re.sub(r'[«,»]', '', name.text)

    image_element = soup.find('img', {"id": "item0image"})
    if image_element is None:
        return None
    img_url = image_element.get('src')

    price = soup.find('span', {"class": "project_price"})
    if price is None:
        price = u"Call us!"
    else
        price = re.sub(u'[руб,\., ]', '', price.text)

    return [attr.encode('utf8') for atr in (id, name, url, img_url, price)]


Notice that I created the image_element variable to avoid searching for the same element multiple times, that I used a list comprehension to encode various strings to UTF-8, and that I used empty lines to separate the code for unrelated attributes, thus making the code easier to understand. I also swapped the if price is not None condition around to avoid a confusing negation.

We should now talk about regular expressions (see also the re library documentation). The […] is a character class. It matches any of the contained characters, so [abc] will match either a or b or c, but not the whole string abc. The comma is not special inside a character class, so [«,»] will match left or right guillemets, or a comma. To match just the angled quotation marks, use [«»]. Similarly, [руб,\., ] will match either р or у or б or a comma or a period or a space. Note that inside a character class, the period is not a metacharacter and does not have to be escaped. So that charclass would be equivalent to [р,у б.] (the order of characters does not matter). If you want to match alternative patterns, use the | regex operator: руб|[. ], which matches either the substring руб or a period or a space.

You could also consider precompiling your regular expressions, which might improve performance if you call your function very often:

RE_GUILLEMETS = re.compile('[«»]')

def ...:
    ...
    name = RE_GUILLEMETS.sub('', name.text)

Code Snippets

if foo:
    if bar:
        return foo + bar
if not foo:
    return None
if not bar:
    return None
return foo + bar
def grab_properties(url, html):
    soup = BeautifulSoup(html)

    link = soup.find('iframe', {"src": "e.smth.com"})
    if link is None:
        return None
    link = link.get("src")

    id = re.match(".*\[(\w+)\].*", link).group(1)
    if (id is None) or (id in ids):
        return None

    name = soup.find('span', {"class": "project_name"})
    if name is None:
        return None
    name = re.sub(r'[«,»]', '', name.text)

    image_element = soup.find('img', {"id": "item0image"})
    if image_element is None:
        return None
    img_url = image_element.get('src')

    price = soup.find('span', {"class": "project_price"})
    if price is None:
        price = u"Call us!"
    else
        price = re.sub(u'[руб,\., ]', '', price.text)

    return [attr.encode('utf8') for atr in (id, name, url, img_url, price)]
RE_GUILLEMETS = re.compile('[«»]')

def ...:
    ...
    name = RE_GUILLEMETS.sub('', name.text)

Context

StackExchange Code Review Q#49052, answer score: 5

Revisions (0)

No revisions yet.