patternpythonMinor
Parsing item page for an online shop
Viewed 0 times
itemonlineparsingforpageshop
Problem
In this function I'm parsing the page of the item in the online shop. Some items lack the picture, some lack price etc, so there are few
My questions are:
-
Is it correct? I.e when I manipulate the variable value using previously assigned value:
Below is my full code:
if-else checks.My questions are:
- How to get rid of these ugly
if-elsechecks?
- How I can "pythonize" my code further?
-
Is it correct? I.e when I manipulate the variable value using previously assigned value:
name = soup.find('span',{"class":"project_name"})
if name is not None:
name = re.sub(r'[«,»]','', name.text)Below is my full code:
def grab_properties(url,html):
soup = BeautifulSoup(html)
link = soup.find('iframe', {"src":"e.smth.com"})
if link is not None:
link = link.get("src")
id = re.match(".*\[(\w+)\].*", link).group(1)
if id is not None:
if id not in ids:
name = soup.find('span',{"class":"project_name"})
if name is not None:
name = re.sub(r'[«,»]','', name.text)
if (soup.find('img',{"id":"item0image"}) is not None):
img_url = soup.find('img',{"id":"item0image"}).get('src')
price = soup.find('span',{"class":"project_price"})
if price is not None:
price = price.text
price = re.sub(u'[руб,\., ]','', price)
else:
price =u"Call us!"
return [id.encode('utf8'), name.encode('utf8'), url.encode('utf8'), img_url.encode('utf8'), price.encode('utf8')]Solution
You cannot get rid of those if-else checks, but you can get rid of the nesting. For example:
could be flattened to
While this is is in fact longer, such code tends to be easier to read.
If we apply that to your code (and fix some style issues like putting a space after each comma), then we end up with:
Notice that I created the
We should now talk about regular expressions (see also the
You could also consider precompiling your regular expressions, which might improve performance if you call your function very often:
if foo:
if bar:
return foo + barcould be flattened to
if not foo:
return None
if not bar:
return None
return foo + barWhile this is is in fact longer, such code tends to be easier to read.
If we apply that to your code (and fix some style issues like putting a space after each comma), then we end up with:
def grab_properties(url, html):
soup = BeautifulSoup(html)
link = soup.find('iframe', {"src": "e.smth.com"})
if link is None:
return None
link = link.get("src")
id = re.match(".*\[(\w+)\].*", link).group(1)
if (id is None) or (id in ids):
return None
name = soup.find('span', {"class": "project_name"})
if name is None:
return None
name = re.sub(r'[«,»]', '', name.text)
image_element = soup.find('img', {"id": "item0image"})
if image_element is None:
return None
img_url = image_element.get('src')
price = soup.find('span', {"class": "project_price"})
if price is None:
price = u"Call us!"
else
price = re.sub(u'[руб,\., ]', '', price.text)
return [attr.encode('utf8') for atr in (id, name, url, img_url, price)]Notice that I created the
image_element variable to avoid searching for the same element multiple times, that I used a list comprehension to encode various strings to UTF-8, and that I used empty lines to separate the code for unrelated attributes, thus making the code easier to understand. I also swapped the if price is not None condition around to avoid a confusing negation.We should now talk about regular expressions (see also the
re library documentation). The […] is a character class. It matches any of the contained characters, so [abc] will match either a or b or c, but not the whole string abc. The comma is not special inside a character class, so [«,»] will match left or right guillemets, or a comma. To match just the angled quotation marks, use [«»]. Similarly, [руб,\., ] will match either р or у or б or a comma or a period or a space. Note that inside a character class, the period is not a metacharacter and does not have to be escaped. So that charclass would be equivalent to [р,у б.] (the order of characters does not matter). If you want to match alternative patterns, use the | regex operator: руб|[. ], which matches either the substring руб or a period or a space.You could also consider precompiling your regular expressions, which might improve performance if you call your function very often:
RE_GUILLEMETS = re.compile('[«»]')
def ...:
...
name = RE_GUILLEMETS.sub('', name.text)Code Snippets
if foo:
if bar:
return foo + barif not foo:
return None
if not bar:
return None
return foo + bardef grab_properties(url, html):
soup = BeautifulSoup(html)
link = soup.find('iframe', {"src": "e.smth.com"})
if link is None:
return None
link = link.get("src")
id = re.match(".*\[(\w+)\].*", link).group(1)
if (id is None) or (id in ids):
return None
name = soup.find('span', {"class": "project_name"})
if name is None:
return None
name = re.sub(r'[«,»]', '', name.text)
image_element = soup.find('img', {"id": "item0image"})
if image_element is None:
return None
img_url = image_element.get('src')
price = soup.find('span', {"class": "project_price"})
if price is None:
price = u"Call us!"
else
price = re.sub(u'[руб,\., ]', '', price.text)
return [attr.encode('utf8') for atr in (id, name, url, img_url, price)]RE_GUILLEMETS = re.compile('[«»]')
def ...:
...
name = RE_GUILLEMETS.sub('', name.text)Context
StackExchange Code Review Q#49052, answer score: 5
Revisions (0)
No revisions yet.