HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Getting data correctly from <span> tag with beautifulsoup and regex

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
tagwithregexspangettingandbeautifulsoupfromdatacorrectly

Problem

I am scraping an online shop page, trying to get the price mentioned in that page. In the following block the price is mentioned:


₹ 999


I am using Beautiful Soup to get this tag and using a regular expression to get the price:

```
# -- coding: utf8 --

import re
import requests
from bs4 import BeautifulSoup

hs18_test_urls = ['http://www.homeshop18.com/diary-wimpy-kid-hard-luck/author:jeff-kinney/isbn:9780141350677/books/juvenile-fiction/product:30926027/cid:10065/?it_category=HP&it_action=BO-R01001&it_label=HP-R01001-140121094603-30926027-PR-BO-RM-OT-RT01_BooksFlat40PercentOff-RL02-160120&it_value=0',
'http://www.homeshop18.com/apple-ipad-air-wi-fi-16gb-space-grey/computers-tablets/tablets/product:31228967/cid:16327/',
'http://www.homeshop18.com/reebok-men-black-yellow-sandals-j97184/footwear/men/product:30795219/cid:15067/',
'http://www.homeshop18.com/american-swan-women-shirt-pink/clothing/women/product:31225645/cid:15021/',
'http://www.homeshop18.com/diva-fashion-art-silk-saree-parrot-green/clothing/women/product:31514557/cid:15011/?it_category=hs18bot&it_action=recentlySoldProducts&it_label=31225645&it_value=1']

hs18_expected_test_results = [u'210', u'35900', u'1499', u'479', u'1199']

def get_homeshop18_product_meta(url):
reg = ur'^ ₹? (\d+)'
response = requests.get(url)
if response.status_code == requests.codes.ok:
soup = BeautifulSoup(response.text)
product_name = soup.find('meta', {'property' : 'og:title'})['content']
product_url = soup.find('meta', {'property' : 'og:url'})['content']
product_img_url = soup.find('meta', {'property': 'og:image'})['content']
product_price_tag_element = soup.find('span', {'id': 'hs18Price', 'itemprop': 'price'})
product_price_match = re.match(reg, product_price_tag_element.text)
if product_price_match:
product_price = product_price_match.group(1)
else:
product_price

Solution

As it happens, there are positive answers to each of your questions:

-
with Beautiful Soup you can remove the WebRupee span with replace_with() entirely....

webrupee_element = soup.find('span', {'class': 'WebRupee'})
webrupee_element.replace_with('')


... then, when you get the text value of the product_price_tag_element.text it will not have the symbol.

EDIT: Of course, it would be faster/better to do:

for wr in product_price_tag_element.find('span'):
    wr.replace_with('')


-
Your regex is not matching the value properly because the &nbsp; character may not be mapping directly to the regular ' ' character in your regex. You should use the 'whitespace' escape-sequence \s in your regex instead of ' ', like reg = ur'^\s₹?\s(\d+)\s*'

-
Unicode string literals in python should be escaped with either the \u or \U escape, but they are different:


Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.

in your case you should use lower-case \u

-
You can remove the utf-8/unicode declaration if you encode the values in the more-pythonic way of unicode escapes.

Code Snippets

webrupee_element = soup.find('span', {'class': 'WebRupee'})
webrupee_element.replace_with('')
for wr in product_price_tag_element.find('span'):
    wr.replace_with('')

Context

StackExchange Code Review Q#40658, answer score: 10

Revisions (0)

No revisions yet.