patternpythonModerate
Getting data correctly from <span> tag with beautifulsoup and regex
Viewed 0 times
tagwithregexspangettingandbeautifulsoupfromdatacorrectly
Problem
I am scraping an online shop page, trying to get the price mentioned in that page. In the following block the price is mentioned:
I am using Beautiful Soup to get this tag and using a regular expression to get the price:
```
# -- coding: utf8 --
import re
import requests
from bs4 import BeautifulSoup
hs18_test_urls = ['http://www.homeshop18.com/diary-wimpy-kid-hard-luck/author:jeff-kinney/isbn:9780141350677/books/juvenile-fiction/product:30926027/cid:10065/?it_category=HP&it_action=BO-R01001&it_label=HP-R01001-140121094603-30926027-PR-BO-RM-OT-RT01_BooksFlat40PercentOff-RL02-160120&it_value=0',
'http://www.homeshop18.com/apple-ipad-air-wi-fi-16gb-space-grey/computers-tablets/tablets/product:31228967/cid:16327/',
'http://www.homeshop18.com/reebok-men-black-yellow-sandals-j97184/footwear/men/product:30795219/cid:15067/',
'http://www.homeshop18.com/american-swan-women-shirt-pink/clothing/women/product:31225645/cid:15021/',
'http://www.homeshop18.com/diva-fashion-art-silk-saree-parrot-green/clothing/women/product:31514557/cid:15011/?it_category=hs18bot&it_action=recentlySoldProducts&it_label=31225645&it_value=1']
hs18_expected_test_results = [u'210', u'35900', u'1499', u'479', u'1199']
def get_homeshop18_product_meta(url):
reg = ur'^ ₹? (\d+)'
response = requests.get(url)
if response.status_code == requests.codes.ok:
soup = BeautifulSoup(response.text)
product_name = soup.find('meta', {'property' : 'og:title'})['content']
product_url = soup.find('meta', {'property' : 'og:url'})['content']
product_img_url = soup.find('meta', {'property': 'og:image'})['content']
product_price_tag_element = soup.find('span', {'id': 'hs18Price', 'itemprop': 'price'})
product_price_match = re.match(reg, product_price_tag_element.text)
if product_price_match:
product_price = product_price_match.group(1)
else:
product_price
₹ 999I am using Beautiful Soup to get this tag and using a regular expression to get the price:
```
# -- coding: utf8 --
import re
import requests
from bs4 import BeautifulSoup
hs18_test_urls = ['http://www.homeshop18.com/diary-wimpy-kid-hard-luck/author:jeff-kinney/isbn:9780141350677/books/juvenile-fiction/product:30926027/cid:10065/?it_category=HP&it_action=BO-R01001&it_label=HP-R01001-140121094603-30926027-PR-BO-RM-OT-RT01_BooksFlat40PercentOff-RL02-160120&it_value=0',
'http://www.homeshop18.com/apple-ipad-air-wi-fi-16gb-space-grey/computers-tablets/tablets/product:31228967/cid:16327/',
'http://www.homeshop18.com/reebok-men-black-yellow-sandals-j97184/footwear/men/product:30795219/cid:15067/',
'http://www.homeshop18.com/american-swan-women-shirt-pink/clothing/women/product:31225645/cid:15021/',
'http://www.homeshop18.com/diva-fashion-art-silk-saree-parrot-green/clothing/women/product:31514557/cid:15011/?it_category=hs18bot&it_action=recentlySoldProducts&it_label=31225645&it_value=1']
hs18_expected_test_results = [u'210', u'35900', u'1499', u'479', u'1199']
def get_homeshop18_product_meta(url):
reg = ur'^ ₹? (\d+)'
response = requests.get(url)
if response.status_code == requests.codes.ok:
soup = BeautifulSoup(response.text)
product_name = soup.find('meta', {'property' : 'og:title'})['content']
product_url = soup.find('meta', {'property' : 'og:url'})['content']
product_img_url = soup.find('meta', {'property': 'og:image'})['content']
product_price_tag_element = soup.find('span', {'id': 'hs18Price', 'itemprop': 'price'})
product_price_match = re.match(reg, product_price_tag_element.text)
if product_price_match:
product_price = product_price_match.group(1)
else:
product_price
Solution
As it happens, there are positive answers to each of your questions:
-
with Beautiful Soup you can remove the WebRupee
... then, when you get the text value of the
EDIT: Of course, it would be faster/better to do:
-
Your regex is not matching the value properly because the
-
Unicode string literals in python should be escaped with either the
Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.
in your case you should use lower-case
-
You can remove the utf-8/unicode declaration if you encode the values in the more-pythonic way of unicode escapes.
-
with Beautiful Soup you can remove the WebRupee
span with replace_with() entirely....webrupee_element = soup.find('span', {'class': 'WebRupee'})
webrupee_element.replace_with('')... then, when you get the text value of the
product_price_tag_element.text it will not have the symbol.EDIT: Of course, it would be faster/better to do:
for wr in product_price_tag_element.find('span'):
wr.replace_with('')-
Your regex is not matching the value properly because the
character may not be mapping directly to the regular ' ' character in your regex. You should use the 'whitespace' escape-sequence \s in your regex instead of ' ', like reg = ur'^\s₹?\s(\d+)\s*'-
Unicode string literals in python should be escaped with either the
\u or \U escape, but they are different:Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.
in your case you should use lower-case
\u-
You can remove the utf-8/unicode declaration if you encode the values in the more-pythonic way of unicode escapes.
Code Snippets
webrupee_element = soup.find('span', {'class': 'WebRupee'})
webrupee_element.replace_with('')for wr in product_price_tag_element.find('span'):
wr.replace_with('')Context
StackExchange Code Review Q#40658, answer score: 10
Revisions (0)
No revisions yet.