patternpythonMinor
Getting rid of certain HTML tags
Viewed 0 times
ridgettingtagscertainhtml
Problem
This code simply returns a small section of HTML code and then gets rid of all tags except for break tags.
It seems inefficient because you cannot search and replace with a beautiful soup object as you can with a Python string, so I was forced to switch it back and forth from a beautiful soup object to a string several times so I could use string functions and beautiful soup functions. It seems that there must be a simpler way to do this without switching back and forth between soup objects and strings.
The original HTML is something like:
The output should be:
It seems inefficient because you cannot search and replace with a beautiful soup object as you can with a Python string, so I was forced to switch it back and forth from a beautiful soup object to a string several times so I could use string functions and beautiful soup functions. It seems that there must be a simpler way to do this without switching back and forth between soup objects and strings.
def whiteScrape(webaddress):
url = (webaddress)
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
soup = soup.find("div", { "class" : "addy" })
#below code will delete tags except /br
soup = str(soup)
soup = soup.replace('' , '^')
soup = BeautifulSoup(soup)
soup = (soup.get_text())
soup = str(soup)
soup=soup.replace('^' , '')
return soupThe original HTML is something like:
blah blah
blahblah
blah
The output should be:
blah blah
blahblah
blah
Solution
By PEP 8,
Variable names should be purposeful. Therefore, it's a bad idea to continually redefine
It seems like what you want to do is to stringify the children of div(s) in question. One way to do that would be:
Note that this is not exactly equivalent to your original code. For example, if an address contains an element, such as
whiteScrape() should be renamed to white_scrape(). Variable names should be purposeful. Therefore, it's a bad idea to continually redefine
soup. If someone were to ask you to explain what the variable soup contained, you would have a hard time explaining.It seems like what you want to do is to stringify the children of div(s) in question. One way to do that would be:
def white_scrape(url):
page = opener.open(url).read()
soup = BeautifulSoup(page)
addr = soup.find('div', { 'class': 'addy' })
return ''.join(str(child) for child in addr.children)Note that this is not exactly equivalent to your original code. For example, if an address contains an element, such as
123 1st St., this solution would preserve the `` tag and its contents, whereas your original code would discard the tag but keep its contents and strip the tag.Code Snippets
def white_scrape(url):
page = opener.open(url).read()
soup = BeautifulSoup(page)
addr = soup.find('div', { 'class': 'addy' })
return ''.join(str(child) for child in addr.children)Context
StackExchange Code Review Q#60867, answer score: 5
Revisions (0)
No revisions yet.