HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Getting rid of certain HTML tags

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
ridgettingtagscertainhtml

Problem

This code simply returns a small section of HTML code and then gets rid of all tags except for break tags.

It seems inefficient because you cannot search and replace with a beautiful soup object as you can with a Python string, so I was forced to switch it back and forth from a beautiful soup object to a string several times so I could use string functions and beautiful soup functions. It seems that there must be a simpler way to do this without switching back and forth between soup objects and strings.

def whiteScrape(webaddress):
     url = (webaddress)
     ourUrl = opener.open(url).read()

     soup = BeautifulSoup(ourUrl)
     soup = soup.find("div", { "class" : "addy" })
     #below code will delete tags except /br
     soup = str(soup)
     soup = soup.replace('' , '^')
     soup = BeautifulSoup(soup)
     soup = (soup.get_text())
     soup = str(soup)
     soup=soup.replace('^' , '')

     return soup


The original HTML is something like:


blah blah
blahblah
blah


The output should be:

blah blah
blahblah
blah

Solution

By PEP 8, whiteScrape() should be renamed to white_scrape().

Variable names should be purposeful. Therefore, it's a bad idea to continually redefine soup. If someone were to ask you to explain what the variable soup contained, you would have a hard time explaining.

It seems like what you want to do is to stringify the children of div(s) in question. One way to do that would be:

def white_scrape(url):
    page = opener.open(url).read()
    soup = BeautifulSoup(page)
    addr = soup.find('div', { 'class': 'addy' })
    return ''.join(str(child) for child in addr.children)


Note that this is not exactly equivalent to your original code. For example, if an address contains an element, such as 123 1st St., this solution would preserve the `` tag and its contents, whereas your original code would discard the tag but keep its contents and strip the tag.

Code Snippets

def white_scrape(url):
    page = opener.open(url).read()
    soup = BeautifulSoup(page)
    addr = soup.find('div', { 'class': 'addy' })
    return ''.join(str(child) for child in addr.children)

Context

StackExchange Code Review Q#60867, answer score: 5

Revisions (0)

No revisions yet.