HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Adding a new class to HTML tag and writing it back with Beautiful Soup

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
newtagwithwritingaddingbacksoupandclasshtml

Problem

I am working on an HTML document to which I need to add certain classes to some elements. In the following code, I am adding class img-responsive.

def add_img_class1(img_tag):
    try:
        img_tag['class'] = img_tag['class']+' img-responsive'   
    except KeyError:
        img_tag['class'] = 'img-responsive'
    return img_tag

def add_img_class2(img_tag):
    if img_tag.has_attr('class'):
        img_tag['class'] = img_tag['class']+' img-responsive'
    else:
        img_tag['class'] = 'img-responsive'
    return img_tag

soup = BeautifulSoup(myhtml)
for img_tag in soup.find_all('img'):    
    img_tag = add_img_class1(img_tag) #or img_tag = add_img_class2(img_tag)

html = soup.prettify(soup.original_encoding)
with open("edited.html","wb") as file:
    file.write(html)


  • Both functions do same, however one uses exceptions and another has_attr from BS4. Which is better and why?



  • Am I doing the right way of writing back to HTML? Or shall convert entire soup to UTF-8 (by string.encode('UTF-8')) and write it?

Solution

The second option is better, because the possible error is explicit. However, in lots of case in Python, you should follow EAFP and go for the try statement. However, we can do better.

get(value, default)

In BeautifulSoup, attributes behave like dictionaries. This means you can write img_tag.get('class', '') to get the class if it exists, or the empty string if it doesn't.

def add_img_class(img_tag):
    img_tag = img_tag.get('class', '') + ' img-responsive'


You don't need to return the new img_tag as it is passed by reference. Now that your function is a one-liner, you might as well use the one-liner directly.

Multi-valued attributes

Note that the above code doesn't work! class is a multi-valued attribute in HTML4 and HTML5, so at least BeautifulSoup 4 returns a list instead of a string. The correct code becomes:

img_tag['class'] = img_tag.get('class', []) + ['img-responsive']


Wich is nicer as you don't have to worry about the extra space between the two values.

Encoding

You don't need to convert to UTF-8 before writing the file back. What's wrong with  ?

Code Snippets

def add_img_class(img_tag):
    img_tag = img_tag.get('class', '') + ' img-responsive'
img_tag['class'] = img_tag.get('class', []) + ['img-responsive']

Context

StackExchange Code Review Q#31523, answer score: 14

Revisions (0)

No revisions yet.