HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scraping HTML using Beautiful Soup

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scrapingusingsouphtmlbeautiful

Problem

I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.

Structure of my source HTML file:



...

...





Some Heading 1


This section can have p, img, or even div tags


...
...





Some Heading


This section can have p, img, or even div tags


...



My code:

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
all_sections = soup.find_all('section',id=re.compile("article-section-[0-9]"))
for section in all_sections:
heading = str(section.find_all('div',class_="heading")[0].text).strip()
contents_list = section.find_all('div',class_="content")[0].contents
content = ''
for i in contents_list:
if i != '\n':
content = content+str(i)
print ''+heading+''+content+''


My code works perfectly without any issues so far, however, I don't find it pythonic. I believe that it could be done in a much better/simpler way.

  • Content_list is a list which has items like '\n'. With a loop running over this list, I am removing it. Is there any better way?



  • I am not interested in article icon, so I am ignoring it in my script.



  • I am using strip method to remove extra white spaces in the heading. Is there any better way?



  • Other than new lines, the div element within content can have anything, even nested divs. So far, I have run my script over a few pages I have and it seems to work. Anything here I need to take care of?



  • Lastly, is there any better way to generate HTML files? Once I scraped data, I will work on generating HTML files. These files will ha

Solution

regarding 1 you can:

new_content = [c for c in old_content if c != '\n']


or simply

new_content = old_content.replace('\n', '')


Regarding 5, if your are generating anything that is nontrivial, then it will pay off to learn some template engines like Jinja2. If that is too much, then you can make a simple template in text file and use regex or even replace() to substitute generic parts:

# template
%FOO%
%BAR%


and on the python side:

values = {"%FOO%": "the foos", "%BAR%": "the bars"} 
template = open('template').read()
for k, v in values.iteritems():
    template = template.replace(k, v)
print template

Code Snippets

new_content = [c for c in old_content if c != '\n']
new_content = old_content.replace('\n', '')
# template
<div class="some value">%FOO%</div>
<div class="some value">%BAR%</div>
values = {"%FOO%": "the foos", "%BAR%": "the bars"} 
template = open('template').read()
for k, v in values.iteritems():
    template = template.replace(k, v)
print template

Context

StackExchange Code Review Q#30992, answer score: 4

Revisions (0)

No revisions yet.