patternpythonMinor
Scraping HTML using Beautiful Soup
Viewed 0 times
scrapingusingsouphtmlbeautiful
Problem
I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.
Structure of my source HTML file:
My code:
My code works perfectly without any issues so far, however, I don't find it pythonic. I believe that it could be done in a much better/simpler way.
Structure of my source HTML file:
...
...
Some Heading 1
This section can have p, img, or even div tags
...
...
Some Heading
This section can have p, img, or even div tags
...
My code:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
all_sections = soup.find_all('section',id=re.compile("article-section-[0-9]"))
for section in all_sections:
heading = str(section.find_all('div',class_="heading")[0].text).strip()
contents_list = section.find_all('div',class_="content")[0].contents
content = ''
for i in contents_list:
if i != '\n':
content = content+str(i)
print ''+heading+''+content+''
My code works perfectly without any issues so far, however, I don't find it pythonic. I believe that it could be done in a much better/simpler way.
Content_listis a list which has items like'\n'. With a loop running over this list, I am removing it. Is there any better way?
- I am not interested in article icon, so I am ignoring it in my script.
- I am using
stripmethod to remove extra white spaces in the heading. Is there any better way?
- Other than new lines, the
divelement within content can have anything, even nesteddivs. So far, I have run my script over a few pages I have and it seems to work. Anything here I need to take care of?
- Lastly, is there any better way to generate HTML files? Once I scraped data, I will work on generating HTML files. These files will ha
Solution
regarding 1 you can:
or simply
Regarding 5, if your are generating anything that is nontrivial, then it will pay off to learn some template engines like Jinja2. If that is too much, then you can make a simple template in text file and use regex or even
and on the python side:
new_content = [c for c in old_content if c != '\n']or simply
new_content = old_content.replace('\n', '')Regarding 5, if your are generating anything that is nontrivial, then it will pay off to learn some template engines like Jinja2. If that is too much, then you can make a simple template in text file and use regex or even
replace() to substitute generic parts:# template
%FOO%
%BAR%and on the python side:
values = {"%FOO%": "the foos", "%BAR%": "the bars"}
template = open('template').read()
for k, v in values.iteritems():
template = template.replace(k, v)
print templateCode Snippets
new_content = [c for c in old_content if c != '\n']new_content = old_content.replace('\n', '')# template
<div class="some value">%FOO%</div>
<div class="some value">%BAR%</div>values = {"%FOO%": "the foos", "%BAR%": "the bars"}
template = open('template').read()
for k, v in values.iteritems():
template = template.replace(k, v)
print templateContext
StackExchange Code Review Q#30992, answer score: 4
Revisions (0)
No revisions yet.