snippetpythonMinor
Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Viewed 0 times
iteratingeachtagconvertelementslistslisthtmlovertable
Problem
I am trying to convert a BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements and handling them accordingly.
I have an implementation of this that works at a surface level using BeautifulSoup4. However, the code is getting repetitive and complicated needlessly, but every time I try to improve it, I just end up breaking the functionality. I need some guidance on tidying this up.
Ultimately, I separate each type of HTML tags for any given row cell. The goal is to re-format the contents of the tables to an Excel spreadsheet and do partial cell formatting (still a work in progress, using
Note I've left out as much as possible of the parsing, but just enough to give an idea.
```
from bs4 import BeautifulSoup
from bs4.element import Tag, NavigableString
def handle_bs4_element(element):
if isinstance(element, Tag):
if len(element.contents) > 1:
# Handle each element separately and return a list? What if more elements are nested? Recursive call?
_res = []
for e_content in element.contents:
_res.append(handle_bs4_element(e_content))
if len(_res) == 1:
return _res[0]
else:
return _res
else:
tag_name = element.name
if tag_name == 'td':
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))
if len(_res) == 1:
return _res[0]
else:
return _res
elif tag_name in ('div', 'span'):
# This will probably contain more nested tags...
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))
if len(_res) == 1:
return _res[0]
else:
return _res
elif tag_n
I have an implementation of this that works at a surface level using BeautifulSoup4. However, the code is getting repetitive and complicated needlessly, but every time I try to improve it, I just end up breaking the functionality. I need some guidance on tidying this up.
Ultimately, I separate each type of HTML tags for any given row cell. The goal is to re-format the contents of the tables to an Excel spreadsheet and do partial cell formatting (still a work in progress, using
xlwt).Note I've left out as much as possible of the parsing, but just enough to give an idea.
```
from bs4 import BeautifulSoup
from bs4.element import Tag, NavigableString
def handle_bs4_element(element):
if isinstance(element, Tag):
if len(element.contents) > 1:
# Handle each element separately and return a list? What if more elements are nested? Recursive call?
_res = []
for e_content in element.contents:
_res.append(handle_bs4_element(e_content))
if len(_res) == 1:
return _res[0]
else:
return _res
else:
tag_name = element.name
if tag_name == 'td':
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))
if len(_res) == 1:
return _res[0]
else:
return _res
elif tag_name in ('div', 'span'):
# This will probably contain more nested tags...
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))
if len(_res) == 1:
return _res[0]
else:
return _res
elif tag_n
Solution
Here is the list of things I would think about to improve:
-
you are doubling on calls to
Instead, you can either allow "falsy" values for the row cells and filter them afterwards, or expand the loop:
-
the DRY principle. There are several repeated blocks of code, like:
-
using list comprehensions is not only more Pythonic, but actually faster. E.g. you can replace:
with:
-
you can use the short if/else one-liner, replacing:
with:
-
variable naming.
-
if you will have more of this kind of tag-specific processing logic, continuing to put it as an another
-
instead of using the
-
or, instead of
As a side note, there is also a simpler way to parse HTML tables -
Would automagically produce:
-
you are doubling on calls to
handle_bs4_element() here:data.append([handle_bs4_element(rc) for rc in row_cells if handle_bs4_element(rc)])Instead, you can either allow "falsy" values for the row cells and filter them afterwards, or expand the loop:
result = []
for rc in row_cells:
cell_text = handle_bs4_element(rc)
if cell_text:
result.append(cell_text)
data.append(result)-
the DRY principle. There are several repeated blocks of code, like:
if len(_res) == 1:
return _res[0]
else:
return _res-
using list comprehensions is not only more Pythonic, but actually faster. E.g. you can replace:
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))with:
_res = [handle_bs4_element(td_content) for td_content in element.contents]-
you can use the short if/else one-liner, replacing:
if len(_res) == 1:
return _res[0]
else:
return _reswith:
return _res[0] if len(_res) == 1 else _res-
variable naming.
_res should not be started with an underscore. You are confusing private class or instance attributes with regular variables. _res should probably be called result, or may be cell_data?-
if you will have more of this kind of tag-specific processing logic, continuing to put it as an another
elif would hurt readability and does not scale well. Consider using the "Extract Method" refactoring method and defining a separate functions for each of the cases. -
instead of using the
.contents list directly, look into using .get_text(), which completes an element's text including the children texts recursively. Not sure if applicable for your problem.-
or, instead of
.contents list, you can use the .children generatorAs a side note, there is also a simpler way to parse HTML tables -
pandas.read_html() which would load an HTML table into a DataFrame, you can then easily dump the dataframe into a list or into CSV, or into an Excel file directly. For example, the following code:from pprint import pprint
import pandas as pd
df = pd.read_html('table_sample.html')[0] # get the first parsed dataframe
pprint(df.values.tolist())Would automagically produce:
[[nan, 'Description', 'Col 1', 'Col 2', 'Col 3'],
[1.0, 'Some paragraph text', 'x', '5', '2'],
[2.0, 'HEADER 1', nan, nan, nan],
[3.0, 'Some text: (1) Check out this Figure 1.0.', 'x', '2', '1'],
[4.0, '(2) Some more text', 'x', '2', '1'],
[5.0, '(3) Additional text', 'x', '2', '1'],
[6.0, '(4) A bit more text', 'x', '2', '1'],
[7.0, '(5) A span Figure 1.0 for edited text. At this point the span starts again', 'x', '2', '1'],
[8.0, 'HEADER 2', nan, nan, nan],
[9.0, 'Weird formatting, because Confluence', 'x', '4', '2'],
[10.0, 'HEADER 3', nan, nan, nan],
[11.0, 'A paragraph about header 3. This is just silly. Strong indeed.', 'x', '3', '3'],
[12.0, 'Something about things or what not. Why is this in a span?', 'x', '2', '2'],
[13.0, 'HEADER 4', nan, nan, nan],
[14.0, 'Section 4 baby! Or header. Confluence formatting fun.', 'x', '2', '3'],
[15.0, 'Pretty boring span of text', 'x', '2', '2'],
[16.0, 'HEADER 5', nan, nan, nan],
[17.0, 'A big paragraph describing more stuff. Super exciting.', 'x', '4', '2']]Code Snippets
data.append([handle_bs4_element(rc) for rc in row_cells if handle_bs4_element(rc)])result = []
for rc in row_cells:
cell_text = handle_bs4_element(rc)
if cell_text:
result.append(cell_text)
data.append(result)if len(_res) == 1:
return _res[0]
else:
return _res_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))_res = [handle_bs4_element(td_content) for td_content in element.contents]Context
StackExchange Code Review Q#154659, answer score: 4
Revisions (0)
No revisions yet.