HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
iteratingeachtagconvertelementslistslisthtmlovertable

Problem

I am trying to convert a BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements and handling them accordingly.

I have an implementation of this that works at a surface level using BeautifulSoup4. However, the code is getting repetitive and complicated needlessly, but every time I try to improve it, I just end up breaking the functionality. I need some guidance on tidying this up.

Ultimately, I separate each type of HTML tags for any given row cell. The goal is to re-format the contents of the tables to an Excel spreadsheet and do partial cell formatting (still a work in progress, using xlwt).

Note I've left out as much as possible of the parsing, but just enough to give an idea.

```
from bs4 import BeautifulSoup
from bs4.element import Tag, NavigableString

def handle_bs4_element(element):
if isinstance(element, Tag):
if len(element.contents) > 1:
# Handle each element separately and return a list? What if more elements are nested? Recursive call?
_res = []
for e_content in element.contents:
_res.append(handle_bs4_element(e_content))
if len(_res) == 1:
return _res[0]
else:
return _res
else:
tag_name = element.name
if tag_name == 'td':
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))
if len(_res) == 1:
return _res[0]
else:
return _res
elif tag_name in ('div', 'span'):
# This will probably contain more nested tags...
_res = []
for td_content in element.contents:
_res.append(handle_bs4_element(td_content))
if len(_res) == 1:
return _res[0]
else:
return _res
elif tag_n

Solution

Here is the list of things I would think about to improve:

-
you are doubling on calls to handle_bs4_element() here:

data.append([handle_bs4_element(rc) for rc in row_cells if handle_bs4_element(rc)])


Instead, you can either allow "falsy" values for the row cells and filter them afterwards, or expand the loop:

result = []
for rc in row_cells:
    cell_text = handle_bs4_element(rc)
    if cell_text:
        result.append(cell_text)
data.append(result)


-
the DRY principle. There are several repeated blocks of code, like:

if len(_res) == 1:
    return _res[0]
else:
    return _res


-
using list comprehensions is not only more Pythonic, but actually faster. E.g. you can replace:

_res = []
for td_content in element.contents:
    _res.append(handle_bs4_element(td_content))


with:

_res = [handle_bs4_element(td_content) for td_content in element.contents]


-
you can use the short if/else one-liner, replacing:

if len(_res) == 1:
    return _res[0]
else:
    return _res


with:

return _res[0] if len(_res) == 1 else _res


-
variable naming. _res should not be started with an underscore. You are confusing private class or instance attributes with regular variables. _res should probably be called result, or may be cell_data?

-
if you will have more of this kind of tag-specific processing logic, continuing to put it as an another elif would hurt readability and does not scale well. Consider using the "Extract Method" refactoring method and defining a separate functions for each of the cases.

-
instead of using the .contents list directly, look into using .get_text(), which completes an element's text including the children texts recursively. Not sure if applicable for your problem.

-
or, instead of .contents list, you can use the .children generator

As a side note, there is also a simpler way to parse HTML tables - pandas.read_html() which would load an HTML table into a DataFrame, you can then easily dump the dataframe into a list or into CSV, or into an Excel file directly. For example, the following code:

from pprint import pprint

import pandas as pd

df = pd.read_html('table_sample.html')[0]  # get the first parsed dataframe
pprint(df.values.tolist())


Would automagically produce:

[[nan, 'Description', 'Col 1', 'Col 2', 'Col 3'],
 [1.0, 'Some paragraph text', 'x', '5', '2'],
 [2.0, 'HEADER 1', nan, nan, nan],
 [3.0, 'Some text: (1) Check out this Figure 1.0.', 'x', '2', '1'],
 [4.0, '(2) Some more text', 'x', '2', '1'],
 [5.0, '(3) Additional text', 'x', '2', '1'],
 [6.0, '(4) A bit more text', 'x', '2', '1'],
 [7.0, '(5) A span Figure 1.0 for  edited text. At this point the span starts again', 'x', '2', '1'],
 [8.0, 'HEADER 2', nan, nan, nan],
 [9.0, 'Weird formatting, because Confluence', 'x', '4', '2'],
 [10.0, 'HEADER 3', nan, nan, nan],
 [11.0, 'A paragraph about header 3.  This is just silly. Strong indeed.', 'x', '3', '3'],
 [12.0, 'Something about things or what not. Why is this in a span?', 'x', '2', '2'],
 [13.0, 'HEADER 4', nan, nan, nan],
 [14.0, 'Section 4 baby! Or header.  Confluence formatting fun.', 'x', '2', '3'],
 [15.0, 'Pretty boring span of text', 'x', '2', '2'],
 [16.0, 'HEADER 5', nan, nan, nan],
 [17.0, 'A big paragraph describing more stuff. Super exciting.', 'x', '4', '2']]

Code Snippets

data.append([handle_bs4_element(rc) for rc in row_cells if handle_bs4_element(rc)])
result = []
for rc in row_cells:
    cell_text = handle_bs4_element(rc)
    if cell_text:
        result.append(cell_text)
data.append(result)
if len(_res) == 1:
    return _res[0]
else:
    return _res
_res = []
for td_content in element.contents:
    _res.append(handle_bs4_element(td_content))
_res = [handle_bs4_element(td_content) for td_content in element.contents]

Context

StackExchange Code Review Q#154659, answer score: 4

Revisions (0)

No revisions yet.