HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scrape an HTML table with python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withscrapepythonhtmltable

Problem

I think I'm on the right track, but ANY suggestions or critques are welcome. The program just scrapes an HTML table and prints it to stdout.

import sys
from urllib2 import urlopen, URLError
from argparse import ArgumentParser
from bs4 import BeautifulSoup

def parse_arguments():
    """ Process command line arguments """
    parser = ArgumentParser(description='Grabs tables from html')
    parser.add_argument('-u', '--url', help='url to grab from',
                        required=True)
    args = parser.parse_args()
    return args

def parse_rows(rows):
    """ Get data from rows """
    results = []
    for row in rows:
        table_headers = row.find_all('th')
        if table_headers:
            results.append([headers.get_text() for headers in table_headers])

        table_data = row.find_all('td')
        if table_data:
            results.append([data.get_text() for data in table_data])
    return results

def main():
    # Get arguments
    args = parse_arguments()
    if args.url:
        url = args.url

    # Make soup
    try:
        resp = urlopen(url)
    except URLError as e:
        print 'An error occured fetching %s \n %s' % (url, e.reason)   
        return 1
    soup = BeautifulSoup(resp.read())

    # Get table
    try:
        table = soup.find('table')
    except AttributeError as e:
        print 'No tables found, exiting'
        return 1

    # Get rows
    try:
        rows = table.find_all('tr')
    except AttributeError as e:
        print 'No table rows found, exiting'
        return 1

    # Get data
    table_data = parse_rows(rows)

    # Print data
    for i in table_data:
       print '\t'.join(i)

if __name__ == '__main__':
    status = main()
    sys.exit(status)

Solution

In general, your code is very well written. PEP8 compliance is very good, names make sense, you have docstrings, you use if __name__ == '__main__:.

Here's a few comments on what I would do differently:

# Get table
try:
    table = soup.find('table')
except AttributeError as e:
    print 'No tables found, exiting'
    return 1

# Get rows
try:
    rows = table.find_all('tr')
except AttributeError as e:
    print 'No table rows found, exiting'
    return 1


This is essentially the same code duplicated twice. The fact that you differentiate between not finding a table and not finding rows within it is nice, but ultimately useless, as a table with nothing in it might as well be no table at all.

Therefore, I propose changing it to detect a valid table, that is one with rows. This tidies up the code a bit like so:

try:
    table = soup.find('table')
    rows = table.find_all('tr')
except AttributeError as e:
    raise ValueError("No valid table found")


You'll notice I eliminated the print and the return 1. Errors should be errors, they shouldn't magically become print statements. The only exception to this is if this is supposed to be for end users, then you will want to retain the print. Secondly, sys.exit() is unnecessary as the script will end automatically when it reaches the end of the file. Therefore, all the return 1's are unnecessary and are extra fluff.

Lastly, let's talk about your comments. Most of them are useless.

# Make soup


Is that really the best description? You are retrieving a url and reading the HTML. THAT should be your comment.

# Get table

# Get rows


As I said, I think these two can be wonderfully merged. Also, I'm pretty sure the names are more than sufficient to understand that you are retrieving the table and the rows of the table.

# Get data
table_data = parse_rows(rows)


Hmm... what else could table_data be? This is a comment that seems pretty useless and adds un-needed fluff.

# Print data


This one I actually think should be more descriptive, such as:

# Print table data


As indicated in the comments, it is suggested that you remove all of your comments, as they are largely plain fluff considering your naming is so well done.

Another last thing to note is that

if args.url:


Is a completely useless check. It is guaranteed that args.url will always exist as per how the argparse module functions. If you made the url an optional parameter, that would be a different story, but then your program would be useless.

Code Snippets

# Get table
try:
    table = soup.find('table')
except AttributeError as e:
    print 'No tables found, exiting'
    return 1

# Get rows
try:
    rows = table.find_all('tr')
except AttributeError as e:
    print 'No table rows found, exiting'
    return 1
try:
    table = soup.find('table')
    rows = table.find_all('tr')
except AttributeError as e:
    raise ValueError("No valid table found")
# Make soup
# Get table

# Get rows
# Get data
table_data = parse_rows(rows)

Context

StackExchange Code Review Q#60769, answer score: 9

Revisions (0)

No revisions yet.