patternpythonMinor
Scrape an HTML table with python
Viewed 0 times
withscrapepythonhtmltable
Problem
I think I'm on the right track, but ANY suggestions or critques are welcome. The program just scrapes an HTML table and prints it to stdout.
import sys
from urllib2 import urlopen, URLError
from argparse import ArgumentParser
from bs4 import BeautifulSoup
def parse_arguments():
""" Process command line arguments """
parser = ArgumentParser(description='Grabs tables from html')
parser.add_argument('-u', '--url', help='url to grab from',
required=True)
args = parser.parse_args()
return args
def parse_rows(rows):
""" Get data from rows """
results = []
for row in rows:
table_headers = row.find_all('th')
if table_headers:
results.append([headers.get_text() for headers in table_headers])
table_data = row.find_all('td')
if table_data:
results.append([data.get_text() for data in table_data])
return results
def main():
# Get arguments
args = parse_arguments()
if args.url:
url = args.url
# Make soup
try:
resp = urlopen(url)
except URLError as e:
print 'An error occured fetching %s \n %s' % (url, e.reason)
return 1
soup = BeautifulSoup(resp.read())
# Get table
try:
table = soup.find('table')
except AttributeError as e:
print 'No tables found, exiting'
return 1
# Get rows
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
return 1
# Get data
table_data = parse_rows(rows)
# Print data
for i in table_data:
print '\t'.join(i)
if __name__ == '__main__':
status = main()
sys.exit(status)Solution
In general, your code is very well written. PEP8 compliance is very good, names make sense, you have docstrings, you use
Here's a few comments on what I would do differently:
This is essentially the same code duplicated twice. The fact that you differentiate between not finding a table and not finding rows within it is nice, but ultimately useless, as a table with nothing in it might as well be no table at all.
Therefore, I propose changing it to detect a valid table, that is one with rows. This tidies up the code a bit like so:
You'll notice I eliminated the print and the
Lastly, let's talk about your comments. Most of them are useless.
Is that really the best description? You are retrieving a url and reading the HTML. THAT should be your comment.
As I said, I think these two can be wonderfully merged. Also, I'm pretty sure the names are more than sufficient to understand that you are retrieving the table and the rows of the table.
Hmm... what else could
This one I actually think should be more descriptive, such as:
As indicated in the comments, it is suggested that you remove all of your comments, as they are largely plain fluff considering your naming is so well done.
Another last thing to note is that
Is a completely useless check. It is guaranteed that args.url will always exist as per how the
if __name__ == '__main__:.Here's a few comments on what I would do differently:
# Get table
try:
table = soup.find('table')
except AttributeError as e:
print 'No tables found, exiting'
return 1
# Get rows
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
return 1This is essentially the same code duplicated twice. The fact that you differentiate between not finding a table and not finding rows within it is nice, but ultimately useless, as a table with nothing in it might as well be no table at all.
Therefore, I propose changing it to detect a valid table, that is one with rows. This tidies up the code a bit like so:
try:
table = soup.find('table')
rows = table.find_all('tr')
except AttributeError as e:
raise ValueError("No valid table found")You'll notice I eliminated the print and the
return 1. Errors should be errors, they shouldn't magically become print statements. The only exception to this is if this is supposed to be for end users, then you will want to retain the print. Secondly, sys.exit() is unnecessary as the script will end automatically when it reaches the end of the file. Therefore, all the return 1's are unnecessary and are extra fluff.Lastly, let's talk about your comments. Most of them are useless.
# Make soupIs that really the best description? You are retrieving a url and reading the HTML. THAT should be your comment.
# Get table
# Get rowsAs I said, I think these two can be wonderfully merged. Also, I'm pretty sure the names are more than sufficient to understand that you are retrieving the table and the rows of the table.
# Get data
table_data = parse_rows(rows)Hmm... what else could
table_data be? This is a comment that seems pretty useless and adds un-needed fluff.# Print dataThis one I actually think should be more descriptive, such as:
# Print table dataAs indicated in the comments, it is suggested that you remove all of your comments, as they are largely plain fluff considering your naming is so well done.
Another last thing to note is that
if args.url:Is a completely useless check. It is guaranteed that args.url will always exist as per how the
argparse module functions. If you made the url an optional parameter, that would be a different story, but then your program would be useless.Code Snippets
# Get table
try:
table = soup.find('table')
except AttributeError as e:
print 'No tables found, exiting'
return 1
# Get rows
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
return 1try:
table = soup.find('table')
rows = table.find_all('tr')
except AttributeError as e:
raise ValueError("No valid table found")# Make soup# Get table
# Get rows# Get data
table_data = parse_rows(rows)Context
StackExchange Code Review Q#60769, answer score: 9
Revisions (0)
No revisions yet.