patternpythonMinor
Scraping a table from Texas Dept. of Criminal Justice website
Viewed 0 times
criminalwebsitejusticescrapingtexasfromtabledept
Problem
The script scrapes a table from the website mentioned, looks at the last 2 columns, takes that information, and then sorts it (and then returns the largest county, and the set of races and their numbers). But, basically, I just wanted to know in which ways I can make this script parallel more closely with the Zen of Python, and if there are any optimizations I could make to the script to make it a little faster. I also know that Python's goal is not to be fast, but if I can make the script faster, I don't see why not (though the main priority is to make it Pythonic).
``
f = open("counties.txt")
for line in f.readlines():
countyArg[line.replace('\n', '').rstrip()] = 0
def scrape(urlArg, countyArg):
"""Scrape Site Based on HTML"""
races = defaultdict(int)
page = requests.get(urlArg)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
for row in table.findAll('tr')[1:]:
if row.findAll('td')[8].text == 'Black':
races['black'] += 1
elif row.findAll('td')[8].text == 'White':
races['white'] += 1
elif row.findAll('td')[8].text == 'Hispanic':
races['hispanic'] += 1
countyArg[row.findAll('td')[9].text.encode('ascii').rstrip()] += 1
return races, county
def find_largest(countyArg):
"""find largest county"""
largestCounty = ''
largestInmates = 0
for key, value in countyArg.iteritems():
if largestInmates < value:
largestInmates = value
largestCounty = key
return largestCounty, largestInmates
if __name__ == '__main__':
``
#!/usr/bin/env python
"""Scraping information from https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"""
import requests, time
from bs4 import BeautifulSoup
from collections import defaultdict
url = "https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html"
county = {}
def counties(countyArg):
"""Fills the county` variable"""f = open("counties.txt")
for line in f.readlines():
countyArg[line.replace('\n', '').rstrip()] = 0
def scrape(urlArg, countyArg):
"""Scrape Site Based on HTML"""
races = defaultdict(int)
page = requests.get(urlArg)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
for row in table.findAll('tr')[1:]:
if row.findAll('td')[8].text == 'Black':
races['black'] += 1
elif row.findAll('td')[8].text == 'White':
races['white'] += 1
elif row.findAll('td')[8].text == 'Hispanic':
races['hispanic'] += 1
countyArg[row.findAll('td')[9].text.encode('ascii').rstrip()] += 1
return races, county
def find_largest(countyArg):
"""find largest county"""
largestCounty = ''
largestInmates = 0
for key, value in countyArg.iteritems():
if largestInmates < value:
largestInmates = value
largestCounty = key
return largestCounty, largestInmates
if __name__ == '__main__':
Solution
In addition to PEP 20, the Python Zen, which you seem to be already familiar with,
another must read (and follow) for Python coders is PEP 8, the style guide.
Many of those recommendations apply to your code,
please read it carefully, I won't repeat them here.
Working with resources
In
you open a file for reading but you never close it.
You should always close the resources, file handles you open.
But actually, in Python, you should use the
In this form, you don't actually need to close:
Python will automatically close the file handle after leaving the
Why out-parameters?
It's not clear why you opted for initializing an empty
It would be more natural to let
Avoid global variables
effectively making it a global variable.
Avoid global variables as much as possible.
As hinted earlier,
it would be better to create the dictionary inside a function,
and pass it around to the functions that need it.
For example:
Avoid code duplication
Avoid repeated logic like this:
Not only it is error-prone to repeat
it's also inefficient: beautiful soup has to find all
Refactor to do this only once:
We can also go one step further:
Consider using
The
Instead of doing the counting by yourself,
this handy little utility can do it for you,
it goes something like this:
That is,
from a list like
Use
another must read (and follow) for Python coders is PEP 8, the style guide.
Many of those recommendations apply to your code,
please read it carefully, I won't repeat them here.
Working with resources
In
counties,you open a file for reading but you never close it.
You should always close the resources, file handles you open.
But actually, in Python, you should use the
with syntax, like this:with open("counties.txt") as fh:
for line in fh.readlines():
countyArg[line.replace('\n', '').rstrip()] = 0In this form, you don't actually need to close:
Python will automatically close the file handle after leaving the
with block.Why out-parameters?
It's not clear why you opted for initializing an empty
county dictionary and then filling it in the counties method.It would be more natural to let
counties create the dictionary and return it.Avoid global variables
county is declared at the outermost scope,effectively making it a global variable.
Avoid global variables as much as possible.
As hinted earlier,
it would be better to create the dictionary inside a function,
and pass it around to the functions that need it.
For example:
load_counties(path):
counties = {}
with open(path) as fh:
for line in fh.readlines():
counties[line.replace('\n', '').rstrip()] = 0
return counties
# ....
def main():
counties = load_counties("counties.txt")
scrape(url, counties)
# ....Avoid code duplication
Avoid repeated logic like this:
for row in table.findAll('tr')[1:]:
if row.findAll('td')[8].text == 'Black':
races['black'] += 1
elif row.findAll('td')[8].text == 'White':
races['white'] += 1
elif row.findAll('td')[8].text == 'Hispanic':
races['hispanic'] += 1Not only it is error-prone to repeat
row.findAll('td')[8].text many times,it's also inefficient: beautiful soup has to find all
td elements multiple times.Refactor to do this only once:
for row in table.findAll('tr')[1:]:
race = row.findAll('td')[8].text
if race == 'Black':
races['black'] += 1
elif race == 'White':
races['white'] += 1
elif race == 'Hispanic':
races['hispanic'] += 1We can also go one step further:
for row in table.findAll('tr')[1:]:
race = row.findAll('td')[8].text
if race in ('Black', 'White', 'Hispanic'):
races[race.lower()] += 1Consider using
collections.CounterThe
collections package has a class that can help you here: Counter.Instead of doing the counting by yourself,
this handy little utility can do it for you,
it goes something like this:
from collections import Counter
# ...
races = []
for row in table.findAll('tr')[1:]:
race = row.findAll('td')[8].text
if race in ('Black', 'White', 'Hispanic'):
races.append(race.lower())
return Counter(races)That is,
from a list like
['black', 'white', 'black', 'white', 'black', 'hispanic'] it will create Counter({'black': 3, 'white': 2, 'hispanic': 1})Use
max to find the maximum entry of a dictionaryfind_largest can be simplified using max:def find_largest(counties):
return max(counties.items(), key=lambda pair: pair[1]))Code Snippets
with open("counties.txt") as fh:
for line in fh.readlines():
countyArg[line.replace('\n', '').rstrip()] = 0load_counties(path):
counties = {}
with open(path) as fh:
for line in fh.readlines():
counties[line.replace('\n', '').rstrip()] = 0
return counties
# ....
def main():
counties = load_counties("counties.txt")
scrape(url, counties)
# ....for row in table.findAll('tr')[1:]:
if row.findAll('td')[8].text == 'Black':
races['black'] += 1
elif row.findAll('td')[8].text == 'White':
races['white'] += 1
elif row.findAll('td')[8].text == 'Hispanic':
races['hispanic'] += 1for row in table.findAll('tr')[1:]:
race = row.findAll('td')[8].text
if race == 'Black':
races['black'] += 1
elif race == 'White':
races['white'] += 1
elif race == 'Hispanic':
races['hispanic'] += 1for row in table.findAll('tr')[1:]:
race = row.findAll('td')[8].text
if race in ('Black', 'White', 'Hispanic'):
races[race.lower()] += 1Context
StackExchange Code Review Q#99300, answer score: 3
Revisions (0)
No revisions yet.