HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scraping a table from Texas Dept. of Criminal Justice website

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
criminalwebsitejusticescrapingtexasfromtabledept

Problem

The script scrapes a table from the website mentioned, looks at the last 2 columns, takes that information, and then sorts it (and then returns the largest county, and the set of races and their numbers). But, basically, I just wanted to know in which ways I can make this script parallel more closely with the Zen of Python, and if there are any optimizations I could make to the script to make it a little faster. I also know that Python's goal is not to be fast, but if I can make the script faster, I don't see why not (though the main priority is to make it Pythonic).

``
#!/usr/bin/env python
"""Scraping information from https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"""

import requests, time
from bs4 import BeautifulSoup
from collections import defaultdict

url = "https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html"
county = {}

def counties(countyArg):
"""Fills the
county` variable"""
f = open("counties.txt")
for line in f.readlines():
countyArg[line.replace('\n', '').rstrip()] = 0

def scrape(urlArg, countyArg):
"""Scrape Site Based on HTML"""
races = defaultdict(int)

page = requests.get(urlArg)
html = page.content

soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
for row in table.findAll('tr')[1:]:
if row.findAll('td')[8].text == 'Black':
races['black'] += 1
elif row.findAll('td')[8].text == 'White':
races['white'] += 1
elif row.findAll('td')[8].text == 'Hispanic':
races['hispanic'] += 1

countyArg[row.findAll('td')[9].text.encode('ascii').rstrip()] += 1

return races, county

def find_largest(countyArg):
"""find largest county"""
largestCounty = ''
largestInmates = 0
for key, value in countyArg.iteritems():
if largestInmates < value:
largestInmates = value
largestCounty = key
return largestCounty, largestInmates

if __name__ == '__main__':

Solution

In addition to PEP 20, the Python Zen, which you seem to be already familiar with,
another must read (and follow) for Python coders is PEP 8, the style guide.
Many of those recommendations apply to your code,
please read it carefully, I won't repeat them here.

Working with resources

In counties,
you open a file for reading but you never close it.
You should always close the resources, file handles you open.
But actually, in Python, you should use the with syntax, like this:

with open("counties.txt") as fh:
    for line in fh.readlines():
        countyArg[line.replace('\n', '').rstrip()] = 0


In this form, you don't actually need to close:
Python will automatically close the file handle after leaving the with block.

Why out-parameters?

It's not clear why you opted for initializing an empty county dictionary and then filling it in the counties method.
It would be more natural to let counties create the dictionary and return it.

Avoid global variables

county is declared at the outermost scope,
effectively making it a global variable.
Avoid global variables as much as possible.

As hinted earlier,
it would be better to create the dictionary inside a function,
and pass it around to the functions that need it.

For example:

load_counties(path):
    counties = {}
    with open(path) as fh:
        for line in fh.readlines():
            counties[line.replace('\n', '').rstrip()] = 0

    return counties

# ....

def main():
    counties = load_counties("counties.txt")
    scrape(url, counties)
    # ....


Avoid code duplication

Avoid repeated logic like this:

for row in table.findAll('tr')[1:]:
    if row.findAll('td')[8].text == 'Black':
        races['black'] += 1
    elif row.findAll('td')[8].text == 'White':
        races['white'] += 1
    elif row.findAll('td')[8].text == 'Hispanic':
        races['hispanic'] += 1


Not only it is error-prone to repeat row.findAll('td')[8].text many times,
it's also inefficient: beautiful soup has to find all td elements multiple times.

Refactor to do this only once:

for row in table.findAll('tr')[1:]:
    race = row.findAll('td')[8].text
    if race == 'Black':
        races['black'] += 1
    elif race == 'White':
        races['white'] += 1
    elif race == 'Hispanic':
        races['hispanic'] += 1


We can also go one step further:

for row in table.findAll('tr')[1:]:
    race = row.findAll('td')[8].text
    if race in ('Black', 'White', 'Hispanic'):
        races[race.lower()] += 1


Consider using collections.Counter

The collections package has a class that can help you here: Counter.
Instead of doing the counting by yourself,
this handy little utility can do it for you,
it goes something like this:

from collections import Counter

# ...

    races = []
    for row in table.findAll('tr')[1:]:
        race = row.findAll('td')[8].text
        if race in ('Black', 'White', 'Hispanic'):
            races.append(race.lower())

    return Counter(races)


That is,
from a list like ['black', 'white', 'black', 'white', 'black', 'hispanic'] it will create Counter({'black': 3, 'white': 2, 'hispanic': 1})

Use max to find the maximum entry of a dictionary

find_largest can be simplified using max:

def find_largest(counties):
    return max(counties.items(), key=lambda pair: pair[1]))

Code Snippets

with open("counties.txt") as fh:
    for line in fh.readlines():
        countyArg[line.replace('\n', '').rstrip()] = 0
load_counties(path):
    counties = {}
    with open(path) as fh:
        for line in fh.readlines():
            counties[line.replace('\n', '').rstrip()] = 0

    return counties

# ....

def main():
    counties = load_counties("counties.txt")
    scrape(url, counties)
    # ....
for row in table.findAll('tr')[1:]:
    if row.findAll('td')[8].text == 'Black':
        races['black'] += 1
    elif row.findAll('td')[8].text == 'White':
        races['white'] += 1
    elif row.findAll('td')[8].text == 'Hispanic':
        races['hispanic'] += 1
for row in table.findAll('tr')[1:]:
    race = row.findAll('td')[8].text
    if race == 'Black':
        races['black'] += 1
    elif race == 'White':
        races['white'] += 1
    elif race == 'Hispanic':
        races['hispanic'] += 1
for row in table.findAll('tr')[1:]:
    race = row.findAll('td')[8].text
    if race in ('Black', 'White', 'Hispanic'):
        races[race.lower()] += 1

Context

StackExchange Code Review Q#99300, answer score: 3

Revisions (0)

No revisions yet.