HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Web crawler that charts stock ticker data using matplotlib

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
tickermatplotlibstockthatusingwebchartscrawlerdata

Problem

I've built a web crawler using the BeautifulSoup library that pulls stock ticker data from CSV files on Yahoo finance, and charts the data using matplotlib. I'm wondering if there are any ways to improve the code I've written, because there are some parts that I think could be a lot better.

```
import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests

def chartStocks(*tickers):

# Run loop for each ticker passed in as an argument
for ticker in tickers:

# Convert URL into text for parsing
url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")

# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)

# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
csv_url = href
res = urllib.request.urlopen(csv_url)
csv = res.read()
csv_str = str(csv)

# Parse the CSV to create a list of data points
point = []
points = []
curDay = 0
day = []
commas = 0
lines = csv_str.split("\\n")
lineOne = True
for line in lines:
commas = 0
if lineOne == True:
lineOne = False
else:
for c in line:
if c == ",":
commas += 1
if commas == 4:
point.append

Solution

Composing small functions

chartStocks would be more readable if you split it into a handful of smaller functions, roughly like this:

def chartStocks(*tickers):
    for ticker in tickers:
        page = getTickerPage(ticker)
        csv_url = findCSVUrl(page)
        csv = getCSV(csv_url)
        day, points = parseCSV(csv)
        plot_data(ticker, day, points)

        # Or, if you're allergic to temporary variables:
        day, points = parseCSV(getCSV(findCSVUrl(getTickerPage(ticker))))


This approach lets you clearly see the "pipeline" your data is passing through, and lets you test and reuse smaller pieces individually.

Arguably it would be even cleaner to define def chartStock(ticker) to handle the case of one ticker, so that chartStocks is just

def chartStocks(*tickers):
    for ticker in tickers:
        chartStock(ticker)


The only thing to be careful of here is to make sure to design your functions to handle errors properly -- either check that each return value is not None before calling the next function, or allow them to take None as a parameter and just return nothing in that case.

str.startswith

This:

# Find all links on the page
for link in soup.findAll('a'):
    href = link.get('href')
    link = []
    for c in href[:48]:
        link.append(c)
    link = ''.join(link)

    # Find the URL for the stock ticker CSV file and convert the data to text
    if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
        # ...


could be simplified with str.startswith:

def findCSVUrl(soupPage):
    CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
    for link in soupPage.findAll('a'):
        href = link.get('href', '')
        if href.startswith(CSV_URL_PREFIX):
            return href


I've also provided a not-found value of '' so that if link has no href, startswith won't get called on None.

Skipping first row

Instead of using a lineOne flag when looping over lines:

lineOne = True
for line in lines:
    if lineOne == True:
        lineOne = False
    else:
         # continue parsing line...


you can just start from after the first row with a slice:

for line in lines[1:]:
    # ... continue parsing line


CSV parsing

Python has a built-in CSV parsing module that could simplify a lot of your parsing. It would do the splitting-by-commas for you and return either a list or dict of fields for each row, depending on what you ask for. You would end up with roughly this:

def parseCSV(csv_text):
    csv_rows = csv.reader(csv_text.split('\n'))
    days = []
    points = []

    for day, row in enumerate(csv_rows):
        close = float(row[4])
        days.append(day)
        points.append(close)

    return days, points


where the enumerate function would give you the same zero-based days list as you currently have.

Actually, since it seems like days will just be the list [0 .. len(points)], you could skip the enumerate and simply define days after you've parsed all the points, and toss in a list comprehension if you'd like for good measure:

def parseCSV(csv_text):
    csv_rows = csv.reader(csv_text.split('\n'))

    points = [float(row[4]) for row in csv_rows]
    days = list(range(len(points)))

    return days, points

Code Snippets

def chartStocks(*tickers):
    for ticker in tickers:
        page = getTickerPage(ticker)
        csv_url = findCSVUrl(page)
        csv = getCSV(csv_url)
        day, points = parseCSV(csv)
        plot_data(ticker, day, points)

        # Or, if you're allergic to temporary variables:
        day, points = parseCSV(getCSV(findCSVUrl(getTickerPage(ticker))))
def chartStocks(*tickers):
    for ticker in tickers:
        chartStock(ticker)
# Find all links on the page
for link in soup.findAll('a'):
    href = link.get('href')
    link = []
    for c in href[:48]:
        link.append(c)
    link = ''.join(link)

    # Find the URL for the stock ticker CSV file and convert the data to text
    if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
        # ...
def findCSVUrl(soupPage):
    CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
    for link in soupPage.findAll('a'):
        href = link.get('href', '')
        if href.startswith(CSV_URL_PREFIX):
            return href
lineOne = True
for line in lines:
    if lineOne == True:
        lineOne = False
    else:
         # continue parsing line...

Context

StackExchange Code Review Q#114612, answer score: 4

Revisions (0)

No revisions yet.