HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Matching values from html table for updating values in pandas dataframe

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pandastableupdatingforvaluesdataframefromhtmlmatching

Problem

This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it:


Pandas is a Python package providing fast, flexible, and expressive
data structures designed to make working with structured (tabular,
multidimensional, potentially heterogeneous) and time series data both
easy and intuitive

I'll make this sound like an exercise:

Given some link http://ABCD.abc/some_date.html, take the necessary information from the table on the page.

Say the information looks like this:

Team | Another Team | Col2 | Current | Col4 | Halftime | Scores

Team1 | TeamX | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ | info | Current4 | Col4 | Halftime4 | Scores4


From fileA (data from the file is pickled - yeah, I know pickling isn't the best option, but let's stick with it for the sake of the exercise), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.

Let's suppose the data in the dataframe looks like this:

| Team | Opponent | Col2 | Col3 Col4 | Col5 | Col6 | Date

0 | Team1 | TeamX | info | info | info | info | info | some_date1

...
and so on


Now, the task:

  • Parse each row from the dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.



  • If you find it, take Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.



  • Do this for each row from the dataframe.



Now, I did solve this pretty easy, but it takes up to 1 minute to resolve 137 rows in the dataframe.

I'd like some ideas on how can I optimise it, make better use of
pandas modules and if there's something wrong with the logic.

import pickle
import requests
import pandas as pd

fro

Solution

Consider avoiding row iteration and simply use pandas.DataFrame.merge() on Team and Date columns. Usually, in Python pandas or numpy, vectorized processes are always the recommended course where you pass in a serialized object (vector, list, array, dataframe) to run a bulk operation in one call instead of on individual elements.

To follow this approach, first you will need to compile the html data for all unique dates found in your file dataframe (pulled from pickle). Also, no need to create empty columns --Currents , Halftimes, Scores-- as the merge will bring them over.

Below first two defined methods should return a dataframe object of which the final function simply merges together. Possibly, the html dataframe build may take some time as you have to parse all unique dated web pages. For this, try implementing pandas.read_html.

def get_df_from_file():
    with open(FILE_TO_PROCESS, "rb") as openfile:
        return pickle.load(openfile)

def get_html_data_from_url(df):
    # LIST OF DATAFRAMES
    dfList = []

    # ITERATE ON UNIQUE DATES 
    for dt in set(df['Date'].tolist()):
        url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'lxml')

        rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
        dates, teams, currents, halftimes, scores = [], [], [], [], []

        for row in rows:
            cells = row.find_all("td")

            dates.append(dt)  
            teams.append(cells[0].get_text().encode('utf-8'))
            currents.append(cells[3].get_text().encode('utf-8'))
            halftimes.append(cells[5].get_text().encode('utf-8'))
            scores.append(cells[6].get_text().encode('utf-8'))

        data = {
            'Date': dates, 
            'Team': teams,
            'Currents': currents,
            'Halftimes': halftimes,
            'Scores': scores
        }
        # APPEND DATAFRAME CREATED FROM EACH DICTIONARY 
        dfList.append(pd.DataFrame(data))

    # CONCATENATE DATAFRAME LIST
    finaldf = pd.concat(dfList)

    return finaldf

def process_data():
    filedf = get_df_from_file('CFB_15_living-2.p')
    filedf['Team'] = filedf['Team'].str.lower()

    htmldf = get_html_data_from_url(filedf)
    htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()

    # LEFT JOIN MERGE
    mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
    mergedf.to_csv('results.csv', sep='\t')

Code Snippets

def get_df_from_file():
    with open(FILE_TO_PROCESS, "rb") as openfile:
        return pickle.load(openfile)

def get_html_data_from_url(df):
    # LIST OF DATAFRAMES
    dfList = []

    # ITERATE ON UNIQUE DATES 
    for dt in set(df['Date'].tolist()):
        url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'lxml')

        rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
        dates, teams, currents, halftimes, scores = [], [], [], [], []

        for row in rows:
            cells = row.find_all("td")

            dates.append(dt)  
            teams.append(cells[0].get_text().encode('utf-8'))
            currents.append(cells[3].get_text().encode('utf-8'))
            halftimes.append(cells[5].get_text().encode('utf-8'))
            scores.append(cells[6].get_text().encode('utf-8'))

        data = {
            'Date': dates, 
            'Team': teams,
            'Currents': currents,
            'Halftimes': halftimes,
            'Scores': scores
        }
        # APPEND DATAFRAME CREATED FROM EACH DICTIONARY 
        dfList.append(pd.DataFrame(data))

    # CONCATENATE DATAFRAME LIST
    finaldf = pd.concat(dfList)

    return finaldf

def process_data():
    filedf = get_df_from_file('CFB_15_living-2.p')
    filedf['Team'] = filedf['Team'].str.lower()

    htmldf = get_html_data_from_url(filedf)
    htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()

    # LEFT JOIN MERGE
    mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
    mergedf.to_csv('results.csv', sep='\t')

Context

StackExchange Code Review Q#143646, answer score: 2

Revisions (0)

No revisions yet.