patternpythonMinor
Matching values from html table for updating values in pandas dataframe
Viewed 0 times
pandastableupdatingforvaluesdataframefromhtmlmatching
Problem
This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it:
Pandas is a Python package providing fast, flexible, and expressive
data structures designed to make working with structured (tabular,
multidimensional, potentially heterogeneous) and time series data both
easy and intuitive
I'll make this sound like an exercise:
Given some link
Say the information looks like this:
From fileA (data from the file is
Let's suppose the data in the dataframe looks like this:
and so on
import requests
import pandas as pd
fro
Pandas is a Python package providing fast, flexible, and expressive
data structures designed to make working with structured (tabular,
multidimensional, potentially heterogeneous) and time series data both
easy and intuitive
I'll make this sound like an exercise:
Given some link
http://ABCD.abc/some_date.html, take the necessary information from the table on the page.Say the information looks like this:
Team | Another Team | Col2 | Current | Col4 | Halftime | Scores
Team1 | TeamX | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ | info | Current4 | Col4 | Halftime4 | Scores4
From fileA (data from the file is
pickled - yeah, I know pickling isn't the best option, but let's stick with it for the sake of the exercise), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.Let's suppose the data in the dataframe looks like this:
| Team | Opponent | Col2 | Col3 Col4 | Col5 | Col6 | Date
0 | Team1 | TeamX | info | info | info | info | info | some_date1
...and so on
Now, the task:
- Parse each row from the
dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.
- If you find it, take
Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.
- Do this for each row from the dataframe.
Now, I did solve this pretty easy, but it takes up to 1 minute to resolve 137 rows in the dataframe.
I'd like some ideas on how can I optimise it, make better use of pandas modules and if there's something wrong with the logic.
import pickleimport requests
import pandas as pd
fro
Solution
Consider avoiding row iteration and simply use
To follow this approach, first you will need to compile the html data for all unique dates found in your file dataframe (pulled from pickle). Also, no need to create empty columns --Currents , Halftimes, Scores-- as the merge will bring them over.
Below first two defined methods should return a dataframe object of which the final function simply merges together. Possibly, the html dataframe build may take some time as you have to parse all unique dated web pages. For this, try implementing pandas.read_html.
pandas.DataFrame.merge() on Team and Date columns. Usually, in Python pandas or numpy, vectorized processes are always the recommended course where you pass in a serialized object (vector, list, array, dataframe) to run a bulk operation in one call instead of on individual elements. To follow this approach, first you will need to compile the html data for all unique dates found in your file dataframe (pulled from pickle). Also, no need to create empty columns --Currents , Halftimes, Scores-- as the merge will bring them over.
Below first two defined methods should return a dataframe object of which the final function simply merges together. Possibly, the html dataframe build may take some time as you have to parse all unique dated web pages. For this, try implementing pandas.read_html.
def get_df_from_file():
with open(FILE_TO_PROCESS, "rb") as openfile:
return pickle.load(openfile)
def get_html_data_from_url(df):
# LIST OF DATAFRAMES
dfList = []
# ITERATE ON UNIQUE DATES
for dt in set(df['Date'].tolist()):
url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
dates, teams, currents, halftimes, scores = [], [], [], [], []
for row in rows:
cells = row.find_all("td")
dates.append(dt)
teams.append(cells[0].get_text().encode('utf-8'))
currents.append(cells[3].get_text().encode('utf-8'))
halftimes.append(cells[5].get_text().encode('utf-8'))
scores.append(cells[6].get_text().encode('utf-8'))
data = {
'Date': dates,
'Team': teams,
'Currents': currents,
'Halftimes': halftimes,
'Scores': scores
}
# APPEND DATAFRAME CREATED FROM EACH DICTIONARY
dfList.append(pd.DataFrame(data))
# CONCATENATE DATAFRAME LIST
finaldf = pd.concat(dfList)
return finaldf
def process_data():
filedf = get_df_from_file('CFB_15_living-2.p')
filedf['Team'] = filedf['Team'].str.lower()
htmldf = get_html_data_from_url(filedf)
htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()
# LEFT JOIN MERGE
mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
mergedf.to_csv('results.csv', sep='\t')Code Snippets
def get_df_from_file():
with open(FILE_TO_PROCESS, "rb") as openfile:
return pickle.load(openfile)
def get_html_data_from_url(df):
# LIST OF DATAFRAMES
dfList = []
# ITERATE ON UNIQUE DATES
for dt in set(df['Date'].tolist()):
url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
dates, teams, currents, halftimes, scores = [], [], [], [], []
for row in rows:
cells = row.find_all("td")
dates.append(dt)
teams.append(cells[0].get_text().encode('utf-8'))
currents.append(cells[3].get_text().encode('utf-8'))
halftimes.append(cells[5].get_text().encode('utf-8'))
scores.append(cells[6].get_text().encode('utf-8'))
data = {
'Date': dates,
'Team': teams,
'Currents': currents,
'Halftimes': halftimes,
'Scores': scores
}
# APPEND DATAFRAME CREATED FROM EACH DICTIONARY
dfList.append(pd.DataFrame(data))
# CONCATENATE DATAFRAME LIST
finaldf = pd.concat(dfList)
return finaldf
def process_data():
filedf = get_df_from_file('CFB_15_living-2.p')
filedf['Team'] = filedf['Team'].str.lower()
htmldf = get_html_data_from_url(filedf)
htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()
# LEFT JOIN MERGE
mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
mergedf.to_csv('results.csv', sep='\t')Context
StackExchange Code Review Q#143646, answer score: 2
Revisions (0)
No revisions yet.