patternpythonMinor
Webscraping calendar events using Python 3, with or without BeautifulSoup
Viewed 0 times
withouteventswithwebscrapingpythonusingbeautifulsoupcalendar
Problem
I'm trying to find out why my web-scraping code with BeautifulSoup (BS) is slower than my code without BS. I would think that BS code would be faster than the other code - so, maybe I'm doing something wrong?
With BS
Without BS
```
import pandas as pd
from datetime import datetime
import time
import urllib.request
import re
# start timer
start = time.time()
# control parameters:
# dates
dateFrom = '2016-01-01'
dateTo = '2016-07-31'
# url
url = 'http://utilitytool.casc.eu/CascUtilityWebService.asmx/GetNetPositionDataForAPeriod?dateFrom=' + dateFrom + '
With BS
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
# start timer
start = time.time()
# control parameters:
# dates
dateFrom = '2016-01-01'
dateTo = '2016-07-31'
# url
url = 'http://utilitytool.casc.eu/CascUtilityWebService.asmx/GetNetPositionDataForAPeriod?dateFrom=' + dateFrom + '&dateTo=' + dateTo
# /control parameters:
page = requests.get(url)
soup = BeautifulSoup(page.content)
# Extract data from soup
calendardate = [i.text for i in soup.findAll('calendardate')]
calendarhour = [i.text for i in soup.findAll('calendarhour')]
be = [i.text for i in soup.findAll('be')]
nl = [i.text for i in soup.findAll('nl')]
deat = [i.text for i in soup.findAll('deat')]
fr = [i.text for i in soup.findAll('fr')]
# lose the useless string in date list
calendardate = [w.replace('T00:00:00', '') for w in calendardate ]
# convert hour column to int
calendarhour = [int(i) for i in calendarhour]
# Python operates with hours: 0-23 and not with 1-24
datetime = [x-1 if x - 1 > 9 else '0' + str(x-1) for x in calendarhour]
# create DateTime list
datetime = ["%s %s:00:00" % t for t in zip(calendardate, datetime)]
# Create Pandas Df
df = pd.DataFrame({
'datetime': datetime,
'be': be,
'nl': nl,
'deat': deat,
'fr': fr
},
columns = ['datetime', 'be', 'nl', 'deat', 'fr'])
# end time
end = time.time()
print('\nTime elapsed', round(end - start, 3), 's')Without BS
```
import pandas as pd
from datetime import datetime
import time
import urllib.request
import re
# start timer
start = time.time()
# control parameters:
# dates
dateFrom = '2016-01-01'
dateTo = '2016-07-31'
# url
url = 'http://utilitytool.casc.eu/CascUtilityWebService.asmx/GetNetPositionDataForAPeriod?dateFrom=' + dateFrom + '
Solution
Regardless of the parsing method, and considering that you are using
Also, these
(Oh and, when using a dictionary to feed into
Now there are a few other things to consider when looking at the code.
First off, get consistent. Between the two versions, naming conventions are not the same so it is harder to talk about the variables at play. The first code is way better at this as it follows PEP 8 more closely.
Consistency would also be to use the same libraries to retrieve the required page. And, as far as timing is concerned, retrieving the page should not be timed as it highly depends on your bandwidth at the moment of download. Including that time in your measurements of the parsing bits is not fair.
Talking about timing, you should put your code into functions so it is easier to reuse, test and time. Wrapping your tests under
This also enables you to use better timing tools, like
Now, as you’re asking about performance, building intermediate lists using
pandas dataframes in the end, I would simplify some of its usage.pandas has its own objects dealing with dates and times, and they are pretty smart: pd.Timestamp and pd.Timedelta. You can get your whole datetime manipulation done using:calendardate = ()
calendarhour = ()
calendartimes = [
pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
for date, time in zip(calendardate, calendarhour)
]Also, these
calendartimes may end up being more than a column of data, you can convert them to an index using pd.to_datetime:df = pd.DataFrame({
'BE': be,
'NL': nl,
'DEAT': deat,
'FR': fr
}, index=pd.to_datetime(calendartimes))(Oh and, when using a dictionary to feed into
pd.DataFrame, you don't need to specify the columns, the keys of the dictionary will be used).Now there are a few other things to consider when looking at the code.
First off, get consistent. Between the two versions, naming conventions are not the same so it is harder to talk about the variables at play. The first code is way better at this as it follows PEP 8 more closely.
Consistency would also be to use the same libraries to retrieve the required page. And, as far as timing is concerned, retrieving the page should not be timed as it highly depends on your bandwidth at the moment of download. Including that time in your measurements of the parsing bits is not fair.
Talking about timing, you should put your code into functions so it is easier to reuse, test and time. Wrapping your tests under
if __name__ == '__main__': would also be good practice:import re
from bs4 import BeautifulSoup
import requests
import pandas as pd
DATA_URL = 'http://utilitytool.casc.eu/' \
'CascUtilityWebService.asmx/GetNetPositionDataForAPeriod'
def download_data(from_date, to_date, url=DATA_URL):
date_format = '%Y-%m-%d'
parameters = {
'dateFrom': from_date.strftime(date_format),
'dateTo': to_date.strftime(date_format),
}
page = requests.get(url, params=parameters)
page.raise_for_status()
return page.content
def parse_page_using_bs(content):
soup = BeautifulSoup(content, 'xml')
# Extract data from soup
calendardate = (i.text for i in soup.findAll('calendardate'))
calendarhour = (i.text for i in soup.findAll('calendarhour'))
be = (i.text for i in soup.findAll('be'))
nl = (i.text for i in soup.findAll('nl'))
deat = (i.text for i in soup.findAll('deat'))
fr = (i.text for i in soup.findAll('fr'))
# Merge date and hours in a single datetime
calendartimes = [
pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
for date, time in zip(calendardate, calendarhour)
]
# Create Pandas Df
return pd.DataFrame({
'BE': be,
'NL': nl,
'DEAT': deat,
'FR': fr
}, index=pd.to_datetime(calendartimes))
def parse_page_using_re(content):
data = str(content)
calendardate = re.findall(r'(.*?)', data)
calendarhour = re.findall(r'(.*?)', data)
be = re.findall(r'(.*?)', data)
nl = re.findall(r'(.*?)', data)
deat = re.findall(r'(.*?)', data)
fr = re.findall(r'(.*?)', data)
# Merge date and hours in a single datetime
calendartimes = [
pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
for date, time in zip(calendardate, calendarhour)
]
# convert strings to floats
be = [float(i) for i in be]
nl = [float(i) for i in nl]
deat = [float(i) for i in deat]
fr = [float(i) for i in fr]
# create pandas df from lists
df = pd.DataFrame()
df['BE'] = be
df['NL'] = nl
df['DEAT'] = deat
df['FR'] = fr
df.index = pd.to_datetime(calendartimes)
return df
if __name__ == '__main__':
import time
import datetime
page = download_data(datetime.date(2016, 1, 1), datetime.date(2016, 7, 31))
start = time.time()
parse_page_using_bs(page)
end = time.time()
print('\nTime elapsed for Beautifulsoup', round(end - start, 3), 's')
start = time.time()
parse_page_using_re(page)
end = time.time()
print('\nTime elapsed for re', round(end - start, 3), 's')This also enables you to use better timing tools, like
timeit:if __name__ == '__main__':
from timeit import timeit
from datetime import date
page = download_data(date(2016, 1, 1), date(2016, 7, 31))
for function in ['parse_page_using_bs', 'parse_page_using_re']:
setup = 'from __main__ import {} as parse, page'.format(function)
print(function, ':', timeit('parse(page)', setup=setup, number=10))Now, as you’re asking about performance, building intermediate lists using
BeautifulSoup.find_all or re.findall only for transformation purposes is not the best thing you can do. Better use generators. It's easier with re using re.finditer as almost Code Snippets
calendardate = (<method using either bs or re>)
calendarhour = (<method using either bs or re>)
calendartimes = [
pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
for date, time in zip(calendardate, calendarhour)
]df = pd.DataFrame({
'BE': be,
'NL': nl,
'DEAT': deat,
'FR': fr
}, index=pd.to_datetime(calendartimes))import re
from bs4 import BeautifulSoup
import requests
import pandas as pd
DATA_URL = 'http://utilitytool.casc.eu/' \
'CascUtilityWebService.asmx/GetNetPositionDataForAPeriod'
def download_data(from_date, to_date, url=DATA_URL):
date_format = '%Y-%m-%d'
parameters = {
'dateFrom': from_date.strftime(date_format),
'dateTo': to_date.strftime(date_format),
}
page = requests.get(url, params=parameters)
page.raise_for_status()
return page.content
def parse_page_using_bs(content):
soup = BeautifulSoup(content, 'xml')
# Extract data from soup
calendardate = (i.text for i in soup.findAll('calendardate'))
calendarhour = (i.text for i in soup.findAll('calendarhour'))
be = (i.text for i in soup.findAll('be'))
nl = (i.text for i in soup.findAll('nl'))
deat = (i.text for i in soup.findAll('deat'))
fr = (i.text for i in soup.findAll('fr'))
# Merge date and hours in a single datetime
calendartimes = [
pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
for date, time in zip(calendardate, calendarhour)
]
# Create Pandas Df
return pd.DataFrame({
'BE': be,
'NL': nl,
'DEAT': deat,
'FR': fr
}, index=pd.to_datetime(calendartimes))
def parse_page_using_re(content):
data = str(content)
calendardate = re.findall(r'<CalendarDate>(.*?)</CalendarDate>', data)
calendarhour = re.findall(r'<CalendarHour>(.*?)</CalendarHour>', data)
be = re.findall(r'<BE>(.*?)</BE>', data)
nl = re.findall(r'<NL>(.*?)</NL>', data)
deat = re.findall(r'<DEAT>(.*?)</DEAT>', data)
fr = re.findall(r'<FR>(.*?)</FR>', data)
# Merge date and hours in a single datetime
calendartimes = [
pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
for date, time in zip(calendardate, calendarhour)
]
# convert strings to floats
be = [float(i) for i in be]
nl = [float(i) for i in nl]
deat = [float(i) for i in deat]
fr = [float(i) for i in fr]
# create pandas df from lists
df = pd.DataFrame()
df['BE'] = be
df['NL'] = nl
df['DEAT'] = deat
df['FR'] = fr
df.index = pd.to_datetime(calendartimes)
return df
if __name__ == '__main__':
import time
import datetime
page = download_data(datetime.date(2016, 1, 1), datetime.date(2016, 7, 31))
start = time.time()
parse_page_using_bs(page)
end = time.time()
print('\nTime elapsed for Beautifulsoup', round(end - start, 3), 's')
start = time.time()
parse_page_using_re(page)
end = time.time()
print('\nTime elapsed for re', round(end - start, 3), 's')if __name__ == '__main__':
from timeit import timeit
from datetime import date
page = download_data(date(2016, 1, 1), date(2016, 7, 31))
for function in ['parse_page_using_bs', 'parse_page_using_re']:
setup = 'from __main__ import {} as parse, page'.format(function)
print(function, ':', timeit('parse(page)', setup=setup, number=10))def find_iter(soup, tag):
content = soup.find(tag)
while content is not None:
yield content
content = content.find_next(tag)Context
StackExchange Code Review Q#150354, answer score: 3
Revisions (0)
No revisions yet.