HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Webscraping calendar events using Python 3, with or without BeautifulSoup

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withouteventswithwebscrapingpythonusingbeautifulsoupcalendar

Problem

I'm trying to find out why my web-scraping code with BeautifulSoup (BS) is slower than my code without BS. I would think that BS code would be faster than the other code - so, maybe I'm doing something wrong?

With BS

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

# start timer
start = time.time()

# control parameters: 
# dates
dateFrom = '2016-01-01'
dateTo = '2016-07-31'

# url
url = 'http://utilitytool.casc.eu/CascUtilityWebService.asmx/GetNetPositionDataForAPeriod?dateFrom=' + dateFrom + '&dateTo=' + dateTo
# /control parameters:
page = requests.get(url)

soup = BeautifulSoup(page.content)

# Extract data from soup
calendardate = [i.text for i in soup.findAll('calendardate')]
calendarhour = [i.text for i in soup.findAll('calendarhour')]
be = [i.text for i in soup.findAll('be')]
nl = [i.text for i in soup.findAll('nl')]
deat = [i.text for i in soup.findAll('deat')]
fr = [i.text for i in soup.findAll('fr')]

# lose the useless string in date list
calendardate = [w.replace('T00:00:00', '') for w in calendardate ]

# convert hour column to int
calendarhour = [int(i) for i in calendarhour]

# Python operates with hours: 0-23 and not with 1-24
datetime = [x-1 if x - 1 > 9 else '0' + str(x-1) for x in calendarhour]

# create DateTime list
datetime = ["%s %s:00:00" % t for t in zip(calendardate, datetime)]

# Create Pandas Df
df = pd.DataFrame({
        'datetime': datetime,
        'be': be,
        'nl': nl,
        'deat': deat,
        'fr': fr
    },
    columns = ['datetime', 'be', 'nl', 'deat', 'fr'])

# end time
end = time.time()
print('\nTime elapsed', round(end - start, 3), 's')


Without BS

```
import pandas as pd
from datetime import datetime
import time
import urllib.request
import re

# start timer
start = time.time()

# control parameters:
# dates
dateFrom = '2016-01-01'
dateTo = '2016-07-31'

# url
url = 'http://utilitytool.casc.eu/CascUtilityWebService.asmx/GetNetPositionDataForAPeriod?dateFrom=' + dateFrom + '

Solution

Regardless of the parsing method, and considering that you are using pandas dataframes in the end, I would simplify some of its usage.

pandas has its own objects dealing with dates and times, and they are pretty smart: pd.Timestamp and pd.Timedelta. You can get your whole datetime manipulation done using:

calendardate = ()
calendarhour = ()

calendartimes = [
    pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
    for date, time in zip(calendardate, calendarhour)
]


Also, these calendartimes may end up being more than a column of data, you can convert them to an index using pd.to_datetime:

df = pd.DataFrame({
    'BE': be,
    'NL': nl,
    'DEAT': deat,
    'FR': fr
}, index=pd.to_datetime(calendartimes))


(Oh and, when using a dictionary to feed into pd.DataFrame, you don't need to specify the columns, the keys of the dictionary will be used).

Now there are a few other things to consider when looking at the code.

First off, get consistent. Between the two versions, naming conventions are not the same so it is harder to talk about the variables at play. The first code is way better at this as it follows PEP 8 more closely.

Consistency would also be to use the same libraries to retrieve the required page. And, as far as timing is concerned, retrieving the page should not be timed as it highly depends on your bandwidth at the moment of download. Including that time in your measurements of the parsing bits is not fair.

Talking about timing, you should put your code into functions so it is easier to reuse, test and time. Wrapping your tests under if __name__ == '__main__': would also be good practice:

import re
from bs4 import BeautifulSoup
import requests
import pandas as pd

DATA_URL = 'http://utilitytool.casc.eu/' \
           'CascUtilityWebService.asmx/GetNetPositionDataForAPeriod'

def download_data(from_date, to_date, url=DATA_URL):
    date_format = '%Y-%m-%d'
    parameters = {
        'dateFrom': from_date.strftime(date_format),
        'dateTo': to_date.strftime(date_format),
    }

    page = requests.get(url, params=parameters)
    page.raise_for_status()
    return page.content

def parse_page_using_bs(content):
    soup = BeautifulSoup(content, 'xml')

    # Extract data from soup
    calendardate = (i.text for i in soup.findAll('calendardate'))
    calendarhour = (i.text for i in soup.findAll('calendarhour'))
    be = (i.text for i in soup.findAll('be'))
    nl = (i.text for i in soup.findAll('nl'))
    deat = (i.text for i in soup.findAll('deat'))
    fr = (i.text for i in soup.findAll('fr'))

    # Merge date and hours in a single datetime
    calendartimes = [
        pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
        for date, time in zip(calendardate, calendarhour)
    ]

    # Create Pandas Df
    return pd.DataFrame({
            'BE': be,
            'NL': nl,
            'DEAT': deat,
            'FR': fr
    }, index=pd.to_datetime(calendartimes))

def parse_page_using_re(content):
    data = str(content)
    calendardate = re.findall(r'(.*?)', data)
    calendarhour = re.findall(r'(.*?)', data)
    be = re.findall(r'(.*?)', data)
    nl = re.findall(r'(.*?)', data)
    deat = re.findall(r'(.*?)', data)
    fr = re.findall(r'(.*?)', data)

    # Merge date and hours in a single datetime
    calendartimes = [
        pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
        for date, time in zip(calendardate, calendarhour)
    ]

    # convert strings to floats
    be = [float(i) for i in be]
    nl = [float(i) for i in nl]
    deat = [float(i) for i in deat]
    fr = [float(i) for i in fr]

    # create pandas df from lists
    df = pd.DataFrame()
    df['BE'] = be
    df['NL'] = nl
    df['DEAT'] = deat
    df['FR'] = fr
    df.index = pd.to_datetime(calendartimes)
    return df

if __name__ == '__main__':
    import time
    import datetime

    page = download_data(datetime.date(2016, 1, 1), datetime.date(2016, 7, 31))

    start = time.time()
    parse_page_using_bs(page)
    end = time.time()
    print('\nTime elapsed for Beautifulsoup', round(end - start, 3), 's')

    start = time.time()
    parse_page_using_re(page)
    end = time.time()
    print('\nTime elapsed for re', round(end - start, 3), 's')


This also enables you to use better timing tools, like timeit:

if __name__ == '__main__':
    from timeit import timeit
    from datetime import date

    page = download_data(date(2016, 1, 1), date(2016, 7, 31))
    for function in ['parse_page_using_bs', 'parse_page_using_re']:
        setup = 'from __main__ import {} as parse, page'.format(function)
        print(function, ':', timeit('parse(page)', setup=setup, number=10))


Now, as you’re asking about performance, building intermediate lists using BeautifulSoup.find_all or re.findall only for transformation purposes is not the best thing you can do. Better use generators. It's easier with re using re.finditer as almost

Code Snippets

calendardate = (<method using either bs or re>)
calendarhour = (<method using either bs or re>)

calendartimes = [
    pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
    for date, time in zip(calendardate, calendarhour)
]
df = pd.DataFrame({
    'BE': be,
    'NL': nl,
    'DEAT': deat,
    'FR': fr
}, index=pd.to_datetime(calendartimes))
import re
from bs4 import BeautifulSoup
import requests
import pandas as pd


DATA_URL = 'http://utilitytool.casc.eu/' \
           'CascUtilityWebService.asmx/GetNetPositionDataForAPeriod'


def download_data(from_date, to_date, url=DATA_URL):
    date_format = '%Y-%m-%d'
    parameters = {
        'dateFrom': from_date.strftime(date_format),
        'dateTo': to_date.strftime(date_format),
    }

    page = requests.get(url, params=parameters)
    page.raise_for_status()
    return page.content


def parse_page_using_bs(content):
    soup = BeautifulSoup(content, 'xml')

    # Extract data from soup
    calendardate = (i.text for i in soup.findAll('calendardate'))
    calendarhour = (i.text for i in soup.findAll('calendarhour'))
    be = (i.text for i in soup.findAll('be'))
    nl = (i.text for i in soup.findAll('nl'))
    deat = (i.text for i in soup.findAll('deat'))
    fr = (i.text for i in soup.findAll('fr'))

    # Merge date and hours in a single datetime
    calendartimes = [
        pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
        for date, time in zip(calendardate, calendarhour)
    ]

    # Create Pandas Df
    return pd.DataFrame({
            'BE': be,
            'NL': nl,
            'DEAT': deat,
            'FR': fr
    }, index=pd.to_datetime(calendartimes))


def parse_page_using_re(content):
    data = str(content)
    calendardate = re.findall(r'<CalendarDate>(.*?)</CalendarDate>', data)
    calendarhour = re.findall(r'<CalendarHour>(.*?)</CalendarHour>', data)
    be = re.findall(r'<BE>(.*?)</BE>', data)
    nl = re.findall(r'<NL>(.*?)</NL>', data)
    deat = re.findall(r'<DEAT>(.*?)</DEAT>', data)
    fr = re.findall(r'<FR>(.*?)</FR>', data)

    # Merge date and hours in a single datetime
    calendartimes = [
        pd.Timestamp(date) + pd.Timedelta('{}h'.format(int(time)-1))
        for date, time in zip(calendardate, calendarhour)
    ]

    # convert strings to floats
    be = [float(i) for i in be]
    nl = [float(i) for i in nl]
    deat = [float(i) for i in deat]
    fr = [float(i) for i in fr]

    # create pandas df from lists
    df = pd.DataFrame()
    df['BE'] = be
    df['NL'] = nl
    df['DEAT'] = deat
    df['FR'] = fr
    df.index = pd.to_datetime(calendartimes)
    return df


if __name__ == '__main__':
    import time
    import datetime

    page = download_data(datetime.date(2016, 1, 1), datetime.date(2016, 7, 31))

    start = time.time()
    parse_page_using_bs(page)
    end = time.time()
    print('\nTime elapsed for Beautifulsoup', round(end - start, 3), 's')

    start = time.time()
    parse_page_using_re(page)
    end = time.time()
    print('\nTime elapsed for re', round(end - start, 3), 's')
if __name__ == '__main__':
    from timeit import timeit
    from datetime import date

    page = download_data(date(2016, 1, 1), date(2016, 7, 31))
    for function in ['parse_page_using_bs', 'parse_page_using_re']:
        setup = 'from __main__ import {} as parse, page'.format(function)
        print(function, ':', timeit('parse(page)', setup=setup, number=10))
def find_iter(soup, tag):
    content = soup.find(tag)
    while content is not None:
        yield content
        content = content.find_next(tag)

Context

StackExchange Code Review Q#150354, answer score: 3

Revisions (0)

No revisions yet.