patternpythonMinor

Beautifulsoup scraper for sport events

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

eventssportforbeautifulsoupscraper

Problem

I've written a simple scraper that parses HTML using BeautifulSoup and collects the data (schedule of sports events), then clubs them together in a list of dicts.

The code works just fine, but the way I process the data is pretty horrible IMO. I use an if...else to parse the data selectively, because the output of the dicts are alternative, that is {venue, result (if available)}, and time of match, and team information.

So basically, is there a better, more reliable way to parse the data?

```
import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):

def getMatches(self, url):
""" Scrape the given url for match schedule """

headers = {'Accept':'text/css,/;q=0.1',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'en-US,en;q=0.8',
'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}

page = requests.get(url, headers = headers)
page_content = page.content
soup = BeautifulSoup(page_content)

result = soup.find('div', attrs={'class':'bElementBox'})
tags = result.findChildren('tr')

match_type_list = ['TEST', 'ODI', 'T20']
match_info = []

for elem in range(1,len(tags)):
dict_ = {}
x = tags[elem].getText()
x = x.replace(r' ', '')

if 'Venue' in x:
for a in match_type_list:
if a in x:
match_type = a

x = x.replace('Venue', '')
if 'Result' in x:
x = x.replace('Result', '')
x = x.split(': ')
# print x
venue = x[1]
result = x[2]
dict_.update({'venue':venue,'result':result,
'match_type':match_type})
else:
x = x.split(': ')

Solution

import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):

Python conventions state that classes should CamelCase.

def getMatches(self, url):

Python convention states that methods should be lowercase_with_underscores. Also, there isn't really a reason to have this method in a class anyway. Seems to me that it should be a function.

""" Scrape the given url for match schedule """

        headers = {'Accept':'text/css,*/*;q=0.1',
        'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding':'gzip,deflate,sdch',
        'Accept-Language':'en-US,en;q=0.8',
        'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}

I'd move this out of the function into a global constant.

page = requests.get(url, headers = headers)
        page_content = page.content
        soup = BeautifulSoup(page_content)

I'd combine these three lines

soup = BeatifulSoup(request.get(url, headers = headers).content)

        result = soup.find('div', attrs={'class':'bElementBox'})
        tags = result.findChildren('tr')

I'd avoid non-descriptive names like result. Tags is a bit better, but not by a whole lot. I'd also combine these two lines

tags = soup.find('div', attr={'class':'bElementBox'}).findChildren('tr')

        match_type_list = ['TEST', 'ODI', 'T20'] 
        match_info = []

        for elem in range(1,len(tags)):

It'd make more sense to process things two rows at a time.

dict_ = {}

Useless name alert.

x = tags[elem].getText()
            x = x.replace(r' ', '')

At this point, you extract the text and throw out the html. But the data is divided in table cells, so why in the world wouldn't you want to take advantage of that? At this point you drop down to trying to extract data straight from the text which much harder then if you can use the tags as hints.

if 'Venue' in x:
                for a in match_type_list:
                    if a in x:
                        match_type = a

                x = x.replace('Venue', '')
                if 'Result' in x:
                    x = x.replace('Result', '')
                    x = x.split(': ')
                    # print x
                    venue = x[1]
                    result = x[2]

All this work to extract data from the string is something you really should use a regular expression for. This is exactly the kind of situation it excells at.

dict_.update({'venue':venue,'result':result,
                        'match_type':match_type})

Its not clear why you would choose to update rather then simply assign. There is no way that any other line of code in this loop can assign to it.

else:
                    x = x.split(': ')
                    venue = x[1]
                    dict_.update({'venue':venue})

            else:
                match = re.search(r'\b[AP]M', x)
                date_time = x[0:match.end()]
                date_time = date_time.replace(',','')[4:]
                teams = x[match.end():].split('vs')
                home_team = teams[0].strip()
                away_team = teams[1].strip()

                # print date_time, home_team, away_team

                time_obj = datetime.strptime(date_time, '%b %d %Y %I:%M %p')

Organization seems a little suspect. You jump from working on the date, over to the teams, and then back to the date. I'd stick with date until it was finished. You also spend a bunch of lines massaging the date. However, strptime lets you specify any format you want, so you should just be able to have it parse the data

timings = time_obj.strftime('%Y-%m-%dT%H:%MZ')

If I'm parsing, I wouldn't convert the time back into a date in another object. I'd keep it as a date object.

dict_.update({'home_team':home_team,
                    'away_team':away_team,
                    'timings':timings
                })

            match_info.append(dict_)

        # print match_info

        final_list = []     # final list of dicts that we need

        for i in range(0, len(match_info), 2):
            final_list.append(dict(match_info[i].items() +
                match_info[i+1].items()))

I'd do this:

for i in range(0, len(match_info), 2):
    left, right = match_info[i:i+2]
    final_list.append( dict(left.items() + right.items() )

I think it makes things a bit clearer.

# for i in final_list:
        #     print i
        # print final_list

this function probably needs to return that list or something.

if __name__ == '__main__':
    url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series.
    #url = 'http://localhost:6543/lhost/static/icc_cricket.html'
    c = cricket()
    c.getMatches(url)

Good

Here is my reworking of your code:

```
def get_cricket_matches(content):
""" Scrape the given content for match schedule """

Code Snippets

import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):

def getMatches(self, url):

""" Scrape the given url for match schedule """

        headers = {'Accept':'text/css,*/*;q=0.1',
        'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding':'gzip,deflate,sdch',
        'Accept-Language':'en-US,en;q=0.8',
        'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}

page = requests.get(url, headers = headers)
        page_content = page.content
        soup = BeautifulSoup(page_content)

soup = BeatifulSoup(request.get(url, headers = headers).content)


        result = soup.find('div', attrs={'class':'bElementBox'})
        tags = result.findChildren('tr')

Context

StackExchange Code Review Q#13593, answer score: 4

Revisions (0)

No revisions yet.