patternpythonMinor
Beautifulsoup scraper for sport events
Viewed 0 times
eventssportforbeautifulsoupscraper
Problem
I've written a simple scraper that parses HTML using BeautifulSoup and collects the data (schedule of sports events), then clubs them together in a list of dicts.
The code works just fine, but the way I process the data is pretty horrible IMO. I use an
So basically, is there a better, more reliable way to parse the data?
```
import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):
def getMatches(self, url):
""" Scrape the given url for match schedule """
headers = {'Accept':'text/css,/;q=0.1',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'en-US,en;q=0.8',
'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}
page = requests.get(url, headers = headers)
page_content = page.content
soup = BeautifulSoup(page_content)
result = soup.find('div', attrs={'class':'bElementBox'})
tags = result.findChildren('tr')
match_type_list = ['TEST', 'ODI', 'T20']
match_info = []
for elem in range(1,len(tags)):
dict_ = {}
x = tags[elem].getText()
x = x.replace(r' ', '')
if 'Venue' in x:
for a in match_type_list:
if a in x:
match_type = a
x = x.replace('Venue', '')
if 'Result' in x:
x = x.replace('Result', '')
x = x.split(': ')
# print x
venue = x[1]
result = x[2]
dict_.update({'venue':venue,'result':result,
'match_type':match_type})
else:
x = x.split(': ')
The code works just fine, but the way I process the data is pretty horrible IMO. I use an
if...else to parse the data selectively, because the output of the dicts are alternative, that is {venue, result (if available)}, and time of match, and team information. So basically, is there a better, more reliable way to parse the data?
```
import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):
def getMatches(self, url):
""" Scrape the given url for match schedule """
headers = {'Accept':'text/css,/;q=0.1',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'en-US,en;q=0.8',
'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}
page = requests.get(url, headers = headers)
page_content = page.content
soup = BeautifulSoup(page_content)
result = soup.find('div', attrs={'class':'bElementBox'})
tags = result.findChildren('tr')
match_type_list = ['TEST', 'ODI', 'T20']
match_info = []
for elem in range(1,len(tags)):
dict_ = {}
x = tags[elem].getText()
x = x.replace(r' ', '')
if 'Venue' in x:
for a in match_type_list:
if a in x:
match_type = a
x = x.replace('Venue', '')
if 'Result' in x:
x = x.replace('Result', '')
x = x.split(': ')
# print x
venue = x[1]
result = x[2]
dict_.update({'venue':venue,'result':result,
'match_type':match_type})
else:
x = x.split(': ')
Solution
import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):Python conventions state that classes should CamelCase.
def getMatches(self, url):Python convention states that methods should be lowercase_with_underscores. Also, there isn't really a reason to have this method in a class anyway. Seems to me that it should be a function.
""" Scrape the given url for match schedule """
headers = {'Accept':'text/css,*/*;q=0.1',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'en-US,en;q=0.8',
'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}I'd move this out of the function into a global constant.
page = requests.get(url, headers = headers)
page_content = page.content
soup = BeautifulSoup(page_content)I'd combine these three lines
soup = BeatifulSoup(request.get(url, headers = headers).content)
result = soup.find('div', attrs={'class':'bElementBox'})
tags = result.findChildren('tr')I'd avoid non-descriptive names like result. Tags is a bit better, but not by a whole lot. I'd also combine these two lines
tags = soup.find('div', attr={'class':'bElementBox'}).findChildren('tr')
match_type_list = ['TEST', 'ODI', 'T20']
match_info = []
for elem in range(1,len(tags)):It'd make more sense to process things two rows at a time.
dict_ = {}Useless name alert.
x = tags[elem].getText()
x = x.replace(r' ', '')At this point, you extract the text and throw out the html. But the data is divided in table cells, so why in the world wouldn't you want to take advantage of that? At this point you drop down to trying to extract data straight from the text which much harder then if you can use the tags as hints.
if 'Venue' in x:
for a in match_type_list:
if a in x:
match_type = a
x = x.replace('Venue', '')
if 'Result' in x:
x = x.replace('Result', '')
x = x.split(': ')
# print x
venue = x[1]
result = x[2]All this work to extract data from the string is something you really should use a regular expression for. This is exactly the kind of situation it excells at.
dict_.update({'venue':venue,'result':result,
'match_type':match_type})Its not clear why you would choose to update rather then simply assign. There is no way that any other line of code in this loop can assign to it.
else:
x = x.split(': ')
venue = x[1]
dict_.update({'venue':venue})
else:
match = re.search(r'\b[AP]M', x)
date_time = x[0:match.end()]
date_time = date_time.replace(',','')[4:]
teams = x[match.end():].split('vs')
home_team = teams[0].strip()
away_team = teams[1].strip()
# print date_time, home_team, away_team
time_obj = datetime.strptime(date_time, '%b %d %Y %I:%M %p')Organization seems a little suspect. You jump from working on the date, over to the teams, and then back to the date. I'd stick with date until it was finished. You also spend a bunch of lines massaging the date. However, strptime lets you specify any format you want, so you should just be able to have it parse the data
timings = time_obj.strftime('%Y-%m-%dT%H:%MZ')If I'm parsing, I wouldn't convert the time back into a date in another object. I'd keep it as a date object.
dict_.update({'home_team':home_team,
'away_team':away_team,
'timings':timings
})
match_info.append(dict_)
# print match_info
final_list = [] # final list of dicts that we need
for i in range(0, len(match_info), 2):
final_list.append(dict(match_info[i].items() +
match_info[i+1].items()))I'd do this:
for i in range(0, len(match_info), 2):
left, right = match_info[i:i+2]
final_list.append( dict(left.items() + right.items() )I think it makes things a bit clearer.
# for i in final_list:
# print i
# print final_listthis function probably needs to return that list or something.
if __name__ == '__main__':
url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series.
#url = 'http://localhost:6543/lhost/static/icc_cricket.html'
c = cricket()
c.getMatches(url)Good
Here is my reworking of your code:
```
def get_cricket_matches(content):
""" Scrape the given content for match schedule """
Code Snippets
import requests
import re
from BeautifulSoup import BeautifulSoup
from datetime import datetime
class cricket(object):def getMatches(self, url):""" Scrape the given url for match schedule """
headers = {'Accept':'text/css,*/*;q=0.1',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'en-US,en;q=0.8',
'User-Agent':'Mozilla/5 (Solaris 10) Gecko'}page = requests.get(url, headers = headers)
page_content = page.content
soup = BeautifulSoup(page_content)soup = BeatifulSoup(request.get(url, headers = headers).content)
result = soup.find('div', attrs={'class':'bElementBox'})
tags = result.findChildren('tr')Context
StackExchange Code Review Q#13593, answer score: 4
Revisions (0)
No revisions yet.