patternpythonMinor
Scan a webpage to find the start time and date for an event
Viewed 0 times
findthewebpagescantimedateforstartandevent
Problem
I am working on a simple web crawler that returns the start time and date for an event listed on a webpage. The webpage can be in two different formats and there are multiple other dates listed on the page. The part of the webpage I am scanning looks like this:
or like this:
After using BeautifulSoup to find the links I am interested in I pass the link to these methods which find the text between 'Time' and 'Location'. Next it searches for a tag in that code block '">" which I know only appears before the date text and scans until the next closing tag, returning a date string.
This would return "Monday March 13, 2017 - 05:30 PM" in the first case and "Friday March 17, 2017 - 01:30 PM" in the second case. My method seems really hacky. Any tips on how I could do this better?
...
Time
Starts: Monday March 13, 2017 - 05:30 PM
Ends: Monday March 13, 2017 - 07:00 PM
Additional Dates/Times
Starts: Monday January 30, 2017 - 05:30 PM
Ends: Monday January 30, 2017 - 07:00 PM
Location
...or like this:
...
Time
Friday March 17, 2017 - 01:30 PM
Location
....After using BeautifulSoup to find the links I am interested in I pass the link to these methods which find the text between 'Time' and 'Location'. Next it searches for a tag in that code block '">" which I know only appears before the date text and scans until the next closing tag, returning a date string.
MAX_CHARS = 140
def get_date(link):
date_text = ""
event_html = urllib2.urlopen(link['href']).read()
start = find_date_location(event_html)
# after the datetime closing tag the date begins until another tag opens
for x in range(start, start + MAX_CHARS):
if event_html[x] == 'Time')
date_ends = html.find('Location')
for x in range(date_starts, date_ends):
if html[x] + html[x+1] == '">':
return x+2
raise ValueError('Date not found in HTML within time range')This would return "Monday March 13, 2017 - 05:30 PM" in the first case and "Friday March 17, 2017 - 01:30 PM" in the second case. My method seems really hacky. Any tips on how I could do this better?
Solution
Your current approach is very fragile and might easily break if the markup changes even slightly - imagine, for example, opening and closing
I would use a proper HTML parser like
Here, the
And, for the second:
You can then go further and convert the date strings to
Note that I've also moved the repetitive date retrieval logic to a separate reusable
h4 being on separate lines while still being a valid HTML element.I would use a proper HTML parser like
BeautifulSoup instead (you mentioned you've already tried it). In order to locate the start and end dates, we may use the itemprop attribute:from bs4 import BeautifulSoup
def get_even_date_range(html):
soup = BeautifulSoup(html, 'html.parser')
start_date = soup.find("time", itemprop="startDate")
end_date = soup.find("time", itemprop="endDate")
return (start_date.get_text() if start_date else None,
end_date.get_text() if end_date else None)Here, the
get_even_date_range() function would return a tuple with start and end dates as items. It would return None if a date is not found. For the first sample input HTML, it would return:('Monday March 13, 2017 - 05:30 PM', 'Monday March 13, 2017 - 07:00 PM')And, for the second:
('Friday March 17, 2017 - 01:30 PM', None)You can then go further and convert the date strings to
datetimes using the datetime.strptime() and the %A %B %d, %Y - %H:%M %p format:from datetime import datetime
from bs4 import BeautifulSoup
DATE_FORMAT = "%A %B %d, %Y - %H:%M %p"
def get_date(date_element):
return datetime.strptime(date_element.get_text(), DATE_FORMAT) if date_element else None
def get_even_date_range(html):
soup = BeautifulSoup(html, 'html.parser')
start_date = soup.find("time", itemprop="startDate")
end_date = soup.find("time", itemprop="endDate")
return get_date(start_date), get_date(end_date)Note that I've also moved the repetitive date retrieval logic to a separate reusable
get_date() function.Code Snippets
from bs4 import BeautifulSoup
def get_even_date_range(html):
soup = BeautifulSoup(html, 'html.parser')
start_date = soup.find("time", itemprop="startDate")
end_date = soup.find("time", itemprop="endDate")
return (start_date.get_text() if start_date else None,
end_date.get_text() if end_date else None)('Monday March 13, 2017 - 05:30 PM', 'Monday March 13, 2017 - 07:00 PM')('Friday March 17, 2017 - 01:30 PM', None)from datetime import datetime
from bs4 import BeautifulSoup
DATE_FORMAT = "%A %B %d, %Y - %H:%M %p"
def get_date(date_element):
return datetime.strptime(date_element.get_text(), DATE_FORMAT) if date_element else None
def get_even_date_range(html):
soup = BeautifulSoup(html, 'html.parser')
start_date = soup.find("time", itemprop="startDate")
end_date = soup.find("time", itemprop="endDate")
return get_date(start_date), get_date(end_date)Context
StackExchange Code Review Q#157765, answer score: 5
Revisions (0)
No revisions yet.