HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scan a webpage to find the start time and date for an event

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
findthewebpagescantimedateforstartandevent

Problem

I am working on a simple web crawler that returns the start time and date for an event listed on a webpage. The webpage can be in two different formats and there are multiple other dates listed on the page. The part of the webpage I am scanning looks like this:

...
Time
     
        Starts: Monday March 13, 2017 - 05:30 PM
        
        Ends: Monday March 13, 2017 - 07:00 PM
    

Additional Dates/Times
    
        Starts: Monday January 30, 2017 - 05:30 PM
        
        Ends: Monday January 30, 2017 - 07:00 PM
    

Location
...


or like this:

...
Time

    Friday March 17, 2017 - 01:30 PM

Location
....


After using BeautifulSoup to find the links I am interested in I pass the link to these methods which find the text between 'Time' and 'Location'. Next it searches for a tag in that code block '">" which I know only appears before the date text and scans until the next closing tag, returning a date string.

MAX_CHARS = 140

def get_date(link):
    date_text = ""
    event_html = urllib2.urlopen(link['href']).read()
    start = find_date_location(event_html)
    # after the datetime closing tag the date begins until another tag opens
    for x in range(start, start + MAX_CHARS):
        if event_html[x] == 'Time')
    date_ends = html.find('Location')
    for x in range(date_starts, date_ends):
        if html[x] + html[x+1] == '">':
            return x+2

    raise ValueError('Date not found in HTML within time range')


This would return "Monday March 13, 2017 - 05:30 PM" in the first case and "Friday March 17, 2017 - 01:30 PM" in the second case. My method seems really hacky. Any tips on how I could do this better?

Solution

Your current approach is very fragile and might easily break if the markup changes even slightly - imagine, for example, opening and closing h4 being on separate lines while still being a valid HTML element.

I would use a proper HTML parser like BeautifulSoup instead (you mentioned you've already tried it). In order to locate the start and end dates, we may use the itemprop attribute:

from bs4 import BeautifulSoup

def get_even_date_range(html):
    soup = BeautifulSoup(html, 'html.parser')

    start_date = soup.find("time", itemprop="startDate")
    end_date = soup.find("time", itemprop="endDate")

    return (start_date.get_text() if start_date else None,
            end_date.get_text() if end_date else None)


Here, the get_even_date_range() function would return a tuple with start and end dates as items. It would return None if a date is not found. For the first sample input HTML, it would return:

('Monday March 13, 2017 - 05:30 PM', 'Monday March 13, 2017 - 07:00 PM')


And, for the second:

('Friday March 17, 2017 - 01:30 PM', None)


You can then go further and convert the date strings to datetimes using the datetime.strptime() and the %A %B %d, %Y - %H:%M %p format:

from datetime import datetime
from bs4 import BeautifulSoup

DATE_FORMAT = "%A %B %d, %Y - %H:%M %p"

def get_date(date_element):
    return datetime.strptime(date_element.get_text(), DATE_FORMAT) if date_element else None

def get_even_date_range(html):
    soup = BeautifulSoup(html, 'html.parser')

    start_date = soup.find("time", itemprop="startDate")
    end_date = soup.find("time", itemprop="endDate")

    return get_date(start_date), get_date(end_date)


Note that I've also moved the repetitive date retrieval logic to a separate reusable get_date() function.

Code Snippets

from bs4 import BeautifulSoup


def get_even_date_range(html):
    soup = BeautifulSoup(html, 'html.parser')

    start_date = soup.find("time", itemprop="startDate")
    end_date = soup.find("time", itemprop="endDate")

    return (start_date.get_text() if start_date else None,
            end_date.get_text() if end_date else None)
('Monday March 13, 2017 - 05:30 PM', 'Monday March 13, 2017 - 07:00 PM')
('Friday March 17, 2017 - 01:30 PM', None)
from datetime import datetime
from bs4 import BeautifulSoup


DATE_FORMAT = "%A %B %d, %Y - %H:%M %p"


def get_date(date_element):
    return datetime.strptime(date_element.get_text(), DATE_FORMAT) if date_element else None


def get_even_date_range(html):
    soup = BeautifulSoup(html, 'html.parser')

    start_date = soup.find("time", itemprop="startDate")
    end_date = soup.find("time", itemprop="endDate")

    return get_date(start_date), get_date(end_date)

Context

StackExchange Code Review Q#157765, answer score: 5

Revisions (0)

No revisions yet.