HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Improving the performance of a webscraper

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
webscrapertheimprovingperformance

Problem

I have here a modified version of a web scraping code I wrote some weeks back. With some help from this forum, this modified version is faster (at 4secs per iteration) than the earlier version. However, I need to run many iterations (over 1million) which is so much time. Is there any way to further enhance its performance? Thank you.

sample data (data.csv)

Code    Origin
1       Eisenstadt
2       Tirana
3       St Pölten Hbf
6       Wien Westbahnhof
7       Wien Hauptbahnhof
8       Klagenfurt Hbf
9       Villach Hbf
11      Graz Hbf
12      Liezen


Code:

```
import csv
from functools import wraps
from datetime import datetime, time
import urllib2
from mechanize import Browser
from bs4 import BeautifulSoup, SoupStrainer

# function to group elements of a list
def group(lst, n):
return zip(*[lst[i::n] for i in range(n)])

# function to convert time string to minutes
def get_min(time_str):
h, m = time_str.split(':')
return int(h) * 60 + int(m)

# Delay function incase of network disconnection
def retry(ExceptionToCheck, tries=1000, delay=3, backoff=2, logger=None):

def deco_retry(f):

@wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck, e:
msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
if logger:
logger.warning(msg)
else:
print msg
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)

return f_retry # true decorator

return deco_retry

def datareader(datafile):
""" This function reads the cities data from csv file and processes
them into an O-D for input into the web scrapper """

# Read the csv
with open(datafile, 'r'

Solution

There is a major limitation. Your code is of a blocking nature - you process timetable searches sequentially - one at a time.

I really think you should switch to Scrapy web-scraping framework - it is fast, pluggable and entirely asynchronous. As a bonus point, you will be able to scale your spider to multiple instances or multiple machines. For example, you may divide your input data evenly into N parts and run a spider instance for every part (check out scrapyd).

Here is a sample spider that works for a single timetable search:

import scrapy

TIMES = ['05:30', '09:00', '12:00', '15:00', '18:00', '21:00']
DEFAULT_PARAMS = {
    "changeQueryInputData=yes&start": "Search connection",

    "REQ0Total_KissRideMotorClass": "404",
    "REQ0Total_KissRideCarClass": "5",
    "REQ0Total_KissRide_maxDist": "10000000",
    "REQ0Total_KissRide_minDist": "0",
    "REQComparisonCarload": "0",

    "REQ0JourneyStopsS0A": "255",
    "REQ0JourneyStopsZ0A": "255",
    "REQ0JourneyStops1.0G": "",
    "REQ0JourneyStops1.0A": "1",
    "REQ0JourneyStopover1": ""
}

def merge_two_dicts(x, y):
    """Given two dicts, merge them into a new dict as a shallow copy."""
    z = x.copy()
    z.update(y)
    return z

class FahrplanSpider(scrapy.Spider):
    name = "fahrplan"
    allowed_domains = ["fahrplan.sbb.ch"]

    def start_requests(self):
        params = {
            "REQ0JourneyStopsS0G": "Eisenstadt",
            "REQ0JourneyStopsZ0G": "Tirano, Stazione",
            "date": "27.02.17",
            "REQ0JourneyTime": "17:00"
        }
        formdata = merge_two_dicts(DEFAULT_PARAMS, params)
        yield scrapy.FormRequest("http://fahrplan.sbb.ch/bin/query.exe/en", method="POST", formdata=formdata)

    def parse(self, response):
        for trip_time in response.css("table.hfs_overview tr td.time::text").extract():
            print(trip_time.strip())


If you want to take it further, you should do the following:

  • use the datareader() results in the start_requests() method and start a form request for every input item



  • define an Item class and yield/return it in the parse() callback



  • use an "Item Pipeline" to "pipe" your items into the output file



I understand that there is a lot of new information for you, but doing web-scraping for a long time, I can say that's really worth it, especially performance-wise.

Code Snippets

import scrapy


TIMES = ['05:30', '09:00', '12:00', '15:00', '18:00', '21:00']
DEFAULT_PARAMS = {
    "changeQueryInputData=yes&start": "Search connection",

    "REQ0Total_KissRideMotorClass": "404",
    "REQ0Total_KissRideCarClass": "5",
    "REQ0Total_KissRide_maxDist": "10000000",
    "REQ0Total_KissRide_minDist": "0",
    "REQComparisonCarload": "0",

    "REQ0JourneyStopsS0A": "255",
    "REQ0JourneyStopsZ0A": "255",
    "REQ0JourneyStops1.0G": "",
    "REQ0JourneyStops1.0A": "1",
    "REQ0JourneyStopover1": ""
}


def merge_two_dicts(x, y):
    """Given two dicts, merge them into a new dict as a shallow copy."""
    z = x.copy()
    z.update(y)
    return z


class FahrplanSpider(scrapy.Spider):
    name = "fahrplan"
    allowed_domains = ["fahrplan.sbb.ch"]

    def start_requests(self):
        params = {
            "REQ0JourneyStopsS0G": "Eisenstadt",
            "REQ0JourneyStopsZ0G": "Tirano, Stazione",
            "date": "27.02.17",
            "REQ0JourneyTime": "17:00"
        }
        formdata = merge_two_dicts(DEFAULT_PARAMS, params)
        yield scrapy.FormRequest("http://fahrplan.sbb.ch/bin/query.exe/en", method="POST", formdata=formdata)

    def parse(self, response):
        for trip_time in response.css("table.hfs_overview tr td.time::text").extract():
            print(trip_time.strip())

Context

StackExchange Code Review Q#157640, answer score: 3

Revisions (0)

No revisions yet.