HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Optimizing the speed of a web scraper

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
theoptimizingwebspeedscraper

Problem

I have just written this code to scrape some data from a website. In its current state it works fine, however, going by my tests on the script, I discovered that with the amount of data I am processing, it will take a few days to finish the task, Is there a way to improve its performance? I will insert a sample of the data as the bulk of it.

Input data in CSV format:

Code Origin
1 Eisenstadt
2 Tirana
3 St Pölten Hbf
6 Wien Westbahnhof
7 Wien Hauptbahnhof
8 Klagenfurt Hbf
9 Villach Hbf
11 Graz Hbf
12 Liezen


Code:

```
# import needed libraries
import csv
from datetime import datetime
from mechanize import Browser
from bs4 import BeautifulSoup

def datareader(datafile):

""" This function reads the cities from csv file and processes

them into an O-D for input into the web scrapper """

# Read the csv
with open(datafile, 'r') as f:

reader = csv.reader(f)
header = reader.next()
ListOfCities = [lines for lines in reader]
temp = ListOfCities[:]

city_num = []
city_orig_dest = []
for i in ListOfCities:
for j in temp:
ans1 = i[0], j[0]

if ans1[0] != ans1[1]:
city_num.append(ans1)

ans = (unicode(i[1], 'iso-8859-1'), unicode(j[1], 'iso-8859-1'),i[0], j[0])
if ans[0] != ans[1] and ans[2] != ans[3]:
city_orig_dest.append(ans)

yield city_orig_dest

input_data = datareader('BAK.csv') # Input data here

def webscrapper(x):

""" This function scraped the required website and extracts the

quickest connection time within given time durations """

#Create a browser object
br = Browser()

# Ignore robots.txt
br.set_handle_robots(False)

# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Chrome')]

# Retrieve the Google home page, saving the response
br.o

Solution

Performance Issues

The main bottleneck here is the blocking nature of the program. You are processing urls one by one sequentially - you don't process the next url until you are done with the current one. This can be solved by switching to an asynchronous approach - either using Scrapy (which is the best thing happened in the Python's web-scraping world), or something like grequests.

Also, the HTML parsing speed can be improved by parsing only the relevant part of the document with a SoupStrainer class:

from bs4 import BeautifulSoup, SoupStrainer

parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)

trs = soup.select('tr')


The other thing you can try is to switch from mechanize to requests using a single requests.Session() instance for all the requests. This way, the underlying TCP connection will be reused which may result into a performance improvement.

There are also some things you are re-doing over and over again in the loops. Things like the control variable should be pre-computed beforehand.

And, avoid redefining the get_sec() function inside the loop - defined it beforehand.

Also, use min() function instead of calling sorted() and getting the first element.

Code Style Issues

  • if len(locations) > 0: can be improved as if locations:



  • if len(durations) == 0: can be improved as if not durations:



  • if len(fastest_connect) == 0: can be improved as if not fastest_connect:



  • .select(..)[0] can be replaced with .select_one(...)



  • BeautifulSoup understands file-like objects as well, replace br.response().read() with br.response()



-
organize imports as per PEP8 recommendations:

import csv
from datetime import datetime

from bs4 import BeautifulSoup
from mechanize import Browser


-
the # import needed libraries comment does not make much sense

  • no need for the extra newline before the function docstrings



  • put the main program logic into if __name__ == '__main__': to avoid it being executed on import



  • by introducing the time variable, you are shadowing the imported time module



  • properly define constants (for example, the time format, or the magical 999999 number)



  • use with context manager when dealing with files



  • remove the unused header variable



  • skip the CSV header via the next() built-in function: next(reader, None)



A note about Python 3 compatibility:

  • use next() function instead of the .next() method



  • range() vs xrange() (cross-Python way to handle both)



  • use print() function instead of a statement



Here is a sample code that uses requests to make a search (note that we handle the default parameters "manually" - if you want to automatically handle the default parameter values as in case of mechanize, look into MechanicalSoup or RoboBrowser):

import requests
from bs4 import BeautifulSoup, SoupStrainer

def merge_two_dicts(x, y):
    """Given two dicts, merge them into a new dict as a shallow copy."""
    z = x.copy()
    z.update(y)
    return z

url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
    "changeQueryInputData=yes&start": "Search connection",

    "REQ0Total_KissRideMotorClass": "404",
    "REQ0Total_KissRideCarClass": "5",
    "REQ0Total_KissRide_maxDist": "10000000",
    "REQ0Total_KissRide_minDist": "0",
    "REQComparisonCarload": "0",

    "REQ0JourneyStopsS0A": "255",
    "REQ0JourneyStopsZ0A": "255",
    "REQ0JourneyStops1.0G": "",
    "REQ0JourneyStops1.0A": "1",
    "REQ0JourneyStopover1": ""
}

with requests.Session() as session:
    session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}

    session.get(url)  # visit the main page (might not be actually needed)

    # sample parameters
    params = {
        "REQ0JourneyStopsS0G": "Eisenstadt",
        "REQ0JourneyStopsZ0G": "Tirano, Stazione",
        "date": "27.02.17",
        "REQ0JourneyTime": "17:00"
    }
    response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))

    parse_only = SoupStrainer("table", class_="hfs_overview")
    soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)

    # print out times for demonstration purposes
    trs = soup.select('tr')
    for tr in trs:
        time = tr.select_one('td.time')
        if time:
            print(time.get_text(strip=True))

Code Snippets

from bs4 import BeautifulSoup, SoupStrainer


parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)

trs = soup.select('tr')
import csv
from datetime import datetime

from bs4 import BeautifulSoup
from mechanize import Browser
import requests
from bs4 import BeautifulSoup, SoupStrainer


def merge_two_dicts(x, y):
    """Given two dicts, merge them into a new dict as a shallow copy."""
    z = x.copy()
    z.update(y)
    return z


url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
    "changeQueryInputData=yes&start": "Search connection",

    "REQ0Total_KissRideMotorClass": "404",
    "REQ0Total_KissRideCarClass": "5",
    "REQ0Total_KissRide_maxDist": "10000000",
    "REQ0Total_KissRide_minDist": "0",
    "REQComparisonCarload": "0",

    "REQ0JourneyStopsS0A": "255",
    "REQ0JourneyStopsZ0A": "255",
    "REQ0JourneyStops1.0G": "",
    "REQ0JourneyStops1.0A": "1",
    "REQ0JourneyStopover1": ""
}

with requests.Session() as session:
    session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}

    session.get(url)  # visit the main page (might not be actually needed)

    # sample parameters
    params = {
        "REQ0JourneyStopsS0G": "Eisenstadt",
        "REQ0JourneyStopsZ0G": "Tirano, Stazione",
        "date": "27.02.17",
        "REQ0JourneyTime": "17:00"
    }
    response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))

    parse_only = SoupStrainer("table", class_="hfs_overview")
    soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)

    # print out times for demonstration purposes
    trs = soup.select('tr')
    for tr in trs:
        time = tr.select_one('td.time')
        if time:
            print(time.get_text(strip=True))

Context

StackExchange Code Review Q#155681, answer score: 5

Revisions (0)

No revisions yet.