patternpythonMinor
Optimizing the speed of a web scraper
Viewed 0 times
theoptimizingwebspeedscraper
Problem
I have just written this code to scrape some data from a website. In its current state it works fine, however, going by my tests on the script, I discovered that with the amount of data I am processing, it will take a few days to finish the task, Is there a way to improve its performance? I will insert a sample of the data as the bulk of it.
Input data in CSV format:
Code:
```
# import needed libraries
import csv
from datetime import datetime
from mechanize import Browser
from bs4 import BeautifulSoup
def datareader(datafile):
""" This function reads the cities from csv file and processes
them into an O-D for input into the web scrapper """
# Read the csv
with open(datafile, 'r') as f:
reader = csv.reader(f)
header = reader.next()
ListOfCities = [lines for lines in reader]
temp = ListOfCities[:]
city_num = []
city_orig_dest = []
for i in ListOfCities:
for j in temp:
ans1 = i[0], j[0]
if ans1[0] != ans1[1]:
city_num.append(ans1)
ans = (unicode(i[1], 'iso-8859-1'), unicode(j[1], 'iso-8859-1'),i[0], j[0])
if ans[0] != ans[1] and ans[2] != ans[3]:
city_orig_dest.append(ans)
yield city_orig_dest
input_data = datareader('BAK.csv') # Input data here
def webscrapper(x):
""" This function scraped the required website and extracts the
quickest connection time within given time durations """
#Create a browser object
br = Browser()
# Ignore robots.txt
br.set_handle_robots(False)
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Chrome')]
# Retrieve the Google home page, saving the response
br.o
Input data in CSV format:
Code Origin
1 Eisenstadt
2 Tirana
3 St Pölten Hbf
6 Wien Westbahnhof
7 Wien Hauptbahnhof
8 Klagenfurt Hbf
9 Villach Hbf
11 Graz Hbf
12 Liezen
Code:
```
# import needed libraries
import csv
from datetime import datetime
from mechanize import Browser
from bs4 import BeautifulSoup
def datareader(datafile):
""" This function reads the cities from csv file and processes
them into an O-D for input into the web scrapper """
# Read the csv
with open(datafile, 'r') as f:
reader = csv.reader(f)
header = reader.next()
ListOfCities = [lines for lines in reader]
temp = ListOfCities[:]
city_num = []
city_orig_dest = []
for i in ListOfCities:
for j in temp:
ans1 = i[0], j[0]
if ans1[0] != ans1[1]:
city_num.append(ans1)
ans = (unicode(i[1], 'iso-8859-1'), unicode(j[1], 'iso-8859-1'),i[0], j[0])
if ans[0] != ans[1] and ans[2] != ans[3]:
city_orig_dest.append(ans)
yield city_orig_dest
input_data = datareader('BAK.csv') # Input data here
def webscrapper(x):
""" This function scraped the required website and extracts the
quickest connection time within given time durations """
#Create a browser object
br = Browser()
# Ignore robots.txt
br.set_handle_robots(False)
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Chrome')]
# Retrieve the Google home page, saving the response
br.o
Solution
Performance Issues
The main bottleneck here is the blocking nature of the program. You are processing urls one by one sequentially - you don't process the next url until you are done with the current one. This can be solved by switching to an asynchronous approach - either using
Also, the HTML parsing speed can be improved by parsing only the relevant part of the document with a
The other thing you can try is to switch from
There are also some things you are re-doing over and over again in the loops. Things like the
And, avoid redefining the
Also, use
Code Style Issues
-
organize imports as per PEP8 recommendations:
-
the
A note about Python 3 compatibility:
Here is a sample code that uses
The main bottleneck here is the blocking nature of the program. You are processing urls one by one sequentially - you don't process the next url until you are done with the current one. This can be solved by switching to an asynchronous approach - either using
Scrapy (which is the best thing happened in the Python's web-scraping world), or something like grequests. Also, the HTML parsing speed can be improved by parsing only the relevant part of the document with a
SoupStrainer class:from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)
trs = soup.select('tr')The other thing you can try is to switch from
mechanize to requests using a single requests.Session() instance for all the requests. This way, the underlying TCP connection will be reused which may result into a performance improvement.There are also some things you are re-doing over and over again in the loops. Things like the
control variable should be pre-computed beforehand. And, avoid redefining the
get_sec() function inside the loop - defined it beforehand.Also, use
min() function instead of calling sorted() and getting the first element.Code Style Issues
if len(locations) > 0:can be improved asif locations:
if len(durations) == 0:can be improved asif not durations:
if len(fastest_connect) == 0:can be improved asif not fastest_connect:
.select(..)[0]can be replaced with.select_one(...)
BeautifulSoupunderstands file-like objects as well, replacebr.response().read()withbr.response()
-
organize imports as per PEP8 recommendations:
import csv
from datetime import datetime
from bs4 import BeautifulSoup
from mechanize import Browser-
the
# import needed libraries comment does not make much sense- no need for the extra newline before the function docstrings
- put the main program logic into
if __name__ == '__main__':to avoid it being executed on import
- by introducing the
timevariable, you are shadowing the importedtimemodule
- properly define constants (for example, the time format, or the magical
999999number)
- use
withcontext manager when dealing with files
- remove the unused
headervariable
- skip the CSV header via the
next()built-in function:next(reader, None)
A note about Python 3 compatibility:
- use
next()function instead of the.next()method
range()vsxrange()(cross-Python way to handle both)
- use
print()function instead of a statement
Here is a sample code that uses
requests to make a search (note that we handle the default parameters "manually" - if you want to automatically handle the default parameter values as in case of mechanize, look into MechanicalSoup or RoboBrowser):import requests
from bs4 import BeautifulSoup, SoupStrainer
def merge_two_dicts(x, y):
"""Given two dicts, merge them into a new dict as a shallow copy."""
z = x.copy()
z.update(y)
return z
url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
"changeQueryInputData=yes&start": "Search connection",
"REQ0Total_KissRideMotorClass": "404",
"REQ0Total_KissRideCarClass": "5",
"REQ0Total_KissRide_maxDist": "10000000",
"REQ0Total_KissRide_minDist": "0",
"REQComparisonCarload": "0",
"REQ0JourneyStopsS0A": "255",
"REQ0JourneyStopsZ0A": "255",
"REQ0JourneyStops1.0G": "",
"REQ0JourneyStops1.0A": "1",
"REQ0JourneyStopover1": ""
}
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}
session.get(url) # visit the main page (might not be actually needed)
# sample parameters
params = {
"REQ0JourneyStopsS0G": "Eisenstadt",
"REQ0JourneyStopsZ0G": "Tirano, Stazione",
"date": "27.02.17",
"REQ0JourneyTime": "17:00"
}
response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)
# print out times for demonstration purposes
trs = soup.select('tr')
for tr in trs:
time = tr.select_one('td.time')
if time:
print(time.get_text(strip=True))Code Snippets
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)
trs = soup.select('tr')import csv
from datetime import datetime
from bs4 import BeautifulSoup
from mechanize import Browserimport requests
from bs4 import BeautifulSoup, SoupStrainer
def merge_two_dicts(x, y):
"""Given two dicts, merge them into a new dict as a shallow copy."""
z = x.copy()
z.update(y)
return z
url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
"changeQueryInputData=yes&start": "Search connection",
"REQ0Total_KissRideMotorClass": "404",
"REQ0Total_KissRideCarClass": "5",
"REQ0Total_KissRide_maxDist": "10000000",
"REQ0Total_KissRide_minDist": "0",
"REQComparisonCarload": "0",
"REQ0JourneyStopsS0A": "255",
"REQ0JourneyStopsZ0A": "255",
"REQ0JourneyStops1.0G": "",
"REQ0JourneyStops1.0A": "1",
"REQ0JourneyStopover1": ""
}
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}
session.get(url) # visit the main page (might not be actually needed)
# sample parameters
params = {
"REQ0JourneyStopsS0G": "Eisenstadt",
"REQ0JourneyStopsZ0G": "Tirano, Stazione",
"date": "27.02.17",
"REQ0JourneyTime": "17:00"
}
response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)
# print out times for demonstration purposes
trs = soup.select('tr')
for tr in trs:
time = tr.select_one('td.time')
if time:
print(time.get_text(strip=True))Context
StackExchange Code Review Q#155681, answer score: 5
Revisions (0)
No revisions yet.