patternpythonMinor
Improving the performance of a webscraper
Viewed 0 times
webscrapertheimprovingperformance
Problem
I have here a modified version of a web scraping code I wrote some weeks back. With some help from this forum, this modified version is faster (at 4secs per iteration) than the earlier version. However, I need to run many iterations (over 1million) which is so much time. Is there any way to further enhance its performance? Thank you.
sample data (data.csv)
Code:
```
import csv
from functools import wraps
from datetime import datetime, time
import urllib2
from mechanize import Browser
from bs4 import BeautifulSoup, SoupStrainer
# function to group elements of a list
def group(lst, n):
return zip(*[lst[i::n] for i in range(n)])
# function to convert time string to minutes
def get_min(time_str):
h, m = time_str.split(':')
return int(h) * 60 + int(m)
# Delay function incase of network disconnection
def retry(ExceptionToCheck, tries=1000, delay=3, backoff=2, logger=None):
def deco_retry(f):
@wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck, e:
msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
if logger:
logger.warning(msg)
else:
print msg
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)
return f_retry # true decorator
return deco_retry
def datareader(datafile):
""" This function reads the cities data from csv file and processes
them into an O-D for input into the web scrapper """
# Read the csv
with open(datafile, 'r'
sample data (data.csv)
Code Origin
1 Eisenstadt
2 Tirana
3 St Pölten Hbf
6 Wien Westbahnhof
7 Wien Hauptbahnhof
8 Klagenfurt Hbf
9 Villach Hbf
11 Graz Hbf
12 LiezenCode:
```
import csv
from functools import wraps
from datetime import datetime, time
import urllib2
from mechanize import Browser
from bs4 import BeautifulSoup, SoupStrainer
# function to group elements of a list
def group(lst, n):
return zip(*[lst[i::n] for i in range(n)])
# function to convert time string to minutes
def get_min(time_str):
h, m = time_str.split(':')
return int(h) * 60 + int(m)
# Delay function incase of network disconnection
def retry(ExceptionToCheck, tries=1000, delay=3, backoff=2, logger=None):
def deco_retry(f):
@wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck, e:
msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
if logger:
logger.warning(msg)
else:
print msg
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)
return f_retry # true decorator
return deco_retry
def datareader(datafile):
""" This function reads the cities data from csv file and processes
them into an O-D for input into the web scrapper """
# Read the csv
with open(datafile, 'r'
Solution
There is a major limitation. Your code is of a blocking nature - you process timetable searches sequentially - one at a time.
I really think you should switch to
Here is a sample spider that works for a single timetable search:
If you want to take it further, you should do the following:
I understand that there is a lot of new information for you, but doing web-scraping for a long time, I can say that's really worth it, especially performance-wise.
I really think you should switch to
Scrapy web-scraping framework - it is fast, pluggable and entirely asynchronous. As a bonus point, you will be able to scale your spider to multiple instances or multiple machines. For example, you may divide your input data evenly into N parts and run a spider instance for every part (check out scrapyd). Here is a sample spider that works for a single timetable search:
import scrapy
TIMES = ['05:30', '09:00', '12:00', '15:00', '18:00', '21:00']
DEFAULT_PARAMS = {
"changeQueryInputData=yes&start": "Search connection",
"REQ0Total_KissRideMotorClass": "404",
"REQ0Total_KissRideCarClass": "5",
"REQ0Total_KissRide_maxDist": "10000000",
"REQ0Total_KissRide_minDist": "0",
"REQComparisonCarload": "0",
"REQ0JourneyStopsS0A": "255",
"REQ0JourneyStopsZ0A": "255",
"REQ0JourneyStops1.0G": "",
"REQ0JourneyStops1.0A": "1",
"REQ0JourneyStopover1": ""
}
def merge_two_dicts(x, y):
"""Given two dicts, merge them into a new dict as a shallow copy."""
z = x.copy()
z.update(y)
return z
class FahrplanSpider(scrapy.Spider):
name = "fahrplan"
allowed_domains = ["fahrplan.sbb.ch"]
def start_requests(self):
params = {
"REQ0JourneyStopsS0G": "Eisenstadt",
"REQ0JourneyStopsZ0G": "Tirano, Stazione",
"date": "27.02.17",
"REQ0JourneyTime": "17:00"
}
formdata = merge_two_dicts(DEFAULT_PARAMS, params)
yield scrapy.FormRequest("http://fahrplan.sbb.ch/bin/query.exe/en", method="POST", formdata=formdata)
def parse(self, response):
for trip_time in response.css("table.hfs_overview tr td.time::text").extract():
print(trip_time.strip())If you want to take it further, you should do the following:
- use the
datareader()results in thestart_requests()method and start a form request for every input item
- define an
Itemclass andyield/returnit in theparse()callback
- use an "Item Pipeline" to "pipe" your items into the output file
I understand that there is a lot of new information for you, but doing web-scraping for a long time, I can say that's really worth it, especially performance-wise.
Code Snippets
import scrapy
TIMES = ['05:30', '09:00', '12:00', '15:00', '18:00', '21:00']
DEFAULT_PARAMS = {
"changeQueryInputData=yes&start": "Search connection",
"REQ0Total_KissRideMotorClass": "404",
"REQ0Total_KissRideCarClass": "5",
"REQ0Total_KissRide_maxDist": "10000000",
"REQ0Total_KissRide_minDist": "0",
"REQComparisonCarload": "0",
"REQ0JourneyStopsS0A": "255",
"REQ0JourneyStopsZ0A": "255",
"REQ0JourneyStops1.0G": "",
"REQ0JourneyStops1.0A": "1",
"REQ0JourneyStopover1": ""
}
def merge_two_dicts(x, y):
"""Given two dicts, merge them into a new dict as a shallow copy."""
z = x.copy()
z.update(y)
return z
class FahrplanSpider(scrapy.Spider):
name = "fahrplan"
allowed_domains = ["fahrplan.sbb.ch"]
def start_requests(self):
params = {
"REQ0JourneyStopsS0G": "Eisenstadt",
"REQ0JourneyStopsZ0G": "Tirano, Stazione",
"date": "27.02.17",
"REQ0JourneyTime": "17:00"
}
formdata = merge_two_dicts(DEFAULT_PARAMS, params)
yield scrapy.FormRequest("http://fahrplan.sbb.ch/bin/query.exe/en", method="POST", formdata=formdata)
def parse(self, response):
for trip_time in response.css("table.hfs_overview tr td.time::text").extract():
print(trip_time.strip())Context
StackExchange Code Review Q#157640, answer score: 3
Revisions (0)
No revisions yet.