patternpythonMinor
Check the stock of references list
Viewed 0 times
thestocklistcheckreferences
Problem
I did this spider that checks the stock of references list (around 3500 references).
Now the spider takes around 37 seconds to scrape 400 references. The CPU is around 5%, the network card (1Gbps) around 18%. My internet connection is 300Mbps symmetric and it's only connected to this computer.
Any idea to improve performance? Is this a good performance? Maybe ISP router is a bottleneck?
Now the spider takes around 37 seconds to scrape 400 references. The CPU is around 5%, the network card (1Gbps) around 18%. My internet connection is 300Mbps symmetric and it's only connected to this computer.
Any idea to improve performance? Is this a good performance? Maybe ISP router is a bottleneck?
import scrapy
from scrapy.crawler import CrawlerProcess
class Spider(scrapy.Spider):
name = "Spider"
start_urls = ['URLS']
def __init__(self, references=None, *args, **kwargs):
super(ktmSpider, self).__init__(*args, **kwargs)
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'UserName': 'username', 'Password': 'password'},
callback=self.after_login
)
def after_login(self, response):
ref = references.pop()
yield scrapy.Request(url="url" + ref, callback=self.parse_stock)
def parse_stock(self, response):
self.f.write(response.selector.xpath('//*[@id="priceDetails"]/form/div[2]/text()').extract_first() + ',')
self.f.write(response.selector.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[2]/text()').extract_first() + ',')
self.f.write(response.selector.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[1]/i/@style').extract_first() + '\n')
while len(references) > 0:
ref = references.pop()
yield scrapy.Request(url="url" + ref, callback=self.parse_stock)
f = open("references.txt")
references = f.read().splitlines()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'LOG_LEVEL': 'INFO',
'AUTOTHROTTLE_ENABLED': 'True',
'AUTOTHROTTLE_START_DELAY': '0.1',
'AUTOTHROTTLE_TARGET_CONCURRENCY': '100'
})
process.crawl(ktmSpider, references=references, stockFile="file.txt")
process.start()Solution
Auto-throttling extension may cause high download delays. Either turn the extension off to see what would the resulting time be, or limit the maximum delay via
Also, you may issue multiple requests from the
Also, instead of writing to the file from a spider directly, you can use a CSV output pipeline:
Where
Which should be set in the
Then, you would need to enable the pipeline:
Or course, improve the "field" and "item" class names to be more meaningful.
There are other things, like using local DNS cache, you may try to speed up your spider:
And, as a "running Scrapy from script" reference topic, please see this post.
AUTOTHROTTLE_MAX_DELAY.Also, you may issue multiple requests from the
after_login() method instead of keeping the queue of references:def after_login(self, response):
for ref in references:
yield scrapy.Request(url="url" + ref, callback=self.parse_stock)Also, instead of writing to the file from a spider directly, you can use a CSV output pipeline:
import csv
class CSVWriterPipeline(object):
def __init__(self):
self.writer = csv.writer(open('file.txt', 'wb'))
def process_item(self, item, spider):
self.writer.writerow([item["field1"], item["field2"], item["field3"])
return itemWhere
field1, field2, field3 are your item fields:class MyItem(Item):
field1 = Field()
field2 = Field()
field3 = Field()Which should be set in the
parse_stock() callback:def parse_stock(self, response):
item = MyItem()
item["field1"] = response.xpath('//*[@id="priceDetails"]/form/div[2]/text()').extract_first()
item["field2"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[2]/text()').extract_first()
item["field3"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[1]/i/@style').extract_first()
return itemThen, you would need to enable the pipeline:
process = CrawlerProcess({
'ITEM_PIPELINES', {
'__main__.CSVWriterPipeline': 100
},
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'LOG_LEVEL': 'INFO',
'AUTOTHROTTLE_ENABLED': 'True',
'AUTOTHROTTLE_START_DELAY': '0.1',
'AUTOTHROTTLE_TARGET_CONCURRENCY': '100'
})Or course, improve the "field" and "item" class names to be more meaningful.
There are other things, like using local DNS cache, you may try to speed up your spider:
- Speed up web scraper
And, as a "running Scrapy from script" reference topic, please see this post.
Code Snippets
def after_login(self, response):
for ref in references:
yield scrapy.Request(url="url" + ref, callback=self.parse_stock)import csv
class CSVWriterPipeline(object):
def __init__(self):
self.writer = csv.writer(open('file.txt', 'wb'))
def process_item(self, item, spider):
self.writer.writerow([item["field1"], item["field2"], item["field3"])
return itemclass MyItem(Item):
field1 = Field()
field2 = Field()
field3 = Field()def parse_stock(self, response):
item = MyItem()
item["field1"] = response.xpath('//*[@id="priceDetails"]/form/div[2]/text()').extract_first()
item["field2"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[2]/text()').extract_first()
item["field3"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[1]/i/@style').extract_first()
return itemprocess = CrawlerProcess({
'ITEM_PIPELINES', {
'__main__.CSVWriterPipeline': 100
},
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'LOG_LEVEL': 'INFO',
'AUTOTHROTTLE_ENABLED': 'True',
'AUTOTHROTTLE_START_DELAY': '0.1',
'AUTOTHROTTLE_TARGET_CONCURRENCY': '100'
})Context
StackExchange Code Review Q#156188, answer score: 3
Revisions (0)
No revisions yet.