HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Check the stock of references list

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thestocklistcheckreferences

Problem

I did this spider that checks the stock of references list (around 3500 references).

Now the spider takes around 37 seconds to scrape 400 references. The CPU is around 5%, the network card (1Gbps) around 18%. My internet connection is 300Mbps symmetric and it's only connected to this computer.

Any idea to improve performance? Is this a good performance? Maybe ISP router is a bottleneck?

import scrapy
from scrapy.crawler import CrawlerProcess

class Spider(scrapy.Spider):
    name = "Spider"

    start_urls = ['URLS']

    def __init__(self, references=None, *args, **kwargs):
        super(ktmSpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'UserName': 'username', 'Password': 'password'},
            callback=self.after_login
        )

    def after_login(self, response):
        ref = references.pop()
        yield scrapy.Request(url="url" + ref, callback=self.parse_stock)

    def parse_stock(self, response):
        self.f.write(response.selector.xpath('//*[@id="priceDetails"]/form/div[2]/text()').extract_first() + ',')
        self.f.write(response.selector.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[2]/text()').extract_first() + ',')
        self.f.write(response.selector.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[1]/i/@style').extract_first() + '\n')
        
        while len(references) > 0:
            ref = references.pop()
            yield scrapy.Request(url="url" + ref, callback=self.parse_stock)

f = open("references.txt")
references = f.read().splitlines()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': 'INFO',
    'AUTOTHROTTLE_ENABLED': 'True',
    'AUTOTHROTTLE_START_DELAY': '0.1',
    'AUTOTHROTTLE_TARGET_CONCURRENCY': '100'
})

process.crawl(ktmSpider, references=references, stockFile="file.txt")
process.start()

Solution

Auto-throttling extension may cause high download delays. Either turn the extension off to see what would the resulting time be, or limit the maximum delay via AUTOTHROTTLE_MAX_DELAY.

Also, you may issue multiple requests from the after_login() method instead of keeping the queue of references:

def after_login(self, response):
    for ref in references:
        yield scrapy.Request(url="url" + ref, callback=self.parse_stock)


Also, instead of writing to the file from a spider directly, you can use a CSV output pipeline:

import csv

class CSVWriterPipeline(object):
    def __init__(self):
        self.writer = csv.writer(open('file.txt', 'wb'))

    def process_item(self, item, spider):
        self.writer.writerow([item["field1"], item["field2"], item["field3"])
        return item


Where field1, field2, field3 are your item fields:

class MyItem(Item):
    field1 = Field()
    field2 = Field()
    field3 = Field()


Which should be set in the parse_stock() callback:

def parse_stock(self, response):
    item = MyItem()
    item["field1"] = response.xpath('//*[@id="priceDetails"]/form/div[2]/text()').extract_first()
    item["field2"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[2]/text()').extract_first()
    item["field3"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[1]/i/@style').extract_first()
    return item


Then, you would need to enable the pipeline:

process = CrawlerProcess({
    'ITEM_PIPELINES', {
         '__main__.CSVWriterPipeline': 100
    },
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': 'INFO',
    'AUTOTHROTTLE_ENABLED': 'True',
    'AUTOTHROTTLE_START_DELAY': '0.1',
    'AUTOTHROTTLE_TARGET_CONCURRENCY': '100'
})


Or course, improve the "field" and "item" class names to be more meaningful.

There are other things, like using local DNS cache, you may try to speed up your spider:

  • Speed up web scraper



And, as a "running Scrapy from script" reference topic, please see this post.

Code Snippets

def after_login(self, response):
    for ref in references:
        yield scrapy.Request(url="url" + ref, callback=self.parse_stock)
import csv


class CSVWriterPipeline(object):
    def __init__(self):
        self.writer = csv.writer(open('file.txt', 'wb'))

    def process_item(self, item, spider):
        self.writer.writerow([item["field1"], item["field2"], item["field3"])
        return item
class MyItem(Item):
    field1 = Field()
    field2 = Field()
    field3 = Field()
def parse_stock(self, response):
    item = MyItem()
    item["field1"] = response.xpath('//*[@id="priceDetails"]/form/div[2]/text()').extract_first()
    item["field2"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[2]/text()').extract_first()
    item["field3"] = response.xpath('//*[@id="priceDetails"]/form/div[8]/div[1]/span/span[1]/i/@style').extract_first()
    return item
process = CrawlerProcess({
    'ITEM_PIPELINES', {
         '__main__.CSVWriterPipeline': 100
    },
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': 'INFO',
    'AUTOTHROTTLE_ENABLED': 'True',
    'AUTOTHROTTLE_START_DELAY': '0.1',
    'AUTOTHROTTLE_TARGET_CONCURRENCY': '100'
})

Context

StackExchange Code Review Q#156188, answer score: 3

Revisions (0)

No revisions yet.