HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Multithreaded web scraper with proxy and user agent switching

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
proxywithuseragentswitchingwebandscrapermultithreaded

Problem

I am trying to improve the performance of my scraper and plug up any possible security leaks (identifying information being revealed).

Ideally, I would like to achieve a performance of 10 pages per second. What would I need to do to achieve the biggest performance boost besides get a faster connection / dedicated server? What could be improved?

PS: I am only using eBay.com as an example here. The production version of the scraper will obey robot.txt requests, avoid peak traffic hours, and be throttled so I am not effectively DDoSing the site.

```
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool as ThreadPool

import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.getLogger("requests").setLevel(logging.WARNING)

# NOTE: The next two sections are for demo purposes only, they will be imported from modules
# this will be stored in proxies.py module
from random import choice
proxies = [
{'host': '1.2.3.4', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
{'host': '2.3.4.5', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
]
def check_proxy(session, proxy_host):
response = session.get('http://canihazip.com/s')
returned_ip = response.text
if returned_ip != proxy_host:
raise StandardError('Proxy check failed: {} not used while requesting'.format(proxy_host))
def random_proxy():
return choice(proxies)
# / end of proxies.py

# this will be stored in user_agents.py module
from random import choice
user_agents = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
]
def random_user_agent():
return choice(user_agents)
# / end of user_agents.py

def scrape_results_page(url):
proxy = random_proxy() # will be proxies.random_proxy()
sess

Solution

Performance

Each call to scrape_results_page will also call check_proxy.
As such, the check_proxy will get called for the same proxies multiple times, and I'm wondering if there's a reason for re-checking the proxies.
If not, then you can save time by checking the list of proxies once at the beginning of the program,
and remove the check from scrape_results_page.

Once you have a set of verified proxies,
instead of selecting them randomly,
it would be better to use them round-robin style,
to balance the load.
Admittedly this tip may not make any visible difference whatsoever.

Other improvements

There are a lot of parameters buried in the implementation,
for example the ebay url, canihazip.com, the number of threads and page numbers, and possibly others.
It would be better to define such values at the top,
in variables with descriptive names, all upper-cased to follow the convention for "constants".

Context

StackExchange Code Review Q#107087, answer score: 5

Revisions (0)

No revisions yet.