patternpythonMinor
Multithreaded web scraper with proxy and user agent switching
Viewed 0 times
proxywithuseragentswitchingwebandscrapermultithreaded
Problem
I am trying to improve the performance of my scraper and plug up any possible security leaks (identifying information being revealed).
Ideally, I would like to achieve a performance of 10 pages per second. What would I need to do to achieve the biggest performance boost besides get a faster connection / dedicated server? What could be improved?
PS: I am only using eBay.com as an example here. The production version of the scraper will obey robot.txt requests, avoid peak traffic hours, and be throttled so I am not effectively DDoSing the site.
```
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool as ThreadPool
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.getLogger("requests").setLevel(logging.WARNING)
# NOTE: The next two sections are for demo purposes only, they will be imported from modules
# this will be stored in proxies.py module
from random import choice
proxies = [
{'host': '1.2.3.4', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
{'host': '2.3.4.5', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
]
def check_proxy(session, proxy_host):
response = session.get('http://canihazip.com/s')
returned_ip = response.text
if returned_ip != proxy_host:
raise StandardError('Proxy check failed: {} not used while requesting'.format(proxy_host))
def random_proxy():
return choice(proxies)
# / end of proxies.py
# this will be stored in user_agents.py module
from random import choice
user_agents = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
]
def random_user_agent():
return choice(user_agents)
# / end of user_agents.py
def scrape_results_page(url):
proxy = random_proxy() # will be proxies.random_proxy()
sess
Ideally, I would like to achieve a performance of 10 pages per second. What would I need to do to achieve the biggest performance boost besides get a faster connection / dedicated server? What could be improved?
PS: I am only using eBay.com as an example here. The production version of the scraper will obey robot.txt requests, avoid peak traffic hours, and be throttled so I am not effectively DDoSing the site.
```
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool as ThreadPool
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.getLogger("requests").setLevel(logging.WARNING)
# NOTE: The next two sections are for demo purposes only, they will be imported from modules
# this will be stored in proxies.py module
from random import choice
proxies = [
{'host': '1.2.3.4', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
{'host': '2.3.4.5', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
]
def check_proxy(session, proxy_host):
response = session.get('http://canihazip.com/s')
returned_ip = response.text
if returned_ip != proxy_host:
raise StandardError('Proxy check failed: {} not used while requesting'.format(proxy_host))
def random_proxy():
return choice(proxies)
# / end of proxies.py
# this will be stored in user_agents.py module
from random import choice
user_agents = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
]
def random_user_agent():
return choice(user_agents)
# / end of user_agents.py
def scrape_results_page(url):
proxy = random_proxy() # will be proxies.random_proxy()
sess
Solution
Performance
Each call to
As such, the
If not, then you can save time by checking the list of proxies once at the beginning of the program,
and remove the check from
Once you have a set of verified proxies,
instead of selecting them randomly,
it would be better to use them round-robin style,
to balance the load.
Admittedly this tip may not make any visible difference whatsoever.
Other improvements
There are a lot of parameters buried in the implementation,
for example the ebay url, canihazip.com, the number of threads and page numbers, and possibly others.
It would be better to define such values at the top,
in variables with descriptive names, all upper-cased to follow the convention for "constants".
Each call to
scrape_results_page will also call check_proxy.As such, the
check_proxy will get called for the same proxies multiple times, and I'm wondering if there's a reason for re-checking the proxies.If not, then you can save time by checking the list of proxies once at the beginning of the program,
and remove the check from
scrape_results_page.Once you have a set of verified proxies,
instead of selecting them randomly,
it would be better to use them round-robin style,
to balance the load.
Admittedly this tip may not make any visible difference whatsoever.
Other improvements
There are a lot of parameters buried in the implementation,
for example the ebay url, canihazip.com, the number of threads and page numbers, and possibly others.
It would be better to define such values at the top,
in variables with descriptive names, all upper-cased to follow the convention for "constants".
Context
StackExchange Code Review Q#107087, answer score: 5
Revisions (0)
No revisions yet.