patternpythonMinor
Web scraper running extremely slow
Viewed 0 times
extremelyslowrunningwebscraper
Problem
I am making my first web scraper in Python. It works great but it runs extremely slow. The website loads in about 10ms but it only does like 1 every couple of seconds. There are about 4-6 million records I need to scrape through. Any ideas?
```
from bs4 import BeautifulSoup
import requests
import json
import re
import urllib
import threading
prox = {"http" : "127.0.0.1:8888", "https" : "127.0.0.1:8888"}
def GetVS(Soup):
return Soup.find('input', {'name' : '__VIEWSTATE'})['value']
def GetEV(Soup):
return Soup.find('input', {'name' : '__EVENTVALIDATION'})['value']
def GetSearch(Viewstate, Eventvalidation):
return requests.post('website',
data="__EVENTTARGET=ctl00%24cpMain%24ctl01%24rblSearchType%241&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE="+urllib.quote(Viewstate, '')+"&__EVENTVALIDATION="+urllib.quote(Eventvalidation, '')+"&ctl00%24txtsearch=&ctl00%24rdoSearch=rdoSite&ctl00%24cpMain%24ctl01%24rblSearchType=PropertyID&ctl00%24cpMain%24ctl01%24txtOwner=",
verify=False,
headers={"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", "Referer" : "https://nevadatreasurer.gov/UPSearch/", "Content-Type" : "application/x-www-form-urlencoded"})
def PropertySearch(PropertyID, Viewstate, Eventvalidation):
return requests.post('website',
data="__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE="+urllib.quote(Viewstate, '')+"&__EVENTVALIDATION="+urllib.quote(Eventvalidation, '')+"&ctl00%24txtsearch=&ctl00%24rdoSearch=rdoSite&ctl00%24cpMain%24ctl01%24rblSearchType=PropertyID&ctl00%24cpMain%24ctl01%24txtPropertyID="+urllib.quote(PropertyID, '')+"&ctl00%24cpMain%24ctl01%24btnSearch=Click+Here+to+Search+for+Property",
verify=False,
headers={"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", "Referer" : "https://nevadatreasurer.gov/UPSearch/", "Content-Ty
```
from bs4 import BeautifulSoup
import requests
import json
import re
import urllib
import threading
prox = {"http" : "127.0.0.1:8888", "https" : "127.0.0.1:8888"}
def GetVS(Soup):
return Soup.find('input', {'name' : '__VIEWSTATE'})['value']
def GetEV(Soup):
return Soup.find('input', {'name' : '__EVENTVALIDATION'})['value']
def GetSearch(Viewstate, Eventvalidation):
return requests.post('website',
data="__EVENTTARGET=ctl00%24cpMain%24ctl01%24rblSearchType%241&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE="+urllib.quote(Viewstate, '')+"&__EVENTVALIDATION="+urllib.quote(Eventvalidation, '')+"&ctl00%24txtsearch=&ctl00%24rdoSearch=rdoSite&ctl00%24cpMain%24ctl01%24rblSearchType=PropertyID&ctl00%24cpMain%24ctl01%24txtOwner=",
verify=False,
headers={"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", "Referer" : "https://nevadatreasurer.gov/UPSearch/", "Content-Type" : "application/x-www-form-urlencoded"})
def PropertySearch(PropertyID, Viewstate, Eventvalidation):
return requests.post('website',
data="__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE="+urllib.quote(Viewstate, '')+"&__EVENTVALIDATION="+urllib.quote(Eventvalidation, '')+"&ctl00%24txtsearch=&ctl00%24rdoSearch=rdoSite&ctl00%24cpMain%24ctl01%24rblSearchType=PropertyID&ctl00%24cpMain%24ctl01%24txtPropertyID="+urllib.quote(PropertyID, '')+"&ctl00%24cpMain%24ctl01%24btnSearch=Click+Here+to+Search+for+Property",
verify=False,
headers={"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", "Referer" : "https://nevadatreasurer.gov/UPSearch/", "Content-Ty
Solution
There are a couple of strange things about this piece of code:
First of all, you don't need to compile the regex in every iteration, and not even in every thread. It seems this can be a global constant, compiled only once.
The threads run
This has some other improvements as well:
Coding style
Please follow PEP8, the official Python coding style guide. Especially,
def dowork(start):
while start < 4000000:
start = start + 1
# ...
regex = re.compile("Over \$100.*__doPostBack.*Select\$(.*)\&")
r = regex.findall(Request.text)
for i in r:
# ...
threads = []
for i in range(50):
t = threading.Thread(target=dowork, args=(i*100000+1000000,))First of all, you don't need to compile the regex in every iteration, and not even in every thread. It seems this can be a global constant, compiled only once.
The threads run
dowork with a different start parameter: 1m, 1.1m, 1.2m, ..., 5.8m, 5.9m. The smaller problem is that dowork only runs until 4m, so threads 30~49 will do nothing. The big problem is that they all run until 4m. I think you really meant this instead:def dowork(start0, maxcnt):
counter = 0
while counter < maxcnt:
counter += 1
start = str(start0 + counter)
# ...This has some other improvements as well:
counter += 1simpler thancounter = counter + 1
- Convert
startto string once, reuse multiple times within the function
maxcntis a parameter instead of hardcoded 10**5, because the caller controls thestart0parameter, and the two are closely related
Coding style
Please follow PEP8, the official Python coding style guide. Especially,
snake_case is preferred for method names, instead of CamelCase.Code Snippets
def dowork(start):
while start < 4000000:
start = start + 1
# ...
regex = re.compile("Over \$100.*__doPostBack.*Select\$(.*)\&")
r = regex.findall(Request.text)
for i in r:
# ...
threads = []
for i in range(50):
t = threading.Thread(target=dowork, args=(i*100000+1000000,))def dowork(start0, maxcnt):
counter = 0
while counter < maxcnt:
counter += 1
start = str(start0 + counter)
# ...Context
StackExchange Code Review Q#62092, answer score: 2
Revisions (0)
No revisions yet.