patternpythonMinor
Improved minimal webcrawler - why is it so slow?
Viewed 0 times
whyimprovedslowminimalwebcrawler
Problem
I recently made a webcrawler that I submitted here for a review:
Minimal webcrawler - bad structure and error handling?
With that help, I've made a much cleaner and better(?) webcrawler.
The only problem I have is that this crawler is extremely slow compared to the other one (13.2s for a depth of 5, compared to 1.3s on the old solution).
Why is it so slow?
With cProfile I have gotten this output over the 10 most time consuming functions. With depth 5, 57 links are crawled or scheduled for crawling - while is_valid_link are called 243 times. Could there be some errors in that method?
```
from bs4 import BeautifulSoup
import httplib
import urlparse
import urllib2
import time
def sanitize(url, parent):
if is_internal_link(url):
if parent.endswith("/") or url.startswith("/"):
return parent + url
else:
return parent + "/" + url
return url
def is_internal_link(url):
if url.startswith("http"):
return False
return True
def hostname(link):
hostname = urlparse.urlparse(link).hostname
return hostname
def get_server_status_code(link):
# http://stackoverflow.com/questions/1140661
host, path = urlparse.urlparse(link)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status
except StandardError:
return None
def is_valid_link(link):
link = sanitize(link, hostname(urlsToCrawl_Parent[0]))
# http://stackoverflow.com/questions/2924422
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
return get_server_status_code(link) in good_codes
def fetch_webpage(url):
webpage = urllib2.urlopen(url)
return webpage
def fetch_links(webpage):
html = webpage.read()
soup = BeautifulSoup(html)
links = soup.find_all("a", href=True)
return links
def schedule_link(link):
if is_valid_link(link):
if hostname(link) in urlsToCrawl_Parent[0]:
urlsTo
Minimal webcrawler - bad structure and error handling?
With that help, I've made a much cleaner and better(?) webcrawler.
The only problem I have is that this crawler is extremely slow compared to the other one (13.2s for a depth of 5, compared to 1.3s on the old solution).
Why is it so slow?
With cProfile I have gotten this output over the 10 most time consuming functions. With depth 5, 57 links are crawled or scheduled for crawling - while is_valid_link are called 243 times. Could there be some errors in that method?
```
from bs4 import BeautifulSoup
import httplib
import urlparse
import urllib2
import time
def sanitize(url, parent):
if is_internal_link(url):
if parent.endswith("/") or url.startswith("/"):
return parent + url
else:
return parent + "/" + url
return url
def is_internal_link(url):
if url.startswith("http"):
return False
return True
def hostname(link):
hostname = urlparse.urlparse(link).hostname
return hostname
def get_server_status_code(link):
# http://stackoverflow.com/questions/1140661
host, path = urlparse.urlparse(link)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status
except StandardError:
return None
def is_valid_link(link):
link = sanitize(link, hostname(urlsToCrawl_Parent[0]))
# http://stackoverflow.com/questions/2924422
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
return get_server_status_code(link) in good_codes
def fetch_webpage(url):
webpage = urllib2.urlopen(url)
return webpage
def fetch_links(webpage):
html = webpage.read()
soup = BeautifulSoup(html)
links = soup.find_all("a", href=True)
return links
def schedule_link(link):
if is_valid_link(link):
if hostname(link) in urlsToCrawl_Parent[0]:
urlsTo
Solution
Edit:
Analysis of posted profile results
Of all the functions listed in the result, your code accounts for
The function that takes the most time is the
Based on the analysis, we have learned that your program's bottleneck is the IO over the network. Every time your code reads a new html file, nothing else is happening in your application. What you can do to speed this is up is to execute code in parallel so that you a reading in multiple files at one time. You can look into threads to do this type of parallelization. However, by doing this you will also have to synchronize access
Original:
A review of the code independent of an performance considerations.
Your variable is hiding the function name. Doing this in a function that is meant to be recursive would lead to an error. Instead you can return the value directly instead of creating a local variable.
There are a few other places you create a local variable and only reference them once. Sometimes giving a value a name can increase readability. However, most of the code in this post do not fit into this category.
Another example if using an if statement to explicitly return a boolean.
I point this out because it is done in a different style. This explicitly uses an
is equivalent to
The second version is cleaner because the variable name gives you context. You also don't have to worry about indexing the list multiple times.
The
Analysis of posted profile results
Of all the functions listed in the result, your code accounts for
0.041 seconds. However, the cumulative time for the entire execution is 14.169. This has your code accounting for .289% of the entire execution.The function that takes the most time is the
recv() calls on the underlying sockets. This one function is dominating the execution time compared to any of the other functions. From this we see that transferring files over the internet is very slow. Evaluating a boolean expression and manipulating a list is very fast.Based on the analysis, we have learned that your program's bottleneck is the IO over the network. Every time your code reads a new html file, nothing else is happening in your application. What you can do to speed this is up is to execute code in parallel so that you a reading in multiple files at one time. You can look into threads to do this type of parallelization. However, by doing this you will also have to synchronize access
ToCrawl and Crawled lists.Original:
A review of the code independent of an performance considerations.
def hostname(link):
hostname = urlparse.urlparse(link).hostname
return hostnameYour variable is hiding the function name. Doing this in a function that is meant to be recursive would lead to an error. Instead you can return the value directly instead of creating a local variable.
There are a few other places you create a local variable and only reference them once. Sometimes giving a value a name can increase readability. However, most of the code in this post do not fit into this category.
def is_internal_link(url):
if url.startswith("http"):
return False
return Truestartswith() is already returning a boolean. You don't need the if statement to explicitly return the result.def is_internal_link(url):
return not url.startswith("http")Another example if using an if statement to explicitly return a boolean.
def havent_visited(link):
if link not in urls_Crawled and link not in urlsToCrawl_Parent and link not in urlsToCrawl_Child:
return True
else:
return FalseI point this out because it is done in a different style. This explicitly uses an
else where as the first case used the implicit else as the true clause will always exit the function. Using a standard convention throughout the code base will improve readability. In this case, one style is not better, but you should be consistent.crawl(urlsToCrawl_Parent[0])
urls_Crawled.append(urlsToCrawl_Parent[0])
urlsToCrawl_Parent.pop(0)is equivalent to
target = urlsToCrawl_Parent.pop(0)
crawl(target)
urls_Crawled.append(target)The second version is cleaner because the variable name gives you context. You also don't have to worry about indexing the list multiple times.
The
elif block is also executing this same operation. You can more this repeated code into a function.while (len(urls_Crawled) < 5):5 is a magic number here. There is no context to why a number less than 5 is important. If there is another 5 somewhere else in the code, does it mean the same thing? Giving a constant a name will provide this context. Also, if you need to change it later, you will just have to change the code in one place. And you won't have to worry about that other 5 that had a different context.Code Snippets
def hostname(link):
hostname = urlparse.urlparse(link).hostname
return hostnamedef is_internal_link(url):
if url.startswith("http"):
return False
return Truedef is_internal_link(url):
return not url.startswith("http")def havent_visited(link):
if link not in urls_Crawled and link not in urlsToCrawl_Parent and link not in urlsToCrawl_Child:
return True
else:
return Falsecrawl(urlsToCrawl_Parent[0])
urls_Crawled.append(urlsToCrawl_Parent[0])
urlsToCrawl_Parent.pop(0)Context
StackExchange Code Review Q#54771, answer score: 2
Revisions (0)
No revisions yet.