HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Improved minimal webcrawler - why is it so slow?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
whyimprovedslowminimalwebcrawler

Problem

I recently made a webcrawler that I submitted here for a review:

Minimal webcrawler - bad structure and error handling?

With that help, I've made a much cleaner and better(?) webcrawler.

The only problem I have is that this crawler is extremely slow compared to the other one (13.2s for a depth of 5, compared to 1.3s on the old solution).

Why is it so slow?
With cProfile I have gotten this output over the 10 most time consuming functions. With depth 5, 57 links are crawled or scheduled for crawling - while is_valid_link are called 243 times. Could there be some errors in that method?

```
from bs4 import BeautifulSoup
import httplib
import urlparse
import urllib2
import time

def sanitize(url, parent):
if is_internal_link(url):
if parent.endswith("/") or url.startswith("/"):
return parent + url
else:
return parent + "/" + url
return url

def is_internal_link(url):
if url.startswith("http"):
return False
return True

def hostname(link):
hostname = urlparse.urlparse(link).hostname
return hostname

def get_server_status_code(link):
# http://stackoverflow.com/questions/1140661
host, path = urlparse.urlparse(link)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status
except StandardError:
return None

def is_valid_link(link):
link = sanitize(link, hostname(urlsToCrawl_Parent[0]))
# http://stackoverflow.com/questions/2924422
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
return get_server_status_code(link) in good_codes

def fetch_webpage(url):
webpage = urllib2.urlopen(url)
return webpage

def fetch_links(webpage):
html = webpage.read()
soup = BeautifulSoup(html)
links = soup.find_all("a", href=True)
return links

def schedule_link(link):
if is_valid_link(link):
if hostname(link) in urlsToCrawl_Parent[0]:
urlsTo

Solution

Edit:

Analysis of posted profile results

Of all the functions listed in the result, your code accounts for 0.041 seconds. However, the cumulative time for the entire execution is 14.169. This has your code accounting for .289% of the entire execution.

The function that takes the most time is the recv() calls on the underlying sockets. This one function is dominating the execution time compared to any of the other functions. From this we see that transferring files over the internet is very slow. Evaluating a boolean expression and manipulating a list is very fast.

Based on the analysis, we have learned that your program's bottleneck is the IO over the network. Every time your code reads a new html file, nothing else is happening in your application. What you can do to speed this is up is to execute code in parallel so that you a reading in multiple files at one time. You can look into threads to do this type of parallelization. However, by doing this you will also have to synchronize access ToCrawl and Crawled lists.

Original:

A review of the code independent of an performance considerations.

def hostname(link):
    hostname = urlparse.urlparse(link).hostname
    return hostname


Your variable is hiding the function name. Doing this in a function that is meant to be recursive would lead to an error. Instead you can return the value directly instead of creating a local variable.

There are a few other places you create a local variable and only reference them once. Sometimes giving a value a name can increase readability. However, most of the code in this post do not fit into this category.

def is_internal_link(url):
    if url.startswith("http"):
        return False
    return True


startswith() is already returning a boolean. You don't need the if statement to explicitly return the result.

def is_internal_link(url):
    return not url.startswith("http")


Another example if using an if statement to explicitly return a boolean.

def havent_visited(link):
    if link not in urls_Crawled and link not in urlsToCrawl_Parent and link not in urlsToCrawl_Child:
        return True
    else:
        return False


I point this out because it is done in a different style. This explicitly uses an else where as the first case used the implicit else as the true clause will always exit the function. Using a standard convention throughout the code base will improve readability. In this case, one style is not better, but you should be consistent.

crawl(urlsToCrawl_Parent[0])
urls_Crawled.append(urlsToCrawl_Parent[0])
urlsToCrawl_Parent.pop(0)


is equivalent to

target = urlsToCrawl_Parent.pop(0)
crawl(target)
urls_Crawled.append(target)


The second version is cleaner because the variable name gives you context. You also don't have to worry about indexing the list multiple times.

The elif block is also executing this same operation. You can more this repeated code into a function.

while (len(urls_Crawled) < 5):


5 is a magic number here. There is no context to why a number less than 5 is important. If there is another 5 somewhere else in the code, does it mean the same thing? Giving a constant a name will provide this context. Also, if you need to change it later, you will just have to change the code in one place. And you won't have to worry about that other 5 that had a different context.

Code Snippets

def hostname(link):
    hostname = urlparse.urlparse(link).hostname
    return hostname
def is_internal_link(url):
    if url.startswith("http"):
        return False
    return True
def is_internal_link(url):
    return not url.startswith("http")
def havent_visited(link):
    if link not in urls_Crawled and link not in urlsToCrawl_Parent and link not in urlsToCrawl_Child:
        return True
    else:
        return False
crawl(urlsToCrawl_Parent[0])
urls_Crawled.append(urlsToCrawl_Parent[0])
urlsToCrawl_Parent.pop(0)

Context

StackExchange Code Review Q#54771, answer score: 2

Revisions (0)

No revisions yet.