patternpythonMinor
A simple little Python web crawler
Viewed 0 times
simplepythonlittlewebcrawler
Problem
The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency
in Python without using system tools like ping?
in Python without using system tools like ping?
import sys
import re
import urllib2
import urlparse
import requests
import socket
import threading
import gevent
from gevent import monkey
import time
monkey.patch_all(
socket=True,
dns=True,
time=True,
select=True,
thread=True,
os=True,
ssl=True,
httplib=False,
subprocess=False,
sys=False,
aggressive=True,
Event=False)
# The stack
tocrawl = set([sys.argv[1]])
crawled = set([])
linkregex = re.compile('')
def Update(links):
if links != None:
for link in (links.pop(0) for _ in xrange(len(links))):
link = ( "http://%s" %(urlparse.urlparse(link).netloc) )
if link not in crawled:
tocrawl.add(link)
def getLinks(crawling):
crawled.add(crawling)
try:
Update(linkregex.findall(requests.get(crawling).content))
except:
return None
def crawl():
try:
print"%d Threads running" % (threading.activeCount())
crawling = tocrawl.pop()
print crawling
print len(crawled)
walk = gevent.spawn(getLinks,crawling)
walk.run()
except:quit()
def dispatcher():
while True:
T = threading.Thread(target=crawl)
T.start()
time.sleep(1)
dispatcher()Solution
I see a flurry of downloading activity, but I don't see that you do anything with the pages that you download except parse some URLs for more downloading. There's no rate limiting or any attempt to check
PEP 8 mandates four spaces per level of indentation. Since whitespace is significant in Python, you should stick to the convention. Furthermore, function names should be
Just a simple call to
Your
Finally, you process the URLs using a stack. Web crawling is usually done using a queue, to avoid processing clusters of closely related URLs together and concentrating the load on one unfortunate webserver at a time.
robots.txt, making your web crawler a poor Internet citizen.PEP 8 mandates four spaces per level of indentation. Since whitespace is significant in Python, you should stick to the convention. Furthermore, function names should be
lower_case(), so Update() and getLinks() should be renamed.Just a simple call to
gevent.monkey.patch_all() will do. There is no need to from gevent import monkey, nor is there any need to list all of the keyword parameters, since you're accepting all of the defaults.Your
linkregex fails if the ` tag contains any intervening attributes before href. For example, will cause a link to be skipped.
I don't believe that your code is a well behaved multithreaded program. For one thing, you indiscriminately spawn and start one thread per second. If the average processing time per request exceeds one second, you'll end up with an uncontrolled proliferation of threads.
Another issue is that you add() and pop() tocrawl elements without any kind of locking. Also, if one thread fails to pop() anything (probably when the tocrawl list becomes empty), you rudely call quit()` without giving other threads a chance to finish what they are doing.Finally, you process the URLs using a stack. Web crawling is usually done using a queue, to avoid processing clusters of closely related URLs together and concentrating the load on one unfortunate webserver at a time.
Context
StackExchange Code Review Q#46993, answer score: 8
Revisions (0)
No revisions yet.