patternpythonMinor

A simple little Python web crawler

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

simplepythonlittlewebcrawler

Problem

The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency
in Python without using system tools like ping?

import sys
import re
import urllib2
import urlparse
import requests
import socket
import threading
import gevent
from gevent import monkey
import time

monkey.patch_all(
  socket=True,
  dns=True,
  time=True,
  select=True,
  thread=True,
  os=True,
  ssl=True,
  httplib=False,
  subprocess=False,
  sys=False,
  aggressive=True,
  Event=False)

# The  stack
tocrawl = set([sys.argv[1]])
crawled = set([])
linkregex = re.compile('')

def Update(links):
  if links != None:
    for link in (links.pop(0) for _ in xrange(len(links))):
      link = ( "http://%s" %(urlparse.urlparse(link).netloc) )
      if link not in crawled:
        tocrawl.add(link)

def getLinks(crawling):
  crawled.add(crawling)
  try:
    Update(linkregex.findall(requests.get(crawling).content))
  except:
    return None

def crawl():
  try:
    print"%d Threads running" % (threading.activeCount())
    crawling = tocrawl.pop()
    print crawling
    print len(crawled)
    walk = gevent.spawn(getLinks,crawling)
    walk.run()
  except:quit()

def dispatcher():
  while True:
    T = threading.Thread(target=crawl)
    T.start()
    time.sleep(1)

dispatcher()

Solution

I see a flurry of downloading activity, but I don't see that you do anything with the pages that you download except parse some URLs for more downloading. There's no rate limiting or any attempt to check robots.txt, making your web crawler a poor Internet citizen.

PEP 8 mandates four spaces per level of indentation. Since whitespace is significant in Python, you should stick to the convention. Furthermore, function names should be lower_case(), so Update() and getLinks() should be renamed.

Just a simple call to gevent.monkey.patch_all() will do. There is no need to from gevent import monkey, nor is there any need to list all of the keyword parameters, since you're accepting all of the defaults.

Your linkregex fails if the ` tag contains any intervening attributes before href. For example,

 will cause a link to be skipped.

I don't believe that your code is a well behaved multithreaded program.  For one thing, you indiscriminately spawn and start one thread per second.  If the average processing time per request exceeds one second, you'll end up with an uncontrolled proliferation of threads.

Another issue is that you

add() and pop() tocrawl elements without any kind of locking. Also, if one thread fails to pop() anything (probably when the tocrawl list becomes empty), you rudely call quit()` without giving other threads a chance to finish what they are doing.

Finally, you process the URLs using a stack. Web crawling is usually done using a queue, to avoid processing clusters of closely related URLs together and concentrating the load on one unfortunate webserver at a time.

Context

StackExchange Code Review Q#46993, answer score: 8

Revisions (0)

No revisions yet.