HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Simple recursive web crawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlerrecursivesimpleweb

Problem

I made a simple web crawler, I know there's many better ones out there, but I thought rolling my own would be a valuable learning experience.

The problem is that I think there's some things I could improve here. I commented the code best I could to explain what it's doing:

```
import re, random, requests, threading, collections, queue

class Crawler():

def __init__(self):
self.data = set() # this will store our crawled urls, avoiding duplicates
self.terminate = False # flag to end the program
self.lock = threading.Lock()
self.print_queue = queue.Queue() # this is for our prints
self.work = collections.defaultdict(int) # store some data about the work done (number of urls stored) by each worker

def run(self, threads=15, limit=1000, urls=()):
if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line
self.urls_max = limit # each thread is killed when we have 1000 urls stored
for i, url in enumerate(urls): # placing the threads, 15 for each url by default
for j in range(threads):
tName = '{}.{}'.format(i+1, j+1)
t = threading.Thread(target=self.producer, args=(url,), name=tName)
t.start()
t = threading.Thread(target=self.print_manager)
t.daemon = True
t.start()
del t

def wait_kill_threads(self): # waits for all the threads are killed
while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too
continue

def print_manager(self): # our deamon to print
while True:
msg = self.print_queue.get()
print(msg)
self.print_queue.task_done()

def renew_url(self, list_urls): # choose another url to work with
return random.choice(list_urls)

def worker(self): # get the thread details
return threading.currentThread()

def get

Solution

1.- Yes, you can use semicolon to have more than one sentence in the same line, but..


Compound statements (multiple statements on the same line) are
generally discouraged.

if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line


Source: PEP-8

2.- Wait some milliseconds all those while True: (ie: wait_kill_threads, print_manager, maybe in producer too) your processor will thank you.

3.- Avoid Magic Numbers or Hardcoded numbers. On the last line in producer:

.....threading.active_count()-3))   # -3 because mainthread and deamon thread


a better practice is use a CONSTANT_VALUE at the beginning of the file;

QTY_THREAD_OFFSET = 3

.....threading.active_count() - QTY_THREAD_OFFSET))


(maybe you can think in a better constant name, but you get the idea right?)

The same for:

while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too


using the previous const it will be:

while(threading.active_count() >= QTY_THREAD_OFFSET):


4.- Regex: If you are going to use a regex multiple times a good practice is compile the regex before. Instead of:

re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls


you can do this:

URL_RE_PATTERN = '(?<=href=["\'])https?://.+?(?=["\'])'
...
class Crawler():
    regex_urls = re.compile(URL_RE_PATTERN)

    def __init__(self):
    ...    

...
...
def get_links(self, url):
....
    urls = urls_re.match(regex_urls)


Also you can use BeautifulSoup to get urls.

5.- In verify_and_add you can try to avoid nested if's:

if(not self.terminate):
    if(url not in self.data):
        self.data.add(url) #


It could be replaced by:

if self.terminate or url in self.data:
    return False

self.data.add(url)
...


6.- A little detail:

if(req is not None):


It could be:

if req:


-- Edit:
As Guimo pointed out, the previous sentences are not equal.

So I would suggest to change

def get_links(self, url):
    req = self.get_req(url)
    if(req is not None):
        html = req.text # the html of the page crawled
        urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
        return urls
    return []


by

def get_links(self, url):
    req = self.get_req(url)
    if req is None:
        return []

    html = req.text # the html of the page crawled
    urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
    return urls


Ie, remove parenthesis and an early return.

7.- Are you removing the visited urls from list_urls after is visited? In renew_url you are getting a random url but I can't see if you are removing it from that list.

8.- Wait some random secs between requesting pages on the same server, you are crawling a website most of them aren't happy with that.

That is what I see in a first approach! Hope it helps !

Code Snippets

if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line
.....threading.active_count()-3))   # -3 because mainthread and deamon thread
QTY_THREAD_OFFSET = 3

.....threading.active_count() - QTY_THREAD_OFFSET))
while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too
while(threading.active_count() >= QTY_THREAD_OFFSET):

Context

StackExchange Code Review Q#154687, answer score: 7

Revisions (0)

No revisions yet.