patternpythonMinor
Simple recursive web crawler
Viewed 0 times
crawlerrecursivesimpleweb
Problem
I made a simple web crawler, I know there's many better ones out there, but I thought rolling my own would be a valuable learning experience.
The problem is that I think there's some things I could improve here. I commented the code best I could to explain what it's doing:
```
import re, random, requests, threading, collections, queue
class Crawler():
def __init__(self):
self.data = set() # this will store our crawled urls, avoiding duplicates
self.terminate = False # flag to end the program
self.lock = threading.Lock()
self.print_queue = queue.Queue() # this is for our prints
self.work = collections.defaultdict(int) # store some data about the work done (number of urls stored) by each worker
def run(self, threads=15, limit=1000, urls=()):
if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line
self.urls_max = limit # each thread is killed when we have 1000 urls stored
for i, url in enumerate(urls): # placing the threads, 15 for each url by default
for j in range(threads):
tName = '{}.{}'.format(i+1, j+1)
t = threading.Thread(target=self.producer, args=(url,), name=tName)
t.start()
t = threading.Thread(target=self.print_manager)
t.daemon = True
t.start()
del t
def wait_kill_threads(self): # waits for all the threads are killed
while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too
continue
def print_manager(self): # our deamon to print
while True:
msg = self.print_queue.get()
print(msg)
self.print_queue.task_done()
def renew_url(self, list_urls): # choose another url to work with
return random.choice(list_urls)
def worker(self): # get the thread details
return threading.currentThread()
def get
The problem is that I think there's some things I could improve here. I commented the code best I could to explain what it's doing:
```
import re, random, requests, threading, collections, queue
class Crawler():
def __init__(self):
self.data = set() # this will store our crawled urls, avoiding duplicates
self.terminate = False # flag to end the program
self.lock = threading.Lock()
self.print_queue = queue.Queue() # this is for our prints
self.work = collections.defaultdict(int) # store some data about the work done (number of urls stored) by each worker
def run(self, threads=15, limit=1000, urls=()):
if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line
self.urls_max = limit # each thread is killed when we have 1000 urls stored
for i, url in enumerate(urls): # placing the threads, 15 for each url by default
for j in range(threads):
tName = '{}.{}'.format(i+1, j+1)
t = threading.Thread(target=self.producer, args=(url,), name=tName)
t.start()
t = threading.Thread(target=self.print_manager)
t.daemon = True
t.start()
del t
def wait_kill_threads(self): # waits for all the threads are killed
while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon too
continue
def print_manager(self): # our deamon to print
while True:
msg = self.print_queue.get()
print(msg)
self.print_queue.task_done()
def renew_url(self, list_urls): # choose another url to work with
return random.choice(list_urls)
def worker(self): # get the thread details
return threading.currentThread()
def get
Solution
1.- Yes, you can use semicolon to have more than one sentence in the same line, but..
Compound statements (multiple statements on the same line) are
generally discouraged.
Source: PEP-8
2.- Wait some milliseconds all those
3.- Avoid Magic Numbers or Hardcoded numbers. On the last line in
a better practice is use a CONSTANT_VALUE at the beginning of the file;
(maybe you can think in a better constant name, but you get the idea right?)
The same for:
using the previous const it will be:
4.- Regex: If you are going to use a regex multiple times a good practice is compile the regex before. Instead of:
you can do this:
Also you can use BeautifulSoup to get urls.
5.- In
It could be replaced by:
6.- A little detail:
It could be:
-- Edit:
As Guimo pointed out, the previous sentences are not equal.
So I would suggest to change
by
Ie, remove parenthesis and an early return.
7.- Are you removing the visited urls from list_urls after is visited? In
8.- Wait some random secs between requesting pages on the same server, you are crawling a website most of them aren't happy with that.
That is what I see in a first approach! Hope it helps !
Compound statements (multiple statements on the same line) are
generally discouraged.
if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same lineSource: PEP-8
2.- Wait some milliseconds all those
while True: (ie: wait_kill_threads, print_manager, maybe in producer too) your processor will thank you.3.- Avoid Magic Numbers or Hardcoded numbers. On the last line in
producer:.....threading.active_count()-3)) # -3 because mainthread and deamon threada better practice is use a CONSTANT_VALUE at the beginning of the file;
QTY_THREAD_OFFSET = 3
.....threading.active_count() - QTY_THREAD_OFFSET))(maybe you can think in a better constant name, but you get the idea right?)
The same for:
while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon toousing the previous const it will be:
while(threading.active_count() >= QTY_THREAD_OFFSET):4.- Regex: If you are going to use a regex multiple times a good practice is compile the regex before. Instead of:
re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urlsyou can do this:
URL_RE_PATTERN = '(?<=href=["\'])https?://.+?(?=["\'])'
...
class Crawler():
regex_urls = re.compile(URL_RE_PATTERN)
def __init__(self):
...
...
...
def get_links(self, url):
....
urls = urls_re.match(regex_urls)Also you can use BeautifulSoup to get urls.
5.- In
verify_and_add you can try to avoid nested if's:if(not self.terminate):
if(url not in self.data):
self.data.add(url) #It could be replaced by:
if self.terminate or url in self.data:
return False
self.data.add(url)
...6.- A little detail:
if(req is not None):It could be:
if req:-- Edit:
As Guimo pointed out, the previous sentences are not equal.
So I would suggest to change
def get_links(self, url):
req = self.get_req(url)
if(req is not None):
html = req.text # the html of the page crawled
urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
return urls
return []by
def get_links(self, url):
req = self.get_req(url)
if req is None:
return []
html = req.text # the html of the page crawled
urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html) # extracting a list with the urls
return urlsIe, remove parenthesis and an early return.
7.- Are you removing the visited urls from list_urls after is visited? In
renew_url you are getting a random url but I can't see if you are removing it from that list.8.- Wait some random secs between requesting pages on the same server, you are crawling a website most of them aren't happy with that.
That is what I see in a first approach! Hope it helps !
Code Snippets
if(not urls): print('[-] Provide start urls'); return # one of the ways ternary operator in python, ';' if continue in same line.....threading.active_count()-3)) # -3 because mainthread and deamon threadQTY_THREAD_OFFSET = 3
.....threading.active_count() - QTY_THREAD_OFFSET))while(threading.active_count() > 2): # main thread (this one) count as 1 and deamon toowhile(threading.active_count() >= QTY_THREAD_OFFSET):Context
StackExchange Code Review Q#154687, answer score: 7
Revisions (0)
No revisions yet.