patternpythonMinor
BFS/DFS Web Crawler
Viewed 0 times
crawlerbfsdfsweb
Problem
I've built a web crawler that starts at an origin URL and crawls the web using a BFS or DFS method. Everything is working fine, but the performance is horrendous. I think the major cause of this is my use of synchronous requests. I've used BeautifulSoup and the Requests library to implement this, so nothing is happening asynchronously.
I've tried using AsyncIO and a couple other ways of making this async, but it's given me a lot of trouble. Any advice on how to do so, or other recommendations for improving performance would be much appreciated.
BFS Usage:
DFS Usage:
Webcrawler.py
```
import urllib
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
import collections
from Graph import Graph
from Node import Node
import sys
from time import gmtime, strftime
from timeout import timeout
from multiprocessing import Pool
from multiprocessing import Process
import json
import pdb
class WebCrawler:
def __init__(self, originUrl, method, totalNodes, depthLimit=None, keyword=None):
self.originUrl = originUrl
self.method = method
self.totalNodes = int(totalNodes)
self.nodeCount = 0
self.depthLimit = int(depthLimit)
self.currentDepth = 0
self.keyword = keyword
self.keywordUrls = []
self.nodeUrlMap = {}
self.nodesToVisit = []
self.visitedUrls = set()
self.graph = Graph()
self.nodeIndex = 0
self.storeCookie()
originTitle = self.getTitle(originUrl)
startNode = Node(originUrl, None, originTitle)
self.crawl(startNode)
def crawl(self, node):
print("crawl(): " + strftime("%H:%M:%S", gmtime()))
visited = node.url in self.visitedUrls
if not visited:
self.
I've tried using AsyncIO and a couple other ways of making this async, but it's given me a lot of trouble. Any advice on how to do so, or other recommendations for improving performance would be much appreciated.
BFS Usage:
python3 Webcrawler.py [origin_url] BFS [#_nodes_to_crawl] 0 [keyword_to_find]DFS Usage:
python3 Webcrawler.py [origin_url]] DFS [#_nodes_to_crawl] [depth_limit] [keyword_to_find]Webcrawler.py
```
import urllib
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
import collections
from Graph import Graph
from Node import Node
import sys
from time import gmtime, strftime
from timeout import timeout
from multiprocessing import Pool
from multiprocessing import Process
import json
import pdb
class WebCrawler:
def __init__(self, originUrl, method, totalNodes, depthLimit=None, keyword=None):
self.originUrl = originUrl
self.method = method
self.totalNodes = int(totalNodes)
self.nodeCount = 0
self.depthLimit = int(depthLimit)
self.currentDepth = 0
self.keyword = keyword
self.keywordUrls = []
self.nodeUrlMap = {}
self.nodesToVisit = []
self.visitedUrls = set()
self.graph = Graph()
self.nodeIndex = 0
self.storeCookie()
originTitle = self.getTitle(originUrl)
startNode = Node(originUrl, None, originTitle)
self.crawl(startNode)
def crawl(self, node):
print("crawl(): " + strftime("%H:%M:%S", gmtime()))
visited = node.url in self.visitedUrls
if not visited:
self.
Solution
It is important to understand your bottlenecks by profiling and measuring your program, but here are some performance notes after a "static" look at your code:
-
initialize
if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
-
switch from
Code style notes:
Other notes:
-
the
-
the
-
switching to
There are other things to improve, but I hope you will get more reviews. Or/And, you can approach it step-by-step making the code better (very broad term, I understand) on every "iteration".
- try out the
Scrapyweb-scraping framework - it is of an asynchronous nature (based ontwistedlibrary) and has a very rich functionality for close to everything you may need for web-scraping
-
initialize
requests.Session() and reuse:if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
-
switch from
html.parser to a faster lxml:soup = BeautifulSoup(plainText, "lxml")- use
__slots__for theGraph,NodeandEdgeclasses
- switch from
jsonto the fasterujson
- the "test request"'s speed may be improved by switching from
GETtoHEAD(using thehead()method instead ofget())
Code style notes:
- unused and not properly organized imports (PEP8 reference)
- Python naming convention (PEP8 reference)
Other notes:
- looks like you are not properly handling relative and absolute URLs while extracting the links, here is a sample how to handle both
- I don't believe your "extension black list" check is good enough - currently, you are checking for a prohibited extension to be present anywhere in the url, which will mark a lot of "valid" URLs as "invalid"
-
the
getTitle can be improved by using a find() method and checking the result to be not None, then, if a tag is found, use get_text() method to get the text of a tag:def getTitle(self, url):
print("getTitle(): " + strftime("%H:%M:%S", gmtime()))
soup = self.generateSoup(url)
title = soup.title # same as soup.find("title")
if title is not None:
return title.get_text()-
the
checkForKeyword() would only return True if there is a text node that matches the keyword exactly. E.g., if a keyword is test, checkForKeyword() would return False if test is there in the HTML, but is a part of a text node, say:It is important to test your code-
switching to
argparse may improve overall usability, robustness and readability of command-line argument parsingThere are other things to improve, but I hope you will get more reviews. Or/And, you can approach it step-by-step making the code better (very broad term, I understand) on every "iteration".
Code Snippets
soup = BeautifulSoup(plainText, "lxml")def getTitle(self, url):
print("getTitle(): " + strftime("%H:%M:%S", gmtime()))
soup = self.generateSoup(url)
title = soup.title # same as soup.find("title")
if title is not None:
return title.get_text()<b>It is important to test your code<b>Context
StackExchange Code Review Q#156863, answer score: 2
Revisions (0)
No revisions yet.