HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

BFS/DFS Web Crawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlerbfsdfsweb

Problem

I've built a web crawler that starts at an origin URL and crawls the web using a BFS or DFS method. Everything is working fine, but the performance is horrendous. I think the major cause of this is my use of synchronous requests. I've used BeautifulSoup and the Requests library to implement this, so nothing is happening asynchronously.

I've tried using AsyncIO and a couple other ways of making this async, but it's given me a lot of trouble. Any advice on how to do so, or other recommendations for improving performance would be much appreciated.

BFS Usage:

python3 Webcrawler.py [origin_url] BFS [#_nodes_to_crawl] 0 [keyword_to_find]


DFS Usage:

python3 Webcrawler.py [origin_url]] DFS [#_nodes_to_crawl] [depth_limit] [keyword_to_find]


Webcrawler.py

```
import urllib
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
import collections
from Graph import Graph
from Node import Node
import sys
from time import gmtime, strftime
from timeout import timeout
from multiprocessing import Pool
from multiprocessing import Process
import json
import pdb

class WebCrawler:
def __init__(self, originUrl, method, totalNodes, depthLimit=None, keyword=None):
self.originUrl = originUrl
self.method = method
self.totalNodes = int(totalNodes)
self.nodeCount = 0
self.depthLimit = int(depthLimit)
self.currentDepth = 0
self.keyword = keyword
self.keywordUrls = []
self.nodeUrlMap = {}
self.nodesToVisit = []
self.visitedUrls = set()
self.graph = Graph()
self.nodeIndex = 0
self.storeCookie()
originTitle = self.getTitle(originUrl)
startNode = Node(originUrl, None, originTitle)
self.crawl(startNode)

def crawl(self, node):
print("crawl(): " + strftime("%H:%M:%S", gmtime()))
visited = node.url in self.visitedUrls
if not visited:
self.

Solution

It is important to understand your bottlenecks by profiling and measuring your program, but here are some performance notes after a "static" look at your code:

  • try out the Scrapy web-scraping framework - it is of an asynchronous nature (based on twisted library) and has a very rich functionality for close to everything you may need for web-scraping



-
initialize requests.Session() and reuse:


if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase

-
switch from html.parser to a faster lxml:

soup = BeautifulSoup(plainText, "lxml")


  • use __slots__ for the Graph, Node and Edge classes



  • switch from json to the faster ujson



  • the "test request"'s speed may be improved by switching from GET to HEAD (using the head() method instead of get())



Code style notes:

  • unused and not properly organized imports (PEP8 reference)



  • Python naming convention (PEP8 reference)



Other notes:

  • looks like you are not properly handling relative and absolute URLs while extracting the links, here is a sample how to handle both



  • I don't believe your "extension black list" check is good enough - currently, you are checking for a prohibited extension to be present anywhere in the url, which will mark a lot of "valid" URLs as "invalid"



-
the getTitle can be improved by using a find() method and checking the result to be not None, then, if a tag is found, use get_text() method to get the text of a tag:

def getTitle(self, url):
    print("getTitle(): " + strftime("%H:%M:%S", gmtime()))
    soup = self.generateSoup(url)
    title = soup.title  # same as soup.find("title")
    if title is not None:
        return title.get_text()


-
the checkForKeyword() would only return True if there is a text node that matches the keyword exactly. E.g., if a keyword is test, checkForKeyword() would return False if test is there in the HTML, but is a part of a text node, say:

It is important to test your code


-
switching to argparse may improve overall usability, robustness and readability of command-line argument parsing

There are other things to improve, but I hope you will get more reviews. Or/And, you can approach it step-by-step making the code better (very broad term, I understand) on every "iteration".

Code Snippets

soup = BeautifulSoup(plainText, "lxml")
def getTitle(self, url):
    print("getTitle(): " + strftime("%H:%M:%S", gmtime()))
    soup = self.generateSoup(url)
    title = soup.title  # same as soup.find("title")
    if title is not None:
        return title.get_text()
<b>It is important to test your code<b>

Context

StackExchange Code Review Q#156863, answer score: 2

Revisions (0)

No revisions yet.