HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reaching the philosophy wiki page

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thereachingpagewikiphilosophy

Problem

I've written a class that will start from a random Wikipedia page, then choose the first link in the main body, and then navigate following the links until it finds the Philosophy page. When I run the testCrawler() method, it crawls starting from 20 pages and then plots the lengths of all of the paths.
This generally works but I just want to confirm that the code looks clean/intuitive.

Points of concern:
As there are a vast amount of edge cases, I have multiple try/except blocks. Does this look unwieldy?

Also, the point of graphing the path lengths is to try to see what kind of distribution the path lengths form. If the graph looks like it's 'normal' can I assume normality? Or is there a better way to do this (an automated way)?

```
import requests
from lxml.html import fromstring
import json
from bs4 import BeautifulSoup,NavigableString, Tag
import sys
import matplotlib.pyplot as plt
import numpy as np

reload(sys)
sys.setdefaultencoding('utf-8')

class Crawler():
''' Class used to crawl wikipedia pages starting from a random article.
'''
def __init__(self):
self.baseUrl = "https://en.wikipedia.org"

def reformatString(self,char,word):
'''Remove passed in char from a string and convert its characters to lowercase
'''
word = word.lower()
charIdx = word.find(char)
if charIdx != -1:
return word[:charIdx]
return word
def checkNameMatch(self,heading,string):
'''Determine whether or not any part of the article heading is in the string and vice versa
'''
for i in range(len(string)):
for j in range(len(heading)):
if heading[j] in string[i] or string[i] in heading[j]:
return True
return False

def tokenize(self, word):
'''Split the passed in 'word' on space characters and return a list of tokens
'''
tokens = []
currWord = ""
for i in range(len(word)):

Solution

Code Style Issues:

  • naming - this is the first thing that is easily noticeable - your variable and method names should not be named in the camel-case - follow the Python naming convention



  • testCrawl, even if renamed to a proper test_crawl is a terrible method name - especially given what it is actually doing - how about find_paths or find_paths_to_philosophy?..



  • there is an unused variable: idxs in the plotLengths() method



  • organize imports per PEP8:





  • standard library imports



  • related third party imports



  • local application/library specific imports




All separated with a new line. And, remove unused imports:

import sys

from bs4 import BeautifulSoup, NavigableString, Tag
import matplotlib.pyplot as plt
import requests


  • when comparing to None, it is better to use is not instead of !=. For instance, if path != None: would become if path is not None:. Also, check if you can simply use the if path: thruthiness check here and in similar cases throughout the code



  • you can remove extra () in class definitions - e.g. class Crawler: instead of class Crawler():



  • docstrings should be enclosed in triple double-quotes, and, if they fit on a single line, start with an upper-case letter and end with a dot.



  • you should have a single newline between the class methods, two blank lines after the imports (PEP8 reference)



  • use print() as a function for Python 3 compatibility



  • inline comments should begin with a single space (reference)



-
you can improve the way you define bins, replacing:

bins = [0,2,4,6,8,10,12,14,16]


with:

bins = range(0, BIN_COUNT + 1, 2)


where BIN_COUNT = 16 (defining it as a "constant")

-
when getting element's attributes in BeautifulSoup, there is no need to use .attrs. You can simply treat an element as a dictionary. E.g. replacing:

return ulist[0].li.a.attrs['href']


with:

return ulist[0].li.a['href']


-
when joining URL parts, it would be cleaner and more reliable to use urljoin() instead of string concatenation

  • it would probably be a good idea to define the "magic" number 20 as a constant, or allow to configure/change it when starting the crawling



Code Organization Issues:

  • there are some "separation of concerns" and Single Responsibility Principle issues - your Crawler class is responsible for too much unrelated things - for instance, the reformatString or plotLengths methods don't sound related to "crawling" - extract these methods



Performance Improvements:

  • since you are issuing multiple requests to the same domain, you can make use of requests.Session which will reuse the underlying TCP connection which will result into performance gains



  • I think you can also do better when parsing HTML - you always need a single part of the page. This made me think, if you can use the SoupStrainer to parse only part of the document



Note that even if you apply all the suggested changes, I think we would definitely need at least one more round of reviews (it is perfectly okay to post a new question with improved code asking for further improvements).

Code Snippets

import sys

from bs4 import BeautifulSoup, NavigableString, Tag
import matplotlib.pyplot as plt
import requests
bins = [0,2,4,6,8,10,12,14,16]
bins = range(0, BIN_COUNT + 1, 2)
return ulist[0].li.a.attrs['href']
return ulist[0].li.a['href']

Context

StackExchange Code Review Q#159648, answer score: 2

Revisions (0)

No revisions yet.