patternpythonMinor
Reaching the philosophy wiki page
Viewed 0 times
thereachingpagewikiphilosophy
Problem
I've written a class that will start from a random Wikipedia page, then choose the first link in the main body, and then navigate following the links until it finds the Philosophy page. When I run the
This generally works but I just want to confirm that the code looks clean/intuitive.
Points of concern:
As there are a vast amount of edge cases, I have multiple try/except blocks. Does this look unwieldy?
Also, the point of graphing the path lengths is to try to see what kind of distribution the path lengths form. If the graph looks like it's 'normal' can I assume normality? Or is there a better way to do this (an automated way)?
```
import requests
from lxml.html import fromstring
import json
from bs4 import BeautifulSoup,NavigableString, Tag
import sys
import matplotlib.pyplot as plt
import numpy as np
reload(sys)
sys.setdefaultencoding('utf-8')
class Crawler():
''' Class used to crawl wikipedia pages starting from a random article.
'''
def __init__(self):
self.baseUrl = "https://en.wikipedia.org"
def reformatString(self,char,word):
'''Remove passed in char from a string and convert its characters to lowercase
'''
word = word.lower()
charIdx = word.find(char)
if charIdx != -1:
return word[:charIdx]
return word
def checkNameMatch(self,heading,string):
'''Determine whether or not any part of the article heading is in the string and vice versa
'''
for i in range(len(string)):
for j in range(len(heading)):
if heading[j] in string[i] or string[i] in heading[j]:
return True
return False
def tokenize(self, word):
'''Split the passed in 'word' on space characters and return a list of tokens
'''
tokens = []
currWord = ""
for i in range(len(word)):
testCrawler() method, it crawls starting from 20 pages and then plots the lengths of all of the paths.This generally works but I just want to confirm that the code looks clean/intuitive.
Points of concern:
As there are a vast amount of edge cases, I have multiple try/except blocks. Does this look unwieldy?
Also, the point of graphing the path lengths is to try to see what kind of distribution the path lengths form. If the graph looks like it's 'normal' can I assume normality? Or is there a better way to do this (an automated way)?
```
import requests
from lxml.html import fromstring
import json
from bs4 import BeautifulSoup,NavigableString, Tag
import sys
import matplotlib.pyplot as plt
import numpy as np
reload(sys)
sys.setdefaultencoding('utf-8')
class Crawler():
''' Class used to crawl wikipedia pages starting from a random article.
'''
def __init__(self):
self.baseUrl = "https://en.wikipedia.org"
def reformatString(self,char,word):
'''Remove passed in char from a string and convert its characters to lowercase
'''
word = word.lower()
charIdx = word.find(char)
if charIdx != -1:
return word[:charIdx]
return word
def checkNameMatch(self,heading,string):
'''Determine whether or not any part of the article heading is in the string and vice versa
'''
for i in range(len(string)):
for j in range(len(heading)):
if heading[j] in string[i] or string[i] in heading[j]:
return True
return False
def tokenize(self, word):
'''Split the passed in 'word' on space characters and return a list of tokens
'''
tokens = []
currWord = ""
for i in range(len(word)):
Solution
Code Style Issues:
All separated with a new line. And, remove unused imports:
-
you can improve the way you define
with:
where
-
when getting element's attributes in
with:
-
when joining URL parts, it would be cleaner and more reliable to use
Code Organization Issues:
Performance Improvements:
Note that even if you apply all the suggested changes, I think we would definitely need at least one more round of reviews (it is perfectly okay to post a new question with improved code asking for further improvements).
- naming - this is the first thing that is easily noticeable - your variable and method names should not be named in the camel-case - follow the Python naming convention
testCrawl, even if renamed to a propertest_crawlis a terrible method name - especially given what it is actually doing - how aboutfind_pathsorfind_paths_to_philosophy?..
- there is an unused variable:
idxsin theplotLengths()method
- organize imports per PEP8:
- standard library imports
- related third party imports
- local application/library specific imports
All separated with a new line. And, remove unused imports:
import sys
from bs4 import BeautifulSoup, NavigableString, Tag
import matplotlib.pyplot as plt
import requests- when comparing to
None, it is better to useis notinstead of!=. For instance,if path != None:would becomeif path is not None:. Also, check if you can simply use theif path:thruthiness check here and in similar cases throughout the code
- you can remove extra
()in class definitions - e.g.class Crawler:instead ofclass Crawler():
- docstrings should be enclosed in triple double-quotes, and, if they fit on a single line, start with an upper-case letter and end with a dot.
- you should have a single newline between the class methods, two blank lines after the imports (PEP8 reference)
- use
print()as a function for Python 3 compatibility
- inline comments should begin with a single space (reference)
-
you can improve the way you define
bins, replacing:bins = [0,2,4,6,8,10,12,14,16]with:
bins = range(0, BIN_COUNT + 1, 2)where
BIN_COUNT = 16 (defining it as a "constant")-
when getting element's attributes in
BeautifulSoup, there is no need to use .attrs. You can simply treat an element as a dictionary. E.g. replacing:return ulist[0].li.a.attrs['href']with:
return ulist[0].li.a['href']-
when joining URL parts, it would be cleaner and more reliable to use
urljoin() instead of string concatenation- it would probably be a good idea to define the "magic" number
20as a constant, or allow to configure/change it when starting the crawling
Code Organization Issues:
- there are some "separation of concerns" and Single Responsibility Principle issues - your
Crawlerclass is responsible for too much unrelated things - for instance, thereformatStringorplotLengthsmethods don't sound related to "crawling" - extract these methods
Performance Improvements:
- since you are issuing multiple requests to the same domain, you can make use of
requests.Sessionwhich will reuse the underlying TCP connection which will result into performance gains
- I think you can also do better when parsing HTML - you always need a single part of the page. This made me think, if you can use the
SoupStrainerto parse only part of the document
Note that even if you apply all the suggested changes, I think we would definitely need at least one more round of reviews (it is perfectly okay to post a new question with improved code asking for further improvements).
Code Snippets
import sys
from bs4 import BeautifulSoup, NavigableString, Tag
import matplotlib.pyplot as plt
import requestsbins = [0,2,4,6,8,10,12,14,16]bins = range(0, BIN_COUNT + 1, 2)return ulist[0].li.a.attrs['href']return ulist[0].li.a['href']Context
StackExchange Code Review Q#159648, answer score: 2
Revisions (0)
No revisions yet.