HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Web crawler that filters out non diseases

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nonthatfilterswebdiseasescrawlerout

Problem

It is very messy and I lack the experience to make it eloquent so I wanted some help with it. The process time is also very slow.

Currently it goes into the first page and goes through the links in the index looking for the word "symptoms." If its not found it will look for a link inside that link named "faqs" and go to that link and look for 'symptoms'.

```
import requests
import string
from bs4 import BeautifulSoup, Tag
alpha = 'a b c d e f g h i j k l m n o p q r s t u v w x y z'
alpha = alpha.split()
w = []
def disease_spider(maxpages):
i = 0
while i <= maxpages:

url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

for l in soup.findAll('a', {'class':'noLinking'}):
if l.text != "USA.gov" and l.text != 'Inspector General' and l.text != '':
href = l.get('href')
diseasechecker(href)
i += 1
def diseasechecker(url2):
source_code = requests.get(url2)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for y in soup.findAll('p'):
x = str(y)
txt = y.text
if 'symptoms' in x and url2 not in w:
w.append(url2)
elif 'symptoms' in txt and url2 not in w:
w.append(url2)
for y in soup.findAll('li'):
x = str(y)
txt = y.text
if 'symptoms' in x and url2 not in w:
w.append(url2)
elif 'symptoms' in txt and url2 not in w:
w.append(url2)
for y in soup.findAll('a'):
href2 = y.get('href')

try:
s = str(href2)
u = "http://www.cdc.gov"+s
except UnicodeEncodeError:
pass
l3 = 0
if 'faqs' in s:
l3 = level3(u)
if l3 == 1 and url2 not in w:
w.append(url2)
def level3(url3):
source_code = requests.get(url3)
plain

Solution

Here are some suggestions:

-
More comments and docstrings would be nice, so the reader/reviewer can see what you intend the code to do (why was the code written this way?). It makes it easier to follow, because your intention is clear, and also then I can spot bugs where the implementation doesn't match the indentation.

-
Add more whitespace! Particularly vertically; breaking up the functions with newlines will make it substantially easier to read/

-
At the top of the file, you construct a list of lowercase letters alpha. You’ve also imported, but not used, the string module. It would be better to use string.ascii_lowercase, which works as follows:


string.ascii_lowercase


The lowercase letters 'abcdefghijklmnopqrstuvwxyz'. This value is not locale-dependent and will not change.

Then you can get rid of the alpha variable, and just use this.

  • disease_spider()



You should use a for loop rather than a while loop:

for i in range(maxpages):
    # do some stuff


Rather than constructing the request URL using string concatenation, use new-style format strings:

url = 'http://www.cdc.gov/DiseasesConditions/az/{}.html'.format(
    string.lowercase[i]
)


I’m guessing that this function goes through the pages from A to Z, yes? But there’s no guard against the user calling this function with maxpages ≥ 26, which would throw an IndexError here. You should add some sort of check.

Use a better variable name than l in your soup.findAll loop. How about link?

Then I’d have a list of forbidden names, and just check than link.text isn’t in that list. For example:

forbidden_text = ['USA.gov', 'Inspector General', '']
if link.text not in forbidden_text:
    # do some more stuff


  • diseasechecker()



I noticed when reading this function that you use this block of code in all three functions:

source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)


You should consider pulling this out into its own function, that takes a URL and returns the corresponding BeautifulSoup object. It would slightly tidy up the code.

The variable names in this function are all short and undescriptive; pick something that tells me what the value of this variable represents.

This function mutates a global list w, although it’s the only function to do so. I think it would be better if you moved the definition of w inside this function, and returned it at the end. That makes the function more self-contained.

You loop over all p and li elements separately, but do the same thing with each. Why not combine the loops, so you only have to write out the body of the loop once:

for element in soup.findAll('p') + soup.findAll('li'):
     # do some stuff with elements


I believe you can also tidy up the if statement:

if (url2 not in w) and ('symptoms' in str(y) or 'symptoms' in y.text):
    w.append(url2)


The parens aren’t strictly necessary, but I think they’re useful here for readability. I also got rid of the x and txt variables, which I don’t think aid readability. This is more direct.

  • level3()



Again, pick some better variable names that x and y.

And you can tidy up the if statement:

if 'symptoms' in str(y) or 'symptoms' in y.text:
    return 1


-
In the final loop, you’re looping over the index:

for i in range(len(w)):
    print w[i]


It is far cleaner and more Pythonic to loop over the list directly; i.e.,

for word in w:
    print word

Code Snippets

for i in range(maxpages):
    # do some stuff
url = 'http://www.cdc.gov/DiseasesConditions/az/{}.html'.format(
    string.lowercase[i]
)
forbidden_text = ['USA.gov', 'Inspector General', '']
if link.text not in forbidden_text:
    # do some more stuff
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for element in soup.findAll('p') + soup.findAll('li'):
     # do some stuff with elements

Context

StackExchange Code Review Q#95943, answer score: 6

Revisions (0)

No revisions yet.