patternpythonMinor
Web crawler that filters out non diseases
Viewed 0 times
nonthatfilterswebdiseasescrawlerout
Problem
It is very messy and I lack the experience to make it eloquent so I wanted some help with it. The process time is also very slow.
Currently it goes into the first page and goes through the links in the index looking for the word "symptoms." If its not found it will look for a link inside that link named "faqs" and go to that link and look for 'symptoms'.
```
import requests
import string
from bs4 import BeautifulSoup, Tag
alpha = 'a b c d e f g h i j k l m n o p q r s t u v w x y z'
alpha = alpha.split()
w = []
def disease_spider(maxpages):
i = 0
while i <= maxpages:
url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for l in soup.findAll('a', {'class':'noLinking'}):
if l.text != "USA.gov" and l.text != 'Inspector General' and l.text != '':
href = l.get('href')
diseasechecker(href)
i += 1
def diseasechecker(url2):
source_code = requests.get(url2)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for y in soup.findAll('p'):
x = str(y)
txt = y.text
if 'symptoms' in x and url2 not in w:
w.append(url2)
elif 'symptoms' in txt and url2 not in w:
w.append(url2)
for y in soup.findAll('li'):
x = str(y)
txt = y.text
if 'symptoms' in x and url2 not in w:
w.append(url2)
elif 'symptoms' in txt and url2 not in w:
w.append(url2)
for y in soup.findAll('a'):
href2 = y.get('href')
try:
s = str(href2)
u = "http://www.cdc.gov"+s
except UnicodeEncodeError:
pass
l3 = 0
if 'faqs' in s:
l3 = level3(u)
if l3 == 1 and url2 not in w:
w.append(url2)
def level3(url3):
source_code = requests.get(url3)
plain
Currently it goes into the first page and goes through the links in the index looking for the word "symptoms." If its not found it will look for a link inside that link named "faqs" and go to that link and look for 'symptoms'.
```
import requests
import string
from bs4 import BeautifulSoup, Tag
alpha = 'a b c d e f g h i j k l m n o p q r s t u v w x y z'
alpha = alpha.split()
w = []
def disease_spider(maxpages):
i = 0
while i <= maxpages:
url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for l in soup.findAll('a', {'class':'noLinking'}):
if l.text != "USA.gov" and l.text != 'Inspector General' and l.text != '':
href = l.get('href')
diseasechecker(href)
i += 1
def diseasechecker(url2):
source_code = requests.get(url2)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for y in soup.findAll('p'):
x = str(y)
txt = y.text
if 'symptoms' in x and url2 not in w:
w.append(url2)
elif 'symptoms' in txt and url2 not in w:
w.append(url2)
for y in soup.findAll('li'):
x = str(y)
txt = y.text
if 'symptoms' in x and url2 not in w:
w.append(url2)
elif 'symptoms' in txt and url2 not in w:
w.append(url2)
for y in soup.findAll('a'):
href2 = y.get('href')
try:
s = str(href2)
u = "http://www.cdc.gov"+s
except UnicodeEncodeError:
pass
l3 = 0
if 'faqs' in s:
l3 = level3(u)
if l3 == 1 and url2 not in w:
w.append(url2)
def level3(url3):
source_code = requests.get(url3)
plain
Solution
Here are some suggestions:
-
More comments and docstrings would be nice, so the reader/reviewer can see what you intend the code to do (why was the code written this way?). It makes it easier to follow, because your intention is clear, and also then I can spot bugs where the implementation doesn't match the indentation.
-
Add more whitespace! Particularly vertically; breaking up the functions with newlines will make it substantially easier to read/
-
At the top of the file, you construct a list of lowercase letters
The lowercase letters 'abcdefghijklmnopqrstuvwxyz'. This value is not locale-dependent and will not change.
Then you can get rid of the
You should use a for loop rather than a while loop:
Rather than constructing the request URL using string concatenation, use new-style format strings:
I’m guessing that this function goes through the pages from A to Z, yes? But there’s no guard against the user calling this function with
Use a better variable name than
Then I’d have a list of forbidden names, and just check than
I noticed when reading this function that you use this block of code in all three functions:
You should consider pulling this out into its own function, that takes a URL and returns the corresponding BeautifulSoup object. It would slightly tidy up the code.
The variable names in this function are all short and undescriptive; pick something that tells me what the value of this variable represents.
This function mutates a global list
You loop over all
I believe you can also tidy up the if statement:
The parens aren’t strictly necessary, but I think they’re useful here for readability. I also got rid of the
Again, pick some better variable names that
And you can tidy up the if statement:
-
In the final loop, you’re looping over the index:
It is far cleaner and more Pythonic to loop over the list directly; i.e.,
-
More comments and docstrings would be nice, so the reader/reviewer can see what you intend the code to do (why was the code written this way?). It makes it easier to follow, because your intention is clear, and also then I can spot bugs where the implementation doesn't match the indentation.
-
Add more whitespace! Particularly vertically; breaking up the functions with newlines will make it substantially easier to read/
-
At the top of the file, you construct a list of lowercase letters
alpha. You’ve also imported, but not used, the string module. It would be better to use string.ascii_lowercase, which works as follows:string.ascii_lowercaseThe lowercase letters 'abcdefghijklmnopqrstuvwxyz'. This value is not locale-dependent and will not change.
Then you can get rid of the
alpha variable, and just use this.- disease_spider()
You should use a for loop rather than a while loop:
for i in range(maxpages):
# do some stuffRather than constructing the request URL using string concatenation, use new-style format strings:
url = 'http://www.cdc.gov/DiseasesConditions/az/{}.html'.format(
string.lowercase[i]
)I’m guessing that this function goes through the pages from A to Z, yes? But there’s no guard against the user calling this function with
maxpages ≥ 26, which would throw an IndexError here. You should add some sort of check.Use a better variable name than
l in your soup.findAll loop. How about link?Then I’d have a list of forbidden names, and just check than
link.text isn’t in that list. For example:forbidden_text = ['USA.gov', 'Inspector General', '']
if link.text not in forbidden_text:
# do some more stuff- diseasechecker()
I noticed when reading this function that you use this block of code in all three functions:
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)You should consider pulling this out into its own function, that takes a URL and returns the corresponding BeautifulSoup object. It would slightly tidy up the code.
The variable names in this function are all short and undescriptive; pick something that tells me what the value of this variable represents.
This function mutates a global list
w, although it’s the only function to do so. I think it would be better if you moved the definition of w inside this function, and returned it at the end. That makes the function more self-contained.You loop over all
p and li elements separately, but do the same thing with each. Why not combine the loops, so you only have to write out the body of the loop once:for element in soup.findAll('p') + soup.findAll('li'):
# do some stuff with elementsI believe you can also tidy up the if statement:
if (url2 not in w) and ('symptoms' in str(y) or 'symptoms' in y.text):
w.append(url2)The parens aren’t strictly necessary, but I think they’re useful here for readability. I also got rid of the
x and txt variables, which I don’t think aid readability. This is more direct.- level3()
Again, pick some better variable names that
x and y.And you can tidy up the if statement:
if 'symptoms' in str(y) or 'symptoms' in y.text:
return 1-
In the final loop, you’re looping over the index:
for i in range(len(w)):
print w[i]It is far cleaner and more Pythonic to loop over the list directly; i.e.,
for word in w:
print wordCode Snippets
for i in range(maxpages):
# do some stuffurl = 'http://www.cdc.gov/DiseasesConditions/az/{}.html'.format(
string.lowercase[i]
)forbidden_text = ['USA.gov', 'Inspector General', '']
if link.text not in forbidden_text:
# do some more stuffsource_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)for element in soup.findAll('p') + soup.findAll('li'):
# do some stuff with elementsContext
StackExchange Code Review Q#95943, answer score: 6
Revisions (0)
No revisions yet.