patternpythonMinor
Crawler with BeautifulSoup
Viewed 0 times
crawlerbeautifulsoupwith
Problem
I am trying to create a web crawler for student research. I have already finish it, but I would like to tell me if the way I use is the best one. (probably it isn't :p)
The crawler is for the cnn site and the only thing I want to get, is the text of the news.
Here is an example link: link
Here is my code:
The crawler is for the cnn site and the only thing I want to get, is the text of the news.
Here is an example link: link
Here is my code:
def cnn_crawler(link):
req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"})
usock = urllib2.urlopen(req)
encoding = usock.headers.getparam('charset')
page = usock.read().decode(encoding)
usock.close()
soup = BeautifulSoup(page)
div = soup.find('div', attrs={'class': 'cnn_strycntntlft'})
text = div.find_all('p')
text.remove(soup.find('p', attrs={'class': 'cnn_strycbftrtxt'}))
final = ""
for entry in text:
final = final + entry.get_text() + " "
return finalSolution
The code looks good! A few comments:
- I'd use the requests library instead of urrllib2.
- Make sure to follow PEP8 (second line has a trailing whitespace and a space after the dictionary key).
- Use more semantic variable names than
div,text, orfinal.
- If your code gets bigger, think about separating the HTTP request from the parsing code.
Context
StackExchange Code Review Q#30069, answer score: 2
Revisions (0)
No revisions yet.