patternpythonMinor

Crawler with BeautifulSoup

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

crawlerbeautifulsoupwith

Problem

I am trying to create a web crawler for student research. I have already finish it, but I would like to tell me if the way I use is the best one. (probably it isn't :p)

The crawler is for the cnn site and the only thing I want to get, is the text of the news.

Here is an example link: link

Here is my code:

def cnn_crawler(link):
    req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) 
    usock = urllib2.urlopen(req)
    encoding = usock.headers.getparam('charset')
    page = usock.read().decode(encoding)
    usock.close()

    soup = BeautifulSoup(page)
    div = soup.find('div', attrs={'class': 'cnn_strycntntlft'})
    text = div.find_all('p')
    text.remove(soup.find('p', attrs={'class': 'cnn_strycbftrtxt'}))
    final = ""
    for entry in text:
            final = final + entry.get_text() + " "
    return final

Solution

The code looks good! A few comments:

I'd use the requests library instead of urrllib2.

Make sure to follow PEP8 (second line has a trailing whitespace and a space after the dictionary key).

Use more semantic variable names than div, text, or final.

If your code gets bigger, think about separating the HTTP request from the parsing code.

Context

StackExchange Code Review Q#30069, answer score: 2

Revisions (0)

No revisions yet.