HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Google News scraper to fetch links with similar stories

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
newsstoriesgooglewithfetchlinkssimilarscraper

Problem

The following code takes either a URL or the title to an existing news article.

-
Searches Google News using the title.

-
Collects all links from search results.

import urllib2
from lxml import html
import requests

def get_page_tree(url=None):
    page = requests.get(url=url, verify=False)
    return html.fromstring(page.text)

def get_title(url=None):
    tree = get_page_tree(url=url)
    return tree.xpath('//title//text()')[0].strip().split(' -')[0]

def find_other_news_sources(url=None, title=None):
    # Google forwards the url using /url?q=    . This might change over time
    forwarding_identifier = '/url?q='
    if not title:
        title = get_title(url=url)
    google_news_search_url = 'http://www.google.com/search?q=' + urllib2.quote(title) + '&tbm=nws'
    google_news_search_tree = get_page_tree(url=google_news_search_url)
    other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in
                            google_news_search_tree.xpath('//a//@href') if forwarding_identifier in a_link]
    return other_news_sources_links

Solution


  • Instead of constructing the google_news_search_url with two string concatenations, use string formatting.



  • The other_news_sources_links line is very dense. Please split it up.



  • In addition to the defaults for all your args being None, you're calling all of your functions with keyword arguments, which seems unnecessary.



  • But before you fix that, consider why you need these 2 line functions in the first place. They don't seem to do anything complicated enough to warrant having to jump around in the code.

Context

StackExchange Code Review Q#97237, answer score: 3

Revisions (0)

No revisions yet.