patternpythonMinor

Google News scraper to fetch links with similar stories

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

newsstoriesgooglewithfetchlinkssimilarscraper

Problem

The following code takes either a URL or the title to an existing news article.

-
Searches Google News using the title.

-
Collects all links from search results.

import urllib2
from lxml import html
import requests

def get_page_tree(url=None):
    page = requests.get(url=url, verify=False)
    return html.fromstring(page.text)

def get_title(url=None):
    tree = get_page_tree(url=url)
    return tree.xpath('//title//text()')[0].strip().split(' -')[0]

def find_other_news_sources(url=None, title=None):
    # Google forwards the url using /url?q=    . This might change over time
    forwarding_identifier = '/url?q='
    if not title:
        title = get_title(url=url)
    google_news_search_url = 'http://www.google.com/search?q=' + urllib2.quote(title) + '&tbm=nws'
    google_news_search_tree = get_page_tree(url=google_news_search_url)
    other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in
                            google_news_search_tree.xpath('//a//@href') if forwarding_identifier in a_link]
    return other_news_sources_links

Solution

Instead of constructing the google_news_search_url with two string concatenations, use string formatting.

The other_news_sources_links line is very dense. Please split it up.

In addition to the defaults for all your args being None, you're calling all of your functions with keyword arguments, which seems unnecessary.

But before you fix that, consider why you need these 2 line functions in the first place. They don't seem to do anything complicated enough to warrant having to jump around in the code.

Context

StackExchange Code Review Q#97237, answer score: 3

Revisions (0)

No revisions yet.