patternpythonMinor
Google News scraper to fetch links with similar stories
Viewed 0 times
newsstoriesgooglewithfetchlinkssimilarscraper
Problem
The following code takes either a URL or the title to an existing news article.
-
Searches Google News using the title.
-
Collects all links from search results.
-
Searches Google News using the title.
-
Collects all links from search results.
import urllib2
from lxml import html
import requests
def get_page_tree(url=None):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)
def get_title(url=None):
tree = get_page_tree(url=url)
return tree.xpath('//title//text()')[0].strip().split(' -')[0]
def find_other_news_sources(url=None, title=None):
# Google forwards the url using /url?q= . This might change over time
forwarding_identifier = '/url?q='
if not title:
title = get_title(url=url)
google_news_search_url = 'http://www.google.com/search?q=' + urllib2.quote(title) + '&tbm=nws'
google_news_search_tree = get_page_tree(url=google_news_search_url)
other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in
google_news_search_tree.xpath('//a//@href') if forwarding_identifier in a_link]
return other_news_sources_linksSolution
- Instead of constructing the
google_news_search_urlwith two string concatenations, use string formatting.
- The
other_news_sources_linksline is very dense. Please split it up.
- In addition to the defaults for all your args being
None, you're calling all of your functions with keyword arguments, which seems unnecessary.
- But before you fix that, consider why you need these 2 line functions in the first place. They don't seem to do anything complicated enough to warrant having to jump around in the code.
Context
StackExchange Code Review Q#97237, answer score: 3
Revisions (0)
No revisions yet.