patternpythonMinor
Parsing a Wikipedia page for a country
Viewed 0 times
countrywikipediaparsingforpage
Problem
The program should accept a name of a country as input. It should then parse the Wikipedia page for that country and find all links to the wikipedia pages of other countries on that page and make a list of all such countries.
For example, if the argument was India and http://en.wikipedia.org/wiki/India has links to wiki pages of Nepal and Pakistan, the list associated with India is [Nepal, Pakistan]. The program should recursively perform the above operation for any country you have not encountered so far.
After parsing the wiki page for India, it should parse the wiki page of Nepal, Pakistan and so on.
The final output is a json object with key being the country and value being the list of countries which were linked to the wiki page for the key country.
Sample output can be:
This is what I have tried, but I want to get the best approach for it:
```
import pycountry
import wikipedia
country_dict = {}
final_dict = {}
refr_list = []
def GetCountryDict():
country_list = list(pycountry.countries)
for country in country_list:
country_dict[country.name] = 1
return country_dict
def internalParse(internal_list,final_dict,countryDict,refr_list):
for i in refr_list:
if not final_dict.has_key(i):
WikiParser(i,countryDict)
else:
refr_list.remove(i)
def WikiParser(country_name,country_dict):
wiki_resp =[]
internal_list = []
try:
wiki_resp = wikipedia.page(country_name)
except:
pass
for i in wiki_resp.links:
if country_dict.has_key(i):
if i not in internal_list:
refr_list.append(i)
internal_list.append(i)
final_dict[country_name] = internal_list
try:
refr_list.remove(country_name)
except:
pass
return internal_list, final_dict,refr_list
if
For example, if the argument was India and http://en.wikipedia.org/wiki/India has links to wiki pages of Nepal and Pakistan, the list associated with India is [Nepal, Pakistan]. The program should recursively perform the above operation for any country you have not encountered so far.
After parsing the wiki page for India, it should parse the wiki page of Nepal, Pakistan and so on.
The final output is a json object with key being the country and value being the list of countries which were linked to the wiki page for the key country.
Sample output can be:
{"India": ["Nepal", "Pakistan"], "Nepal": ["India", "China"], "Pakistan": ["India"], "China": ["Japan"], "Japan": ["India", "China"]}This is what I have tried, but I want to get the best approach for it:
```
import pycountry
import wikipedia
country_dict = {}
final_dict = {}
refr_list = []
def GetCountryDict():
country_list = list(pycountry.countries)
for country in country_list:
country_dict[country.name] = 1
return country_dict
def internalParse(internal_list,final_dict,countryDict,refr_list):
for i in refr_list:
if not final_dict.has_key(i):
WikiParser(i,countryDict)
else:
refr_list.remove(i)
def WikiParser(country_name,country_dict):
wiki_resp =[]
internal_list = []
try:
wiki_resp = wikipedia.page(country_name)
except:
pass
for i in wiki_resp.links:
if country_dict.has_key(i):
if i not in internal_list:
refr_list.append(i)
internal_list.append(i)
final_dict[country_name] = internal_list
try:
refr_list.remove(country_name)
except:
pass
return internal_list, final_dict,refr_list
if
Solution
You have quite a few violations of the style guide. This isn't necessarily crucial to follow (although highly recommended!), but more important is that your code lacks consistency.
Your whitespace is at least consistent, but spaces after commas makes it easier to read; compare:
with:
The current
Note that I have renamed the function in line with the style guide and provided a docstring to explain what it does.
Given the output you want, a better structure might be:
Notes:
I don't have
Given the nature of the task, one option for
This uses
If you're wondering why I've used
Note that
You currently create a dictionary of lists, but not JSON; if you actually want to, look into
- You have three functions, two named with
CapitalizedWordsand one withmixedCase; and
internalParse's parameters are a mix oflower_case_with_underscoresandmixedCase.
Your whitespace is at least consistent, but spaces after commas makes it easier to read; compare:
def internalParse(internal_list,final_dict,countryDict,refr_list):with:
def internalParse(internal_list, final_dict, countryDict, refr_list):country_dict, final_dict and refr_list shouldn't be defined at the top level of the script; global variables are a bad idea. You could create country_dict in GetCountryDict and final_dict and refr_list in WikiParser.The current
country_dict seems a bit pointless - you fill it with 1s for some reason, but then only ever use the keys. I would suggest a set instead, and using foo in bar rather than bar.has_key(foo) (in would still work with a dictionary, by the way). This really simplifies GetCountryDict:def get_countries():
"""Get a set of all valid country names."""
return set(country.name for country in pycountry.countries)Note that I have renamed the function in line with the style guide and provided a docstring to explain what it does.
internalParse has four parameters but only uses three of them; you should either remove internal_list or, better, explicitly pass it through to WikiParser (where it's currently accessed by scope).WikiParser is currently called in two places - directly from the if __name__ == '__main__': block, and indirectly via internalParse. This makes it more difficult that it needs to be to figure out what the code is doing.Given the output you want, a better structure might be:
if __name__ == '__main__':
# 1. Get set of valid countries first
valid_countries = get_countries()
# 2. Add input validation
country_name = input_valid_country(valid_countries)
# 3. Call parser
print wiki_parser(country_name, valid_countries)Notes:
- Already covered;
- See Asking the user for input until they give a valid response;
- Refactor
wiki_parserto return the dictionary of lists that you want. You can still have something likeinternalParse, but as a private function that's only called viawiki_parser(conventionally, it would therefore be named with a leading underscore:_internal_parse).
I don't have
pycountry or wikipedia installed, so this is untested. The ideas should be useful, though!Given the nature of the task, one option for
wiki_parser would be to do it recursively. This has advantages in terms of clear and comprehensible code, although one issue can be with hitting the system recursion limit if you delve too deeply:def wiki_parser(country_name, valid_countries, out=None):
"""Recursively parse Wikipedia for linked countries."""
if out is None:
out = {}
out[country_name] = []
for country in _parse_page(country_name, valid_countries):
if country not in out:
out[country_name].append(country)
wiki_parser(country_name, valid_countries, out=None)
return outThis uses
out both to provide the output and to determine if we've already seen a given country - membership testing (foo in bar) is as efficient with a dict as a set, as they're both hash-based (O(1), vs. O(n) for a list/tuple). If you're wondering why I've used
out=None rather than out={}, see “Least Astonishment” in Python: The Mutable Default Argument.def _parse_page(country_name, valid_countries):
"""Parse a single page and return list of linked countries."""
try:
wiki_resp = wikipedia.page(country_name)
except Exception:
return []
return [link for link in wiki_resp.links if link in valid_countries]Note that
except Exception: is the bare minimum - bare except is a very bad idea. Ideally, you should figure out what errors wikipedia.page can raise and handle them explicitly. I've also used a list comprehension to build the list of linked countries in one step.You currently create a dictionary of lists, but not JSON; if you actually want to, look into
json.Code Snippets
def internalParse(internal_list,final_dict,countryDict,refr_list):def internalParse(internal_list, final_dict, countryDict, refr_list):def get_countries():
"""Get a set of all valid country names."""
return set(country.name for country in pycountry.countries)if __name__ == '__main__':
# 1. Get set of valid countries first
valid_countries = get_countries()
# 2. Add input validation
country_name = input_valid_country(valid_countries)
# 3. Call parser
print wiki_parser(country_name, valid_countries)def wiki_parser(country_name, valid_countries, out=None):
"""Recursively parse Wikipedia for linked countries."""
if out is None:
out = {}
out[country_name] = []
for country in _parse_page(country_name, valid_countries):
if country not in out:
out[country_name].append(country)
wiki_parser(country_name, valid_countries, out=None)
return outContext
StackExchange Code Review Q#90517, answer score: 6
Revisions (0)
No revisions yet.