HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing Wikipedia data in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dataparsingpythonwikipedia

Problem

I'm new to Python and would like some advice or guidance moving forward. I'm trying to parse Wikipedia data into something uniform that I can put into a database. I've looked at wiki parsers but from what I can see they are large and complex and don't get me much as I don't need 99% of their functionality (I'm not editing, or the ilk). What I am doing is reading some information from template variables and trying to clean them up into useable information. I've already created a simple function to read a wiki template and return a dictionary of key/values. These key/values are what I'm reading and trying to parse into useful data.

For example, I'm trying to parse the Infobox settlement template to create the following information:

{'CITY':             u'Portland',
 'COUNTRY':          u'United States of America',
 'ESTABLISHED_DATE': u'',
 'LATITUDE':         40.43388888888889,
 'LONGITUDE':        -84.98,
 'REGION':           u'Indiana',
 'WIKI':             u'Portland, Indiana'}


The only item not directly from the template is the WIKI entry, this is the wiki page title the template if from. The raw template that the above is produced from is:

```
{{Infobox settlement
|official_name = Portland, Indiana
|native_name =
|settlement_type = [[City]]
|nickname =
|motto =
|image_skyline = BlueBridge.jpg
|imagesize = 250px
|image_caption = Meridian (arch) Bridge in the fog
|image_flag =
|image_seal =
|image_map = Jay_County_Indiana_Incorporated_and_Unincorporated_areas_Portland_Highlighted.svg
|mapsize = 250px
|map_caption = Location in the state of [[Indiana]]
|image_map1 =
|mapsize1 =
|map_caption1 =
|coordinates_display = inline,title
|coordinates_region = US-IN
|subdivision_type = [[List of countries|Country]]
|

Solution

Wikipedia pages have this great comment line-- ` to tell you where infoboxes start and end. Use that to find the information in the infobox.

You end up with a string, lets call it
infobox.

# List Comprehension over infobox to return values
info = [j.split("=") for j in [i for i in infobox.split('|')]][1:]

# And, here's your dict:
wikidict = {}
for i in info:
    try:
        # stripping here is best
        wikidict[i[0].strip()] = i[1].strip()
    except IndexError:
        pass # if there's no data, there's no i[1], and an IndexError is thrown


That said, the template values are just that-- template values. If you want the latitude, you dont need to code anything complex-- the dictionary keys are already there.

latkeys = "latd latm lats latNS".split()
 lat_info = [wikidict[i] for i in latkeys]


You could easily do a quick transformation over the lat_info to get things into the format you want.

You should also probably write a separate function that strips the
[[x|y]]` from certain elements, and provides a return as a tuple if you're interested in manipulating those. As it stands, your code is nearly impossible to read. You dont need to push strings through complex logic gates like the ones you have; keep the logic to a bare minimum. You know, the Keep It Simple, Stupid rule.

Code Snippets

# List Comprehension over infobox to return values
info = [j.split("=") for j in [i for i in infobox.split('|')]][1:]

# And, here's your dict:
wikidict = {}
for i in info:
    try:
        # stripping here is best
        wikidict[i[0].strip()] = i[1].strip()
    except IndexError:
        pass # if there's no data, there's no i[1], and an IndexError is thrown
latkeys = "latd latm lats latNS".split()
 lat_info = [wikidict[i] for i in latkeys]

Context

StackExchange Code Review Q#16335, answer score: 2

Revisions (0)

No revisions yet.