HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing Geographic Data From XML Files in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
xmlfilesparsingpythongeographicfromdata

Problem

I have recently been working on a program which takes as input, log files generated from a running/biking app. These files are in an xml format which is incredibly regular. In my program, I first strip all of the geographic and time information, and output as a text file. Then, reading back this text file, instances of Entry class are compiled into a list. Using these entries, calculations (and eventually graphing and long term trends) can be achieved.

This python script requires the library geopy to use, and has been tested on Python3.

My main concern is how little of the code is pythonic in nature, with multiple if statements begging the question of optimization. I have uploaded a copy of an example input file for the script on my github, which can be found here. Typing '4-19-17a.bin' without the quotations at the prompt will begin parsing, and will print to standard output the string representation of each Entry instance that was created, along with a couple calculations.

dataparse.py

```
import xml.etree.ElementTree as ET

def processtimestring(time):
year = int(time[:4])
month = int(time[5:7])
day = int(time[8:10])
h = int(time[11:13])
m = int(time[14:16])
s = float(time[17:-6])
return [year, month, day, h, m, s]

def parsethedata():
time = []
lat = []
lng = []
alt = []
dist = []
garbage = '{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}'
filestring = input('The file to parse: ')
outputstring = filestring[:-4] + '.txt'

f = open(outputstring, 'w')

tree = ET.parse(filestring)
root = tree.getroot()

for child in root[0][0][1][4]:
for info in child:
if (not info.text):
for position in info:
f.write(position.tag.replace(garbage, '') + '\n')
f.write(position.text + '\n')
else:
f.write(info.tag.replace(garbage, '') + '\n')
f.write(info.text + '\

Solution

I have recast your code, and the complete listing is below. I will discuss various elements of the changes here. If I missed explaining a change, or some concept, please leave a comment and I will see what I can do to better cover that.

Organization

First thing I did was move the parsing code to be a method of your Entry class. This parsing code seems to be very specific to this class, so let's make the relationship explicit. I also broke the parsing into parsing a single Entry Node, and the part which manages parsing the entire XML file.

Parsing

File Parse Code:

@classmethod
def from_xml(cls, xml_source):
    tree = etree.parse(xml_source)
    root = tree.getroot()
    return [cls.from_xml_node(child) for child in root[0][0][1][4]]


Some notes:

  • Don't prompt for user input in the parsing code.



  • Convert the data directly to the desired representation (a list of Entries).



Node Parse Code:

converters = dict(
    AltitudeMeters=(float, 'alt'),
    DistanceMeters=(float, 'dist'),
    LatitudeDegrees=(float, 'lat'),
    LongitudeDegrees=(float, 'lng'),
    Time=(dateutil.parser.parse, 'time'),
)

@classmethod
def from_xml_node(cls, node):
    data_point = {}
    for info in node.getiterator():
        tag = info.tag.split('}')[-1]
        if tag in cls.converters:
            converter, tag_text = cls.converters[tag]
            data_point[tag_text] = converter(info.text.strip())
    return cls(**data_point)


A few highlights:

-
Flattened nodes using getiterator

getiterator finds all of the nodes under a node, so allows the removal of the special if for the nested position data.

-
Removed need for stacked if/else by using converters dict

By using the node tag as a key into a dict, you can store the pieces needed to parse that tag's value. In this case I stored a conversion function, and the desired name for the value.

-
Removed garbage string by more simply splitting on } and using text after the }

-
Convert the data directly to the desired representation.

The writing of the data into a secondary file during the XML tree traversal, added quite a bit of complexity, for no apparent gain. Instead this code stores each element of the current node into a dict. The dict is then used to init an Entry object.

-
Dict keys are mapped to same names used in Entry.__init__

The parameter names for __init__ match the keys used in the data_point dict. This mapping was done via the second element in the tuple in the converters dict. This makes it easy to map things cleanly. This allows return cls(data_point) to directly create a class instance. Note the fact that you can specify non-keyword parameters with keywords, which is how the data_point can expand into the 5 parameters to init the Entry class

Class Structure

static methods probably should not be passed a class instance

If you are passing a class instance to a static method of that class, you are probably doing it wrong. Static methods are intended for code which is closely aligned with a class but which it not directly associated with an instance.

So I converted:

@staticmethod
def distancecalc(x1, x2):


To:

def distance_from(self, other):


properties allow calculated fields to be easily treated

In distance_from I converted:

coord_1 = (x1.pos[0], x1.pos[1])
coord_2 = (x2.pos[0], x2.pos[1])
return geopy.distance.vincenty(coord_1, coord_2).m


To:

return geopy.distance.vincenty(self.pos, other.pos).m


By providing:

@property
def pos(self):
    return self.lat, self.lng


Since lat, lng as a pair is a position, making that relationship explicit by using a property, better documents the relationship.

Python features

Using python libraries

By storing the timestamp in the class as a datetime value, I was able to convert:

@staticmethod
def timecalc(x1, x2):
    a, b = x1.time, x2.time
    t1 = dt.datetime(a[0], a[1], a[2], a[3], a[4], int(a[5]))
    t2 = dt.datetime(b[0], b[1], b[2], b[3], b[4], int(b[5]))
    t1_decimal = float(a[5]-int(a[5]))
    t2_decimal = float(b[5]-int(b[5]))
    return (t2 - t1).total_seconds() + (t2_decimal - t1_decimal)


To:

def time_since(self, other):
    return (self.time - other.time).total_seconds()


This removed all of the date conversion and other math.

Use string formatting

By using python string formatting functionality I was able to reduce the __str__ method from 11 lines to 2

def __str__(self):
    return 'time: %s, lat: %.6f, lng: %.6f, alt: %.2f, dst:%.3f' % (
        self.time, self.lat, self.lng, self.alt, self.dist)


zip allows processing multiple lists at once

Since I have completed removed the lists of 'lat', 'lng' etc, this is less relevant, but I thought I would show zip this way:

for prev, entry in zip(entries[:-1], entries[1:]):


this line recasts your print loop from:

```
prev = entries[0]
for entry in entries:
....
prev = en

Code Snippets

@classmethod
def from_xml(cls, xml_source):
    tree = etree.parse(xml_source)
    root = tree.getroot()
    return [cls.from_xml_node(child) for child in root[0][0][1][4]]
converters = dict(
    AltitudeMeters=(float, 'alt'),
    DistanceMeters=(float, 'dist'),
    LatitudeDegrees=(float, 'lat'),
    LongitudeDegrees=(float, 'lng'),
    Time=(dateutil.parser.parse, 'time'),
)

@classmethod
def from_xml_node(cls, node):
    data_point = {}
    for info in node.getiterator():
        tag = info.tag.split('}')[-1]
        if tag in cls.converters:
            converter, tag_text = cls.converters[tag]
            data_point[tag_text] = converter(info.text.strip())
    return cls(**data_point)
@staticmethod
def distancecalc(x1, x2):
def distance_from(self, other):
coord_1 = (x1.pos[0], x1.pos[1])
coord_2 = (x2.pos[0], x2.pos[1])
return geopy.distance.vincenty(coord_1, coord_2).m

Context

StackExchange Code Review Q#161529, answer score: 2

Revisions (0)

No revisions yet.