HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

PANDAS code for calculating distance between waypoints

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pandasdistancecalculatingbetweenwaypointsforcode

Problem

I've written some python code designed to take a csv of waypoints for a series of trips, and calculate the distance of each trip by the sum of the distance between the waypoints.

An example csv might be:

9e77d54918dd25c3f9d2e5354ec86666,0,2015-10-01T14:14:15.000Z,45.0988,7.5811,,
9e77d54918dd25c3f9d2e5354ec86666,1,2015-10-01T14:17:15.000Z,45.0967,7.5793,,
9e77d54918dd25c3f9d2e5354ec86666,2,2015-10-01T14:20:15.000Z,45.1012,7.6144,,
9e77d54918dd25c3f9d2e5354ec86666,3,2015-10-01T14:23:15.000Z,45.0883,7.6479,,
9e77d54918dd25c3f9d2e5354ec86666,4,2015-10-01T14:26:15.000Z,45.0774,7.6444,,
ect...


I've got code working, using pandas and numpy, however I'm entirely self-taught and I want to know if there's any serious or obvious mistakes I'm using that might make my code inefficient. It currently takes quite a while to run, I'm guessing because of my for loop. The code I'm using is:

```
import pandas as pd
import numpy as np
from math import radians, cos, sqrt

def dist(lat1, lon1, lat2, lon2): #short distances using Equirectangular approximation
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
x = (lon2 - lon1) cos( 0.5(lat2+lat1) )
y = lat2 - lat1
D = 6371 * sqrt(x2 + y2)
return D

waypoint = pd.read_csv('TripRecordsReportWaypoints.csv',sep=',',header=None, usecols=[0,3,4], names=['TripID','Lat','Lon'])
output = pd.DataFrame(columns = ['TripID','Distance','No. of Waypoints'])

tripList = waypoint['TripID'].tolist() #creates list of tripids
tripList = list(set(tripList)) #makes list unique

for ID in tripList:
temp = waypoint.loc[waypoint['TripID'] == ID] #creates a temporary dataframe with all waypoint for each trip

temp['endLat'] = temp['Lat'].shift(periods=-1) #adds two columns with next waypoints lat and lon
temp['endLon'] = temp['Lon'].shift(periods=-1)

temp['Distance']=np.vectorize(dist)(temp['Lat'],temp['Lon'],temp['endLat'],temp['endLon']) #calculates distance, can change function 'dist' for mo

Solution

Your code can be greatly simplified when using pandas.DataFrame.groupby. This function groups a dataframe by some key(s) and then allows performing functions that act on the whole sub-dataframe (henceforth called group) using apply or apply some aggregating function to single columns of that group using aggregate.

For this to work, we need to define a dist function that can take a DataFrame and calculate the total distance of that DataFrame. Note that I will assume that each trip is sorted within the csv, otherwise you will have to add this there as well.

To make the dist function, we need to make sure all functions it uses are vectorized, so I will be using np.cos and np.sqrt instead of the math ones. I also defined a constant, EARTH_RADIUS, because you might want to change the precision on that at some point. In any case, it is currently a magic number and giving it a name helps a lot. Or maybe you move to Mars at some point and need to use a different radius :)

import pandas as pd
import numpy as np

EARTH_RADIUS = 6371  # km

def total_dist(group):
    lat = np.radians(group.Lat)
    lon = np.radians(group.Lon)
    endLon = lon.shift(-1)
    endLat = lat.shift(-1)
    x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
    y = endLat - lat
    D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
    return D.sum()


Note that the conversion to radians is not as nice anymore, but we gain the ability to process a whole trip at a time!

Now we define a helper function to output not only the dist, but also the number of waypoints (by using len):

def trip_statistics(trip):
    return pd.Series({"Distance": total_dist(trip),
                      "No. of Waypoints": len(trip)})


Now the only thing left to do is apply this function to all groups and reset the index to get TripID back as a column and not just as an index:

waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
                       header=None, usecols=[0, 3, 4],
                       names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)


Note that I added index=False to avoid writing the row index to the output file.

Code Snippets

import pandas as pd
import numpy as np


EARTH_RADIUS = 6371  # km


def total_dist(group):
    lat = np.radians(group.Lat)
    lon = np.radians(group.Lon)
    endLon = lon.shift(-1)
    endLat = lat.shift(-1)
    x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
    y = endLat - lat
    D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
    return D.sum()
def trip_statistics(trip):
    return pd.Series({"Distance": total_dist(trip),
                      "No. of Waypoints": len(trip)})
waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
                       header=None, usecols=[0, 3, 4],
                       names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)

Context

StackExchange Code Review Q#154601, answer score: 2

Revisions (0)

No revisions yet.