patternpythonMinor
PANDAS code for calculating distance between waypoints
Viewed 0 times
pandasdistancecalculatingbetweenwaypointsforcode
Problem
I've written some python code designed to take a csv of waypoints for a series of trips, and calculate the distance of each trip by the sum of the distance between the waypoints.
An example csv might be:
I've got code working, using pandas and numpy, however I'm entirely self-taught and I want to know if there's any serious or obvious mistakes I'm using that might make my code inefficient. It currently takes quite a while to run, I'm guessing because of my for loop. The code I'm using is:
```
import pandas as pd
import numpy as np
from math import radians, cos, sqrt
def dist(lat1, lon1, lat2, lon2): #short distances using Equirectangular approximation
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
x = (lon2 - lon1) cos( 0.5(lat2+lat1) )
y = lat2 - lat1
D = 6371 * sqrt(x2 + y2)
return D
waypoint = pd.read_csv('TripRecordsReportWaypoints.csv',sep=',',header=None, usecols=[0,3,4], names=['TripID','Lat','Lon'])
output = pd.DataFrame(columns = ['TripID','Distance','No. of Waypoints'])
tripList = waypoint['TripID'].tolist() #creates list of tripids
tripList = list(set(tripList)) #makes list unique
for ID in tripList:
temp = waypoint.loc[waypoint['TripID'] == ID] #creates a temporary dataframe with all waypoint for each trip
temp['endLat'] = temp['Lat'].shift(periods=-1) #adds two columns with next waypoints lat and lon
temp['endLon'] = temp['Lon'].shift(periods=-1)
temp['Distance']=np.vectorize(dist)(temp['Lat'],temp['Lon'],temp['endLat'],temp['endLon']) #calculates distance, can change function 'dist' for mo
An example csv might be:
9e77d54918dd25c3f9d2e5354ec86666,0,2015-10-01T14:14:15.000Z,45.0988,7.5811,,
9e77d54918dd25c3f9d2e5354ec86666,1,2015-10-01T14:17:15.000Z,45.0967,7.5793,,
9e77d54918dd25c3f9d2e5354ec86666,2,2015-10-01T14:20:15.000Z,45.1012,7.6144,,
9e77d54918dd25c3f9d2e5354ec86666,3,2015-10-01T14:23:15.000Z,45.0883,7.6479,,
9e77d54918dd25c3f9d2e5354ec86666,4,2015-10-01T14:26:15.000Z,45.0774,7.6444,,
ect...I've got code working, using pandas and numpy, however I'm entirely self-taught and I want to know if there's any serious or obvious mistakes I'm using that might make my code inefficient. It currently takes quite a while to run, I'm guessing because of my for loop. The code I'm using is:
```
import pandas as pd
import numpy as np
from math import radians, cos, sqrt
def dist(lat1, lon1, lat2, lon2): #short distances using Equirectangular approximation
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
x = (lon2 - lon1) cos( 0.5(lat2+lat1) )
y = lat2 - lat1
D = 6371 * sqrt(x2 + y2)
return D
waypoint = pd.read_csv('TripRecordsReportWaypoints.csv',sep=',',header=None, usecols=[0,3,4], names=['TripID','Lat','Lon'])
output = pd.DataFrame(columns = ['TripID','Distance','No. of Waypoints'])
tripList = waypoint['TripID'].tolist() #creates list of tripids
tripList = list(set(tripList)) #makes list unique
for ID in tripList:
temp = waypoint.loc[waypoint['TripID'] == ID] #creates a temporary dataframe with all waypoint for each trip
temp['endLat'] = temp['Lat'].shift(periods=-1) #adds two columns with next waypoints lat and lon
temp['endLon'] = temp['Lon'].shift(periods=-1)
temp['Distance']=np.vectorize(dist)(temp['Lat'],temp['Lon'],temp['endLat'],temp['endLon']) #calculates distance, can change function 'dist' for mo
Solution
Your code can be greatly simplified when using
For this to work, we need to define a
To make the
Note that the conversion to radians is not as nice anymore, but we gain the ability to process a whole trip at a time!
Now we define a helper function to output not only the dist, but also the number of waypoints (by using
Now the only thing left to do is
Note that I added
pandas.DataFrame.groupby. This function groups a dataframe by some key(s) and then allows performing functions that act on the whole sub-dataframe (henceforth called group) using apply or apply some aggregating function to single columns of that group using aggregate.For this to work, we need to define a
dist function that can take a DataFrame and calculate the total distance of that DataFrame. Note that I will assume that each trip is sorted within the csv, otherwise you will have to add this there as well.To make the
dist function, we need to make sure all functions it uses are vectorized, so I will be using np.cos and np.sqrt instead of the math ones. I also defined a constant, EARTH_RADIUS, because you might want to change the precision on that at some point. In any case, it is currently a magic number and giving it a name helps a lot. Or maybe you move to Mars at some point and need to use a different radius :)import pandas as pd
import numpy as np
EARTH_RADIUS = 6371 # km
def total_dist(group):
lat = np.radians(group.Lat)
lon = np.radians(group.Lon)
endLon = lon.shift(-1)
endLat = lat.shift(-1)
x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
y = endLat - lat
D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
return D.sum()Note that the conversion to radians is not as nice anymore, but we gain the ability to process a whole trip at a time!
Now we define a helper function to output not only the dist, but also the number of waypoints (by using
len):def trip_statistics(trip):
return pd.Series({"Distance": total_dist(trip),
"No. of Waypoints": len(trip)})Now the only thing left to do is
apply this function to all groups and reset the index to get TripID back as a column and not just as an index:waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
header=None, usecols=[0, 3, 4],
names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)Note that I added
index=False to avoid writing the row index to the output file.Code Snippets
import pandas as pd
import numpy as np
EARTH_RADIUS = 6371 # km
def total_dist(group):
lat = np.radians(group.Lat)
lon = np.radians(group.Lon)
endLon = lon.shift(-1)
endLat = lat.shift(-1)
x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
y = endLat - lat
D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
return D.sum()def trip_statistics(trip):
return pd.Series({"Distance": total_dist(trip),
"No. of Waypoints": len(trip)})waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
header=None, usecols=[0, 3, 4],
names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)Context
StackExchange Code Review Q#154601, answer score: 2
Revisions (0)
No revisions yet.