patternpythonMinor
Optimization for calculating statistics
Viewed 0 times
optimizationcalculatingstatisticsfor
Problem
I have collected a large amount of data about the train network in my country. Now, I am writing some code that calculates the average amount of delay on the entire train network, for each day, in a given period of time. Important to note: every train has a unique train-number. Each train-number only occurs once each day.
The average amount of delay is calculated in two ways:
-
Worst case: For every train, the maximum amount of delay that train suffers on the given day is used for the computation.
-
Average case: For every train, the average amount of delay is computed over all the stops of the train and used for the computation.
To further understand my code, I have to explain a portion of my database design. Each
```
start = date(2014, 10, 27)
stop = date(2014, 12, 14)
diff = stop - start
period = pd.date_range(start, stop)
zeros = np.zeros((diff.days+1, 2))
#DataFrame with worst-case and avg delay on entire network, per day
#Delay is computed in a worst-case scenario, i.e. max delay of a train
#and in avg-case scenario, i.e. avg of avg of arrival and avg of departure delay
delays = pd.DataFrame(zeros, index = period, columns = ['Worst case', 'Avg case'])
while(start max_delay:
max_delay = max(arrival_delay, departure_delay)
worst_train_delays.append(max_delay)
avg_arrival = np.mean(arrivals)
avg_departure = np.mean(departures)
avg_train_delays.append(np.mean([avg_arrival, avg_departure]))
key = start.isoformat()
delays['Worst case'][key] = np.mean(worst_train_delays)
delays['Avg case'][key] = np.mean(avg_train_delays)
delta = timedelta(days=1)
start = start + delta
The average amount of delay is calculated in two ways:
-
Worst case: For every train, the maximum amount of delay that train suffers on the given day is used for the computation.
-
Average case: For every train, the average amount of delay is computed over all the stops of the train and used for the computation.
To further understand my code, I have to explain a portion of my database design. Each
train has train_id, origin, destination. Each route has route_id, train_id, date. As mentioned before, each train_id occurs only once each day. So the combination of train_id and date is unique. Each stop has stop_id, arrival_delay, departure_delay, route_id. Each stop is associated with only one route.```
start = date(2014, 10, 27)
stop = date(2014, 12, 14)
diff = stop - start
period = pd.date_range(start, stop)
zeros = np.zeros((diff.days+1, 2))
#DataFrame with worst-case and avg delay on entire network, per day
#Delay is computed in a worst-case scenario, i.e. max delay of a train
#and in avg-case scenario, i.e. avg of avg of arrival and avg of departure delay
delays = pd.DataFrame(zeros, index = period, columns = ['Worst case', 'Avg case'])
while(start max_delay:
max_delay = max(arrival_delay, departure_delay)
worst_train_delays.append(max_delay)
avg_arrival = np.mean(arrivals)
avg_departure = np.mean(departures)
avg_train_delays.append(np.mean([avg_arrival, avg_departure]))
key = start.isoformat()
delays['Worst case'][key] = np.mean(worst_train_delays)
delays['Avg case'][key] = np.mean(avg_train_delays)
delta = timedelta(days=1)
start = start + delta
Solution
One problem I see with your code is that you first collect all the data points you want to average, and then you take the average. This has O(n) complexity and extra storage requirements, as you have to append all the data points into a list.
An easy way to fix this is to keep track of the sum and the number of data points, and then divide to get the average. This uses O(1) extra memory instead of O(n), as for each data point, you increment the number of data points and add the value to the accumulated sum, and then continue on to the next one.
For example, instead of:
you should use:
While this approach makes your code a bit longer, it will allow it to work on larger datasets, as you no longer need to store lots and lots of data points while processing, and instead need only to store a few integers
An easy way to fix this is to keep track of the sum and the number of data points, and then divide to get the average. This uses O(1) extra memory instead of O(n), as for each data point, you increment the number of data points and add the value to the accumulated sum, and then continue on to the next one.
For example, instead of:
for stopRow in stops:
# some code
arrivals.append(arrival_delay)
# more code
avg_arrival = np.mean(arrivals)you should use:
total_arrival_delay = 0
number_arrival_delays = 0
for stopRow in stops:
# some code
total_arrival_delay += arrival_delay
number_arrival_delays += 1
# more code
avg_arrival = total_arrival_delay / number_arrival_delaysWhile this approach makes your code a bit longer, it will allow it to work on larger datasets, as you no longer need to store lots and lots of data points while processing, and instead need only to store a few integers
Code Snippets
for stopRow in stops:
# some code
arrivals.append(arrival_delay)
# more code
avg_arrival = np.mean(arrivals)total_arrival_delay = 0
number_arrival_delays = 0
for stopRow in stops:
# some code
total_arrival_delay += arrival_delay
number_arrival_delays += 1
# more code
avg_arrival = total_arrival_delay / number_arrival_delaysContext
StackExchange Code Review Q#85179, answer score: 2
Revisions (0)
No revisions yet.