HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Optimization for calculating statistics

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
optimizationcalculatingstatisticsfor

Problem

I have collected a large amount of data about the train network in my country. Now, I am writing some code that calculates the average amount of delay on the entire train network, for each day, in a given period of time. Important to note: every train has a unique train-number. Each train-number only occurs once each day.

The average amount of delay is calculated in two ways:

-
Worst case: For every train, the maximum amount of delay that train suffers on the given day is used for the computation.

-
Average case: For every train, the average amount of delay is computed over all the stops of the train and used for the computation.

To further understand my code, I have to explain a portion of my database design. Each train has train_id, origin, destination. Each route has route_id, train_id, date. As mentioned before, each train_id occurs only once each day. So the combination of train_id and date is unique. Each stop has stop_id, arrival_delay, departure_delay, route_id. Each stop is associated with only one route.

```
start = date(2014, 10, 27)
stop = date(2014, 12, 14)
diff = stop - start
period = pd.date_range(start, stop)
zeros = np.zeros((diff.days+1, 2))

#DataFrame with worst-case and avg delay on entire network, per day
#Delay is computed in a worst-case scenario, i.e. max delay of a train
#and in avg-case scenario, i.e. avg of avg of arrival and avg of departure delay
delays = pd.DataFrame(zeros, index = period, columns = ['Worst case', 'Avg case'])

while(start max_delay:
max_delay = max(arrival_delay, departure_delay)

worst_train_delays.append(max_delay)
avg_arrival = np.mean(arrivals)
avg_departure = np.mean(departures)
avg_train_delays.append(np.mean([avg_arrival, avg_departure]))

key = start.isoformat()
delays['Worst case'][key] = np.mean(worst_train_delays)
delays['Avg case'][key] = np.mean(avg_train_delays)

delta = timedelta(days=1)
start = start + delta

Solution

One problem I see with your code is that you first collect all the data points you want to average, and then you take the average. This has O(n) complexity and extra storage requirements, as you have to append all the data points into a list.

An easy way to fix this is to keep track of the sum and the number of data points, and then divide to get the average. This uses O(1) extra memory instead of O(n), as for each data point, you increment the number of data points and add the value to the accumulated sum, and then continue on to the next one.

For example, instead of:

for stopRow in stops:
    # some code
    arrivals.append(arrival_delay)
    # more code

avg_arrival = np.mean(arrivals)


you should use:

total_arrival_delay = 0
number_arrival_delays = 0

for stopRow in stops:
    # some code
    total_arrival_delay += arrival_delay
    number_arrival_delays += 1
    # more code

avg_arrival = total_arrival_delay / number_arrival_delays


While this approach makes your code a bit longer, it will allow it to work on larger datasets, as you no longer need to store lots and lots of data points while processing, and instead need only to store a few integers

Code Snippets

for stopRow in stops:
    # some code
    arrivals.append(arrival_delay)
    # more code

avg_arrival = np.mean(arrivals)
total_arrival_delay = 0
number_arrival_delays = 0

for stopRow in stops:
    # some code
    total_arrival_delay += arrival_delay
    number_arrival_delays += 1
    # more code

avg_arrival = total_arrival_delay / number_arrival_delays

Context

StackExchange Code Review Q#85179, answer score: 2

Revisions (0)

No revisions yet.