patternpythonMinor
Optimize calculation for the amount of time between when a value changes in Pandas
Viewed 0 times
calculationthepandasamounttimevaluebetweenchangesforoptimize
Problem
I have some sensor data that contains timestamps from when a machine is turned on and an indicator variable showing whether or not the machine is actively running. The data is mostly recorded every two minutes, but they there could be more or less time between two sensor readings due to issues with the sensor.
I want to create a new feature that captures at a specific datetime, how long has the machine been in that Run status for.
I currently have the data loaded in a
This approach seems to work fine on this small example, but I feel like it's not the most efficient use of resources, especially when trying to scale up to my 400,000 record dataset.
```
# load example data
df = pd.DataFrame(data = [['2015-01-01 00:00', 1],
['2015-01-01 00:02', 1],
['2015-01-01 00:04', 1],
['2015-01-01 00:06', 0],
['2015-01-01 00:08', 0],
['2015-01-01 00:10', 1],
['2015-01-01 00:12', 0],
['2015-01-01 00:15', 1],
['2015-01-01 00:17', 1],
['2015-01-01 00:19', 1],
['2015-01-01 00:23', 0],
['2015-01-01 00:25', 0],
['2015-01-01 00:30', 0],
['2015-01-01 00:32', 0],
I want to create a new feature that captures at a specific datetime, how long has the machine been in that Run status for.
I currently have the data loaded in a
DataFrame, and I have been able to create a new variable that shows whether or not the Run status has changed from the previous reading. Next, I need to figure out how capture the amount of time since the last time the Run status changed. If the readings where consistently every two minutes, I could maybe do some type of counter that resets when Run changes. Instead, I took an approach where if the time changed, I hold onto that value and keep subtracting the next Datetime stamp from it, as long as Run didn't change. When Run changes, I start the processes over.This approach seems to work fine on this small example, but I feel like it's not the most efficient use of resources, especially when trying to scale up to my 400,000 record dataset.
```
# load example data
df = pd.DataFrame(data = [['2015-01-01 00:00', 1],
['2015-01-01 00:02', 1],
['2015-01-01 00:04', 1],
['2015-01-01 00:06', 0],
['2015-01-01 00:08', 0],
['2015-01-01 00:10', 1],
['2015-01-01 00:12', 0],
['2015-01-01 00:15', 1],
['2015-01-01 00:17', 1],
['2015-01-01 00:19', 1],
['2015-01-01 00:23', 0],
['2015-01-01 00:25', 0],
['2015-01-01 00:30', 0],
['2015-01-01 00:32', 0],
Solution
Cool problem, thanks for sharing. Here is my take on the implementation. I will discuss each section in turn.
Since we need to keep state from previous rows I could not formulate a vector way to do this, and had to resort to a loop. But in the loop we do not use any indices, instead we iterate directly on the data structure.
This is just for illustration. Likely more efficient to use concat, than three separate dataframe constructions.
Whole Code Listing:
- Use diff to directly get the state changes
diff() returns a vector with the difference between subsequent elements. Thus when the state does not change it is equal to zero.# find when the state changes
run_change = df['Run'].diff()- Use diff to directly get time delta between steps
# get the step lengths
step_length = df['Datetime'].diff()- Loop and build delta since previous change
Since we need to keep state from previous rows I could not formulate a vector way to do this, and had to resort to a loop. But in the loop we do not use any indices, instead we iterate directly on the data structure.
zip() is used to take matching index elements of the two series at the same time.# loop and get the change since last state change
since_change = []
current_delta = 0
for is_change, delta in zip(run_change, step_length):
current_delta = 0 if is_change != 0 else \
current_delta + delta.total_seconds() / 60.0
since_change.append(current_delta)- Insert the intermediate data into the data frame
This is just for illustration. Likely more efficient to use concat, than three separate dataframe constructions.
# add this data to the data frame
df['Run_Change'] = run_change
df['Step_Length'] = step_length
df['Time_Since_Change'] = pd.Series(since_change).valuesWhole Code Listing:
import pandas as pd
# load example data
df = pd.DataFrame(data=[['2015-01-01 00:00', 1],
['2015-01-01 00:02', 1],
['2015-01-01 00:04', 1],
['2015-01-01 00:06', 0],
['2015-01-01 00:08', 0],
['2015-01-01 00:10', 1],
['2015-01-01 00:12', 0],
['2015-01-01 00:15', 1],
['2015-01-01 00:17', 1],
['2015-01-01 00:19', 1],
['2015-01-01 00:23', 0],
['2015-01-01 00:25', 0],
['2015-01-01 00:30', 0],
['2015-01-01 00:32', 0],
['2015-01-01 00:34', 0]],
columns=['Datetime', 'Run'])
# convert to datetime object
df.Datetime = pd.to_datetime(df['Datetime'])
# find when the state changes
run_change = df['Run'].diff()
# get the step lengths
step_length = df['Datetime'].diff()
# loop and get the change since last state change
since_change = []
current_delta = 0
for is_change, delta in zip(run_change, step_length):
current_delta = 0 if is_change != 0 else \
current_delta + delta.total_seconds() / 60.0
since_change.append(current_delta)
# add this data to the data frame
df['Run_Change'] = run_change
df['Step_Length'] = step_length
df['Time_Since_Change'] = pd.Series(since_change).valuesCode Snippets
# find when the state changes
run_change = df['Run'].diff()# get the step lengths
step_length = df['Datetime'].diff()# loop and get the change since last state change
since_change = []
current_delta = 0
for is_change, delta in zip(run_change, step_length):
current_delta = 0 if is_change != 0 else \
current_delta + delta.total_seconds() / 60.0
since_change.append(current_delta)# add this data to the data frame
df['Run_Change'] = run_change
df['Step_Length'] = step_length
df['Time_Since_Change'] = pd.Series(since_change).valuesimport pandas as pd
# load example data
df = pd.DataFrame(data=[['2015-01-01 00:00', 1],
['2015-01-01 00:02', 1],
['2015-01-01 00:04', 1],
['2015-01-01 00:06', 0],
['2015-01-01 00:08', 0],
['2015-01-01 00:10', 1],
['2015-01-01 00:12', 0],
['2015-01-01 00:15', 1],
['2015-01-01 00:17', 1],
['2015-01-01 00:19', 1],
['2015-01-01 00:23', 0],
['2015-01-01 00:25', 0],
['2015-01-01 00:30', 0],
['2015-01-01 00:32', 0],
['2015-01-01 00:34', 0]],
columns=['Datetime', 'Run'])
# convert to datetime object
df.Datetime = pd.to_datetime(df['Datetime'])
# find when the state changes
run_change = df['Run'].diff()
# get the step lengths
step_length = df['Datetime'].diff()
# loop and get the change since last state change
since_change = []
current_delta = 0
for is_change, delta in zip(run_change, step_length):
current_delta = 0 if is_change != 0 else \
current_delta + delta.total_seconds() / 60.0
since_change.append(current_delta)
# add this data to the data frame
df['Run_Change'] = run_change
df['Step_Length'] = step_length
df['Time_Since_Change'] = pd.Series(since_change).valuesContext
StackExchange Code Review Q#155111, answer score: 4
Revisions (0)
No revisions yet.