HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What is the proper relationship between rollback/rollforward and MTTR metrics?

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
therollbackwhatrollforwardandbetweenmetricspropermttrrelationship

Problem

I'm trying to understand the best way to capture data to start measuring Mean Time To Repair (MTTR) metrics, and I need to wrap my head around how "rollback" impacts MTTR positively or negatively.

Scenario 1

Assuming that solid monitoring is in place, code is deployed that causes an incident that is detected rather quickly (low MTTI). At the point of identification, there are two main possible paths forward (yes, I'm oversimplifying for purposes of discussion):

-
Rollback the deployment, returning stability quickly, but without the intended features in production.

-
Roll-forward with additional changes that resolve the incident and keep the intended features live.

In this scenaro, MTTR is pretty darn low, given that site stability can come back pretty quickly. That said, the intended outcome of the change isn't live, and thus the code/feature/change is still stuck in process. If a goal is low MTTR, it seems to incentivize roll-back as a recovery mechanism.

Scenario 2

In this scenario, MTTR is strictly measured by how long it takes the expected code/feature/change to be working properly in production. Even if I rollback, until my "fixed" code change goes into prod, the MTTR timer is still running. In this case, MTTR seems tied to business outcome stability instead of just pure "hey, things are stable."

Now, the answer may be as simple as MTTR not being used as a metric in a vacuum, but rather in conjunction with Change Failure Rate - a super-low MTTR caused by frequent rollbacks could point to a sky-high Change Failure Rate. That said, there's something that doesn't seem right to me in the idea of divorcing the MTTR measurement from business outcome.

I may be way overthinking this, but I'm curious how others are measuring MTTR and what the end point-in-time is for "recovery." Are you using it simply as stability, or do other factors come into determining what "recovered" means?

Solution

Yes, MTTR is/should always be tied to the business outcome: if things are not stable the very business is at risk.

The fact that the expected code/feature/change is still stuck in process in scenario 1 is irelevant: the feature is not stable, so it doesn't bring new business, rolling back is the best you can do at that time from the business prospective.

The rollforward is a gamble: keeps the business at risk waiting for a potential fix that in fact has statistically lower changes of success (due to the instability it will always be rushed compared to the change that caused the instability in the first place without even having such pressure on it). The rollforward is a yet another version of the code which hasn't been checked before.

If you want to keep the MTTR low you rollback immediately, without debate. This removes the business risk and gives you a chance to check that the fix is actually working before attempting to deploy it. I'd strongly suggest making it a policy as yes, almost always there will be someone asking for a fix instead of the rollback and calling a meeting to negociate/decide on it - all while business remains at risk.

Side note: if you're concerned with a high Change Failure Rate then I'd suggest checking the the actual rollback rate instead of deriving it from a low MTRR. Maybe you'd like to add a gate check before deployment for the most frequent failures.
If you have such check already automated - why not include it in the CI verification? If you don't have one - maybe its time to start thinking about it? :)

Context

StackExchange DevOps Q#1230, answer score: 2

Revisions (0)

No revisions yet.