patternMinor

pre-DevOps deployment metrics challenge

Submitted by: @import:stackexchange-devops·Mar 10, 2026·

Viewed 0 times

challengepremetricsdeploymentdevops

Problem

TL;DR, how do you prove devops, specifically deployment automation, improves change failure rates?

We're all trying to capture metrics on 'deployment failures' using current (mostly manual) means. Unfortunately, a 'failure' rarely happens, right? Because when something goes wrong, the team comes together (typically with heroics) to fix the issue (typically permissions, missed configs, you know the drill). So... when we ask how the deployment went, the answer is "it worked."

But, intuitively we all know that's not good. The 2017 state of devops report says there's about a 31-45% "change failure rate." While that intuitively sounds about right, are they tracked as incidents? Nah. Because they get fixed pretty quickly, usually during validation. It's much more rare to actually roll back a deployment.

So, it takes discipline to report failure rates accurately. We're disincentivized to report like that because we want things to work and we do what it takes to make it happen.

So, how do you prove devops, specifically deployment automation, improves change failure rates?

(PS tried to tag this with "#devops-capability-model")

Solution

A technique we've used in the past in similar situations is to get "management commitment" that imposes these rules to every team member:

-
Access to perform updates to the target deployment areas (i.e. production) is limited to selected automated systems, which have appropriate audit trails (= logging) of any kind of updates to the areas they manage.

-
Manual updates to the target deployment areas, for whatever reason, are no longer allowed by the typical team members (user ids) that used to be able (authorized) to perform these updates. Instead NEW (additional) user IDs will be created which will have all required permissions to (still) perform such manual updates, whenever needed. But to actually be able to use those new user IDs (= perform a logon with them), the team member who wants to perform a logon with such new user ID will have to perform "some" extra step to get access to the password for such new user Id. Ideally this extra step is automated also (use your own imagination how it should look like), but if anything else fails: just contact (= eMail, call, etc) the gate-keeper of the required password, including "which issue they have to get fixed" (similar to your question).

Getting such gate-keeper in place is not an easy job. And the most resistance will come from ... the team members (for all sorts of reasons). Therefor a variation of those new user IDs (as in the previous step) is that each team member gets an extra user ID (with the password they decide themselves), but with an extra string attached to it: they are only allowed to perform a logon with that (extra) user ID if they actualy have a good reason to do so. And each time they perform such logon, they are required to file some type of report about "which issue they fixed" (similar to your question).

With these procedures in place, all that's left to do is to periodically review each of those reports / reasons why it was required to use such special user ID, and ask the question "Is there anything that can be done to further automate this, to further reduce the need for such special login?".

Update:

Quote from your extra comment below this answer:

I think adding artificial barriers to fixing a deployment issue is counter-productive.

True it adds an extra barrier, but I'm not conviced it is "artificial". Because this is, to my knowledge, the only way to become aware of things those team members otherwise won't ever tell you, for reasons such as:

job security.

bad things/practises they prefer to keep hidden.

power they do not want to loose.

Context

StackExchange DevOps Q#2918, answer score: 7

Revisions (0)

No revisions yet.