HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to implement the four-eyes principle for emergency fixes?

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
eyestheprincipleimplementfourfixesemergencyforhow

Problem

Consider this scenario (any comparison with real world situations is purely by accident):

  • 3:07 am: incoming support call "Something in production went down, I need your help!".



  • 3:12 am: connected to the system (logon accepted) ... and no time for coffee.



  • 3:15 am: lucky you, right away you could spot the issue via some error message somewhere.



  • 3:17 am: use your SCM toolbox to grab the code, fix the issue, test it, great ... my fix works!



  • 3:20 am: get in touch with the DevOps-team to ship the fix and to get production running again.



  • 3:21 am: red flag ... "To respect four-eyes, we need 2 more eyes to get approval for this fix".



  • 3:22 am: ggggrrrreat, now what, who else can we call (= wake up some manager)?



If you implemented some approval procedure similar to my answer to "What are possible implementations (or examples) of the four-eyes principle?", then you're out of luck ... here are your choices:

  • Your fix will be stuck (read: production will be down) until 2 more eyes got involved.



  • You figure out a way to get around the missing eyes.



So how to implement the four-eyes principle for emergency fixes? ... So that you get production up and running asap, i.e. around 3:25 am ... And so that you can also close the call (and go back to where you came from)?

Solution

In the SCM-world where I'm mostly familiar with, the above scenario is typically addressed by what's called the "abbreviated-approval list procedure.

Here is a blueprint of it:

  • Define your business hours, say from 8 am to 6 pm.



  • Define a complete approval list of (say) 3 levels of approval (for roles X, Y and Z).



  • Define an abbreviated approval list of (say) only 1 level of approval (only for roles X).



  • Planned changes always require all approvals from the complete approval list.



  • For Unplanned changes, the complete approval list is used also to gather the required approvals, provided the approvals are to be issued during the defined business hours.



  • For any approvals of unplanned changes that are to be issued outside the defined business hours:



  • Only the approvals from the abbreviated approval list (such as role X above) are required to authorize the change. And after the authorization by the abbreviated approval list is given, the deployment of the change (in the target environment) will actually be performed.



  • But additional post-approvals will be needed afterwards (within a reasonable amount of hours/days), i.e from all roles contained in the complete approval list (such as role Y and Z above), which are not also contained in the abbreviated approval list (such as role X above). And if within the (upfront) agreed amount of hours/days not all post-approvals have been issued (e.g because the fix worked "this" time, but was only like a temporary fix), then the change might be subject to a rollback. While there is at least 1 outstanding post-approval, the change is marked as "waiting post approvals".



With such solution in place, the call can be closed around 3:23 am ... since there will be no more red flag at 3:21 am ... ggggrrreat, time for a beer to celebrate my fix to get production going again (instead of coffee) ... and fingers crossed the outstanding post approvals will come in soon ...

Context

StackExchange DevOps Q#437, answer score: 8

Revisions (0)

No revisions yet.