snippetMinor
How can I convince someone that 100% reliability is not the right target for anything?
Viewed 0 times
canconvincethetargetanythingsomeone100thatforhow
Problem
It's a fundamental principle of DevOps and SRE that failure is normal and setting a goal of perfect reliability is misguided. But sometimes IT execs and business leaders push back on the idea. They believe the business needs 100% reliability. What are some good way to persuade them to change their thinking?
Solution
I was wondering whether anyone in this forum had new arguments they have heard or tried in person to convince executives of the need to set service level objectives at a level below 100%, hence the original question.
Here are some approaches I've seen or taken in the past to tackling this problem. The Site Reliability Workbook has a few suggestions here, which include:
non-zero probability of failure.
In addition to the above, I've also gathered the following talking points from various in-person discussions that might be useful to persuade organizational leaders:
We also have to acknowledge that for certain dimensions of reliability, such as data loss prevention, some regulated industries require levels of reliability so high and over long periods of time that they are virtually indistinguishable from 100%, like 99.999999%. For other dimensions like availability and latency, three or four nines of reliability are often more than adequate, and some of the arguments above may help persuade IT leadership that this is true.
Here are some approaches I've seen or taken in the past to tackling this problem. The Site Reliability Workbook has a few suggestions here, which include:
- 100% is not a reasonable target since every component of any system has a
non-zero probability of failure.
- External components (such as an ISP) sitting between the customer and the target system aren't 100% reliable.
- 100% reliability implies you can never change the system, since all change introduces risk.
- 100% reliability implies you spend all your time reacting to reliability problems and have no time for anything else.
In addition to the above, I've also gathered the following talking points from various in-person discussions that might be useful to persuade organizational leaders:
- If an IT exec sets a target of 100% reliability, they are basically encouraging their engineers to lie to them and hide problems, which means they will discover those problems in the most expensive possible way.
- Nature has never evolved a 100% reliable system. Gene replication is unreliable. The human heart is not 100% reliable. Life continues and thrives in the face of imperfect reliability, and so can technology systems and their users.
- No group of humans has ever engineered a system that's 100% reliable in the long run.
We also have to acknowledge that for certain dimensions of reliability, such as data loss prevention, some regulated industries require levels of reliability so high and over long periods of time that they are virtually indistinguishable from 100%, like 99.999999%. For other dimensions like availability and latency, three or four nines of reliability are often more than adequate, and some of the arguments above may help persuade IT leadership that this is true.
Context
StackExchange DevOps Q#6691, answer score: 4
Revisions (0)
No revisions yet.