gotchaMinor
What is the difference between the traditional Development and Operations Model and Site Reliability Engineering?
Viewed 0 times
theoperationstraditionalwhatdifferencebetweendevelopmentsiteengineeringand
Problem
"SRE is what happens when you ask a software engineer to design an operations team." – Site Reliability Engineering
Since Google's Site Reliability Engineering Book was released, on more than one occasion I have been told that SRE is an extension of the existing Operations or Application Support model.
We've had a couple of questions that defined differences between Sys. Admins, DevOps Engineers and Site Reliability Engineers:
However none of these questions or their answers describe the differences between a Systems Administrator and a Site Reliability Engineer.
In broader terms: what are the key differences between Google's practice of Site Reliability Engineering and the traditional separated Development and Operations functions within a business.
Since Google's Site Reliability Engineering Book was released, on more than one occasion I have been told that SRE is an extension of the existing Operations or Application Support model.
We've had a couple of questions that defined differences between Sys. Admins, DevOps Engineers and Site Reliability Engineers:
- What is the difference between Sysadmin and DevOps Engineer?
- What is the difference between SRE and DevOps?
- What could be a valid definition of DevOps to introduce it to a novice?
However none of these questions or their answers describe the differences between a Systems Administrator and a Site Reliability Engineer.
In broader terms: what are the key differences between Google's practice of Site Reliability Engineering and the traditional separated Development and Operations functions within a business.
Solution
Thankfully, since Site Reliability Engineering developed internally at Google and only recently has started to make its way into the broader community, it is fairly well-defined. What isn't, though, is web operations (or "systems administration" - as an example of the lack of clarity, you use both in your question). It's difficult to discuss the differences between two things when you're not altogether sure what one of them is.
But I'm an adventurous fellow, so I'll give it a shot.
In very traditional shops, developers and sysadmins are very siloed from each other. The devs build an app, then consider their job complete as soon as their code has been committed. The sysadmins take the build artifacts (which may be just the code, if it's an interpreted language) and deploy it to production servers. It's the sysadmins' job to keep the application running smoothly, and in general manage the production environment. However, often performance problems come from architecture issues in the app; the sysadmins don't have the programming knowledge to know what the app is doing, and the developers don't know how the app acts in the production topology with production traffic, so no one is equipped by themselves to solve the problem.
Additionally, the developers are usually judged on how quickly they can produce new features, while the sysadmins are judged on how infrequent the app breaks in production. Since change is one of the leading causes of breakage, this puts the two departments at odds with each other - an old rivalry that hurts the business and the people involved.
At some point, some developer-centric companies got so annoyed at this that they began practicing "NoOps" - they eliminated their operations departments and the perceived roadblocks that came with them. In reality, this meant that developers took on operations roles, but maintained their old titles.
In a discussion surrounding NoOps, John Allspaw, then VP of Technical Operations at Etsy and an editor of the well-respected Web Operations book, defined roles at Etsy this way:
Etsy Operations is responsible for:
Etsy Development is responsible for:
Neither of those lists are comprehensive, I'm sure I'm missing
something there. While Etsy Ops has made production-facing application
changes, they're few but real (and sometimes quite deep). While Etsy
Dev makes Chef changes, they're few but real. If there's so much
overlap in responsibilities, why the difference, you might ask? Domain
expertise and background. Not many Devs have deep knowledge of how TCP
slow start works, but Ops does. Not many Ops have a comprehensive
knowledge of sorting or relevancy algorithms, but Dev does. Ops has
years of experience in forecasting resource usage quickly with
acceptable accuracy, Dev doesn't. Dev might not be aware of the pros
and cons of distributing workload options across all layers1-7, maybe
only just at 7, Ops does. Entity-relationship modeling may come
natural to a developer, it may not to ops. In the end, they both
discover solutions to various forms of Byzantine failure scenarios and
resilience patterns, at all tiers and layers.
In his world, developers and ops engineers had very similar high-level skill sets and responsibilities; where they differed was in their expertise. Their differing specialties encouraged them to work together to solve problems, and their common base-level skills gave them a language in which to do that.
This is generally the definition of web operations that I land on for most cases. So it's the one we're going to continue along with.
So then, what is Site Reliability Engineering?
The Google SRE book opens with a definition of SRE... and then another one... and then spends a chapter continuing to define the role and an entire book covering the specifics. Even when developed in one organization, it seems that it's difficult to condense the job down to one single agreed-upon definition.
To start with, we need to walk back to 2003, when Ben Traynor joined Google and founded what came to be the first Site Reliability Engineering team. Recall that a few paragraphs ago we were in the early 2010s; but in 2003, the industry was still pretty set on the sysadmin/developer divide as the natural way of things. So when Ben says that SRE was what would happen if a software engineer created an operations team, this was a much more radical melding of the two worlds than it appears now.
The definition given in the preface emphasizes
But I'm an adventurous fellow, so I'll give it a shot.
In very traditional shops, developers and sysadmins are very siloed from each other. The devs build an app, then consider their job complete as soon as their code has been committed. The sysadmins take the build artifacts (which may be just the code, if it's an interpreted language) and deploy it to production servers. It's the sysadmins' job to keep the application running smoothly, and in general manage the production environment. However, often performance problems come from architecture issues in the app; the sysadmins don't have the programming knowledge to know what the app is doing, and the developers don't know how the app acts in the production topology with production traffic, so no one is equipped by themselves to solve the problem.
Additionally, the developers are usually judged on how quickly they can produce new features, while the sysadmins are judged on how infrequent the app breaks in production. Since change is one of the leading causes of breakage, this puts the two departments at odds with each other - an old rivalry that hurts the business and the people involved.
At some point, some developer-centric companies got so annoyed at this that they began practicing "NoOps" - they eliminated their operations departments and the perceived roadblocks that came with them. In reality, this meant that developers took on operations roles, but maintained their old titles.
In a discussion surrounding NoOps, John Allspaw, then VP of Technical Operations at Etsy and an editor of the well-respected Web Operations book, defined roles at Etsy this way:
Etsy Operations is responsible for:
- Responding to outages, takes on-call
- Alerting systems thresholding, design
- Architecture design and review
- Building metrics collection
- Application configuration
- Infrastructure buildout/management
Etsy Development is responsible for:
- Responding to outages, takes on-call
- Alerting systems thresholding, design
- Architecture design and review
- Building metrics collection
- Application configuration
- Shipping public-facing code
Neither of those lists are comprehensive, I'm sure I'm missing
something there. While Etsy Ops has made production-facing application
changes, they're few but real (and sometimes quite deep). While Etsy
Dev makes Chef changes, they're few but real. If there's so much
overlap in responsibilities, why the difference, you might ask? Domain
expertise and background. Not many Devs have deep knowledge of how TCP
slow start works, but Ops does. Not many Ops have a comprehensive
knowledge of sorting or relevancy algorithms, but Dev does. Ops has
years of experience in forecasting resource usage quickly with
acceptable accuracy, Dev doesn't. Dev might not be aware of the pros
and cons of distributing workload options across all layers1-7, maybe
only just at 7, Ops does. Entity-relationship modeling may come
natural to a developer, it may not to ops. In the end, they both
discover solutions to various forms of Byzantine failure scenarios and
resilience patterns, at all tiers and layers.
In his world, developers and ops engineers had very similar high-level skill sets and responsibilities; where they differed was in their expertise. Their differing specialties encouraged them to work together to solve problems, and their common base-level skills gave them a language in which to do that.
This is generally the definition of web operations that I land on for most cases. So it's the one we're going to continue along with.
So then, what is Site Reliability Engineering?
The Google SRE book opens with a definition of SRE... and then another one... and then spends a chapter continuing to define the role and an entire book covering the specifics. Even when developed in one organization, it seems that it's difficult to condense the job down to one single agreed-upon definition.
To start with, we need to walk back to 2003, when Ben Traynor joined Google and founded what came to be the first Site Reliability Engineering team. Recall that a few paragraphs ago we were in the early 2010s; but in 2003, the industry was still pretty set on the sysadmin/developer divide as the natural way of things. So when Ben says that SRE was what would happen if a software engineer created an operations team, this was a much more radical melding of the two worlds than it appears now.
The definition given in the preface emphasizes
Context
StackExchange DevOps Q#911, answer score: 7
Revisions (0)
No revisions yet.