HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to safeguard Ansible deployment to mitigate accidents?

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
ansiblemitigatesafeguardaccidentshowdeployment

Problem

Recently the Amazon S3 had a major outage in the us-east-1 region. It looks like it was likely caused by a spelling error when running a maintenance playbook in Ansible or a similar tool. You can put a shell script wrapper around ansible-playbook to look like:

#!/bin/bash
/usr/bin/ansible-playbook "$@" --list-hosts --list-tasks
read -p "Are you sure? (y/n) " answer
test "$answer" = "y" || exit 0
exec /usr/bin/ansible-playbook "$@"


But what are some other ways you use to improve the safety and reduce a chance of error causing a major outage for your company.

Solution

We're using jobs in jenkins to trigger deployments. It ensures that no matter who does the deployment, the ansible command that is run will be the same. A nice bonus is the build logs record when deployments were triggered, who triggered them and what exactly happened during the deployment.

It's certainly not foolproof, but it's been a nice improvement over running ansible playbooks by hand.

For larger/riskier changes this should ideally be combined with some form of change management so changes are made only after another person/team reviews the change and the approach to the change to help identify and resolve potential issues early.

Additionally it never hurts to have a teammate who understands the change you're making be present and watching while you make big changes so they can watch for and help prevent mistakes in the execution of the change.

Context

StackExchange DevOps Q#309, answer score: 6

Revisions (0)

No revisions yet.