snippetMinor

How to improve DRP-testing of SE sites?

Submitted by: @import:stackexchange-devops·Mar 10, 2026·

Viewed 0 times

drpsitesimprovetestinghow

Problem

Have a look at what's described in the question about "Brief outage planned for Wed, May 3, 2017 at 0:00 UTC, 8pm US/Eastern (like a fire drill for computers)", which is about testing of the existing "Disaster Recovery Planning" (= DRP) of the entire family of SE sites.

If you were in charge of this, what would be your recommendations to improve these kinds of DRP-testing in production?

Solution

NOTE: It's probably not worth reading too much into the outward comprehension of how good-or-not StackExchange is at managing their disaster recovery scenarios. I suspect they are following much of the best practice below and simply testing scenarios to validate their configuration.

Depending on the environment you operate within:

-
A disaster recovery plan might form part of a larger Business Continuity Plan, a Business Continuity Plan would also likely consider operational risks to you people, organisation, locations, information, partners and management systems.

-
A disaster recovery plan might be broken down into many IT Service Continuity Plans for individual services. The disaster recovery plan would likely incorporate people and process along with the technical aspects of the service.

Given these definitions, you might consider ways of improving the capability of the organisation as a whole to be resilient against failure:

-
Service Recovery:

Enable individual services to be Active-Active across two geographically dispersed data centres. This does assume that applications are capable of replicating state between data centres, for example using BASE Semantics for data.

Create Self-Healing services, this means expecting failure and building with Resilience Engineering in mind. An example is by using a tool such as Chaos Monkey to simulate a failure.

-
Disaster Recovery Plan:

Again enable Active-Active mechanisms across data centres, the difference from SRPs is that capacity needs to be carefully considered, i.e. if you have to DCs in an Active-Active pattern and one fails then a single DC must be sufficiently scaled to support 100% of the traffic.

War Games and Rehearsals are really important for disaster recovery plans as this tests the people and the process, in the most mature DevOps environments much of this can be automated as evidenced by Chaos Gorilla.

-
Business Continuity Plan:

On the basis that this is a DevOps site I won't go into the long process of building Business Continuity plans. However, the rules of not keeping all of your eggs in one basket apply - have a plan for when your office is flooded:

Allow your staff to work remotely from home one day a week, this tests your BCP strategies.

If possible have to geographically and politically separate locations for your workforce.

Define and test a clear process for communicating business continuity events and practice them through fire drills.

Context

StackExchange DevOps Q#1055, answer score: 7

Revisions (0)

No revisions yet.