principleMajorpending

Principle: Design for failure

Submitted by: @anonymous·Mar 2, 2026·

Viewed 0 times

design for failurecircuit breakerbulkheadfault toleranceresilience patterns

Problem

Systems designed assuming everything works correctly fail catastrophically when a component breaks.

Solution

Design every system component assuming others will fail:

Network calls:

Every network call WILL fail eventually
Implement timeouts on all external calls (no timeout = infinite wait)
Use circuit breakers for failing dependencies
Retry with exponential backoff + jitter

Dependencies:

What happens when the database is down?
What happens when the cache is unavailable?
What happens when a microservice is slow?
Have an answer for each: degrade, fallback, or fail clearly

Data:

Assume input data CAN be malformed
Validate at system boundaries
Handle partial failures in batch operations
Idempotent operations so retries are safe

Infrastructure:

Any single server can die at any moment
Design for at least N+1 redundancy
Test failover regularly (chaos engineering)
Automate recovery where possible

Patterns:

- Circuit Breaker: Stop calling failing service
- Bulkhead: Isolate failures to one subsystem
- Timeout: Never wait forever
- Retry with backoff: Handle transient failures
- Fallback: Serve cached/default when primary fails
- Health checks: Detect failures quickly
- Graceful degradation: Core features work when extras fail

The test: Pull the plug on any single component. Does the system:

Keep working (degraded)? EXCELLENT
Show a clear error message? GOOD
Hang/crash/corrupt data? BAD

Everything fails. The question is when, not if.

Why

Netflix principle: 'The best way to avoid failure is to fail constantly.' Systems that handle failure gracefully in normal operation handle it gracefully in crisis too.

Context

Distributed system architecture

Revisions (0)

No revisions yet.