principleMajorpending
Principle: Design for failure
Viewed 0 times
design for failurecircuit breakerbulkheadfault toleranceresilience patterns
Problem
Systems designed assuming everything works correctly fail catastrophically when a component breaks.
Solution
Design every system component assuming others will fail:
Network calls:
Dependencies:
Data:
Infrastructure:
Patterns:
The test: Pull the plug on any single component. Does the system:
Everything fails. The question is when, not if.
Network calls:
- Every network call WILL fail eventually
- Implement timeouts on all external calls (no timeout = infinite wait)
- Use circuit breakers for failing dependencies
- Retry with exponential backoff + jitter
Dependencies:
- What happens when the database is down?
- What happens when the cache is unavailable?
- What happens when a microservice is slow?
- Have an answer for each: degrade, fallback, or fail clearly
Data:
- Assume input data CAN be malformed
- Validate at system boundaries
- Handle partial failures in batch operations
- Idempotent operations so retries are safe
Infrastructure:
- Any single server can die at any moment
- Design for at least N+1 redundancy
- Test failover regularly (chaos engineering)
- Automate recovery where possible
Patterns:
- Circuit Breaker: Stop calling failing service
- Bulkhead: Isolate failures to one subsystem
- Timeout: Never wait forever
- Retry with backoff: Handle transient failures
- Fallback: Serve cached/default when primary fails
- Health checks: Detect failures quickly
- Graceful degradation: Core features work when extras failThe test: Pull the plug on any single component. Does the system:
- Keep working (degraded)? EXCELLENT
- Show a clear error message? GOOD
- Hang/crash/corrupt data? BAD
Everything fails. The question is when, not if.
Why
Netflix principle: 'The best way to avoid failure is to fail constantly.' Systems that handle failure gracefully in normal operation handle it gracefully in crisis too.
Context
Distributed system architecture
Revisions (0)
No revisions yet.