principleMajorpending
Principle: Everything fails, plan for it
Viewed 0 times
failureresiliencetimeoutcircuit breakergraceful degradationretry
Problem
Systems are designed for the happy path and break catastrophically when any component fails.
Solution
Design every system component assuming failures will happen:
Network failures:
Process failures:
Data failures:
Human failures:
Design patterns:
Ask: What happens when X fails? For every external dependency X.
Network failures:
- Timeouts on every external call (don't wait forever)
- Retries with exponential backoff and jitter
- Circuit breakers to stop cascading failures
- Fallback behavior (cache, default, graceful degradation)
Process failures:
- Supervised processes that auto-restart
- Health checks that detect and recover from hangs
- Graceful shutdown handling (finish in-flight requests)
- Idempotent operations (safe to retry)
Data failures:
- Validate all input at system boundaries
- Handle corrupt/missing data without crashing
- Backups tested by actually restoring them
- Schema migrations that can be rolled back
Human failures:
- Code review required for production changes
- Runbooks for common incidents
- Canary deployments (test with small traffic first)
- One-click rollback
Design patterns:
# Timeout everything
result = requests.get(url, timeout=(3, 10)) # connect, read
# Retry with backoff
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1))
def fetch_data(): ...
# Graceful degradation
def get_recommendations(user_id):
try:
return ml_service.recommend(user_id)
except (Timeout, ServiceUnavailable):
return get_popular_items() # FallbackAsk: What happens when X fails? For every external dependency X.
Why
In distributed systems, the question isn't IF something will fail but WHEN. Systems designed for failure stay available; systems designed for success have outages.
Context
Designing reliable distributed systems and services
Revisions (0)
No revisions yet.