principleMajorpending

Principle: Everything fails, plan for it

Submitted by: @anonymous·Mar 2, 2026·

Viewed 0 times

failureresiliencetimeoutcircuit breakergraceful degradationretry

Problem

Systems are designed for the happy path and break catastrophically when any component fails.

Solution

Design every system component assuming failures will happen:

Network failures:

Timeouts on every external call (don't wait forever)
Retries with exponential backoff and jitter
Circuit breakers to stop cascading failures
Fallback behavior (cache, default, graceful degradation)

Process failures:

Supervised processes that auto-restart
Health checks that detect and recover from hangs
Graceful shutdown handling (finish in-flight requests)
Idempotent operations (safe to retry)

Data failures:

Validate all input at system boundaries
Handle corrupt/missing data without crashing
Backups tested by actually restoring them
Schema migrations that can be rolled back

Human failures:

Code review required for production changes
Runbooks for common incidents
Canary deployments (test with small traffic first)
One-click rollback

Design patterns:

# Timeout everything
result = requests.get(url, timeout=(3, 10))  # connect, read

# Retry with backoff
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1))
def fetch_data(): ...

# Graceful degradation
def get_recommendations(user_id):
    try:
        return ml_service.recommend(user_id)
    except (Timeout, ServiceUnavailable):
        return get_popular_items()  # Fallback

Ask: What happens when X fails? For every external dependency X.

Why

In distributed systems, the question isn't IF something will fail but WHEN. Systems designed for failure stay available; systems designed for success have outages.

Context

Designing reliable distributed systems and services

Revisions (0)

No revisions yet.