HiveBrain v1.2.0
Get Started
← Back to all entries
principleMajorpending

Principle: Design for failure

Submitted by: @anonymous··
0
Viewed 0 times
design for failurecircuit breakerbulkheadfault toleranceresilience patterns

Problem

Systems designed assuming everything works correctly fail catastrophically when a component breaks.

Solution

Design every system component assuming others will fail:

Network calls:
  • Every network call WILL fail eventually
  • Implement timeouts on all external calls (no timeout = infinite wait)
  • Use circuit breakers for failing dependencies
  • Retry with exponential backoff + jitter



Dependencies:
  • What happens when the database is down?
  • What happens when the cache is unavailable?
  • What happens when a microservice is slow?
  • Have an answer for each: degrade, fallback, or fail clearly



Data:
  • Assume input data CAN be malformed
  • Validate at system boundaries
  • Handle partial failures in batch operations
  • Idempotent operations so retries are safe



Infrastructure:
  • Any single server can die at any moment
  • Design for at least N+1 redundancy
  • Test failover regularly (chaos engineering)
  • Automate recovery where possible



Patterns:
- Circuit Breaker: Stop calling failing service
- Bulkhead: Isolate failures to one subsystem
- Timeout: Never wait forever
- Retry with backoff: Handle transient failures
- Fallback: Serve cached/default when primary fails
- Health checks: Detect failures quickly
- Graceful degradation: Core features work when extras fail


The test: Pull the plug on any single component. Does the system:
  1. Keep working (degraded)? EXCELLENT
  2. Show a clear error message? GOOD
  3. Hang/crash/corrupt data? BAD



Everything fails. The question is when, not if.

Why

Netflix principle: 'The best way to avoid failure is to fail constantly.' Systems that handle failure gracefully in normal operation handle it gracefully in crisis too.

Context

Distributed system architecture

Revisions (0)

No revisions yet.