Enterprise resilience
patterns
@seva_dolgopolov
Disaster Semantics
2 Approaches
“Defensive coding” vs. “Let it crash”
Disaster Math
Defensive coding
Let it crash
Patterns
Exceptions, Timeout, Circuit Breaker, Handshaking, Bulkhead,
Health checks, Heartbeat, Retry, Rollback, Reset, Failover,
Fallback, Backpressure, Bounded Queue, Load Balancing,
Dead Letter, Supervision, Governor, ...
Infrastructure View
Hardware
|
Process
|
Network
Exceptions, Health checks, Heartbeat, Dead Letter,
Retry, Rollback, Reset, Governor, Supervision
Timeout, Handshaking, Backpressure,
Fallback, Circuit Breaker
Load Balancing, Bulkhead, Failover
Employer View
Ops
|
Dev
Failover, Load balancing, Supervisor, Health checks
Fallback, Bulkhead, Timeout, Circuit Breaker,
Handshaking, Backpressure, Retry, Rollback,
Reset,Supervisor, Bounded Queue, Dead Letter
Application View
Failover, Load balancing, Supervisor, Bulkhead
Health checks, Timeout, Handshaking, Supervisor, Dead
Letter, Heartbeat
Fallback, Circuit Breaker, Backpressure, Retry, Rollback,
Reset, Bounded Queue
Deployment
|
Detection
|
Repair
What makes a
perfect mix
Akka
Deployment
- Supervisor
- Bulkhead(as Actor)
-> Detection -> Repair
- Heartbeat
- Dead letters
- Timeouts
- Restart
- Fallback
- Backpressure (akka-stream)
- Failover (akka-cluster)
Netflix
Deployment
- Bulkhead(as
Microservice)
-> Detection -> Repair
- Heartbeat
- Timeouts
- Circuit Breaker (Hystrix)
- Fallback
- Retry (Ribbon)
- Failover (Eureka)
That’s it?
A few things
1. Resilience will shape the way you implement your business logic
2. And get another level of complexity
objectives
- Staging env will never be full sized replica of production
- Safety is not composable property
Chaos
Engineering
http://principlesofchaos.org/
Netflix “Chaos monkey”
If something hurts do it more often.
Enrolling Chaos
Opt Out vs. Opt In
Targeting chaos
Random vs. Prespecified
Playing Chaos
1. Define a “Steady State”
2. Make a hypothesis that state will not change
“Clustered services should be unaffected by instance failures”
“The application is responsive even under high latency conditions”
3. Inject “Chaos”
4. Verify your hypothesis
Chaos Automation Platform
Chaos Experiment
Takeaways
- There is no silver bullet to achieve safe system
- Implementing Resilience brings complexity and you need
to manage it.
- Only test in Production make you confident
thx

Enterprise resilience patterns