Application Resiliency
Yet Another Resilience Framework
Inspired By Netflix Hystrix
Resilience:
Systems easily “drift” from a state of resilience
and failure can emerge from component
relationships. Thus, applications (as components
of a complex system) must be resilient to latency
and failure on all of its system relationships and
not rely upon infrastructure alone to implement
this resilience
Application Resiliency
Philosophy
• Embrace failure as a natural state in the life-
cycle of the application
• Instead of trying to prevent it; manage it
• Let developers responsible for resiliency
• Process supervision
• Supervisor hierarchies
Resiliency Patterns
• Bulkheads
– Workload isolation with thread pools
• Dataflow Concurrency (Promises)
• Retry On Failure
• Timeouts
• Circuit Breaker
• Fallback
• Governor
– Overload Protection
– Throttling Concurrency & RateLimit
Bulkheads
Async task orchestration with Promises
Throttling
Fallback
Fail fast (Timeout)
• Avoid “slow responses”
• Separate:
– SystemError - resources not available
– ApplicationError - bad user input etc
• Verify resource availability before starting
expensive task
• Input validation immediately
Circuit Breakers
Retry on Failure
N
Improving User Experience and
Application Resiliency by Retrying
Dependency if Error recoverable
Retry Policy per service Call or Global defaults,
Transient errors should be retried
Service
Dependent Service Call
Transient Exception failures
Retry Call
1
2
Exception List:
Error 1
Error 2
Retry = Y
Retry Cnt = 2
Delay = 50ms
Policy: Service 1
Service
Framework
Application Service
Based on Netflix Hystrix
Our Approach
Fallback
CircuitBreaker
CircuitBreaker
Retry
Retry
Timeout
Timeout
Primary
Service
Alternate
Service
1 3 4 5
Component Order
bulkhead
(Thread Pool)
bulkhead
(Thread Pool)
Governor
Order Service
getBillingInfo()
getOrder()
getShippingInfo()
2 6
Future Patterns
• Request Caching
• Request Collapsing
– A mechanism, which combines multiple requests
into a single backend dependency call to reduce
the number of threads and network connections
required
– The primary driver of using request collapsing is to reduce the number of threads and network
connections needed to perform concurrent command executions and do so in an automated manner
without forcing all developers of a codebase to coordinate manually batching of requests.
Getting Started
Demo on YouTube
– http://youtu.be/ZyeEdjufSHE
Code on GitHub
– https://github.com/xmlking/Resilience
Follow me on Twitter
– @xmlking

Resilience engineering

  • 1.
    Application Resiliency Yet AnotherResilience Framework Inspired By Netflix Hystrix
  • 2.
    Resilience: Systems easily “drift”from a state of resilience and failure can emerge from component relationships. Thus, applications (as components of a complex system) must be resilient to latency and failure on all of its system relationships and not rely upon infrastructure alone to implement this resilience
  • 4.
  • 5.
    Philosophy • Embrace failureas a natural state in the life- cycle of the application • Instead of trying to prevent it; manage it • Let developers responsible for resiliency • Process supervision • Supervisor hierarchies
  • 6.
    Resiliency Patterns • Bulkheads –Workload isolation with thread pools • Dataflow Concurrency (Promises) • Retry On Failure • Timeouts • Circuit Breaker • Fallback • Governor – Overload Protection – Throttling Concurrency & RateLimit
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Fail fast (Timeout) •Avoid “slow responses” • Separate: – SystemError - resources not available – ApplicationError - bad user input etc • Verify resource availability before starting expensive task • Input validation immediately
  • 12.
  • 13.
    Retry on Failure N ImprovingUser Experience and Application Resiliency by Retrying Dependency if Error recoverable Retry Policy per service Call or Global defaults, Transient errors should be retried Service Dependent Service Call Transient Exception failures Retry Call 1 2 Exception List: Error 1 Error 2 Retry = Y Retry Cnt = 2 Delay = 50ms Policy: Service 1 Service Framework Application Service
  • 14.
  • 15.
    Our Approach Fallback CircuitBreaker CircuitBreaker Retry Retry Timeout Timeout Primary Service Alternate Service 1 34 5 Component Order bulkhead (Thread Pool) bulkhead (Thread Pool) Governor Order Service getBillingInfo() getOrder() getShippingInfo() 2 6
  • 16.
    Future Patterns • RequestCaching • Request Collapsing – A mechanism, which combines multiple requests into a single backend dependency call to reduce the number of threads and network connections required – The primary driver of using request collapsing is to reduce the number of threads and network connections needed to perform concurrent command executions and do so in an automated manner without forcing all developers of a codebase to coordinate manually batching of requests.
  • 17.
    Getting Started Demo onYouTube – http://youtu.be/ZyeEdjufSHE Code on GitHub – https://github.com/xmlking/Resilience Follow me on Twitter – @xmlking