Senior Technical Lead, WSO2
Resilience Patterns with Ballerina
Isuru Udana
Bob is a developer who
works at the IT
Department of a
popular bank.
Bob
Bob was asked to develop
a mobile selfcare banking
application.
This mobile application
should be capable of
showing account balances
as well as details of past
transactions.
Legacy services are
already available.
But when he started to implement the application,
bob found issues with some of the legacy services.
Transient
Network Failures
Moderate Load
Intermittent
Failures
Bob found it very
difficult to build a
reliable application.
Resilient service
Bob got an idea!
Resilience
Ability to return to the
original form, position after
being affected by a particular
alteration
What is Resilience?
In a software system, resilience means ...
… the ability to recover to a working condition after being
affected by a serious incident
Resilience in Software Applications
“The probability of failure-free software operation for a
specified period of time in a specified environment.”
- The IEEE Reliability Society
• 100% operational all the time
Reliability and Resilience
Reliability
http://www.picserver.org/r/reliability.html
Focusing on Reliability is Enough...?
Distributed and complex systems
with many interactions are prone
to failures
Why Focusing on Reliability is Not
Enough
Systems are Complex and Prone to Failures
• Untested corner cases
• Minor mistakes can affect serious production
incidents
• Failures are unpredictable
Why Focusing on Reliability is Not
Enough
Avoiding Failures is Not Practical
• Handle unexpected situations
• When one feature is temporarily unavailable, the rest
of the application still runs
• Stop propagating errors happening at downstreams of
a complex system into upstreams
Resilience in Production
It’s All About Achieving
Availability of a Production
System!
Best case:
• User get’s a 100% availability of the
service
Typical case:
• User sees a graceful degradation of the
service
What Does it Mean to a User?
• Never expect systems to be 100% reliable
• Design systems thinking about connection issues,
down times, etc.
What Does it Mean to a developer ?
• Bulkhead
• Retry
• Circuit Breaker
• Timeout
• …
Resilience Patterns
Isolate components of an application into multiple pools.
If one component fails, others will continue to service
Bulkhead
Isolation
• Transient failures are not uncommon
• They recover by themselves
• Can be handled by
– Cancel
– Retry
– Retry with a delay
Retry
https://www.flickr.com/photos/markgregory/8184890333
https://creativecommons.org/licenses/by-nc-sa/2.0/legalcode
• Hide downstream latency and keep
the responsiveness to upstream
• Prevent waiting forever
Timeout
• Some transient failures takes
much longer to recover
• Repeatedly retrying may
hinder recoverability
• Retry up to a certain degree
and cut off
Circuit Breaker
Circuit Breaker
Fail/Keep Open
Reset Timeout
Fail
Success
Fail
(threshold not reached)
Fail
(threshold exceeded)
Success
Open
Half-OpenClosed
States in
circuit breaker
Resilience with Ballerina
• Designed to implement resilient programs/services
• Highly structured error and exception handling
Resilience with Ballerina
Back to the Story
Banking Service
Banking
Service
Account
Balance Service
Account
History Service
Mobile
Application
Can only handle
moderate load
Transient Failures
Sometimes takes a
long time to respond
Banking Service Account Balance
Legacy Service
Account
History
Legacy
Service
Account
Balance
Resource
Account
History
Resource
Banking Service
Account
Balance
Resource
Account
Balance Legacy
Service
Connectors, Connections and Endpoints
Endpoint
Connection
Params
Connector Options
Struct
Handling Transient Failures
Retry
Banking
Service
Account
Balance Service
Mobile
Application
Handling Transient Failures
Retry
Retry
Count
Retry
Delay
Options
Struct
Timeout
Timeout
Duration
Protect Services From Overload
Circuit Breaker
Retries Before
Suspension
Suspension
Duration
Applying Multiple Patterns
Circuit Breaker + Retry + Timeout
Circuit
Breaker
Retry
Timeout
Balancing Load
Banking
Service
Account
Balance Service 1
Mobile Application
Account
Balance Service 2
Balancing Load
Connectors
Story Continued ...
For priority
customers,
application should
provide zero
downtime.
Offering Different Quality of Services
Bulkhead
Banking
Service
Reliable
Service
Service got
Transient
Failures
Standard
Customer
Priority
Customer
Offering Different Quality of Services
Bulkhead
Reliable
Service
Service with
Transient Issues
Priority
Check
Function
Conclusion
● What is Resilience
● Reliability and resilience
● Resilience patterns
● Building resilience patterns with Ballerina
wso2.com

[WSO2Con EU 2017] Resilience Patterns with Ballerina