As your organization builds multi-tier architecture consisting of several applications and technologies, higher vulnerabilities or availability issues between tiers are bound to surface. Failures in downstream system can start a dominoes effect to bring the entire application down and un-estimated load can make revival very challenging.
How do you ensure that failure at a tier remain isolated and doesn’t cascade?
What does it take to build a fault tolerant, self healing system that fails fast or degrades gracefully?
Basically, how will you make your system resilient and when will you call ‘Its done’?
1. PAGE 1 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
#GHCI16
2016Building Resiliency In Multi-
Tier Systems
Shreya Mukhopadhyay
2. PAGE 2 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Agenda
• What is resiliency?
• Open source tools for resilence
— Netflix’s Hystrix
— Wiremock
• Learnings
• Q/A
3. PAGE 3 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
What is Resiliency?
4. PAGE 4 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Why is it important?
5. PAGE 5 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Why is it important?
6. PAGE 7 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Open Source
Framework for
building Resiliency
7. PAGE 8 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Hystrix Framework
• Java library
• Build fault tolerance in system interactions
• Stop cascading failures
• Fail fast and rapidly recover
• Fallback and gracefully degrade when possible
• Dashboards- monitoring, alerting, and operational
control
8. PAGE 9 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Hystrix Implementation
Wrap external
calls in Hystrix
Annotation-
javanica
Hystrix
Command
Set configs
for
Statistical
rolling
window
Timeouts,
Rejection,
Sleep
Thresholds-
volume,
failure %
Circuit Breaker-
Thresholds
breached
Fails
proactively
Frees up
worker
threads
Fallback
triggered
Request
failed
Rejection at
threadpool
Times-out
Shortcircuits
Dashboards
Hystrix
Others-
Splunk etc
9. PAGE 10 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Hystrix at work
10. PAGE 12 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Open Source
Framework for
testing Resiliency
11. PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Wiremock
• Stub and mock framework for testing HTTP(s) traffic
• Standalone OR Embedded
• Features include
• Dependency stubs
• Fault simulation
• Delay injection
• Request verification
Client
Standalone
wire-mock
Client unit
test
Embedded
Wire-mock
12. PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How does it work?
$ curl –H ’Accept: application/json’ http://localhost:8089/sample
14
• Starts embedded
Jetty server
• Acts as a proxy to
the actual service
• Is used by the client
transparently
Client Wire-mock
Request
Mapping
Response
Data
13. PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How does it work?
$ curl –H ’Accept: application/json’ http://localhost:8089/sample
15
Client Wire-mock
Response
Data
{ "request":
{ "method": "GET", "url":
"/sample/"
},
"response": {
"status": 500,
”bodyFileName” :
”body.json”,
“headers” : {
“Content-type” :
“application/json”
…..........}
}
Request
Mapping
14. PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
System architecture
15. PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Wiremock simulating dependency and failures
16. PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Hystrix Dashboard
https://github.com/Netflix/Hystrix/wiki/Dashboard
17. PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Hystrix Dashboard Details
18. PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Splunk Dashboard
19. PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Learnings
• Make the implementation choice based on your
system- Annotation V/S Command
• Hystrix Configs- driven by production metrices
• Hystrix default time out- 1 second
• Don’t call another Hystrix command in catch block
• Update catch block after Hystrix implementation
• Don’t implement fallback if you don’t have one
20. PAGE 22 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
References
https://github.com/Netflix/Hystrix/wiki/How-it-Works
https://github.com/Netflix/Hystrix/wiki/Dashboard
http://wiremock.org/
21. PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Contact Details
SHREYA MUKHOPADHYAY
Email: shreya_mukhopadhyay3@intuit.com
Linked In : https://in.linkedin.com/in/shreya-mukhopadhyay-354b0062
22. PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Thank you
Editor's Notes
Sapling still lives in the harshest conditions it has the ability to recover from the adversity and still grow
Act normal to your customer even if you’re hurting and fix the problem quickly before your customers notice
Similarly, resilience is the ability of a system to gracefully recover from a breakdown
Application resilience is the ability of an application to react to problems in one of its components and still provide the best possible service
Reliability: The target at which software designers have always aimed: perfect operation all the time. Reliability is the planned outcome.Resiliency: The ability of an app to recover from certain types of failure and yet remain functional from the customer perspective. Resilience is the way you achieve the outcome.
What’s the issue? PayTM, LinkedIn, Netflix
Almost all systems are distributed, multi- tier systems and that means that the failure of a system that you didn’t even know existed can render your own computer unusable
e.g. Payment system is down
Falacies in a distributed system- reliable network, zero latency, infinite bandwidth, secure network and many more
Therefore, there is no point in trying to avoid failures as there will be so many, instead we need to embrace them
What if the Payment acceptance engine at your end is down, so essentially customers can’t add money to their wallets.
But wont you allow your customers to login and take a look at other offerings?
There may be many other systems which you might know might not be aware of but they will have the capacity to bring your system down
So if you want to call your system resilient , you need to design considering that failures will happen.
How do you isolate them, stop them from cascading and degrade gracefully keeping in mind the end customer
Here are some of resiliency approaches that can be built in your system
Hystrix is a Java library designed to control the interactions between these distributed systems providing latency and fault tolerance.
It does this by isolating points of access between the services
Also known as “Circuit Breaker” - Automatic Fail Fast - Automatic Fail Over (fallback) - Protocol Agnostics § Created by an engineer to make his life better during production support! § Defend against your dependencies (Trust but verify)! § Battle tested with billions of request per days. § Metrics Generation
Preventing any single dependency from using up all container (such as Tomcat) user threads.
Shedding load and failing fast instead of queueing.
Providing fallbacks wherever feasible to protect users from failure.
Using isolation techniques (such as bulkhead, swimlane, and circuit breaker patterns) to limit the impact of any one dependency.
Optimizing for time-to-discovery through near real-time metrics, monitoring, and alerting
Construct a HystrixCommand or HystrixObservableCommand Object
Execute the Command
Is the Response Cached?
Is the Circuit Open?
Is the Thread Pool/Queue/Semaphore Full?
HystrixObservableCommand.construct() or HystrixCommand.run()
Calculate Circuit Health
Get the Fallback
Return the Successful Response
Preventing any single dependency from using up all container (such as Tomcat) user threads.
Shedding load and failing fast instead of queueing.
Providing fallbacks wherever feasible to protect users from failure.
Using isolation techniques (such as bulkhead, swimlane, and circuit breaker patterns) to limit the impact of any one dependency.
Optimizing for time-to-discovery through near real-time metrics, monitoring, and alerting
Once you’ve implemented Resiliency based on your system, you need to prove that it works right?
And for that you need to bring down the systems that are interacting with your systems- this is only possible when you mock them.
Not only that you might need specific failure responses like in case of http/ https: 400, 500 series errors.
Now enters the open source framework Wiremock- which is basically a stub and mock framework for testing
And can be used in standalone- real time interactions or embedded mode- unit and integration tests
What’s the issue? PayTM, LinkedIn, Netflix
Almost all systems are distributed, multi- tier systems and that means that the failure of a system that you didn’t even know existed can render your own computer unusable
e.g. Payment system is down
Falacies in a distributed system- reliable network, zero latency, infinite bandwidth, secure network and many more
Therefore, there is no point in trying to avoid failures as there will be so many, instead we need to embrace them
Proxy ports for connecting to wiremock, other tiers should be configurable
Resiliency is a choice which you need to make during the design and architecture- what is important- security, performance, reliability, resiliency
Annotation approach is easier to implement and cleaner, but the inheritance approach is offering the possibility to externalize the commands settings, and this could be more convenient, because we can change them in production without to create a new release