Operational Excellence with
Hystrix
Lesson Learned from
QuickBooks Online (QBO)
Billy Yuen
Principal Engineer, Intuit
Intuit Mission:
To improve our customers’ financial lives so profoundly…
they can’t imagine going back to the old way
CONSUMERS
SMALL
BUSINESSES
ACCOUNTING
PROFESSIONALS
A Premiere Innovative Growth Company
Employees
8,000+
Customers
45M
Global Offices
US, UK, India,
Canada, Australia
Revenue
4.2B
Founded
1983
Public 1993
INTU
What is resiliency?
Resiliency: Act tough to your enemy even if you are hurting
NORMAL CUSTOMER
And fix the problem quickly before your customers notice!
Why is resiliency hard to achieve?
§ Micro-services allow teams to move and innovate faster.
§ But failure now will also be distributed!
§ New Architecture == New Set of Problems
Even Google has faced this issue!
§ On May 3, 2013 “From 6:26 PM to 7:58 PM PT, requests to most Google APIs resulted in
500 error response messages.”
§ “The combination of the bug and configuration error quickly caused all of the
serving threads to be consumed. Traffic was permanently queued waiting
for a serving thread to become available.”
§ http://googledevelopers.blogspot.com/2013/05/google-api-infrastructure-outage_3.html
Challenges we face
§ Cascade failures from dependent systems.
§ Very few instrumentation and monitoring for the dependent services.
§ Decomposition from Monolith to micro-services.
- Design consideration for auto recovery from latency and dependency failure.
- Testing for latency and dependency failure.
§ Impossible to achieve 99.9% uptime
- If there are N dependencies and each has 99.9% uptime, best uptime would be 99.9% ^ N (99.9%
^ 10 = 99%).
- If dependency is down for X minutes, our recovery time will be X + Y minutes (detection and
server restart).
How does Hystrix help?
§ Also known as “Circuit Breaker”
- Automatic Fail Fast
- Automatic Fail Over (fallback)
- Protocol Agnostics
§ Created by an engineer to make his life better during production support!
§ Defend against your dependencies (Trust but verify)!
§ Battle tested with billions of request per days.
§ Metrics Generation
Hystrix at work (fail fast)
Health Check
Overcome Organization Challenge – feature, feature, feature…
§ Growing Pain and Opportunity – QBO experienced big growth and uptime became a
bigger issue.
- Dependency network issue making QBO unresponsive.
- Cascade failure from Disk Failure in one component causing QBO clusters unresponsive.
§ Action over Plan – Implemented Hystrix and monitoring in Payment API to demonstrate
the vision.
Our Journey to implement Hystrix
§ Apply Hystrix to legacy code
§ Prevent Drift Detection
§ Failure Testing
§ Real time monitoring
§ Production troubleshooting with historical data
§ Production support process
Apply Hystrix to existing applications
§ Challenges
- We don’t know all the network calls. Many calls are also buried in client libraries.
- New Relics can provide the URL for the remote service, but not the stack trace.
§ Solution - Hystrix Network Audit Agent
- Java Agent to detect any “naked” network call.
- Work for both socket and NIO calls.
- https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent
Learning from applying Hystrix Agent in QBO
§ Remove Network calls to system that are no longer use.
- Analogous to “Reduce Attack surface” in security.
- Remove unnecessary network calls to “Reduce Dependency Vulnerability”.
§ Remove Heart Beat to dependent systems.
§ Remove Custom fallback logic.
Prevent Drift detection
§ Integrated Hystrix audit agent with regression build.
§ Report all unprotected calls in self-service dashboard.
Failure testing
§ Mindset changes
- Developer - Coding for resiliency
- Quality - Testing for dependency failure
§ Failure testing tools
- Hystrix’s ForceCircuitOpen property
- Unit Test and Integration with Wiremock - Http only
- Custom Proxy (Man in the middle) - Any protocol
Real time monitoring - architecture
§ Instance - publish Hystrix Metrics stream (provided by Hystrix).
§ Turbine – aggregate all streams in one cluster.
Real time monitoring - dashboard
§ Hystrix Dashboard - Rolling window of last ten seconds of dependency health
§ https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard
Real time monitoring - metrics
§ Circuit Breaker Status
§ Success, Failure and Timeout Count
§ Response time
§ Fallback status
§ Thread pool status
§ https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring
Production troubleshooting with historical data
§ Netflix Atlas - https://github.com/Netflix/atlas/wiki/Overview
§ Intuit – export the Hystrix Metrics to Splunk
- Pinpoint production issue root cause.
- Measurement for actual latency (Network + Service).
- Troubleshoot random “blip”. What is “blip”?
Splunk Hystrix Dashboard
§ Aggregate data by any timeframe
§ Drill down to specific Hystrix command for detail.
Production support process
§ Configure Alerts around critical Hystrix Commands in Splunk
§ Integrate Alerts with PagerDuty.
§ Use Hystrix Dashboard to review real time system health.
§ Use Splunk to debug production issues
Alert Page Review Debug
Other Considerations
§ Is the call necessary?
§ Can you create fallback for read?
§ Client does not need to know it is from fallback (except fragment caching)
§ Can you handle offline write?
§ Idempotence for write (or verify before calling again)
§ Limit blast radius with thread pool or semaphore.
§ Adjust the Hystrix default parameter to your use case.
§ Fallback needs to be fast and highly available.
§ Timeout vs Failure
§ Not applicable to stream or batch.
Thread pool vs Semaphore
§ Thread pool
- Guaranteed timeout by using thread interrupt.
- More memory required.
- Support Concurrent programming (Future is non-blocking).
§ Semaphore
- No timeout guarantee (your code will assume the timeout responsibility).
- Set Hystrix Timeout to be greater than your timeout to avoid false timeout.
- Should throw HystrixTimeoutException if client code throws timeout exception.
- Scale better for large pool.
§ HystrixObservableCommand (NIO) can scale even better in some cases.
§ https://github.com/billyy/Hystrix-Tutorial
E2E experience including UI
§ Focus on critical workflows!
§ QBO
- Provide degraded experience if not available.
- Payroll is down.
- Invoice without attachment service.
- Allow Login if subscription service is temporary not available.
§ Payments Service
- Limit dependency calls to only applicable use case.
- Cache read-only data with last known good value.
Best Practices
§ Don’t implement fallback if you don’t have a fallback.
§ Don’t call another Hystrix Command from catch.
§ Good and Bad use of fallback.
§ Throw Hystrix TimeoutException if your code is doing timeout.
§ Update your catch block once you have implemented Hystrix.
§ Avoid the use of thread local.
§ Retry only if necessary.
Don’t implement fallback if you don’t have a fallback
Don’t call another Hystrix Command from catch
Good use of fallback
Bad use of fallback
Throw Hystrix TimeoutException if your code is doing timeout
§ Hystrix will report all exceptions as failure.
§ Timeout will indicate a potential issue with dependency instead of coding issue.
§ Need to configure Hystrix timeout > your timeout.
Update your catch block once you have implemented Hystrix
Retry only if necessary
§ Hystrix does NOT support retry.
§ Retry code inside Hystrix Command will skew your metrics.
§ Bad things can happen
- Good chance that the immediate retry could also timeout.
- Overload the dependency.
- Exceed your SLA.
§ If you have to retry, implement the retry outside of Hystrix.
Minimize the use of thread local
§ Not explicit to the caller.
§ TLS needs to be copied to the Hystrix worker thread and reset afterward (Prone for error).
§ Implement callable wrapper if TLS is required.
§ https://github.com/Netflix/Hystrix/issues/92
Important differences in Hystrix version
§ 1.2.x
- Initial public release.
§ 1.3.x
- Add support for RxJava.
- Semaphore based execution cannot be interrupted but would throw exception if SLA is exceeded
(even if the call is successful!).
§ 1.4.x
- Rewritten based on RxJava.
- Semaphore based execution can be interrupted by a separate background thread.
- HystrixTimeoutException is public (1.4.18+).
§ 1.5.x – Re-architecture of the Metrics to support metric streaming (non-aggregate).
Different ways to implement Hystrix
§ Client library - Hystrix jar
- http://www.slideshare.net/MattJacobs11/using-hystrix-to-build-resilient-distributed-systems-
58836753
§ Existing application – Javanica
- https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-javanica
§ New application - Spring Cloud
- http://cloud.spring.io/spring-cloud-netflix/spring-cloud-netflix.html
Thanks!
billy_yuen@intuit.com

Velocity 2016 - Operational Excellence with Hystrix

  • 1.
    Operational Excellence with Hystrix LessonLearned from QuickBooks Online (QBO) Billy Yuen Principal Engineer, Intuit
  • 2.
    Intuit Mission: To improveour customers’ financial lives so profoundly… they can’t imagine going back to the old way CONSUMERS SMALL BUSINESSES ACCOUNTING PROFESSIONALS
  • 3.
    A Premiere InnovativeGrowth Company Employees 8,000+ Customers 45M Global Offices US, UK, India, Canada, Australia Revenue 4.2B Founded 1983 Public 1993 INTU
  • 4.
  • 5.
    Resiliency: Act toughto your enemy even if you are hurting NORMAL CUSTOMER
  • 6.
    And fix theproblem quickly before your customers notice!
  • 7.
    Why is resiliencyhard to achieve? § Micro-services allow teams to move and innovate faster. § But failure now will also be distributed! § New Architecture == New Set of Problems
  • 8.
    Even Google hasfaced this issue! § On May 3, 2013 “From 6:26 PM to 7:58 PM PT, requests to most Google APIs resulted in 500 error response messages.” § “The combination of the bug and configuration error quickly caused all of the serving threads to be consumed. Traffic was permanently queued waiting for a serving thread to become available.” § http://googledevelopers.blogspot.com/2013/05/google-api-infrastructure-outage_3.html
  • 9.
    Challenges we face §Cascade failures from dependent systems. § Very few instrumentation and monitoring for the dependent services. § Decomposition from Monolith to micro-services. - Design consideration for auto recovery from latency and dependency failure. - Testing for latency and dependency failure. § Impossible to achieve 99.9% uptime - If there are N dependencies and each has 99.9% uptime, best uptime would be 99.9% ^ N (99.9% ^ 10 = 99%). - If dependency is down for X minutes, our recovery time will be X + Y minutes (detection and server restart).
  • 10.
    How does Hystrixhelp? § Also known as “Circuit Breaker” - Automatic Fail Fast - Automatic Fail Over (fallback) - Protocol Agnostics § Created by an engineer to make his life better during production support! § Defend against your dependencies (Trust but verify)! § Battle tested with billions of request per days. § Metrics Generation
  • 11.
    Hystrix at work(fail fast) Health Check
  • 12.
    Overcome Organization Challenge– feature, feature, feature… § Growing Pain and Opportunity – QBO experienced big growth and uptime became a bigger issue. - Dependency network issue making QBO unresponsive. - Cascade failure from Disk Failure in one component causing QBO clusters unresponsive. § Action over Plan – Implemented Hystrix and monitoring in Payment API to demonstrate the vision.
  • 13.
    Our Journey toimplement Hystrix § Apply Hystrix to legacy code § Prevent Drift Detection § Failure Testing § Real time monitoring § Production troubleshooting with historical data § Production support process
  • 14.
    Apply Hystrix toexisting applications § Challenges - We don’t know all the network calls. Many calls are also buried in client libraries. - New Relics can provide the URL for the remote service, but not the stack trace. § Solution - Hystrix Network Audit Agent - Java Agent to detect any “naked” network call. - Work for both socket and NIO calls. - https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent
  • 15.
    Learning from applyingHystrix Agent in QBO § Remove Network calls to system that are no longer use. - Analogous to “Reduce Attack surface” in security. - Remove unnecessary network calls to “Reduce Dependency Vulnerability”. § Remove Heart Beat to dependent systems. § Remove Custom fallback logic.
  • 16.
    Prevent Drift detection §Integrated Hystrix audit agent with regression build. § Report all unprotected calls in self-service dashboard.
  • 17.
    Failure testing § Mindsetchanges - Developer - Coding for resiliency - Quality - Testing for dependency failure § Failure testing tools - Hystrix’s ForceCircuitOpen property - Unit Test and Integration with Wiremock - Http only - Custom Proxy (Man in the middle) - Any protocol
  • 18.
    Real time monitoring- architecture § Instance - publish Hystrix Metrics stream (provided by Hystrix). § Turbine – aggregate all streams in one cluster.
  • 19.
    Real time monitoring- dashboard § Hystrix Dashboard - Rolling window of last ten seconds of dependency health § https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard
  • 20.
    Real time monitoring- metrics § Circuit Breaker Status § Success, Failure and Timeout Count § Response time § Fallback status § Thread pool status § https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring
  • 21.
    Production troubleshooting withhistorical data § Netflix Atlas - https://github.com/Netflix/atlas/wiki/Overview § Intuit – export the Hystrix Metrics to Splunk - Pinpoint production issue root cause. - Measurement for actual latency (Network + Service). - Troubleshoot random “blip”. What is “blip”?
  • 22.
    Splunk Hystrix Dashboard §Aggregate data by any timeframe § Drill down to specific Hystrix command for detail.
  • 23.
    Production support process §Configure Alerts around critical Hystrix Commands in Splunk § Integrate Alerts with PagerDuty. § Use Hystrix Dashboard to review real time system health. § Use Splunk to debug production issues Alert Page Review Debug
  • 24.
    Other Considerations § Isthe call necessary? § Can you create fallback for read? § Client does not need to know it is from fallback (except fragment caching) § Can you handle offline write? § Idempotence for write (or verify before calling again) § Limit blast radius with thread pool or semaphore. § Adjust the Hystrix default parameter to your use case. § Fallback needs to be fast and highly available. § Timeout vs Failure § Not applicable to stream or batch.
  • 25.
    Thread pool vsSemaphore § Thread pool - Guaranteed timeout by using thread interrupt. - More memory required. - Support Concurrent programming (Future is non-blocking). § Semaphore - No timeout guarantee (your code will assume the timeout responsibility). - Set Hystrix Timeout to be greater than your timeout to avoid false timeout. - Should throw HystrixTimeoutException if client code throws timeout exception. - Scale better for large pool. § HystrixObservableCommand (NIO) can scale even better in some cases. § https://github.com/billyy/Hystrix-Tutorial
  • 26.
    E2E experience includingUI § Focus on critical workflows! § QBO - Provide degraded experience if not available. - Payroll is down. - Invoice without attachment service. - Allow Login if subscription service is temporary not available. § Payments Service - Limit dependency calls to only applicable use case. - Cache read-only data with last known good value.
  • 27.
    Best Practices § Don’timplement fallback if you don’t have a fallback. § Don’t call another Hystrix Command from catch. § Good and Bad use of fallback. § Throw Hystrix TimeoutException if your code is doing timeout. § Update your catch block once you have implemented Hystrix. § Avoid the use of thread local. § Retry only if necessary.
  • 28.
    Don’t implement fallbackif you don’t have a fallback
  • 29.
    Don’t call anotherHystrix Command from catch
  • 30.
    Good use offallback
  • 31.
    Bad use offallback
  • 32.
    Throw Hystrix TimeoutExceptionif your code is doing timeout § Hystrix will report all exceptions as failure. § Timeout will indicate a potential issue with dependency instead of coding issue. § Need to configure Hystrix timeout > your timeout.
  • 33.
    Update your catchblock once you have implemented Hystrix
  • 34.
    Retry only ifnecessary § Hystrix does NOT support retry. § Retry code inside Hystrix Command will skew your metrics. § Bad things can happen - Good chance that the immediate retry could also timeout. - Overload the dependency. - Exceed your SLA. § If you have to retry, implement the retry outside of Hystrix.
  • 35.
    Minimize the useof thread local § Not explicit to the caller. § TLS needs to be copied to the Hystrix worker thread and reset afterward (Prone for error). § Implement callable wrapper if TLS is required. § https://github.com/Netflix/Hystrix/issues/92
  • 36.
    Important differences inHystrix version § 1.2.x - Initial public release. § 1.3.x - Add support for RxJava. - Semaphore based execution cannot be interrupted but would throw exception if SLA is exceeded (even if the call is successful!). § 1.4.x - Rewritten based on RxJava. - Semaphore based execution can be interrupted by a separate background thread. - HystrixTimeoutException is public (1.4.18+). § 1.5.x – Re-architecture of the Metrics to support metric streaming (non-aggregate).
  • 37.
    Different ways toimplement Hystrix § Client library - Hystrix jar - http://www.slideshare.net/MattJacobs11/using-hystrix-to-build-resilient-distributed-systems- 58836753 § Existing application – Javanica - https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-javanica § New application - Spring Cloud - http://cloud.spring.io/spring-cloud-netflix/spring-cloud-netflix.html
  • 38.