Velocity 2016 - Operational Excellence with Hystrix

Operational Excellence with
Hystrix
Lesson Learned from
QuickBooks Online (QBO)
Billy Yuen
Principal Engineer, Intuit

Intuit Mission:
To improve our customers’ financial lives so profoundly…
they can’t imagine going back to the old way
CONSUMERS
SMALL
BUSINESSES
ACCOUNTING
PROFESSIONALS

A Premiere Innovative Growth Company
Employees
8,000+
Customers
45M
Global Offices
US, UK, India,
Canada, Australia
Revenue
4.2B
Founded
1983
Public 1993
INTU

Resiliency: Act tough to your enemy even if you are hurting
NORMAL CUSTOMER

And fix the problem quickly before your customers notice!

Why is resiliency hard to achieve?
§ Micro-services allow teams to move and innovate faster.
§ But failure now will also be distributed!
§ New Architecture == New Set of Problems

Even Google has faced this issue!
§ On May 3, 2013 “From 6:26 PM to 7:58 PM PT, requests to most Google APIs resulted in
500 error response messages.”
§ “The combination of the bug and configuration error quickly caused all of the
serving threads to be consumed. Traffic was permanently queued waiting
for a serving thread to become available.”
§ http://googledevelopers.blogspot.com/2013/05/google-api-infrastructure-outage_3.html

Challenges we face
§ Cascade failures from dependent systems.
§ Very few instrumentation and monitoring for the dependent services.
§ Decomposition from Monolith to micro-services.
- Design consideration for auto recovery from latency and dependency failure.
- Testing for latency and dependency failure.
§ Impossible to achieve 99.9% uptime
- If there are N dependencies and each has 99.9% uptime, best uptime would be 99.9% ^ N (99.9%
^ 10 = 99%).
- If dependency is down for X minutes, our recovery time will be X + Y minutes (detection and
server restart).

How does Hystrix help?
§ Also known as “Circuit Breaker”
- Automatic Fail Fast
- Automatic Fail Over (fallback)
- Protocol Agnostics
§ Created by an engineer to make his life better during production support!
§ Defend against your dependencies (Trust but verify)!
§ Battle tested with billions of request per days.
§ Metrics Generation

Hystrix at work (fail fast)
Health Check

Overcome Organization Challenge – feature, feature, feature…
§ Growing Pain and Opportunity – QBO experienced big growth and uptime became a
bigger issue.
- Dependency network issue making QBO unresponsive.
- Cascade failure from Disk Failure in one component causing QBO clusters unresponsive.
§ Action over Plan – Implemented Hystrix and monitoring in Payment API to demonstrate
the vision.

Our Journey to implement Hystrix
§ Apply Hystrix to legacy code
§ Prevent Drift Detection
§ Failure Testing
§ Real time monitoring
§ Production troubleshooting with historical data
§ Production support process

Apply Hystrix to existing applications
§ Challenges
- We don’t know all the network calls. Many calls are also buried in client libraries.
- New Relics can provide the URL for the remote service, but not the stack trace.
§ Solution - Hystrix Network Audit Agent
- Java Agent to detect any “naked” network call.
- Work for both socket and NIO calls.
- https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent

Learning from applying Hystrix Agent in QBO
§ Remove Network calls to system that are no longer use.
- Analogous to “Reduce Attack surface” in security.
- Remove unnecessary network calls to “Reduce Dependency Vulnerability”.
§ Remove Heart Beat to dependent systems.
§ Remove Custom fallback logic.

Prevent Drift detection
§ Integrated Hystrix audit agent with regression build.
§ Report all unprotected calls in self-service dashboard.

Failure testing
§ Mindset changes
- Developer - Coding for resiliency
- Quality - Testing for dependency failure
§ Failure testing tools
- Hystrix’s ForceCircuitOpen property
- Unit Test and Integration with Wiremock - Http only
- Custom Proxy (Man in the middle) - Any protocol

Real time monitoring - architecture
§ Instance - publish Hystrix Metrics stream (provided by Hystrix).
§ Turbine – aggregate all streams in one cluster.

Real time monitoring - dashboard
§ Hystrix Dashboard - Rolling window of last ten seconds of dependency health
§ https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard

Real time monitoring - metrics
§ Circuit Breaker Status
§ Success, Failure and Timeout Count
§ Response time
§ Fallback status
§ Thread pool status
§ https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring

Production troubleshooting with historical data
§ Netflix Atlas - https://github.com/Netflix/atlas/wiki/Overview
§ Intuit – export the Hystrix Metrics to Splunk
- Pinpoint production issue root cause.
- Measurement for actual latency (Network + Service).
- Troubleshoot random “blip”. What is “blip”?

Splunk Hystrix Dashboard
§ Aggregate data by any timeframe
§ Drill down to specific Hystrix command for detail.

Production support process
§ Configure Alerts around critical Hystrix Commands in Splunk
§ Integrate Alerts with PagerDuty.
§ Use Hystrix Dashboard to review real time system health.
§ Use Splunk to debug production issues
Alert Page Review Debug

Other Considerations
§ Is the call necessary?
§ Can you create fallback for read?
§ Client does not need to know it is from fallback (except fragment caching)
§ Can you handle offline write?
§ Idempotence for write (or verify before calling again)
§ Limit blast radius with thread pool or semaphore.
§ Adjust the Hystrix default parameter to your use case.
§ Fallback needs to be fast and highly available.
§ Timeout vs Failure
§ Not applicable to stream or batch.

Thread pool vs Semaphore
§ Thread pool
- Guaranteed timeout by using thread interrupt.
- More memory required.
- Support Concurrent programming (Future is non-blocking).
§ Semaphore
- No timeout guarantee (your code will assume the timeout responsibility).
- Set Hystrix Timeout to be greater than your timeout to avoid false timeout.
- Should throw HystrixTimeoutException if client code throws timeout exception.
- Scale better for large pool.
§ HystrixObservableCommand (NIO) can scale even better in some cases.
§ https://github.com/billyy/Hystrix-Tutorial

E2E experience including UI
§ Focus on critical workflows!
§ QBO
- Provide degraded experience if not available.
- Payroll is down.
- Invoice without attachment service.
- Allow Login if subscription service is temporary not available.
§ Payments Service
- Limit dependency calls to only applicable use case.
- Cache read-only data with last known good value.

Best Practices
§ Don’t implement fallback if you don’t have a fallback.
§ Don’t call another Hystrix Command from catch.
§ Good and Bad use of fallback.
§ Throw Hystrix TimeoutException if your code is doing timeout.
§ Update your catch block once you have implemented Hystrix.
§ Avoid the use of thread local.
§ Retry only if necessary.

Don’t implement fallback if you don’t have a fallback

Don’t call another Hystrix Command from catch

Throw Hystrix TimeoutException if your code is doing timeout
§ Hystrix will report all exceptions as failure.
§ Timeout will indicate a potential issue with dependency instead of coding issue.
§ Need to configure Hystrix timeout > your timeout.

Update your catch block once you have implemented Hystrix

Retry only if necessary
§ Hystrix does NOT support retry.
§ Retry code inside Hystrix Command will skew your metrics.
§ Bad things can happen
- Good chance that the immediate retry could also timeout.
- Overload the dependency.
- Exceed your SLA.
§ If you have to retry, implement the retry outside of Hystrix.

Minimize the use of thread local
§ Not explicit to the caller.
§ TLS needs to be copied to the Hystrix worker thread and reset afterward (Prone for error).
§ Implement callable wrapper if TLS is required.
§ https://github.com/Netflix/Hystrix/issues/92

Important differences in Hystrix version
§ 1.2.x
- Initial public release.
§ 1.3.x
- Add support for RxJava.
- Semaphore based execution cannot be interrupted but would throw exception if SLA is exceeded
(even if the call is successful!).
§ 1.4.x
- Rewritten based on RxJava.
- Semaphore based execution can be interrupted by a separate background thread.
- HystrixTimeoutException is public (1.4.18+).
§ 1.5.x – Re-architecture of the Metrics to support metric streaming (non-aggregate).

Different ways to implement Hystrix
§ Client library - Hystrix jar
- http://www.slideshare.net/MattJacobs11/using-hystrix-to-build-resilient-distributed-systems-
58836753
§ Existing application – Javanica
- https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-javanica
§ New application - Spring Cloud
- http://cloud.spring.io/spring-cloud-netflix/spring-cloud-netflix.html

Velocity 2016 - Operational Excellence with Hystrix

More Related Content

What's hot

Similar to Velocity 2016 - Operational Excellence with Hystrix

Recently uploaded

Velocity 2016 - Operational Excellence with Hystrix