Reactive programming and Hystrix fault tolerance by Max MyslyvtsevJavaDayUA
Reactive programming is a new paradigm that provides asynchronous event-based flow control. Java implementation is called rxJava and is being developed by Netfix. They have also released Hystrix — a non-functional layer that provides fault tolerance and latency features which also exposes reactive API.
Presentation to describe about Circuit Breakers, where to apply, how and examples. Using the Netflix Hystrix and Spring Retry to demonstrate how and examples available on Github.
https://github.com/BHRother/spring-hystrix-example
https://github.com/BHRother/spring-circuit-breaker-example
Presented at the inaugural Kafka summit (2016) hosted by Confluent in San Francisco
Abstract:
Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:
Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.
Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.
Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.
Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.
and more…
This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevJavaDayUA
Reactive programming is a new paradigm that provides asynchronous event-based flow control. Java implementation is called rxJava and is being developed by Netfix. They have also released Hystrix — a non-functional layer that provides fault tolerance and latency features which also exposes reactive API.
Presentation to describe about Circuit Breakers, where to apply, how and examples. Using the Netflix Hystrix and Spring Retry to demonstrate how and examples available on Github.
https://github.com/BHRother/spring-hystrix-example
https://github.com/BHRother/spring-circuit-breaker-example
Presented at the inaugural Kafka summit (2016) hosted by Confluent in San Francisco
Abstract:
Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:
Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.
Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.
Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.
Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.
and more…
This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
In the financial industry, losing data is unacceptable. Financial firms are adopting Kafka for their critical applications. Kafka provides the low latency, high throughput, high availability, and scale that these applications require. But can it also provide complete reliability? As a system architect, when asked “Can you guarantee that we will always get every transaction,” you want to be able to say “Yes” with total confidence.
In this session, we will go over everything that happens to a message – from producer to consumer, and pinpoint all the places where data can be lost – if you are not careful. You will learn how developers and operation teams can work together to build a bulletproof data pipeline with Kafka. And if you need proof that you built a reliable system – we’ll show you how you can build the system to prove this too.
Lessons learned from writing over 300,000 lines of infrastructure codeYevgeniy Brikman
This talk is a concise masterclass on how to write infrastructure code. I share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild.
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven SystemsJonas Bonér
Abstract:
The demands and expectations for applications have changed dramatically in recent years. Applications today are deployed on a wide range of infrastructure; from mobile devices up to thousands of nodes running in the cloud—all powered by multi-core processors. They need to be rich and collaborative, have a real-time feel with millisecond response time and should never stop running. Additionally, modern applications are a mashup of external services that need to be consumed and composed to provide the features at hand.
We are seeing a new type of applications emerging to address these new challenges—these are being called Reactive Applications. In this talk we will discuss four key traits of Reactive; Responsive, Resilient, Elastic and Message-Driven—how they impact application design, how they interact, their supporting technologies and techniques, how to think when designing and building them—all to make it easier for you and your team to Go Reactive.
Intended Audience:
Programmers, architects, CIO/CTOs and everyone with a desire to challenge the status quo and expand their horizons on how to tackle the current and future challenges in the computing industry.
Go Reactive: Event-Driven, Scalable, Resilient & Responsive SystemsJonas Bonér
The demands and expectations for applications have changed dramatically in recent years. Applications today are deployed on a wide range of infrastructure; from mobile devices up to thousands of nodes running in the cloud—all powered by multi-core processors. They need to be rich and collaborative, have a real-time feel with millisecond response time and should never stop running. Additionally, modern applications are a mashup of external services that need to be consumed and composed to provide the features at hand.
We are seeing a new type of applications emerging to address these new challenges—these are being called Reactive Applications. In this talk we will discuss four key traits of Reactive; Event-Driven, Scalable, Resilient and Responsive—how they impact application design, how they interact, their supporting technologies and techniques, how to think when designing and building them—all to make it easier for you and your team to Go Reactive.
How Netflix tests in production to augment more traditional testing methods. This talk covers the Simian Army (Chaos Monkey & friends, code coverage in production, and canary testing.
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
Cloud Design Patterns - Hong Kong CodeaholicsTaswar Bhatti
Talk on Cloud Design Patterns at Hong Kong Codeaholics Meetup Group. Talk includes External Config Pattern, Cache Aside, Federated Identity Pattern, Valet Key Pattern, Gatekeeper Pattern, Circuit Breaker Pattern, Retry Pattern and the Strangler Pattern. These patterns depicts common problems in designing cloud-hosted applications and design patterns that offer guidance.
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
In the financial industry, losing data is unacceptable. Financial firms are adopting Kafka for their critical applications. Kafka provides the low latency, high throughput, high availability, and scale that these applications require. But can it also provide complete reliability? As a system architect, when asked “Can you guarantee that we will always get every transaction,” you want to be able to say “Yes” with total confidence.
In this session, we will go over everything that happens to a message – from producer to consumer, and pinpoint all the places where data can be lost – if you are not careful. You will learn how developers and operation teams can work together to build a bulletproof data pipeline with Kafka. And if you need proof that you built a reliable system – we’ll show you how you can build the system to prove this too.
Lessons learned from writing over 300,000 lines of infrastructure codeYevgeniy Brikman
This talk is a concise masterclass on how to write infrastructure code. I share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild.
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven SystemsJonas Bonér
Abstract:
The demands and expectations for applications have changed dramatically in recent years. Applications today are deployed on a wide range of infrastructure; from mobile devices up to thousands of nodes running in the cloud—all powered by multi-core processors. They need to be rich and collaborative, have a real-time feel with millisecond response time and should never stop running. Additionally, modern applications are a mashup of external services that need to be consumed and composed to provide the features at hand.
We are seeing a new type of applications emerging to address these new challenges—these are being called Reactive Applications. In this talk we will discuss four key traits of Reactive; Responsive, Resilient, Elastic and Message-Driven—how they impact application design, how they interact, their supporting technologies and techniques, how to think when designing and building them—all to make it easier for you and your team to Go Reactive.
Intended Audience:
Programmers, architects, CIO/CTOs and everyone with a desire to challenge the status quo and expand their horizons on how to tackle the current and future challenges in the computing industry.
Go Reactive: Event-Driven, Scalable, Resilient & Responsive SystemsJonas Bonér
The demands and expectations for applications have changed dramatically in recent years. Applications today are deployed on a wide range of infrastructure; from mobile devices up to thousands of nodes running in the cloud—all powered by multi-core processors. They need to be rich and collaborative, have a real-time feel with millisecond response time and should never stop running. Additionally, modern applications are a mashup of external services that need to be consumed and composed to provide the features at hand.
We are seeing a new type of applications emerging to address these new challenges—these are being called Reactive Applications. In this talk we will discuss four key traits of Reactive; Event-Driven, Scalable, Resilient and Responsive—how they impact application design, how they interact, their supporting technologies and techniques, how to think when designing and building them—all to make it easier for you and your team to Go Reactive.
How Netflix tests in production to augment more traditional testing methods. This talk covers the Simian Army (Chaos Monkey & friends, code coverage in production, and canary testing.
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
Cloud Design Patterns - Hong Kong CodeaholicsTaswar Bhatti
Talk on Cloud Design Patterns at Hong Kong Codeaholics Meetup Group. Talk includes External Config Pattern, Cache Aside, Federated Identity Pattern, Valet Key Pattern, Gatekeeper Pattern, Circuit Breaker Pattern, Retry Pattern and the Strangler Pattern. These patterns depicts common problems in designing cloud-hosted applications and design patterns that offer guidance.
This webinar by Orkhan Gasimov (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Java Community Webinar #3 on October 16, 2020.
During webinar we had simplified overview of classical and modern architecture patterns and concepts that are used for development of distributed applications during the last decade.
More details and presentation: https://www.globallogic.com/ua/about/events/java-community-webinar-3/
These are my summarized notes from all the microservices session I attended at QCon 2015. These sessions had tons of learning around how to scale microservices and avoid common pitfalls
It's harder than ever to predict the load your application will need to handle in advance, so how do you design your architecture so you can afford to implement as you go and be ready for whatever comes your way. It's easy to focus on optimizing each part of your application but your application architecture determines the options you have to make big leaps in scalability. In this talk we'll cover practical patterns you can build today to meet the needs of rapid development while still creating systems that can scale up and out. Specific code examples will focus on .NET but the principles apply across many technologies. Real world systems will be discussed based on our experience helping customers around the world optimize their enterprise applications.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
From Zero to Serverless (CoderCruise 2018)Chad Green
So many times our customers need a simple routine that can be executed on a routine basis but the solution doesn’t need to be an elaborate solution without going the trouble of setting servers and other infrastructure. Serverless computer is the abstraction of servers, infrastructure, and operating systems and make getting solutions to your customer’s needs much quicker and cheaper. During this session we will look at how Azure Functions will enable you to run code on-demand without having to explicitly provision or manage infrastructure.
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Codemotion
At the Cloud, Big Data and Internet division of the Dutch National Police, 4 DevOps teams use the latest open source technology to build high tech, cloud native web applications using Spring Boot, Angular 5, Spark, Kafka and Jenkins 2. I'll share our experiences and real-world use cases for microservices. I’ll show how 4 teams work together on one product and I’ll talk about how we apply the principles of DevOps and Continuous Delivery. I’ll show how we handle security, build pipelines, test automation, performance tests, service discovery, automated deployments, monitoring and more!
Cloud Design Pattern at Carlerton University
External Config Pattern, Cache Aside, Federated Identity Pattern, Valet Key Pattern, Gatekeeper Pattern, Circuit Breaker Pattern, Retry Pattern and the Strangler Pattern. These patterns depicts common problems in designing cloud-hosted applications and design patterns that offer guidance.
This is the story of how we managed to scale and improve Tappsi’s RoR RESTful API to handle our ever-growing load - told from different perspectives: infrastructure, data storage tuning, web server tuning, RoR optimization, monitoring and architecture design.
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
You're building your startup and you know it will be big. You don't want to spend a lot of time on infrastructure, but you also don't want to be putting out fires after you get mentioned on Hacker News. In this session, we will give you real practical tips that you can take home with you on building an infrastructure that will scale quickly with minimal up front work on your part, using time tested techniques in infrastructure as code, SaaS, and Serverless, among other things.
Similar to Using Hystrix to Build Resilient Distributed Systems (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Using Hystrix to Build Resilient Distributed Systems
1. Using Hystrix to Build
Resilient Distributed
Systems
Matt Jacobs
Netflix
2. Who am I?
• Edge Platform team at Netflix
• In a 24/7 Support Role for Netflix production
• (Not 365/24/7)
• Maintainer of Hystrix OSS
• github.com/Netflix/Hystrix
• I get a lot of internal questions too
3. • Video on Demand
• 1000s of device types
• 1000s of microservices
• Control Plane completely on AWS
• Competing against products which have high availability
(network / cable TV)
4. • 2013 – now
• from 40M to 75M+ paying customers
• from 40 to 190+ countries (#netflixeverywhere)
• In original programming, from House of Cards to hundreds of hours
a year, in documentaries / movies / comedy specials/ TV series
5. Edge Platform @ Netflix
• Goal: Maximize availability of Netflix API
• API used by all customer devices
6. Edge Platform @ Netflix
• Goal: Maximize availability of Netflix API
• API used by all customer devices
• How?
• Fault-tolerant by design
• Operational visibility
• Limit manual intervention
• Understand failure modes before they happen
7. Hystrix @ Netflix
• Fault-tolerance pattern as a library
• Provides operational insights in real-time
• Automatic load-shedding under pressure
• Initial design/implementation by Ben Christensen
8. Scope of this talk
• User-facing request-response system that uses
blocking I/O
• NOT batch system
• NOT nonblocking I/O
• (Hystrix can be used in those, though…)
10. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
11. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
13. Distributed Systems
• Write an application and deploy!
• Add a redundant system for resiliency (no SPOF)
• Break out the state
• Break out the function (microservices)
• Add more machines to scale horizontally
15. Fallacies of Distributed Computing
• Network is reliable
• Latency is zero
• Bandwidth is infinite
• Network is secure
• Topology doesn’t change
• There is 1 administrator
• Transport cost is 0
• Network is homogenous
16. Fallacies of Distributed Computing
• Network is reliable
• Latency is zero
• Bandwidth is infinite
• Network is secure
• Topology doesn’t change
• There is 1 administrator
• Transport cost is 0
• Network is homogenous
17. Fallacies of Distributed Computing
• In the cloud, other services are reliable
• In the cloud, latency is zero
• In the cloud, bandwidth is infinite
• Network is secure
• Topology doesn’t change
• In the cloud, there is 1 administrator
• Transport cost is 0
• Network is homogenous
28. There is 1 administrator
Source: http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
29. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
30. Operating a Distributed System
• How can I stop going to incident reviews?
• Which occur every time we cause customer pain
31. Operating a Distributed System
• How can I stop going to incident reviews?
• Which occur every time we cause customer pain
• We must assume every other system can fail
32. Operating a Distributed System
• How can I stop going to incident reviews?
• Which occur every time we cause customer pain
• We must assume every other system can fail
• We must understand those failure modes and prevent
them from cascading
40. How to handle Errors
try {
return getCustomer(id);
} catch (Exception ex) {
//TODO What to return here?
}
41. How to handle Errors
try {
return getCustomer(id);
} catch (Exception ex) {
//static value
return null;
}
42. How to handle Errors
try {
return getCustomer(id);
} catch (Exception ex) {
//value from memory
return anonymousCustomer;
}
43. How to handle Errors
try {
return getCustomer(id);
} catch (Exception ex) {
//value from cache
return cache.getCustomer(id);
}
44. How to handle Errors
try {
return getCustomer(id);
} catch (Exception ex) {
//value from service
return customerViaSecondSvc(id);
}
45. How to handle Errors
try {
return getCustomer(id);
} catch (Exception ex) {
//explicitly rethrow
throw ex;
}
46. Handle Errors with Fallbacks
• Some options for fallbacks
• Static value
• Value from in-memory
• Value from cache
• Value from network
• Throw
• Make error-handling explicit
• Applications have to work in the presence of either
fallbacks or rethrown exceptions
48. Exposure to failures
• As your app grows, your set of dependencies is much
more likely to get bigger, not smaller
• Overall uptime = (Dep uptime)^(num deps)
99.99% 99.9% 99%
5 99.95% 99.5% 95%
10 99.9% 99% 90.4%
25 99.75% 97.5% 77.8%
Dependency uptime
Dependencies
49. If you don’t plan for Latency
• First-order effect: outbound latency increase
• Second-order effect: increase resource usage
50. If you don’t plan for Latency
• First-order effect: outbound latency increase
51. If you don’t plan for Latency
• First-order effect: outbound latency increase
52. Queuing theory
• Say that your application is making 100RPS of a network
call and mean latency is 10ms
53. Queuing theory
• Say that your application is making 100RPS of a network
call and mean latency is 10ms
• The utilization of I/O threads has a distribution over time –
according to Little’s Law, the mean is 100 RPS * (1/100 s)
= 1
54. Queuing theory
• Say that your application is making 100RPS of a network
call and mean latency is 10ms
• The utilization of I/O threads has a distribution over time –
according to Little’s Law, the mean is 100 RPS * (1/100 s)
= 1
• If mean latency increased to 100ms, mean utilization
increases to 10
55. Queuing theory
• Say that your application is making 100RPS of a network
call and mean latency is 10ms
• The utilization of I/O threads has a distribution over time –
according to Little’s Law, the mean is 100 RPS * (1/100 s)
= 1
• If mean latency increased to 100ms, mean utilization
increases to 10
• If mean latency increased to 1000ms, mean utilization
increases to 100
60. Bound resource utilization
• Don’t take on too much work globally
• Don’t let any single dependency take too many resources
61. Bound resource utilization
• Don’t take on too much work globally
• Don’t let any single dependency take too many resources
• Bound the concurrency of each dependency
62. Bound resource utilization
• Don’t take on too much work globally
• Don’t let any single dependency take too many resources
• Bound the concurrency of each dependency
• What should we do if we’re at that threshold?
• We’ve already designed a fallback.
63. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
64. Hystrix Goals
• Handle Errors with Fallbacks
• Handle Latency with Timeouts
• Bound Resource Utilization
65. Handle Errors with Fallback
• We need something to execute
• We need a fallback
66. Handle Errors with Fallback
class CustLookup extends HystrixCommand<Customer> {
@Override
public Customer run() {
return svc.getOverHttp(customerId);
}
}
67. Handle Errors with Fallback
class CustLookup extends HystrixCommand<Customer> {
@Override
public Customer run() {
return svc.getOverHttp(customerId);
}
@Override
public Customer getFallback() {
return Customer.anonymousCustomer();
}
}
68. Handle Errors with Fallback
class CustLookup extends HystrixCommand<Customer> {
private final CustomersService svc;
private final long customerId;
public CustLookup(CustomersService svc, long id)
{
this.svc = svc;
this.customerId = id;
}
//run() : has references to svc and customerId
//getFallback()
}
69. Handle Latency with Timeouts
• We need a timeout value
• And a timeout mechanism
70. Handle Latency with Timeouts
class CustLookup extends HystrixCommand<Customer> {
private final CustomersService svc;
private final long customerId;
public CustLookup(CustomersService svc, long id)
{
super(timeout = 100);
this.svc = svc;
this.customerId = id;
}
//run(): has references to svc and customerId
//getFallback()
}
72. Handle Latency with Timeouts
class CustLookup extends HystrixCommand<Customer> {
@Override
public Customer run() {
return svc.getOverHttp(customerId);
}
@Override
public Customer getFallback() {
return Customer.anonymousCustomer();
}
}
73. Handle Latency with Timeouts
class CustLookup extends HystrixCommand<Customer> {
@Override
public Customer run() {
return svc.nowThisThreadIsMine(customerId);
}
@Override
public Customer getFallback() {
return Customer.anonymousCustomer();
}
}
74. Handle Latency with Timeouts
• What if run() does not respect your timeout?
• We can’t control the I/O library
• We can put it on a different thread to control its impact
82. Circuit-Breaker
• We can build feedback loop
• If command has high error rate in last 10 seconds, it’s
more likely to fail right now
83. Circuit-Breaker
• We can build feedback loop
• If command has high error rate in last 10 seconds, it’s
more likely to fail right now
• Fail fast now – don’t spend my resources
84. Circuit-Breaker
• We can build feedback loop
• If command has high error rate in last 10 seconds, it’s
more likely to fail right now
• Fail fast now – don’t spend my resources
• Give downstream system some breathing room
85.
86. Hystrix as a Decision Tree
• Compose all of the above behaviors
87. Hystrix as a Decision Tree
• Compose all of the above behaviors
• Now you’ve got a library!
88. SUCCESS
class CustLookup extends HystrixCommand<Customer> {
@Override
public Customer run() {
return svc.getOverHttp(customerId); //succeeds
}
@Override
public Customer getFallback() {
return Customer.anonymousCustomer();
}
}
105. Fallback Handling
• What are the basic error modes?
• Failure
• Latency
• We need to be aware of failures in fallback
• No automatic recovery provided – single fallback is enough
106. Fallback Handling
• What are the basic error modes?
• Failure
• Latency
• We need to be aware of failures in fallback
• We need to protect ourselves from latency in fallback
• Add a semaphore to fallback to bound concurrency
116. Hystrix as a building block
• Now we have bounded latency and concurrency, along
with a consistent fallback mechanism
117. Hystrix as a building block
• Now we have bounded latency and concurrency, along
with a consistent fallback mechanism
• In practice within the Netflix API, we have:
• ~250 commands
• ~90 thread pools
• 10s of billions of command executions / day
118. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
120. CustomerCommand
• Takes id as arg
• Makes HTTP call to service
• Uses anonymous user as fallback
• Customer has no personalization
121. RatingsCommand
• Takes Customer as argument
• Makes HTTP call to service
• Uses empty list for fallback
• Customer has no ratings
122. RecommendationsCommand
• Takes Customer as argument
• Makes HTTP call to service
• Uses precomputed list for fallback
• Customer gets unpersonalized
recommendations
123. Fallbacks in this example
• Fallbacks vary in importance
• Customer has no personalization
• Customer gets unpersonalized recommendations
• Customer has no ratings
130. HystrixCommand.queue()
• Does not block application thread
• Application thread can launch multiple HystrixCommands
• Returns java.util.concurrent.Future to application thread
131.
132.
133.
134.
135.
136.
137. HystrixCommand.observe()
• Does not block application thread
• Returns rx.Observable to application thread
• RxJava – reactive stream
• reactivex.io
138.
139.
140.
141.
142.
143.
144.
145.
146.
147.
148. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
149. Hystrix Metrics
• Hystrix provides a nice semantic layer to model
• Abstracts details of I/O mechanism
• HTTP?
• UDP?
• Postgres?
• Cassandra?
• Memcached?
• …
150. Hystrix Metrics
• Hystrix provides a nice semantic layer to model
• Abstracts details of I/O mechanism
• Can capture event counts and event latencies
151. Hystrix Metrics
• Can be aggregated for historical time-series
• hystrix-servo-metrics-publisher
Source: http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html
152. Hystrix Metrics
• Can be aggregated for historical time-series
• hystrix-servo-metrics-publisher
157. Taking Action on Hystrix Metrics
• Circuit Breaker opens/closes
158. Taking Action on Hystrix Metrics
• Circuit Breaker opens/closes
• Alert/Page on Error %
159. Taking Action on Hystrix Metrics
• Circuit Breaker opens/closes
• Alert/Page on Error %
• Alert/Page on Latency
160. Taking Action on Hystrix Metrics
• Circuit Breaker opens/closes
• Alert/Page on Error %
• Alert/Page on Latency
• Use in canary analysis
161. How the system should behave
~100 RPS Success
~0 RPS Fallback
162. How the system should behave
~90 RPS Success
~10 RPS Fallback
163. How the system should behave
~5 RPS Success
~95 RPS Fallback
164. How the system should behave
~100 RPS Success
~0 RPS Fallback
165. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms?
Before Latency
166. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms?
Before Latency After Latency
167. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms?
• Increased timeouts
Before Latency After Latency
168. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms
at 100 RPS?
169. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms
at 100 RPS?
• By Little’s Law, mean utilization: 1 -> 3
->
Before Thread Pool
Utilization
170. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms
at 100 RPS?
• By Little’s Law, mean utilization: 1 -> 3
->
Before Thread Pool
Utilization
After Thread Pool
Utilization
171. How the system should behave
• What happens if mean latency goes from 10ms -> 30ms
at 100 RPS?
• By Little’s Law, mean utilization: 1 -> 3
• Increased thread pool rejections
->
Before Thread Pool
Utilization
After Thread Pool
Utilization
172. How the system should behave
~84 RPS Success
~16 RPS Fallback
173. How the system should behave
~48 RPS Success
~52 RPS Fallback
174. Lessons learned
• Everything we measure has a distribution (once you add
time)
• Queuing theory can teach us a lot
175. Lessons learned
• Everything we measure has a distribution (once you add
time)
• Queuing theory can teach us a lot
• Errors are more frequent, but latency is more
consequential
176. Lessons learned
• Everything we measure has a distribution (once you add
time)
• Queuing theory can teach us a lot
• Errors are more frequent, but latency is more
consequential
• 100% success rates are not what you want!
177. Lessons learned
• Everything we measure has a distribution (once you add
time)
• Queuing theory can teach us a lot
• Errors are more frequent, but latency is more
consequential
• 100% success rates are not what you want!
• Don’t know if your fallbacks work
• You have outliers, and they’re not being failed
178. Lessons learned
• Everything we measure has a distribution (once you add
time)
• Queueing theory can teach us a lot
• Errors are more frequent, but latency is more
consequential
• 100% success rates are not what you want!
• Don’t know if your fallbacks work
• You have outliers, and they’re not being failed
• Visualization (especially realtime) goes a long way
179. Goals of this talk
• Philosophical Motivation
• Why are distributed systems hard?
• Practical Motivation
• Why do I keep getting paged?
• Solving those problems with Hystrix
• How does it work?
• How do I use it in my system?
• How should a system behave if I use Hystrix?
• What? Netflix was down for me – what happened there?
180. Areas Hystrix doesn’t cover
• Traffic to the system
• Functional bugs
• Data issues
• AWS
• Service Discovery
• Etc…
182. Hystrix doesn’t help if...
• Your fallbacks fail
• Upstream systems do not tolerate fallbacks
183. Hystrix doesn’t help if...
• Your fallbacks fail
• Upstream systems do not tolerate fallbacks
• Resource bounds are too loose
184. Hystrix doesn’t help if...
• Your fallbacks fail
• Upstream systems do not tolerate fallbacks
• Resource bounds are too loose
• I/O calls don’t use Hystrix
189. Unusable fallbacks
• Even if fallback succeeds, upstream systems/UIs need to
handle them
• What happens when UI receives anonymous customer?
190. Unusable fallbacks
• Even if fallback succeeds, upstream systems/UIs need to
handle them
• What happens when UI receives anonymous customer?
• Need integration tests that expect fallback data
191. Unusable fallbacks
• Even if fallback succeeds, upstream systems/UIs need to
handle them
• What happens when UI receives anonymous customer?
198. Unwrapped I/O calls
• Any place where user traffic triggers I/O is dangerous
• Hystrix-network-auditor-agent finds all stack traces that:
• Are on Tomcat thread
• Do Network work
• Don’t use Hystrix
199. How to Prove our Resilience
“Trust, but verify”
200. How to Prove our Resilience
• Get lucky and have a real outage
201. How to Prove our Resilience
• Get lucky and have a real outage
• Hopefully our system reacts the way we expect
202. How to Prove our Resilience
• Get lucky and have a real outage
• Hopefully our system reacts the way we expect
• Cause the outage we want to protect against
203. How to Prove our Resilience
• Get lucky and have a real outage
• Hopefully our system reacts the way we expect
• Cause the outage we want to protect against
206. Failure Injection Testing
• Inject failures at any layer
• Hystrix is a good one
• Scope (region / % traffic / user / device)
207. Failure Injection Testing
• Inject failures at any layer
• Hystrix is a good one
• Scope (region / % traffic / user / device)
• Scenario (fail?, add latency?)
209. Failure Injection Testing
• Inject error and observe fallback successes
• Inject error and observe device handling fallbacks
210. Failure Injection Testing
• Inject error and observe fallback successes
• Inject error and observe device handling fallbacks
• Inject latency and observe thread-pool rejections and
timeouts
211. Failure Injection Testing
• Inject error and observe fallback successes
• Inject error and observe device handling fallbacks
• Inject latency and observe thread-pool rejections and
timeouts
• We have an off-switch in case this goes poorly!
214. Operational Summary
• Downstream transient errors cause fallbacks and a
degraded but not broken customer experience
• Fallbacks go away once errors subside
215. Operational Summary
• Downstream transient errors cause fallbacks and a
degraded but not broken customer experience
• Fallbacks go away once errors subside
• Downstream latency increases load/latency, but not to a
dangerous level
216. Operational Summary
• Downstream transient errors cause fallbacks and a
degraded but not broken customer experience
• Fallbacks go away once errors subside
• Downstream latency increases load/latency, but not to a
dangerous level
• We have near realtime visibility into our system and all of
our downstream systems
217. References
• Sandvine internet usage report
• Fallacies of Distributed Computing
• Little’s Law
• Release It!
• Netflix Techblog: Atlas
• Netflix Techblog: FIT
• Netflix Techblog: Hystrix
• Netflix Techblog: How API became reactive
• Spinnaker OSS
• Antifragile
Add a callback to Simon’s keynote. Design for modularity within a monolith. I agree completely, but fundamentally remote is different from local.
There is fundamentally more work being done here. Serialization / deserializion / system calls / whatever syscall does to your CPU caches / IO (bounded by speed of light)