SlideShare a Scribd company logo
1 of 84
Designing Services for
Resilience Experiments:
Lessons from Netflix
Nora Jones, Senior Chaos Engineer
@nora_js
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Designing Services for
Resilience Experiments:
Lessons from Netflix
Nora Jones, Senior Chaos Engineer
@nora_js
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
● Proper monitoring
○ Key business metrics to look for
So, how can teams design services
for resilience testing?
● Failure Injection Enabled
● RPC enabled
● Fallback Paths
○ And ways to discover them
● Proper monitoring
○ Key business metrics to look for
● Proper timeouts
○ And ways to discover them
Known Ways to Increase
Confidence in Resilience
Known Ways to Increase
Confidence in Resilience
● Unit Tests
Known Ways to Increase
Confidence in Resilience
● Integration Tests
New Ways to Increase Confidence
in Resilience
● Chaos Experiments
SPS: Key Business Metric
Chaos Engineering: Netflix’s ChAP
API Personalization
100%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
1%
98%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
1%
98%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
API Exp
1%
1%
98%
Chaos Engineering: Netflix’s ChAP
APIGateway Personalization
API Control
API Exp
1%
1%
98%
Monitoring
Monitoring
SHORTED
1. Have Failure Injection
Testing Enabled.
Sample Failure Injection
Library
https://github.com/norajones/FailureInjectionLibrary
Types of Chaos Failures
Types of Chaos Failures
Criteria&API
Automating Creation of Chaos
Experiments
2. Have Good Monitoring in
Place for Configuration
Changes.
Have Good Monitoring in Place
● RPC Enabled
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
● Timeouts
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
● Timeouts
● Retries
Have Good Monitoring in Place
● RPC Enabled
○ Associated Hystrix Commands
■ Associated Fallbacks
● Timeouts
● Retries
● All in One Place!
● Java library managing REST clients to/from
different services
● Fast failing/fallback capability
RPC/Ribbon
RPC/Ribbon Timeouts
RPC Timeouts
At what point does the service give up?
Retries
Immediately retrying a failure after an operation
is not usually a great idea.
Retries
Understand the logic between your timeouts and
your retries.
Circuit Breakers/Fallback Paths
Hystrix Commands/Fallback Paths
If your service is non-critical, ensure that there
are fallback paths in place.
Fallback Strategies
Static Content Cache Fallback
Service
Fallback Strategies
Know what your fallback strategy is and how to
get that information.
3.Ensure Synergy
between Hystrix
Timeouts, RPC timeouts,
and retry logic.
ChAP’s Monocle
ChAP’s Monocle
ChAP’s Monocle
There isn’t always money in
microservices
Criticality Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality
Score
Chaos Success Stories
“We ran a chaos experiment which
verifies that our fallback path works
and it successfully caught a issue in
the fallback path and the issue was
resolved before it resulted in any
availability incident!”
“While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful...
“While [failing calls] we discovered an increase in
license requests for the experiment cluster even
though fallbacks were all successful. ...This likely
means that whoever was consuming the fallback
was retrying the call, causing an increase in
license requests.”
Don’t lose sight of your
company’s customers.
Takeaways
● Designing for resiliency testability is a shared
responsibility.
● Configuration changes can cause outages.
● Have explicit monitoring in place on
antipatterns in configuration changes.
@nora_js
Questions?
@nora_js
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-microservices-resiliency

More Related Content

More from C4Media

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 

Designing Services for Resilience: Netflix Lessons