SlideShare a Scribd company logo
1 of 79
Monkeys in Lab Coats
Applied Failure Testing Research at
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/failure-test-research-netflix
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
The whole is greater than the sum of its parts.
- Aristotle
[Metaphysics]
The Professor vs The Practitioner
Peter Alvaro
Ex-Berkeley, Ex-Industry
Assistant Prof @ Santa Cruz
Misses the calm of PhD life
Likes prototyping stuff
Kolton Andrus
Ex-Netflix, Ex-Amazon
‘Chaos’ Engineer
Misses his actual pager
Likes breaking stuff
Measures of Success
Academic
H-Index
Grant warchest
Department ranking
Industry
Availability (i.e. 99.99% uptime)
Number of Incidents
Reduce Operational Burden
An Unlikely Team?
but ... it’s manual
Works Great!
Surely there is a better way ...
Free lunch?
The End?
(Academia + Industry)
Let’s build it
“Can we, pretty please?”
Freedom and Responsibility
Core Value
Responsibility
Academic Industry
Prove that it works
Show that it scales
Find real bugs
The Big Idea Lineage Driven
Fault Injection
What could possibly go wrong?
Consider computation
involving 100 services
Search Space:
2100
executions
“Depth” of bugs
Single Faults Search Space:
100 executions
“Depth” of bugs
Combination of 4 faults Search Space:
3M executions
“Depth” of bugs
Combination of 7 faults Search Space:
16B executions
Random Search
Search Space:
2100
executions
Engineer-guided Search
Search Space:
???
Fault-tolerance “is just” redundancy
But how do we know redundancy when we see it?
Hard question: “Could a bad thing ever happen?”
Easier: “Exactly why did a good thing happen?”
“What could have gone wrong?”
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
What would have to go wrong?
(RepA OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast2
Client Client
Bcast1
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1
Client Client
Bcast2
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Search Space Reduction
Each Experiment finds
a bug, OR
Reduces the
Search space
The prototype system “Molly”
Recipe:
1. Start with a successful
outcome. Work backwards.
2. Ask why it happened: Lineage
3. Convert lineage to a boolean
formula and solve
4. Lather, rinse, repeat
2. Lineage 3. CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
The Big Idea Meets Production
1. Start with a successful outcome
2.
Lineage
3.
CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
What is success?
“Start with the customer and work
backwards”
Leadership Principle
“Streaming” Data
Joining the Streams
Missing Data?
Lesson 1
Work backwards from what you know
2. Ask why it happened
2.
Lineage
3.
CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
Request Tracing
Request Tracing
Alternate Execution
Evolution over time
Redundancy through History
Lesson 2
Meet in the middle
3. Solve
2.
Lineage
3.
CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
A “small” matter of code
4. Lather, Rinse, Repeat
2.
Lineage
3.
CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
Turn the crank, right?
Idempotence
Bins and Balls
Request
Class 1
Class 2
Class 3
Class n
[...]
r’ r
Class n
Predicting Request Graphs
Request
Class n
Predicting Request Graphs
Request
Some function f:
Requests → Classes
F( ) =
Class n
Request
Predicting Request Graphs
Solve the Machine Learning problem?
or the Failure Testing one?
Simplest thing that will work?
["bookmarks”, “recent”]
["playlist", 0, “name”]
["ratings"]
Falcor Path Mapping
=>
“bookmarks,playlist,ratings”
Lesson 3
Adapt the theory to the reality
Many moons passed...
Does it work? YES!
Case study: “Netflix AppBoot”
Services ~100
Search space (executions) 2100
(1,000,000,000,000,000,000,000,000,000,000)
Experiments performed 200
Critical bugs found 6
Future Work
Richer device metrics
Request class creation
Better experiment selection
Search prioritization
Richer lineage collection
Exploring temporal
interleavings
Lessons
Work backwards from what you know
Meet in the middle
Adapt the theory to the reality
Academia + Industry
Academia + Industry
Academia Industry
Thank You!
Peter Alvaro
@palvaro
palvaro@ucsc.edu
Kolton Andrus
@KoltonAndrus
kolton@gremlininc.com
References
● Netflix Blog on ‘Automated Failure Testing’ http://techblog.netflix.
com/2016/01/automated-failure-testing.html
● Netflix Blog on ‘Failure Injection Testing’ techblog.netflix.com/2014/10/fit-
failure-injection-testing.html
● ‘Lineage Driven Fault Injection’
http://people.ucsc.edu/~palvaro/molly.pdf
Photo Credits
● http://etc.usf.edu/clipart/4000/4048/children_7_lg.gif
● http://cdn.c.photoshelter.com/img-
get2/I0000MIN8fL0q8AA/fit=1000x750/taiwan-hiking-river-tracing-walking.jpg
● http://i.imgur.com/iWKad22.jpg
● https://blogs.endjin.com/2014/05/event-stream-manipulation-using-rx-part-2/
● http://youpivot.com/category/features/
● https://www.cloudave.com/33427/boards-need-evolve-time/
● https://www.linkedin.com/pulse/amelia-packager-missing-data-imputation-
ramprakash-veluchamy
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/failure-
test-research-netflix

More Related Content

Similar to Monkeys in Lab Coats: Applying Failure Testing Research @Netflix

Better Functional Design through TDD
Better Functional Design through TDDBetter Functional Design through TDD
Better Functional Design through TDD
Phil Calçado
 
One Year Later: Reflections on Developing with Extbase and Fluid
One Year Later: Reflections on Developing with Extbase and FluidOne Year Later: Reflections on Developing with Extbase and Fluid
One Year Later: Reflections on Developing with Extbase and Fluid
zdavis
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
RAHUL126667
 

Similar to Monkeys in Lab Coats: Applying Failure Testing Research @Netflix (20)

C# .NET - Um overview da linguagem
C# .NET - Um overview da linguagem C# .NET - Um overview da linguagem
C# .NET - Um overview da linguagem
 
Prototypes and Drupal
Prototypes and DrupalPrototypes and Drupal
Prototypes and Drupal
 
Troublefree troubleshooting ian campbell sps jhb 2019
Troublefree troubleshooting ian campbell sps jhb 2019Troublefree troubleshooting ian campbell sps jhb 2019
Troublefree troubleshooting ian campbell sps jhb 2019
 
Better Functional Design through TDD
Better Functional Design through TDDBetter Functional Design through TDD
Better Functional Design through TDD
 
Scaling Confluence: From Performance to People
Scaling Confluence: From Performance to PeopleScaling Confluence: From Performance to People
Scaling Confluence: From Performance to People
 
One Year Later: Reflections on Developing with Extbase and Fluid
One Year Later: Reflections on Developing with Extbase and FluidOne Year Later: Reflections on Developing with Extbase and Fluid
One Year Later: Reflections on Developing with Extbase and Fluid
 
How Libraries Evolve. A Survey of Two Industrial Companies and an Open-Source...
How Libraries Evolve. A Survey of Two Industrial Companies and an Open-Source...How Libraries Evolve. A Survey of Two Industrial Companies and an Open-Source...
How Libraries Evolve. A Survey of Two Industrial Companies and an Open-Source...
 
Devtest: using Lean and Devops practices to bring QA and coders together by L...
Devtest: using Lean and Devops practices to bring QA and coders together by L...Devtest: using Lean and Devops practices to bring QA and coders together by L...
Devtest: using Lean and Devops practices to bring QA and coders together by L...
 
Moving Your Library to Web 2.0 and Beyond
Moving Your Library to Web 2.0 and BeyondMoving Your Library to Web 2.0 and Beyond
Moving Your Library to Web 2.0 and Beyond
 
10 practices that every developer needs to start right now
10 practices that every developer needs to start right now10 practices that every developer needs to start right now
10 practices that every developer needs to start right now
 
How Did We End up Here?
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?
 
Agile velocity - Requirements Discovery Presentation
Agile velocity  - Requirements Discovery Presentation Agile velocity  - Requirements Discovery Presentation
Agile velocity - Requirements Discovery Presentation
 
The Rationale for Continuous Delivery (The culture and practice of good softw...
The Rationale for Continuous Delivery (The culture and practice of good softw...The Rationale for Continuous Delivery (The culture and practice of good softw...
The Rationale for Continuous Delivery (The culture and practice of good softw...
 
Focus fast bigd15_roger_belveal_2015-09-19
Focus fast bigd15_roger_belveal_2015-09-19Focus fast bigd15_roger_belveal_2015-09-19
Focus fast bigd15_roger_belveal_2015-09-19
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
 
Working Effectively with Legacy Code
Working Effectively with Legacy CodeWorking Effectively with Legacy Code
Working Effectively with Legacy Code
 
Open-Source Project Tools for Corporate Projects?
Open-Source Project Tools for Corporate Projects?Open-Source Project Tools for Corporate Projects?
Open-Source Project Tools for Corporate Projects?
 
How You Can Use Email to Discover the Essence of Your Value Propostion
How You Can Use Email to Discover the Essence of Your Value PropostionHow You Can Use Email to Discover the Essence of Your Value Propostion
How You Can Use Email to Discover the Essence of Your Value Propostion
 
Lean engineering for lean/balanced teams: lessons learned (and still learning...
Lean engineering for lean/balanced teams: lessons learned (and still learning...Lean engineering for lean/balanced teams: lessons learned (and still learning...
Lean engineering for lean/balanced teams: lessons learned (and still learning...
 
The Goal Discussion Guide - Participants Guide
The Goal Discussion Guide - Participants GuideThe Goal Discussion Guide - Participants Guide
The Goal Discussion Guide - Participants Guide
 

More from C4Media

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Monkeys in Lab Coats: Applying Failure Testing Research @Netflix