SlideShare a Scribd company logo
Madaari
Ordering For The Monkeys
2
Agenda
● Distributed Systems and Chaos Engineering : State Of The Union
● Lineage Driven Fault Injection : A Brief Primer
● LDFI : Ordering Of Faults
● Bringing LDFI to the Enterprise
● Results
● Future Work
3
Industry + Academia = Win !!
Joint work between eBay and Disorderly Labs
● Dr. Peter Alvaro ( UCSC )
● Kamala Ramasubramanian ( UCSC )
● eBay SRE Team
Madaari : a trainer who teaches a monkey to perform tricks
4
The Problem : Testing Distributed
Systems
Combinatorial Space of FailuresMicroservices Death Star
Consider 100 Services
Fault Search Space : 2100
5
Fault
Cardinality
Possible
Faults
1 100
4 3 Million
Chaos Engineering : A Possible Solution
● Failure is inevitable, let’s fail in a controlled environment
● Proactively inject failure in your system to reveal weaknesses
● Perturbation and observation of large-scale systems
6
Chaos Engineering : A Brief Primer
Doesn’t
scale well !!
7
A genius holds the
mental model of the
system
Guided Fault Injection
No Model Of The
System
Random Fault
Injection
Can’t quantify
progress
Lineage Driven Fault Injection aka LDFI
CLAIM : Fault Tolerance = Redundancy
● Use explanations of successful outcomes to search for faults that can drive the system
into a bad state
● Observing successful executions enables LDFI to build a model of the redundancy of the
system
8
Lineage Driven Fault Injection aka LDFI
Why did a good thing happen?
Consider its lineage.
9
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
Lineage Driven Fault Injection aka LDFI
(RepA OR Bcast1)
10
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
Lineage Driven Fault Injection aka LDFI
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
Hypothesis: {Bcast1, Bcast2}
11
LDFI : Building Blocks
● Witnessing a large number of successful
executions allows LDFI to build a model
of redundancy of the system
● How? Because it can reason about why
faults were tolerated
12
LDFI : Building Blocks
Recipe:
1. Start with a successful outcome. Work
backwards.
2. Ask why it happened ? Ans. Lineage (Traces)
3. Convert lineage to a CNF formula and solve
the decision problem ( using a SAT solver )
4. Lather, rinse, repeat
13
Encoding the Lineage
(A v B v C v D v E)
14
A
B
C
ED
(A v C v D v E)
(A v B v C v D v E) ^ (A v C v D v E)
A
C
D E
B
Injecting Faults That Matter
● Drawbacks of existing approach
○ LDFI (using SAT) reduces the search space but the search space might still be still
large
○ LDFI is a decision problem, solutions are returned in no particular order
● We want to order solutions (run experiments) to:
○ Find the most likely faults before users do!
○ Reduce the search space as much as possible
15
Ordering Faults : Injecting Faults That
Matter
16
LDFI assumes all faults are equally likely,
the reality differs !!
Intuition : Some faults are more likely than
others; incident history usually backs this
claim
We want to encode our intuition of failure
in LDFI
A
B
C
ED
F
Ordering Of Faults
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G)
∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
17
Ordering Of Faults : Minimal Hitting Set
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
18
e.g (C,E,H)
Ordering Of Faults : Minimal Hitting Set
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F )
Maximise: XAlog(PA) + XBlog(PB) + XClog(PC) + XDlog(PD) + XElog(PE) +
XFlog(PF)
Subject to:
XA + XB + XC >= 1
XC + XD + XE + XF >= 1
XD + XE + XF >= 1 19
Ordering Faults : Injecting Faults That
Matter
20
A
B
Use the structure of the Trace to prune the Solution Space :
1. Rank Of the Service ( distance from the root )
2. Size Of the sub graph of the Service
3. If we survive the failure of C, we will surely survive the failure
of D, E and F
A
B
C
ED
F
Ordering Faults : Injecting Faults That
Matter
● All services are not created equal, some services fail more than others
● Likelihood and Containment :
○ P(Node failure) > P(Rack Failure) >> P(Data center failure)
● Historical measures :
○ Time since last release
○ History Of Failure and Bug Rate
21
LDFI in the Enterprise
Explanations
Models Of
Redundancy
Fault Injection
22
Traces = Explanations
● Distributed Tracing
○ Call graphs come for free
● Less Ideal (but OK) : Structured
Logging
○ We did this too !!
23
What are traces anyway ?
○ Ordered Events with context
stitched together
○ Create the call graphs using
service names and endpoints
Fault Injection Tool
● We rolled our own ( Mowgli )
○ Inspired by Trogdor ( Kafka’s FIT
Tool)
○ Circuit breaker aware fault injection
tool, deals with services and
databases
○ Built in safety mechanisms
○ Hooks for AZ level, node level fault
injection
○ Audit and Tracking capabilities
24
● Lots of open source options available
○ Start simple, a script to drop
network traffic is also OK
○ https://github.com/dastergon/awes
ome-chaos-engineering
● Tip : Be safe by default
○ Always have a rollback strategy
Interaction Replay
● Ability to replay interactions ( Tip : E2E Tests )
● Measure of Success
○ A unique binary (yes or no ) way of saying whether the execution was successful or not
● Works for Eventually Consistent systems as well, as long as there is finite
upper bound on the eventuality
25
LDFI in the Enterprise
Traces/Structured
Logs LDFI FIT Tool
To Call
Graphs
Encode For
The Solver Fault
Suggestion
● PyCoSAT
● PULP
● SAT4J
26
Results : Finding Bugs
27
Comparison With Chaos Monkey
28
Strategy Fault Experiment Runs
(avg.)
Standard Deviation
Ordered LDFI 17 0
Uniform Random 210.35 111.42
How long did it take to find those 5 bugs? A few hours
(An experiment takes ~2 minute, and we did retries to get around our infrastructure)
Results : Finding No Bugs
29
Madaari : The Road Ahead
● Scalarizing Probabilities of Failure
● SLA verification using strategic Delay Injection
● Reason about Stateful systems
● Fine Grained Fault Injection
● Microservices Only ?
○ Databases, Containers, Service Mesh .. Let’s Go !!
30
LDFI : The Road Ahead
3 W’s For Fault Injection
1. What to inject ? ( type of fault we want to inject )
2. Where to inject ? (the target component )
3. When to inject ? ( inject when there are exactly 5 items in the cart !! )
31
LDFI : The Road Ahead
A Journey from Time to State and back
1. What’s time anyway ??
2. Applications have state and change of state gives you implicit order.
3. A rendezvous of state and time gives us precision for fault injection.
32
Madaari : Key Takeaways
● Industry and Academia can work together for fun(d) and profit
● Limitations of LDFI w.r.t unordered solutions and why ordering matters for
chaos engineering experiments
● Understand how LDFI can be integrated in the enterprise by harnessing the
observability infrastructure
● Preliminary results of prioritized LDFI and a future direction for the community
● Evangelising new techniques is hard; start small and stay simple
33
Discussion
34
Reach us at :
@ashutoshraina
@palvaro
@KamalRamas
https://disorderlylabs.github.io

More Related Content

What's hot

[Defcon Russia #29] Борис Савков - Bare-metal programming на примере Raspber...
[Defcon Russia #29] Борис Савков -  Bare-metal programming на примере Raspber...[Defcon Russia #29] Борис Савков -  Bare-metal programming на примере Raspber...
[Defcon Russia #29] Борис Савков - Bare-metal programming на примере Raspber...
DefconRussia
 
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Anne Nicolas
 
VHdl lab report
VHdl lab reportVHdl lab report
VHdl lab report
Jinesh Kb
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
Chong-Kuan Chen
 
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
corehard_by
 
Weakpass - defcon russia 23
Weakpass - defcon russia 23Weakpass - defcon russia 23
Weakpass - defcon russia 23
DefconRussia
 
TDD CrashCourse Part3: TDD Techniques
TDD CrashCourse Part3: TDD TechniquesTDD CrashCourse Part3: TDD Techniques
TDD CrashCourse Part3: TDD Techniques
David Rodenas
 
Navigating the xDD Alphabet Soup
Navigating the xDD Alphabet SoupNavigating the xDD Alphabet Soup
Navigating the xDD Alphabet Soup
Dror Helper
 
Top 20 java programming interview questions for sdet
Top 20 java programming interview questions for sdetTop 20 java programming interview questions for sdet
Top 20 java programming interview questions for sdet
DevLabs Alliance
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
Shauvik Roy Choudhary, Ph.D.
 
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
Sung Kim
 
Ch7
Ch7Ch7
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
STAMP Project
 
VHDL CODE
VHDL CODE VHDL CODE
VHDL CODE
Veer Singh shakya
 
JavaFest. Виктор Полищук. Legacy: как победить в гонке
JavaFest. Виктор Полищук. Legacy: как победить в гонкеJavaFest. Виктор Полищук. Legacy: как победить в гонке
JavaFest. Виктор Полищук. Legacy: как победить в гонке
FestGroup
 
EC6612 VLSI Design Lab Manual
EC6612 VLSI Design Lab ManualEC6612 VLSI Design Lab Manual
EC6612 VLSI Design Lab Manual
tamil arasan
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Shinya Takamaeda-Y
 
GCC Summit 2010
GCC Summit 2010GCC Summit 2010
GCC Summit 2010
regehr
 
Digital System Design Lab Report - VHDL ECE
Digital System Design Lab Report - VHDL ECEDigital System Design Lab Report - VHDL ECE
Digital System Design Lab Report - VHDL ECE
Ramesh Naik Bhukya
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenance
Gaignard Alban
 

What's hot (20)

[Defcon Russia #29] Борис Савков - Bare-metal programming на примере Raspber...
[Defcon Russia #29] Борис Савков -  Bare-metal programming на примере Raspber...[Defcon Russia #29] Борис Савков -  Bare-metal programming на примере Raspber...
[Defcon Russia #29] Борис Савков - Bare-metal programming на примере Raspber...
 
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
 
VHdl lab report
VHdl lab reportVHdl lab report
VHdl lab report
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
 
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
 
Weakpass - defcon russia 23
Weakpass - defcon russia 23Weakpass - defcon russia 23
Weakpass - defcon russia 23
 
TDD CrashCourse Part3: TDD Techniques
TDD CrashCourse Part3: TDD TechniquesTDD CrashCourse Part3: TDD Techniques
TDD CrashCourse Part3: TDD Techniques
 
Navigating the xDD Alphabet Soup
Navigating the xDD Alphabet SoupNavigating the xDD Alphabet Soup
Navigating the xDD Alphabet Soup
 
Top 20 java programming interview questions for sdet
Top 20 java programming interview questions for sdetTop 20 java programming interview questions for sdet
Top 20 java programming interview questions for sdet
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
 
Ch7
Ch7Ch7
Ch7
 
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
 
VHDL CODE
VHDL CODE VHDL CODE
VHDL CODE
 
JavaFest. Виктор Полищук. Legacy: как победить в гонке
JavaFest. Виктор Полищук. Legacy: как победить в гонкеJavaFest. Виктор Полищук. Legacy: как победить в гонке
JavaFest. Виктор Полищук. Legacy: как победить в гонке
 
EC6612 VLSI Design Lab Manual
EC6612 VLSI Design Lab ManualEC6612 VLSI Design Lab Manual
EC6612 VLSI Design Lab Manual
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
 
GCC Summit 2010
GCC Summit 2010GCC Summit 2010
GCC Summit 2010
 
Digital System Design Lab Report - VHDL ECE
Digital System Design Lab Report - VHDL ECEDigital System Design Lab Report - VHDL ECE
Digital System Design Lab Report - VHDL ECE
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenance
 

Similar to Madaari : Ordering For The Monkeys

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Claire Le Goues
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
Nitish Upreti
 
Www.istqb.guru istqb question-paper5
Www.istqb.guru istqb question-paper5Www.istqb.guru istqb question-paper5
Www.istqb.guru istqb question-paper5
Tomas Vileikis
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
Abhik Roychoudhury
 
Istqb question-paper-dump-1
Istqb question-paper-dump-1Istqb question-paper-dump-1
Istqb question-paper-dump-1TestingGeeks
 
Istqb question-paper-dump-5
Istqb question-paper-dump-5Istqb question-paper-dump-5
Istqb question-paper-dump-5TestingGeeks
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
Abhik Roychoudhury
 
Black-box Behavioral Model Inference for Autopilot Software Systems
Black-box Behavioral Model Inference for Autopilot Software SystemsBlack-box Behavioral Model Inference for Autopilot Software Systems
Black-box Behavioral Model Inference for Autopilot Software Systems
Mohammad Jafar Mashhadi
 
Meltdown & spectre
Meltdown & spectreMeltdown & spectre
Meltdown & spectre
Sergio Shevchenko
 
Once Upon a Process
Once Upon a ProcessOnce Upon a Process
Once Upon a Process
David Evans
 
White-box Unit Test Generation with Microsoft IntelliTest
White-box Unit Test Generation with Microsoft IntelliTestWhite-box Unit Test Generation with Microsoft IntelliTest
White-box Unit Test Generation with Microsoft IntelliTest
Dávid Honfi
 
Week 5
Week 5Week 5
Week 5
EasyStudy3
 
Week 5
Week 5Week 5
Week 5
EasyStudy3
 
KaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep LearningKaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep Learning
Van Huy
 
Static Analysis and Verification of C Programs
Static Analysis and Verification of C ProgramsStatic Analysis and Verification of C Programs
5. DFT.pptx
5. DFT.pptx5. DFT.pptx
5. DFT.pptx
Ahmed Abdelazeem
 
Meltdown & Spectre
Meltdown & Spectre Meltdown & Spectre
Meltdown & Spectre
Marco Cipriano
 
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
Hao Jin
 
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
Iosif Itkin
 

Similar to Madaari : Ordering For The Monkeys (20)

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Www.istqb.guru istqb question-paper5
Www.istqb.guru istqb question-paper5Www.istqb.guru istqb question-paper5
Www.istqb.guru istqb question-paper5
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
 
Istqb question-paper-dump-1
Istqb question-paper-dump-1Istqb question-paper-dump-1
Istqb question-paper-dump-1
 
Istqb question-paper-dump-5
Istqb question-paper-dump-5Istqb question-paper-dump-5
Istqb question-paper-dump-5
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Black-box Behavioral Model Inference for Autopilot Software Systems
Black-box Behavioral Model Inference for Autopilot Software SystemsBlack-box Behavioral Model Inference for Autopilot Software Systems
Black-box Behavioral Model Inference for Autopilot Software Systems
 
Meltdown & spectre
Meltdown & spectreMeltdown & spectre
Meltdown & spectre
 
Once Upon a Process
Once Upon a ProcessOnce Upon a Process
Once Upon a Process
 
White-box Unit Test Generation with Microsoft IntelliTest
White-box Unit Test Generation with Microsoft IntelliTestWhite-box Unit Test Generation with Microsoft IntelliTest
White-box Unit Test Generation with Microsoft IntelliTest
 
Week 5
Week 5Week 5
Week 5
 
Week 5
Week 5Week 5
Week 5
 
KaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep LearningKaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep Learning
 
Static Analysis and Verification of C Programs
Static Analysis and Verification of C ProgramsStatic Analysis and Verification of C Programs
Static Analysis and Verification of C Programs
 
5. DFT.pptx
5. DFT.pptx5. DFT.pptx
5. DFT.pptx
 
Meltdown & Spectre
Meltdown & Spectre Meltdown & Spectre
Meltdown & Spectre
 
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
 
Need 4 Speed FI
Need 4 Speed FINeed 4 Speed FI
Need 4 Speed FI
 
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Su...
 

More from J On The Beach

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
J On The Beach
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
J On The Beach
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
J On The Beach
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
J On The Beach
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
J On The Beach
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
J On The Beach
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
J On The Beach
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
J On The Beach
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
J On The Beach
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
J On The Beach
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
J On The Beach
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
J On The Beach
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
J On The Beach
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
J On The Beach
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
J On The Beach
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
J On The Beach
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
J On The Beach
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
J On The Beach
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
J On The Beach
 
Getting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement LearningGetting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement Learning
J On The Beach
 

More from J On The Beach (20)

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 
Getting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement LearningGetting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement Learning
 

Recently uploaded

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 

Recently uploaded (20)

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 

Madaari : Ordering For The Monkeys

  • 2. 2
  • 3. Agenda ● Distributed Systems and Chaos Engineering : State Of The Union ● Lineage Driven Fault Injection : A Brief Primer ● LDFI : Ordering Of Faults ● Bringing LDFI to the Enterprise ● Results ● Future Work 3
  • 4. Industry + Academia = Win !! Joint work between eBay and Disorderly Labs ● Dr. Peter Alvaro ( UCSC ) ● Kamala Ramasubramanian ( UCSC ) ● eBay SRE Team Madaari : a trainer who teaches a monkey to perform tricks 4
  • 5. The Problem : Testing Distributed Systems Combinatorial Space of FailuresMicroservices Death Star Consider 100 Services Fault Search Space : 2100 5 Fault Cardinality Possible Faults 1 100 4 3 Million
  • 6. Chaos Engineering : A Possible Solution ● Failure is inevitable, let’s fail in a controlled environment ● Proactively inject failure in your system to reveal weaknesses ● Perturbation and observation of large-scale systems 6
  • 7. Chaos Engineering : A Brief Primer Doesn’t scale well !! 7 A genius holds the mental model of the system Guided Fault Injection No Model Of The System Random Fault Injection Can’t quantify progress
  • 8. Lineage Driven Fault Injection aka LDFI CLAIM : Fault Tolerance = Redundancy ● Use explanations of successful outcomes to search for faults that can drive the system into a bad state ● Observing successful executions enables LDFI to build a model of the redundancy of the system 8
  • 9. Lineage Driven Fault Injection aka LDFI Why did a good thing happen? Consider its lineage. 9 What could have gone wrong? Faults are cuts in the lineage graph. Is there a cut that breaks all supports?
  • 10. Lineage Driven Fault Injection aka LDFI (RepA OR Bcast1) 10 AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1)
  • 11. Lineage Driven Fault Injection aka LDFI (RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1) Hypothesis: {Bcast1, Bcast2} 11
  • 12. LDFI : Building Blocks ● Witnessing a large number of successful executions allows LDFI to build a model of redundancy of the system ● How? Because it can reason about why faults were tolerated 12
  • 13. LDFI : Building Blocks Recipe: 1. Start with a successful outcome. Work backwards. 2. Ask why it happened ? Ans. Lineage (Traces) 3. Convert lineage to a CNF formula and solve the decision problem ( using a SAT solver ) 4. Lather, rinse, repeat 13
  • 14. Encoding the Lineage (A v B v C v D v E) 14 A B C ED (A v C v D v E) (A v B v C v D v E) ^ (A v C v D v E) A C D E B
  • 15. Injecting Faults That Matter ● Drawbacks of existing approach ○ LDFI (using SAT) reduces the search space but the search space might still be still large ○ LDFI is a decision problem, solutions are returned in no particular order ● We want to order solutions (run experiments) to: ○ Find the most likely faults before users do! ○ Reduce the search space as much as possible 15
  • 16. Ordering Faults : Injecting Faults That Matter 16 LDFI assumes all faults are equally likely, the reality differs !! Intuition : Some faults are more likely than others; incident history usually backs this claim We want to encode our intuition of failure in LDFI A B C ED F
  • 17. Ordering Of Faults (A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I) (A, B, C), (C, D, E, F), (D, E, F, G), (H, I) 17
  • 18. Ordering Of Faults : Minimal Hitting Set (A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I) (A, B, C), (C, D, E, F), (D, E, F, G), (H, I) 18 e.g (C,E,H)
  • 19. Ordering Of Faults : Minimal Hitting Set (A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ) Maximise: XAlog(PA) + XBlog(PB) + XClog(PC) + XDlog(PD) + XElog(PE) + XFlog(PF) Subject to: XA + XB + XC >= 1 XC + XD + XE + XF >= 1 XD + XE + XF >= 1 19
  • 20. Ordering Faults : Injecting Faults That Matter 20 A B Use the structure of the Trace to prune the Solution Space : 1. Rank Of the Service ( distance from the root ) 2. Size Of the sub graph of the Service 3. If we survive the failure of C, we will surely survive the failure of D, E and F A B C ED F
  • 21. Ordering Faults : Injecting Faults That Matter ● All services are not created equal, some services fail more than others ● Likelihood and Containment : ○ P(Node failure) > P(Rack Failure) >> P(Data center failure) ● Historical measures : ○ Time since last release ○ History Of Failure and Bug Rate 21
  • 22. LDFI in the Enterprise Explanations Models Of Redundancy Fault Injection 22
  • 23. Traces = Explanations ● Distributed Tracing ○ Call graphs come for free ● Less Ideal (but OK) : Structured Logging ○ We did this too !! 23 What are traces anyway ? ○ Ordered Events with context stitched together ○ Create the call graphs using service names and endpoints
  • 24. Fault Injection Tool ● We rolled our own ( Mowgli ) ○ Inspired by Trogdor ( Kafka’s FIT Tool) ○ Circuit breaker aware fault injection tool, deals with services and databases ○ Built in safety mechanisms ○ Hooks for AZ level, node level fault injection ○ Audit and Tracking capabilities 24 ● Lots of open source options available ○ Start simple, a script to drop network traffic is also OK ○ https://github.com/dastergon/awes ome-chaos-engineering ● Tip : Be safe by default ○ Always have a rollback strategy
  • 25. Interaction Replay ● Ability to replay interactions ( Tip : E2E Tests ) ● Measure of Success ○ A unique binary (yes or no ) way of saying whether the execution was successful or not ● Works for Eventually Consistent systems as well, as long as there is finite upper bound on the eventuality 25
  • 26. LDFI in the Enterprise Traces/Structured Logs LDFI FIT Tool To Call Graphs Encode For The Solver Fault Suggestion ● PyCoSAT ● PULP ● SAT4J 26
  • 27. Results : Finding Bugs 27
  • 28. Comparison With Chaos Monkey 28 Strategy Fault Experiment Runs (avg.) Standard Deviation Ordered LDFI 17 0 Uniform Random 210.35 111.42 How long did it take to find those 5 bugs? A few hours (An experiment takes ~2 minute, and we did retries to get around our infrastructure)
  • 29. Results : Finding No Bugs 29
  • 30. Madaari : The Road Ahead ● Scalarizing Probabilities of Failure ● SLA verification using strategic Delay Injection ● Reason about Stateful systems ● Fine Grained Fault Injection ● Microservices Only ? ○ Databases, Containers, Service Mesh .. Let’s Go !! 30
  • 31. LDFI : The Road Ahead 3 W’s For Fault Injection 1. What to inject ? ( type of fault we want to inject ) 2. Where to inject ? (the target component ) 3. When to inject ? ( inject when there are exactly 5 items in the cart !! ) 31
  • 32. LDFI : The Road Ahead A Journey from Time to State and back 1. What’s time anyway ?? 2. Applications have state and change of state gives you implicit order. 3. A rendezvous of state and time gives us precision for fault injection. 32
  • 33. Madaari : Key Takeaways ● Industry and Academia can work together for fun(d) and profit ● Limitations of LDFI w.r.t unordered solutions and why ordering matters for chaos engineering experiments ● Understand how LDFI can be integrated in the enterprise by harnessing the observability infrastructure ● Preliminary results of prioritized LDFI and a future direction for the community ● Evangelising new techniques is hard; start small and stay simple 33
  • 34. Discussion 34 Reach us at : @ashutoshraina @palvaro @KamalRamas https://disorderlylabs.github.io