SlideShare a Scribd company logo
Lineage-driven
Fault Injection
Peter Alvaro Joshua Rosen Joseph M. Hellerstein
UC Berkeley
The future is disorder
•  Data-intensive systems are increasingly
distributed and heterogeneous
•  Distributed systems suffer partial failures
•  Fault-tolerant code is hard to get right
•  Composing FT components is hard too!
Motivation: Kafka replication bug
Three correct components:
1.  Primary/backup replication
2.  Timeout-based failure detectors
3.  Zookeeper
One nasty bug:
Acknowledged writes are lost
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
a becomes
primary and
sole replica
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
a becomes
primary and
sole replica
a ACKs
client write
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
a becomes
primary and
sole replica
a ACKs
client write
Data
loss
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Investment
Returns
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Fault-tolerance:
the state of the art
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Lineage-driven fault injection
Goal: whole-system testing that
•  finds all of the fault-tolerance bugs, or
•  certifies that none exist
Main idea: fault-tolerance is redundancy.
Lineage-driven fault injection
Approach: think backwards from outcomes
Use lineage to find evidence of redundancy
Original Question:
•  Could a bad thing ever happen?
Reframed question:
•  Why did a good thing happen?
•  What could have gone wrong?
A game
Protocol:
Reliable broadcast
Specification:
Pre: A correct process delivers a message m
Post: All correct process delivers m
Failure Model:
(Permanent) crash failures
Message loss / partitions
Program'
Output%
constraints%
Round 1
The broadcaster makes an attempt to
relay the message to the other nodes	
  	
  
“An effort” delivery protocol:
Round 1 in space / time
Process b Process a Process c
2
1
2
log log
Outcomes are data
log(B, “data”)@5	
  
What
Where
When
Some data
Round 1: Lineage
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(Node, Pload)@next :- log(Node, Pload);
log(B, data)@5:- log(B, data)@4;
Round 1: Lineage	
  
Round 1: Lineage	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
Round 1: Lineage	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
bcast(A,	
  data)@1	
  
	
  
log(Node2, Pload)@async :- bcast(Node1, Pload),
node(Node1, Node2);
log(B, data)@2 :- bcast(A, data)@1,
node(A, B)@1;
	
  
Round 1: Lineage	
  
An execution is a (fragile) “proof”
of an outcome
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
l
l
AB2
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
l
l
l
(which required a message from A to B at time 1)
Round 1: counterexample
The adversary wins!
Process b Process a Process c
1
2
log (LOST) log
Round 2
The broadcaster makes repeated attempts
to relay the message to the other nodes	
  	
  
“Sender retries” delivery protocol:
Round 2 in spacetime
Process b Process a Process c
2
3
4
5
1
2
3
4
2
3
4
5
log log
log log
log log
log log
Round 2: sender retries
log(B,	
  data)@5	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);
log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Retry provides redundancy in time
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
✖	
  
✖	
   ✖	
  
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
✖	
  
✖	
   ✖	
  
Round	
  2:	
  counterexample	
  
Process b Process a Process c
1
CRASHED 2
log (LOST) log
The adversary wins!
Round 1
All participants make repeated attempts to
relay the message to the other nodes	
  	
  
“Symmetric retry” delivery protocol:
Round 3 in space / time
Process b Process a Process c
2
3
4
5
1
2
3
4
5
2
3
4
5
log log
log log
log log
log log
log log
log log
log log
log log
log log
log log
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(C,	
  data)@3	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(C,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(C,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(C,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(C,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Redundancy in space and time
Round 3: symmetric retry
The programmer wins!
Let’s reflect
Intuition:
Fault-tolerance is redundancy in space and time.
Strategy:
Reason backwards from outcomes using lineage
Lineage exposes redundancy of outcome support.
Finding bugs: choose failures that “break” all derivations
Fixing bugs: add additional derivations
Automating the role of the adversary
1.  Break a proof by dropping any
contributing message.
(AB1 ∨ BC2)
Automating the role of the adversary
1.  Break a proof by dropping any
contributing message.
2.  Find a set of failures that breaks all proofs
of a good outcome.
Disjunction
Conjunction of disjunctions (AKA CNF)
(AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
Automating the role of the adversary
1.  Break a proof by dropping any
contributing message.
2.  Find a set of failures that breaks all proofs
of a good outcome.
Disjunction
Conjunction of disjunctions (AKA CNF)
(AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
By injecting only “interesting” faults…
Molly finds bugs quickly
By injecting only “interesting” faults…
Molly finds bugs quickly
By injecting only “interesting” faults…
Molly provides guarantees that
outcomes are fault-tolerant
Program	
   Bound	
   Combina/ons	
   Execu/ons	
  
redun-­‐deliv	
   11	
   8.07	
  X	
  1018	
   11	
  
ack-­‐deliv	
   8	
   3.08	
  X	
  1013	
   673	
  
paxos-­‐synod	
   7	
   4.81	
  X	
  1011	
   173	
  
bully-­‐leader	
   10	
   1.26	
  X	
  1017	
   2	
  
flux	
   22	
   6.20	
  X	
  1076	
   187	
  
Molly, the LDFI prototype
Molly finds fault-tolerance violations
quickly or guarantees that none exist.
Molly uses data lineage to reason about
redundancy of support (or lack thereof)
for system outcomes.
Case study: commit protocols
Agent a Agent a Coordinator Agent d
2 2
1
3
CRASHED
2
v v
p p p
v
2-Phase commit
Agent a Agent b Coordinator Agent d
2
3
4
5
6
2
3
4
5
6
1
2
3
CRASHED
2
3
4
5
6
vote
decision_req decision_req
vote
decision_req decision_req
prepare prepare prepare
vote
decision_req decision_req
Collaborative termination
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
3-Phase commit
3PC in an asynchronous network
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Brief network
partition
Agent crash
Agents learn
commit decision
d is dead; coordinator
decides to abort
Agents A & B
decide to
commit

More Related Content

What's hot

Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
Bioinformatics and Computational Biosciences Branch
 
Extending Python, what is the best option for me?
Extending Python, what is the best option for me?Extending Python, what is the best option for me?
Extending Python, what is the best option for me?
Codemotion
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
Chris Mungall
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
Jeff Hammerbacher
 
Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1
Goran S. Milovanovic
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
semlavssws2015
semlavssws2015semlavssws2015
semlavssws2015
hala Skaf
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
Anomaly detection in dns traffic
Anomaly detection in dns trafficAnomaly detection in dns traffic
Anomaly detection in dns traffic
Bangladesh Network Operators Group
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
National Inistitute of Informatics (NII), Tokyo, Japann
 
Accessing r from python using r py2
Accessing r from python using r py2Accessing r from python using r py2
Accessing r from python using r py2Wisdio
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQL
Olaf Hartig
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
Bioinformatics and Computational Biosciences Branch
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
Ryan Rosario
 

What's hot (17)

Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Extending Python, what is the best option for me?
Extending Python, what is the best option for me?Extending Python, what is the best option for me?
Extending Python, what is the best option for me?
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
semlavssws2015
semlavssws2015semlavssws2015
semlavssws2015
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Anomaly detection in dns traffic
Anomaly detection in dns trafficAnomaly detection in dns traffic
Anomaly detection in dns traffic
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Accessing r from python using r py2
Accessing r from python using r py2Accessing r from python using r py2
Accessing r from python using r py2
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQL
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
 

Similar to Lineage-driven Fault Injection, SIGMOD'15

Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
DataWorks Summit
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
Jianfeng Zhang
 
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control HazardPipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
babuece
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit IIManoj Patil
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit IIManoj Patil
 
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...jon_bell
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park
 
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Fwdays
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi
 
Building a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xBuilding a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16x
Tyler Treat
 
Deep learning for biotechnology presentation
Deep learning for biotechnology presentationDeep learning for biotechnology presentation
Deep learning for biotechnology presentation
ashuh3
 
1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP
Jordi Llonch
 
1 hour dive into erlang
1  hour dive into erlang1  hour dive into erlang
1 hour dive into erlang
Joan Valduvieco
 
Building a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchBuilding a Distributed Message Log from Scratch
Building a Distributed Message Log from Scratch
Tyler Treat
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
DataStax
 
Design principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalencesDesign principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalences
Michael P.H. Stumpf
 
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdfΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
GeorgeGeorge385587
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 

Similar to Lineage-driven Fault Injection, SIGMOD'15 (20)

Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control HazardPipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit II
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit II
 
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
 
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description Logics
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
Building a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xBuilding a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16x
 
Deep learning for biotechnology presentation
Deep learning for biotechnology presentationDeep learning for biotechnology presentation
Deep learning for biotechnology presentation
 
1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP
 
1 hour dive into erlang
1  hour dive into erlang1  hour dive into erlang
1 hour dive into erlang
 
Building a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchBuilding a Distributed Message Log from Scratch
Building a Distributed Message Log from Scratch
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Design principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalencesDesign principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalences
 
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdfΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 

Recently uploaded

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 

Recently uploaded (20)

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 

Lineage-driven Fault Injection, SIGMOD'15

  • 1. Lineage-driven Fault Injection Peter Alvaro Joshua Rosen Joseph M. Hellerstein UC Berkeley
  • 2. The future is disorder •  Data-intensive systems are increasingly distributed and heterogeneous •  Distributed systems suffer partial failures •  Fault-tolerant code is hard to get right •  Composing FT components is hard too!
  • 3. Motivation: Kafka replication bug Three correct components: 1.  Primary/backup replication 2.  Timeout-based failure detectors 3.  Zookeeper One nasty bug: Acknowledged writes are lost
  • 4. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w
  • 5. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition
  • 6. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition a becomes primary and sole replica
  • 7. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition a becomes primary and sole replica a ACKs client write
  • 8. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition a becomes primary and sole replica a ACKs client write Data loss
  • 9. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns Investment Returns
  • 10. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns Investment Returns
  • 11. 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Fault-tolerance: the state of the art Investment Returns
  • 12. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns
  • 13. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns
  • 14. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection)
  • 15. Lineage-driven fault injection Goal: whole-system testing that •  finds all of the fault-tolerance bugs, or •  certifies that none exist Main idea: fault-tolerance is redundancy.
  • 16. Lineage-driven fault injection Approach: think backwards from outcomes Use lineage to find evidence of redundancy Original Question: •  Could a bad thing ever happen? Reframed question: •  Why did a good thing happen? •  What could have gone wrong?
  • 17. A game Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions Program' Output% constraints%
  • 18. Round 1 The broadcaster makes an attempt to relay the message to the other nodes     “An effort” delivery protocol:
  • 19. Round 1 in space / time Process b Process a Process c 2 1 2 log log
  • 20. Outcomes are data log(B, “data”)@5   What Where When Some data
  • 21. Round 1: Lineage log(B,  data)@5    
  • 22. log(B,  data)@5     log(B,  data)@4     log(Node, Pload)@next :- log(Node, Pload); log(B, data)@5:- log(B, data)@4; Round 1: Lineage  
  • 23. Round 1: Lineage   log(B,  data)@5     log(B,  data)@4     log(B,  data)@3    
  • 24. Round 1: Lineage   log(B,  data)@5     log(B,  data)@4     log(B,  data)@3     log(B,data)@2    
  • 25. log(B,  data)@5     log(B,  data)@4     log(B,  data)@3     log(B,data)@2     bcast(A,  data)@1     log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2); log(B, data)@2 :- bcast(A, data)@1, node(A, B)@1;   Round 1: Lineage  
  • 26. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 l l AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 l l l (which required a message from A to B at time 1)
  • 27. Round 1: counterexample The adversary wins! Process b Process a Process c 1 2 log (LOST) log
  • 28. Round 2 The broadcaster makes repeated attempts to relay the message to the other nodes     “Sender retries” delivery protocol:
  • 29. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log
  • 30. Round 2: sender retries log(B,  data)@5    
  • 31. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4    
  • 32. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2); log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;  
  • 33. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3    
  • 34. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3     log(B,data)@2     log(A,  data)@2    
  • 35. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3     log(B,data)@2     log(A,  data)@2     log(A,  data)@1    
  • 36. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3     log(B,data)@2     log(A,  data)@2     log(A,  data)@1     Retry provides redundancy in time
  • 37. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 38. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 39. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4 ✖   ✖   ✖  
  • 40. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4 ✖   ✖   ✖  
  • 41. Round  2:  counterexample   Process b Process a Process c 1 CRASHED 2 log (LOST) log The adversary wins!
  • 42. Round 1 All participants make repeated attempts to relay the message to the other nodes     “Symmetric retry” delivery protocol:
  • 43. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log log log
  • 44. Round 3: symmetric retry log(B,  data)@5    
  • 45. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4    
  • 46. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4     Log(B,  data)@3     log(A,  data)@3     log(C,  data)@3    
  • 47. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4     Log(B,  data)@3     log(A,  data)@3     log(C,  data)@3     log(B,data)@2     log(A,  data)@2     log(C,  data)@2     log(A,  data)@1    
  • 48. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4     Log(B,  data)@3     log(A,  data)@3     log(C,  data)@3     log(B,data)@2     log(A,  data)@2     log(C,  data)@2     log(A,  data)@1     Redundancy in space and time
  • 49. Round 3: symmetric retry The programmer wins!
  • 50. Let’s reflect Intuition: Fault-tolerance is redundancy in space and time. Strategy: Reason backwards from outcomes using lineage Lineage exposes redundancy of outcome support. Finding bugs: choose failures that “break” all derivations Fixing bugs: add additional derivations
  • 51. Automating the role of the adversary 1.  Break a proof by dropping any contributing message. (AB1 ∨ BC2)
  • 52. Automating the role of the adversary 1.  Break a proof by dropping any contributing message. 2.  Find a set of failures that breaks all proofs of a good outcome. Disjunction Conjunction of disjunctions (AKA CNF) (AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
  • 53. Automating the role of the adversary 1.  Break a proof by dropping any contributing message. 2.  Find a set of failures that breaks all proofs of a good outcome. Disjunction Conjunction of disjunctions (AKA CNF) (AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
  • 54. By injecting only “interesting” faults… Molly finds bugs quickly
  • 55. By injecting only “interesting” faults… Molly finds bugs quickly
  • 56. By injecting only “interesting” faults… Molly provides guarantees that outcomes are fault-tolerant Program   Bound   Combina/ons   Execu/ons   redun-­‐deliv   11   8.07  X  1018   11   ack-­‐deliv   8   3.08  X  1013   673   paxos-­‐synod   7   4.81  X  1011   173   bully-­‐leader   10   1.26  X  1017   2   flux   22   6.20  X  1076   187  
  • 57. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly uses data lineage to reason about redundancy of support (or lack thereof) for system outcomes.
  • 58.
  • 59. Case study: commit protocols Agent a Agent a Coordinator Agent d 2 2 1 3 CRASHED 2 v v p p p v 2-Phase commit Agent a Agent b Coordinator Agent d 2 3 4 5 6 2 3 4 5 6 1 2 3 CRASHED 2 3 4 5 6 vote decision_req decision_req vote decision_req decision_req prepare prepare prepare vote decision_req decision_req Collaborative termination Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg 3-Phase commit
  • 60. 3PC in an asynchronous network Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit