-
1.
Outwards
from the middle of the maze
Peter Alvaro
UC Berkeley
-
2.
Outline
1. Mourning the death of transactions
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
3.
The transaction concept
DEBIT_CREDIT:
BEGIN_TRANSACTION;
GET
MESSAGE;
EXTRACT
ACCOUT_NUMBER,
DELTA,
TELLER,
BRANCH
FROM
MESSAGE;
FIND
ACCOUNT(ACCOUT_NUMBER)
IN
DATA
BASE;
IF
NOT_FOUND
|
ACCOUNT_BALANCE
+
DELTA
<
0
THEN
PUT
NEGATIVE
RESPONSE;
ELSE
DO;
ACCOUNT_BALANCE
=
ACCOUNT_BALANCE
+
DELTA;
POST
HISTORY
RECORD
ON
ACCOUNT
(DELTA);
CASH_DRAWER(TELLER)
=
CASH_DRAWER(TELLER)
+
DELTA;
BRANCH_BALANCE(BRANCH)
=
BRANCH_BALANCE(BRANCH)
+
DELTA;
PUT
MESSAGE
('NEW
BALANCE
='
ACCOUNT_BALANCE);
END;
COMMIT;
-
4.
The transaction concept
DEBIT_CREDIT:
BEGIN_TRANSACTION;
GET
MESSAGE;
EXTRACT
ACCOUT_NUMBER,
DELTA,
TELLER,
BRANCH
FROM
MESSAGE;
FIND
ACCOUNT(ACCOUT_NUMBER)
IN
DATA
BASE;
IF
NOT_FOUND
|
ACCOUNT_BALANCE
+
DELTA
<
0
THEN
PUT
NEGATIVE
RESPONSE;
ELSE
DO;
ACCOUNT_BALANCE
=
ACCOUNT_BALANCE
+
DELTA;
POST
HISTORY
RECORD
ON
ACCOUNT
(DELTA);
CASH_DRAWER(TELLER)
=
CASH_DRAWER(TELLER)
+
DELTA;
BRANCH_BALANCE(BRANCH)
=
BRANCH_BALANCE(BRANCH)
+
DELTA;
PUT
MESSAGE
('NEW
BALANCE
='
ACCOUNT_BALANCE);
END;
COMMIT;
-
5.
The transaction concept
DEBIT_CREDIT:
BEGIN_TRANSACTION;
GET
MESSAGE;
EXTRACT
ACCOUT_NUMBER,
DELTA,
TELLER,
BRANCH
FROM
MESSAGE;
FIND
ACCOUNT(ACCOUT_NUMBER)
IN
DATA
BASE;
IF
NOT_FOUND
|
ACCOUNT_BALANCE
+
DELTA
<
0
THEN
PUT
NEGATIVE
RESPONSE;
ELSE
DO;
ACCOUNT_BALANCE
=
ACCOUNT_BALANCE
+
DELTA;
POST
HISTORY
RECORD
ON
ACCOUNT
(DELTA);
CASH_DRAWER(TELLER)
=
CASH_DRAWER(TELLER)
+
DELTA;
BRANCH_BALANCE(BRANCH)
=
BRANCH_BALANCE(BRANCH)
+
DELTA;
PUT
MESSAGE
('NEW
BALANCE
='
ACCOUNT_BALANCE);
END;
COMMIT;
-
6.
The transaction concept
DEBIT_CREDIT:
BEGIN_TRANSACTION;
GET
MESSAGE;
EXTRACT
ACCOUT_NUMBER,
DELTA,
TELLER,
BRANCH
FROM
MESSAGE;
FIND
ACCOUNT(ACCOUT_NUMBER)
IN
DATA
BASE;
IF
NOT_FOUND
|
ACCOUNT_BALANCE
+
DELTA
<
0
THEN
PUT
NEGATIVE
RESPONSE;
ELSE
DO;
ACCOUNT_BALANCE
=
ACCOUNT_BALANCE
+
DELTA;
POST
HISTORY
RECORD
ON
ACCOUNT
(DELTA);
CASH_DRAWER(TELLER)
=
CASH_DRAWER(TELLER)
+
DELTA;
BRANCH_BALANCE(BRANCH)
=
BRANCH_BALANCE(BRANCH)
+
DELTA;
PUT
MESSAGE
('NEW
BALANCE
='
ACCOUNT_BALANCE);
END;
COMMIT;
-
7.
The “top-down” ethos
-
8.
The “top-down” ethos
-
9.
The “top-down” ethos
-
10.
The “top-down” ethos
-
11.
The “top-down” ethos
-
12.
The “top-down” ethos
-
13.
Transactions: a holistic contract
Write
Read
Application
Opaque
store
Transactions
-
14.
Transactions: a holistic contract
Write
Read
Application
Opaque
store
Transactions
Assert:
balance > 0
-
15.
Transactions: a holistic contract
Assert:
balance > 0
Write
Read
Application
Opaque
store
Transactions
-
16.
Transactions: a holistic contract
Write
Read
Application
Opaque
store
Transactions
Assert:
balance > 0
-
17.
Transactions: a holistic contract
Write
Read
Application
Opaque
store
Transactions
Assert:
balance > 0
-
18.
Incidental complexities
• The “Internet.” Searching it.
• Cross-datacenter replication schemes
• CAP Theorem
• Dynamo & MapReduce
• “Cloud”
-
19.
Fundamental complexity
“[…] distributed systems require that the
programmer be aware of latency, have a different
model of memory access, and take into account
issues of concurrency and partial failure.”
Jim Waldo et al.,
A Note on Distributed Computing (1994)
-
20.
A holistic contract
…stretched to the limit
Write
Read
Application
Opaque
store
Transactions
-
21.
A holistic contract
…stretched to the limit
Write
Read
Application
Opaque
store
Transactions
-
22.
Are you blithely asserting
that transactions aren’t webscale?
Some people just want to see the world burn.
Those same people want to see the world use inconsistent databases.
- Emin Gun Sirer
-
23.
Alternative to top-down design?
The “bottom-up,” systems tradition:
Simple, reusable components first.
Semantics later.
-
24.
Alternative:
the “bottom-up,” systems ethos
-
25.
The “bottom-up” ethos
-
26.
The “bottom-up” ethos
-
27.
The “bottom-up” ethos
-
28.
The “bottom-up” ethos
-
29.
The “bottom-up” ethos
-
30.
The “bottom-up” ethos
-
31.
The “bottom-up” ethos
“‘Tis a fine barn, but sure ‘tis no castle, English”
-
32.
The “bottom-up” ethos
Simple, reusable components first.
Semantics later.
This is how we live now.
Question: Do we ever get those
application-level guarantees back?
-
33.
Low-level contracts
Write
Read
Application
Distributed
store KVS
-
34.
Low-level contracts
Write
Read
Application
Distributed
store KVS
-
35.
Low-level contracts
Write
Read
Application
Distributed
store KVS
R1(X=1)
R2(X=1)
W1(X=2)
W2(X=0)
W1(X=1)
W1(Y=2)
R2(Y=2)
R2(X=0)
-
36.
Low-level contracts
Write
Read
Application
Distributed
store KVS
Assert:
balance > 0
R1(X=1)
R2(X=1)
W1(X=2)
W2(X=0)
W1(X=1)
W1(Y=2)
R2(Y=2)
R2(X=0)
-
37.
Low-level contracts
Write
Read
Application
Distributed
store KVS
Assert:
balance > 0
causal?
PRAM?
delta?
fork/join?
red/blue?
Release?
R1(X=1)
R2(X=1)
W1(X=2)
W2(X=0)
W1(X=1)
W1(Y=2)
R2(Y=2)
R2(X=0)
-
38.
When do contracts compose?
Application
Distributed
service
Assert:
balance > 0
-
39.
iw, did I get mongo in my riak?
Assert:
balance > 0
-
40.
Composition is the last hard
problem
Composing modules is hard enough
We must learn how to compose guarantees
-
41.
Outline
1. Mourning the death of transactions
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
42.
Why distributed systems are hard2
Asynchrony Partial Failure
Fundamental Uncertainty
-
43.
Asynchrony isn’t that hard
Ameloriation:
Logical timestamps
Deterministic interleaving
-
44.
Partial failure isn’t that hard
Ameloriation:
Replication
Replay
-
45.
(asynchrony * partial failure) = hard2
Logical timestamps
Deterministic interleaving
Replication
Replay
-
46.
(asynchrony * partial failure) = hard2
Logical timestamps
Deterministic interleaving
Replication
Replay
-
47.
(asynchrony * partial failure) = hard2
Tackling one clown at a time
Poor strategy for programming distributed systems
Winning strategy for analyzing distributed programs
-
48.
Outline
1. Mourning the death of transactions
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
49.
Distributed consistency
Today: A quick summary of some great work.
-
50.
Consider a (distributed) graph
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
-
51.
Partitioned, for scalability
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
-
52.
Replicated, for availability
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
-
53.
Deadlock detection
Task: Identify strongly-connected
components
Waits-for graph
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
-
54.
Garbage collection
Task: Identify nodes not reachable
from Root. Root
Refers-to graph
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
-
55.
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Correctness
Deadlock detection
• Safety: No false positives
• Liveness: Identify all deadlocks
Garbage collection
• Safety: Never GC live memory!
• Liveness: GC all orphaned memory
-
56.
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Correctness
Deadlock detection
• Safety: No false positives-
• Liveness: Identify all deadlocks
Garbage collection
• Safety: Never GC live memory!
• Liveness: GC all orphaned memory
-
57.
Correctness
Deadlock detection
• Safety: No false positives
• Liveness: Identify all deadlocks
Garbage collection
• Safety: Never GC live memory!
• Liveness: GC all orphaned memory
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Root
-
58.
Consistency at the extremes
Application
Language
Custom s
olutions?
Flow
Object
Storage
Linearizable
key-value store?
-
59.
Consistency at the extremes
Application
Language
Custom s
olutions?
Flow
Object
Storage
Linearizable
key-value store?
-
60.
Consistency at the extremes
Application
Language
Custom s
olutions?
Flow
Efficient Object
Correct
Storage
Linearizable
key-value store?
-
61.
Object-level consistency
Capture semantics of data structures that
• allow greater concurrency
• maintain guarantees (e.g. convergence)
Application
Language
Flow
Object
Storage
-
62.
Object-level consistency
-
63.
Object-level consistency
Insert
Read
Convergent
data structure
(e.g., Set CRDT)
Insert
Read
Commutativity
Associativity
Idempotence
-
64.
Object-level consistency
Insert
Read
Convergent
data structure
(e.g., Set CRDT)
Insert
Read
Commutativity
Associativity
Idempotence
-
65.
Object-level consistency
Insert
Read
Convergent
data structure
(e.g., Set CRDT)
Insert
Read
Commutativity
Associativity
Idempotence
-
66.
Object-level consistency
Insert
Read
Convergent
data structure
(e.g., Set CRDT)
Insert
Read
Commutativity
Associativity
Idempotence
Reordering
Batching
Retry/duplication
Tolerant to
-
67.
Object-level composition?
Application
Convergent
data structures
Assert:
Graph replicas
converge
-
68.
Object-level composition?
Application
Convergent
data structures
GC Assert:
No live nodes are reclaimed
Assert:
Graph replicas
converge
-
69.
Object-level composition?
Application
Convergent
data structures
GC Assert:
No live nodes are reclaimed
?
?
Assert:
Graph replicas
converge
-
70.
Flow-level consistency
Application
Language
Flow
Object
Storage
-
71.
Flow-level consistency
Capture semantics of data in motion
• Asynchronous dataflow model
• component properties à system-wide guarantees
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
-
72.
Flow-level consistency
Order-insensitivity (confluence)
output
set
=
f(input
set)
-
73.
Flow-level consistency
Order-insensitivity (confluence)
output
set
=
f(input
set)
-
74.
Flow-level consistency
Order-insensitivity (confluence)
output
set
=
f(input
set)
-
75.
Flow-level consistency
Order-insensitivity (confluence)
output
set
=
f(input
set)
-
76.
Flow-level consistency
Order-insensitivity (confluence)
output
set
=
f(input
set)
=
-
77.
Flow-level consistency
Order-insensitivity (confluence)
output
set
=
f(input
set)
{
}
=
{
}
-
78.
Confluence is compositional
output
set
=
f
g(input
set)
-
79.
Confluence is compositional
output
set
=
f
g(input
set)
-
80.
Confluence is compositional
output
set
=
f
g(input
set)
-
81.
Graph queries as dataflow
Graph
store
Memory
allocator
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
Confluent Confluent Confluent
-
82.
Graph queries as dataflow
Graph
store
Memory
allocator
Confluent
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
Confluent Confluent Confluent
Coordinate
here
-
83.
Coordination: what is that?
Strategy 1: Establish a total order
Graph
store
Memory
allocator
Coordinate
here
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
-
84.
Coordination: what is that?
Strategy 2: Establish a producer-consumer
Graph
store
Memory
allocator
Coordinate
here
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
barrier
-
85.
Fundamental costs: FT via replication
(mostly) free!
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
Confluent Confluent Confluent
Graph
store
Transitive
closure
Deadlock
detector
Confluent Confluent Confluent
-
86.
Fundamental costs: FT via replication
global synchronization!
Graph
store
Transaction
manager
Transitive
closure
Garbage
Collector
Confluent Confluent
Graph
store
Transitive
closure
Garbage
Collector
Confluent Not
Confluent
Confluent
Paxos
Not
Confluent
-
87.
Fundamental costs: FT via replication
The first principle of successful scalability is to batter the
consistency mechanisms down to a minimum.
– James Hamilton
Garbage
Collector
Graph
store
Transaction
manager
Transitive
closure
Garbage
Collector
Confluent Confluent
Graph
store
Transitive
closure
Confluent Not
Confluent
Confluent
Barrier
Not
Confluent
Barrier
-
88.
Language-level consistency
DSLs for distributed programming?
• Capture consistency concerns in the
type system
Application
Language
Flow
Object
Storage
-
89.
Language-level consistency
CALM Theorem:
Monotonic à confluent
Conservative, syntactic test for confluence
-
90.
Language-level consistency
Deadlock detector
Garbage collector
-
91.
Language-level consistency
Deadlock detector
Garbage collector
nonmonotonic
-
92.
Let’s review
• Consistency is tolerance to asynchrony
• Tricks:
– focus on data in motion, not at rest
– avoid coordination when possible
– choose coordination carefully otherwise
(Tricks are great, but tools are better)
-
93.
Outline
1. Mourning the death of transactions
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
94.
Grand challenge: composition
Hard problem:
Is a given component fault-tolerant?
Much harder:
Is this system (built up from components)
fault-tolerant?
-
95.
Example: Atomic
multi-partition update
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Two-phase
commit
-
96.
Example: replication
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Reliable
broadcast
-
97.
Popular wisdom: don’t reinvent
-
98.
Example: Kafka replication bug
Three “correct” components:
1. Primary/backup replication
2. Timeout-based failure detectors
3. Zookeeper
One nasty bug:
Acknowledged writes are lost
-
99.
A guarantee would be nice
Bottom up approach:
• use formal methods to verify individual
components (e.g. protocols)
• Build systems from verified components
Shortcomings:
• Hard to use
• Hard to compose
Investment
Returns
-
100.
Bottom-up assurances
Formal
verifica[on
Environment
Program
Correctness
Spec
-
101.
Composing bottom-up
assurances
-
102.
Composing bottom-up
assurances
Issue 1: incompatible failure models
eg, crash failure vs. omissions
Issue 2: Specs do not compose
(FT is an end-to-end property)
If you take 10 components off the shelf, you are putting 10 world views
together, and the result will be a mess. -- Butler Lampson
-
103.
Composing bottom-up
assurances
-
104.
Composing bottom-up
assurances
-
105.
Composing bottom-up
assurances
-
106.
Top-down “assurances”
-
107.
Top-down “assurances”
Testing
-
108.
Top-down “assurances”
Fault
injection Testing
-
109.
Top-down “assurances”
Fault
injection
Testing
-
110.
End-to-end testing
would be nice
Top-down approach:
• Build a large-scale system
• Test the system under faults
Shortcomings:
• Hard to identify complex bugs
• Fundamentally incomplete
Investment
Returns
-
111.
Lineage-driven fault injection
Goal: top-down testing that
• finds all of the fault-tolerance bugs, or
• certifies that none exist
-
112.
Lineage-driven fault injection
Correctness
Specification
Malevolent
sentience
Molly
-
113.
Lineage-driven fault injection
Molly
Correctness
Specification
Malevolent
sentience
-
114.
Lineage-driven fault injection
(LDFI)
Approach: think backwards from outcomes
Question: could a bad thing ever happen?
Reframe:
• Why did a good thing happen?
• What could have gone wrong along the way?
-
115.
Thomasina: What a faint-heart! We must
work outward from the middle of the
maze. We will start with something simple.
-
116.
The game
• Both players agree on a failure model
• The programmer provides a protocol
• The adversary observes executions and
chooses failures for the next execution.
-
117.
Dedalus: it’s about data
log(B, “data”)@5
What
Where
When
Some data
-
118.
Dedalus: it’s like Datalog
consequence ! :- premise[s]!
!
log(Node, Pload) ! ! ! :- bcast(Node, Pload);!
!
-
119.
Dedalus: it’s like Datalog
consequence ! :- premise[s]!
!
log(Node, Pload) ! ! ! :- bcast(Node, Pload);!
!
(Which is like SQL)
create view log as
select Node, Pload from bcast;!
-
120.
Dedalus: it’s about time
consequence@when ! :- premise[s]!
!!
node(Node, Neighbor)@next :- node(Node, Neighbor);!
!!
log(Node2, Pload)@async :- bcast(Node1, Pload),
! ! ! ! ! ! ! ! ! node(Node1, Node2);
-
121.
Dedalus: it’s about time
consequence@when ! :- premise[s]!
!!
node(Node, Neighbor)@next :- node(Node, Neighbor);!
!!
log(Node2, Pload)@async :- bcast(Node1, Pload),
! ! ! ! ! ! ! ! ! node(Node1, Node2);
State change
Natural join (bcast.Node1 == node.Node1)
Communication
-
122.
The match
Protocol:
Reliable broadcast
Specification:
Pre: A correct process delivers a message m
Post: All correct process delivers m
Failure Model:
(Permanent) crash failures
Message loss / partitions
-
123.
Round 1
node(Node, Neighbor)@next :- node(Node, Neighbor);!
log(Node, Pload)@next ! :- log(Node, Pload);!
!!
log(Node, Pload) ! ! ! :- bcast(Node, Pload);!
!
log(Node2, Pload)@async :- bcast(Node1, Pload),
! ! ! ! ! ! ! ! ! node(Node1, Node2);
“An effort” delivery protocol
-
124.
Round 1 in space / time
Process b Process a Process c
2
1
2
log log
-
125.
Round 1: Lineage
log(B,
data)@5
-
126.
Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(Node, Pload)@next :- log(Node, Pload);!
!!!
log(B, data)@5:- log(B, data)@4;!
-
127.
Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(B,
data)@3
-
128.
Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(B,
data)@3
log(B,data)@2
-
129.
Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(B,
data)@3
log(B,data)@2
log(Node2, Pload)@async :- bcast(Node1, Pload), !
! ! ! ! ! ! node(Node1, Node2);!
!!!!
log(B, data)@2 :- bcast(A, data)@1, !
! ! ! ! ! ! node(A, B)@1;!
log(A,
data)@1
-
130.
An execution is a (fragile) “proof”
of an outcome
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(log(AB2 log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
log(log(log(log((which required a message from A to B at time 1)
-
131.
Valentine: “The unpredictable and the
predetermined unfold together to make
everything the way it is.”
-
132.
Round 1: counterexample
Process b Process a Process c
1
2
log (LOST) log
The adversary wins!
-
133.
Round
2
Same
as
Round
1,
but
A
retries.
bcast(N, P)@next ! ! ! :- bcast(N, P);!
-
134.
Round 2 in spacetime
Process b Process a Process c
2
3
4
5
1
2
3
4
2
3
4
5
log log
log log
log log
log log
-
135.
Round 2
log(B,
data)@5
-
136.
Round 2
log(B,
data)@5
log(B,
data)@4
log(Node, Pload)@next :- log(Node, Pload);!
!!!
log(B, data)@5:- log(B, data)@4;!
-
137.
Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!
!!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
-
138.
Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
-
139.
Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
log(B,data)@2
log(A,
data)@2
-
140.
Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
log(B,data)@2
log(A,
data)@2
log(A,
data)@1
-
141.
Round 2
Retry provides redundancy in time
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
log(B,data)@2
log(A,
data)@2
log(A,
data)@1
-
142.
Traces
are
forests
of
proof
trees
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
-
143.
Traces
are
forests
of
proof
trees
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
-
144.
Round
2:
counterexample
Process b Process a Process c
1
log (LOST) log
CRASHED 2
The adversary wins!
-
145.
Round 3
Same
as
in
Round
2,
but
symmetrical.
bcast(N, P)@next ! ! ! :- log(N, P);!
-
146.
Round 3 in space / time
Process b Process a Process c
2
3
4
5
1
log log
2
3
4
5
2
3
4
5
log log
log log
log log
log log
log log
log log
log log
log log
log log
Redundancy in
space and time
-
147.
Round 3 -- lineage
log(B,
data)@5
-
148.
Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4
-
149.
Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4
Log(B,
data)@3
log(A,
data)@3
log(C,
data)@3
-
150.
Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4
Log(B,
data)@3
log(A,
data)@3
log(C,
data)@3
log(B,data)@2
log(A,
data)@2
log(C,
data)@2
log(A,
data)@1
-
151.
Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4
Log(B,
data)@3
log(A,
data)@3
log(C,
data)@3
log(B,data)@2
log(A,
data)@2
log(C,
data)@2
log(A,
data)@1
-
152.
Round 3
The programmer wins!
-
153.
Let’s reflect
Fault-tolerance is redundancy in space and
time.
Best strategy for both players: reason
backwards from outcomes using lineage
Finding bugs: find a set of failures that
“breaks” all derivations
Fixing bugs: add additional derivations
-
154.
The role of the adversary
can be automated
1. Break a proof by dropping any contributing
message.
(AB1 ∨ BC2)
Disjunction
-
155.
The role of the adversary
can be automated
1. Break a proof by dropping any contributing
message.
2. Find a set of failures that breaks all proofs
of a good outcome.
(AB1 ∨ BC2)
Disjunction
∧ (AC1) ∧ (AC2)
Conjunction of disjunctions (AKA CNF)
-
156.
The role of the adversary
can be automated
1. Break a proof by dropping any contributing
message.
2. Find a set of failures that breaks all proofs
of a good outcome.
(AB1 ∨ BC2)
Disjunction
∧ (AC1) ∧ (AC2)
Conjunction of disjunctions (AKA CNF)
-
157.
Molly, the LDFI prototype
Molly finds fault-tolerance violations quickly
or guarantees that none exist.
Molly finds bugs by explaining good
outcomes – then it explains the bugs.
Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka
Certified correct: paxos (synod), Flux, bully
leader election, reliable broadcast
-
158.
Commit protocols
Problem:
Atomically change things
Correctness properties:
1. Agreement (All or nothing)
2. Termination (Something)
-
159.
Two-phase commit
Agent a Agent b Coordinator Agent d
2
5
2
5
1
prepare prepare prepare
3
4
2
5
vote vote
vote
commit commit commit
-
160.
Two-phase commit
Agent a Agent b Coordinator Agent d
2
5
2
5
1
prepare prepare prepare
3
4
2
5
vote vote
vote
commit commit commit
Can I kick it?
-
161.
Two-phase commit
Agent a Agent b Coordinator Agent d
2
5
2
5
1
prepare prepare prepare
3
4
2
5
vote vote
vote
commit commit commit
Can I kick it?
YES YOU CAN
-
162.
Two-phase commit
Agent a Agent b Coordinator Agent d
2
5
2
5
1
prepare prepare prepare
3
4
2
5
vote vote
vote
commit commit commit
Can I kick it?
YES YOU CAN
Well I’m gone
-
163.
Two-phase commit
Agent a Agent a Coordinator Agent d
2 2
1
p p p
3
CRASHED
2
v v
v
Violation: Termination
-
164.
The
collabora[ve
termina[on
protocol
Basic idea:
Agents talk amongst themselves when the
coordinator fails.
Protocol: On timeout, ask other agents
about decision.
-
165.
2PC - CTP
Agent a Agent b Coordinator Agent d
2
3
4
5
6
7
prepare prepare prepare
2
3
4
5
6
7
1
2
3
CRASHED
2
3
4
5
6
7
vote
decision_req decision_req
vote
decision_req decision_req
vote
decision_req decision_req
-
166.
2PC - CTP
Agent a Agent b Coordinator Agent d
2
3
4
5
6
7
prepare prepare prepare
2
3
4
5
6
7
1
2
3
CRASHED
2
3
4
5
6
7
vote
decision_req decision_req
vote
decision_req decision_req
vote
decision_req decision_req
Can I kick it?
YES YOU CAN
……?
-
167.
3PC
Basic idea:
Add a round, a state, and simple failure
detectors (timeouts).
Protocol:
1. Phase 1: Just like in 2PC
– Agent timeout à abort
2. Phase 2: send canCommit, collect acks
– Agent timeout à commit
3. Phase 3: Just like phase 2 of 2PC
-
168.
3PC
Process a Process b Process C Process d
2
4
7
2
4
7
1
cancommit cancommit cancommit
3
vote_msg
precommit precommit precommit
5
6
2
4
7
vote_msg
ack
vote_msg
ack
ack
commit commit commit
-
169.
3PC
Process a Process b Process C Process d
2
4
7
2
4
7
1
cancommit cancommit cancommit
3
vote_msg
precommit precommit precommit
5
6
2
4
7
vote_msg
ack
vote_msg
ack
ack
commit commit commit
Timeout
à Abort
Timeout
à Commit
-
170.
Network partitions
make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
-
171.
Network partitions
make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Agent crash
Agents learn
commit decision
-
172.
Network partitions
make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Agent crash
Agents learn
commit decision
d is dead; coordinator
decides to abort
-
173.
Network partitions
make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Brief network
partition
Agent crash
Agents learn
commit decision
d is dead; coordinator
decides to abort
-
174.
Network partitions
make 3pc act crazy
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Brief network
partition
Agent crash
Agents learn
commit decision
d is dead; coordinator
decides to abort
Agents A & B
decide to
commit
-
175.
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
-
176.
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
-
177.
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
a becomes
leader and
sole replica
-
178.
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
a becomes
leader and
sole replica
a ACKs
client write
-
179.
Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
a becomes
leader and
sole replica
a ACKs
client write
Data
loss
-
180.
Molly summary
Lineage allows us to reason backwards
from good outcomes
Molly: surgically-targeted fault injection
Investment similar to testing
Returns similar to formal methods
-
181.
Where we’ve been; where we’re headed
1. Mourning the death of transactions
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
182.
Where we’ve been; where we’re headed
1. We need application-level guarantees
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
183.
Where we’ve been; where we’re headed
1. We need application-level guarantees
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
184.
Where we’ve been; where we’re headed
1. We need application-level guarantees
2. (asynchrony X partial failure) = too hard to
hide! We need tools to manage it.
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
185.
Where we’ve been; where we’re headed
1. We need application-level guarantees
2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures
-
186.
Where we’ve been; where we’re headed
1. We need application-level guarantees
2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Focus on flow: data in motion
4. Fault-tolerance: progress despite failures
-
187.
Outline
1. We need application-level guarantees
2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Focus on flow: data in motion
4. Fault-tolerance: progress despite failures
-
188.
Outline
1. We need application-level guarantees
2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.
3. Focus on flow: data in motion
4. Backwards from outcomes
-
189.
Remember
1. We need application-level guarantees
2. asynchrony X partial failure = too hard to hide! We
need tools to manage it.
3. Focus on flow: data in motion
4. Backwards from outcomes
Composition is the hardest problem
-
190.
A happy crisis
Valentine: “It makes me so happy. To be at
the beginning again, knowing almost
nothing.... It's the best possible time of
being alive, when almost everything you
thought you knew is wrong.”
USER-CENTRIC
OMG pause here. Remember brewer 2012? Top-down vs bottom-up designs? We had this top-down thing and it was beautiful.
It was so beautiful that it didn’t matter that it was somewhat ugly
The abstraction was so beautiful,
IT DOESN”T MATTER WHAT”S UNDERNEATH. Wait, or does it? When does it?
We’ve known for a long time that it is hard to hide the complexities of distribution
Focus not on semantics, but on the properties of components: thin interfaces, understandable latency & failure modes. DEV-centric
But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong?
FIX ME: joe’s idea: sketch of a castle being filled in, vs bricks
But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong?
In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
Meaning: translation
DS are hard because of uncertainty – nondeterminism – which is fundamental to the environment and can “leak” into the results”
It’s astoundingly difficult to face these demons at the same time – tempting to try to defeat them one at a time.
Async isn’t a problem: just need to be careful to number messages and interleave correctly. Ignore arrival order.
Whoa, this is easy so far.
Failure isn’t a problem: just do redundant computation and store redundant data. Make more copies than there will be failures.
I win.
We can’t do deterministic interleaving if producers may fail.
Nd message order makes it hard to keep replicas in agreement
We can’t do deterministic interleaving if producers may fail.
Nd message order makes it hard to keep replicas in agreement
We can’t do deterministic interleaving if producers may fail.
Nd message order makes it hard to keep replicas in agreement
To guard against failures, we replicate.
NB: asynchrony => replicas might not agree
Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
FIX: make it about translation vs. prayer
FIX: make it about translation vs. prayer
FIX: make it about translation vs. prayer
Ie, reorderability, batchability, tolerance to duplication / retry
Now programmer must map from application invariants to object API (with richer semantics than read/write).
Ie, reorderability, batchability, tolerance to duplication / retry
Now programmer must map from application invariants to object API (with richer semantics than read/write).
Convergence is a property of component state. It rules out divergence, but it does not readily compose.
Convergence is a property of component state. It rules out divergence, but it does not readily compose.
Convergence is a property of component state. It rules out divergence, but it does not readily compose.
Convergence is a property of component state. It rules out divergence, but it does not readily compose.
However, not sufficient to synchronize GC.
Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give?
To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app.
*** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
However, not sufficient to synchronize GC.
Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give?
To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app.
*** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
However, not sufficient to synchronize GC.
Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give?
To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app.
*** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
Confluence is compositional: Composing confluent components yields a confluent dataflow
Confluence is compositional: Composing confluent components yields a confluent dataflow
Confluence is compositional: Composing confluent components yields a confluent dataflow
All of these components are confluent! Composing confluent components yields a confluent dataflow
But annotations are burdensome
All of these components are confluent! Composing confluent components yields a confluent dataflow
But annotations are burdensome
A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
M – a semantic property of code – implies confluence
An appropriately constrained language provides a conservative syntactic test for M.
M – a semantic property of code – implies confluence
An appropriately constrained language provides a conservative syntactic test for M.
Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
Try to not use it! Learn how to choose it. Tools help!
Start with a hard problemHard problem: is my FT protocol work?
Harder: is the composition of my components FT
Point: we need to replicate data to both copies of a replica
We need to commit multiple partitions together
Start with a hard problemHard problem: is my FT protocol work?
Harder: is the composition of my components FT
Examples! 2pc and replication. Properties, etc etc
Talk about speed too.
After all, FT is an end-to-end concern.
(synchronous)
(synchronous)
(synchronous)
TALK ABOUT SAT!!!
TALK ABOUT SAT!!!