Successfully reported this slideshow.

RICON keynote: outwards from the middle of the maze

10

Share

Loading in …3
×
1 of 190
1 of 190

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

RICON keynote: outwards from the middle of the maze

  1. 1. Outwards from the middle of the maze Peter Alvaro UC Berkeley
  2. 2. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  3. 3. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  4. 4. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  5. 5. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  6. 6. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  7. 7. The “top-down” ethos
  8. 8. The “top-down” ethos
  9. 9. The “top-down” ethos
  10. 10. The “top-down” ethos
  11. 11. The “top-down” ethos
  12. 12. The “top-down” ethos
  13. 13. Transactions: a holistic contract Write Read Application Opaque store Transactions
  14. 14. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  15. 15. Transactions: a holistic contract Assert: balance > 0 Write Read Application Opaque store Transactions
  16. 16. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  17. 17. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  18. 18. Incidental complexities • The “Internet.” Searching it. • Cross-datacenter replication schemes • CAP Theorem • Dynamo & MapReduce • “Cloud”
  19. 19. Fundamental complexity “[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.” Jim Waldo et al., A Note on Distributed Computing (1994)
  20. 20. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  21. 21. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  22. 22. Are you blithely asserting that transactions aren’t webscale? Some people just want to see the world burn. Those same people want to see the world use inconsistent databases. - Emin Gun Sirer
  23. 23. Alternative to top-down design? The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.
  24. 24. Alternative: the “bottom-up,” systems ethos
  25. 25. The “bottom-up” ethos
  26. 26. The “bottom-up” ethos
  27. 27. The “bottom-up” ethos
  28. 28. The “bottom-up” ethos
  29. 29. The “bottom-up” ethos
  30. 30. The “bottom-up” ethos
  31. 31. The “bottom-up” ethos “‘Tis a fine barn, but sure ‘tis no castle, English”
  32. 32. The “bottom-up” ethos Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?
  33. 33. Low-level contracts Write Read Application Distributed store KVS
  34. 34. Low-level contracts Write Read Application Distributed store KVS
  35. 35. Low-level contracts Write Read Application Distributed store KVS R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  36. 36. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  37. 37. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 causal? PRAM? delta? fork/join? red/blue? Release? R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  38. 38. When do contracts compose? Application Distributed service Assert: balance > 0
  39. 39. iw, did I get mongo in my riak? Assert: balance > 0
  40. 40. Composition is the last hard problem Composing modules is hard enough We must learn how to compose guarantees
  41. 41. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  42. 42. Why distributed systems are hard2 Asynchrony Partial Failure Fundamental Uncertainty
  43. 43. Asynchrony isn’t that hard Ameloriation: Logical timestamps Deterministic interleaving
  44. 44. Partial failure isn’t that hard Ameloriation: Replication Replay
  45. 45. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  46. 46. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  47. 47. (asynchrony * partial failure) = hard2 Tackling one clown at a time Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs
  48. 48. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  49. 49. Distributed consistency Today: A quick summary of some great work.
  50. 50. Consider a (distributed) graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  51. 51. Partitioned, for scalability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  52. 52. Replicated, for availability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  53. 53. Deadlock detection Task: Identify strongly-connected components Waits-for graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  54. 54. Garbage collection Task: Identify nodes not reachable from Root. Root Refers-to graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  55. 55. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  56. 56. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives- • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  57. 57. Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Root
  58. 58. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  59. 59. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  60. 60. Consistency at the extremes Application Language Custom s olutions? Flow Efficient Object Correct Storage Linearizable key-value store?
  61. 61. Object-level consistency Capture semantics of data structures that • allow greater concurrency • maintain guarantees (e.g. convergence) Application Language Flow Object Storage
  62. 62. Object-level consistency
  63. 63. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  64. 64. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  65. 65. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  66. 66. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence Reordering Batching Retry/duplication Tolerant to
  67. 67. Object-level composition? Application Convergent data structures Assert: Graph replicas converge
  68. 68. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed Assert: Graph replicas converge
  69. 69. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed ? ? Assert: Graph replicas converge
  70. 70. Flow-level consistency Application Language Flow Object Storage
  71. 71. Flow-level consistency Capture semantics of data in motion • Asynchronous dataflow model • component properties à system-wide guarantees Graph store Transaction manager Transitive closure Deadlock detector
  72. 72. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  73. 73. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  74. 74. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  75. 75. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  76. 76. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) =
  77. 77. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) { } = { }
  78. 78. Confluence is compositional output set = f Ÿ g(input set)
  79. 79. Confluence is compositional output set = f Ÿ g(input set)
  80. 80. Confluence is compositional output set = f Ÿ g(input set)
  81. 81. Graph queries as dataflow Graph store Memory allocator Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent
  82. 82. Graph queries as dataflow Graph store Memory allocator Confluent Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Coordinate here
  83. 83. Coordination: what is that? Strategy 1: Establish a total order Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent
  84. 84. Coordination: what is that? Strategy 2: Establish a producer-consumer Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent barrier
  85. 85. Fundamental costs: FT via replication (mostly) free! Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Graph store Transitive closure Deadlock detector Confluent Confluent Confluent
  86. 86. Fundamental costs: FT via replication global synchronization! Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Garbage Collector Confluent Not Confluent Confluent Paxos Not Confluent
  87. 87. Fundamental costs: FT via replication The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton Garbage Collector Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Confluent Not Confluent Confluent Barrier Not Confluent Barrier
  88. 88. Language-level consistency DSLs for distributed programming? • Capture consistency concerns in the type system Application Language Flow Object Storage
  89. 89. Language-level consistency CALM Theorem: Monotonic à confluent Conservative, syntactic test for confluence
  90. 90. Language-level consistency Deadlock detector Garbage collector
  91. 91. Language-level consistency Deadlock detector Garbage collector nonmonotonic
  92. 92. Let’s review • Consistency is tolerance to asynchrony • Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise (Tricks are great, but tools are better)
  93. 93. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  94. 94. Grand challenge: composition Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?
  95. 95. Example: Atomic multi-partition update T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Two-phase commit
  96. 96. Example: replication T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Reliable broadcast
  97. 97. Popular wisdom: don’t reinvent
  98. 98. Example: Kafka replication bug Three “correct” components: 1. Primary/backup replication 2. Timeout-based failure detectors 3. Zookeeper One nasty bug: Acknowledged writes are lost
  99. 99. A guarantee would be nice Bottom up approach: • use formal methods to verify individual components (e.g. protocols) • Build systems from verified components Shortcomings: • Hard to use • Hard to compose Investment Returns
  100. 100. Bottom-up assurances Formal verifica[on Environment Program Correctness Spec
  101. 101. Composing bottom-up assurances
  102. 102. Composing bottom-up assurances Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property) If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson
  103. 103. Composing bottom-up assurances
  104. 104. Composing bottom-up assurances
  105. 105. Composing bottom-up assurances
  106. 106. Top-down “assurances”
  107. 107. Top-down “assurances” Testing
  108. 108. Top-down “assurances” Fault injection Testing
  109. 109. Top-down “assurances” Fault injection Testing
  110. 110. End-to-end testing would be nice Top-down approach: • Build a large-scale system • Test the system under faults Shortcomings: • Hard to identify complex bugs • Fundamentally incomplete Investment Returns
  111. 111. Lineage-driven fault injection Goal: top-down testing that • finds all of the fault-tolerance bugs, or • certifies that none exist
  112. 112. Lineage-driven fault injection Correctness Specification Malevolent sentience Molly
  113. 113. Lineage-driven fault injection Molly Correctness Specification Malevolent sentience
  114. 114. Lineage-driven fault injection (LDFI) Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: • Why did a good thing happen? • What could have gone wrong along the way?
  115. 115. Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.
  116. 116. The game • Both players agree on a failure model • The programmer provides a protocol • The adversary observes executions and chooses failures for the next execution.
  117. 117. Dedalus: it’s about data log(B, “data”)@5 What Where When Some data
  118. 118. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! !
  119. 119. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! (Which is like SQL) create view log as select Node, Pload from bcast;!
  120. 120. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
  121. 121. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); State change Natural join (bcast.Node1 == node.Node1) Communication
  122. 122. The match Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions
  123. 123. Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);! log(Node, Pload)@next ! :- log(Node, Pload);! !! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); “An effort” delivery protocol
  124. 124. Round 1 in space / time Process b Process a Process c 2 1 2 log log
  125. 125. Round 1: Lineage log(B, data)@5
  126. 126. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  127. 127. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3
  128. 128. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2
  129. 129. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2 log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! node(Node1, Node2);! !!!! log(B, data)@2 :- bcast(A, data)@1, ! ! ! ! ! ! ! node(A, B)@1;! log(A, data)@1
  130. 130. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(log(AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 log(log(log(log((which required a message from A to B at time 1)
  131. 131. Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”
  132. 132. Round 1: counterexample Process b Process a Process c 1 2 log (LOST) log The adversary wins!
  133. 133. Round 2 Same as Round 1, but A retries. bcast(N, P)@next ! ! ! :- bcast(N, P);!
  134. 134. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log
  135. 135. Round 2 log(B, data)@5
  136. 136. Round 2 log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  137. 137. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);! !!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
  138. 138. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3
  139. 139. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2
  140. 140. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  141. 141. Round 2 Retry provides redundancy in time log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  142. 142. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  143. 143. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  144. 144. Round 2: counterexample Process b Process a Process c 1 log (LOST) log CRASHED 2 The adversary wins!
  145. 145. Round 3 Same as in Round 2, but symmetrical. bcast(N, P)@next ! ! ! :- log(N, P);!
  146. 146. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 log log 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log Redundancy in space and time
  147. 147. Round 3 -- lineage log(B, data)@5
  148. 148. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4
  149. 149. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3
  150. 150. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  151. 151. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  152. 152. Round 3 The programmer wins!
  153. 153. Let’s reflect Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations
  154. 154. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. (AB1 ∨ BC2) Disjunction
  155. 155. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  156. 156. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  157. 157. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast
  158. 158. Commit protocols Problem: Atomically change things Correctness properties: 1. Agreement (All or nothing) 2. Termination (Something)
  159. 159. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit
  160. 160. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it?
  161. 161. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN
  162. 162. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN Well I’m gone
  163. 163. Two-phase commit Agent a Agent a Coordinator Agent d 2 2 1 p p p 3 CRASHED 2 v v v Violation: Termination
  164. 164. The collabora[ve termina[on protocol Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.
  165. 165. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req
  166. 166. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req Can I kick it? YES YOU CAN ……?
  167. 167. 3PC Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1. Phase 1: Just like in 2PC – Agent timeout à abort 2. Phase 2: send canCommit, collect acks – Agent timeout à commit 3. Phase 3: Just like phase 2 of 2PC
  168. 168. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit
  169. 169. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit Timeout à Abort Timeout à Commit
  170. 170. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg
  171. 171. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision
  172. 172. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  173. 173. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  174. 174. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit
  175. 175. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w
  176. 176. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition
  177. 177. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica
  178. 178. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write
  179. 179. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write Data loss
  180. 180. Molly summary Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods
  181. 181. Where we’ve been; where we’re headed 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  182. 182. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  183. 183. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  184. 184. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. (asynchrony X partial failure) = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  185. 185. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  186. 186. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  187. 187. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  188. 188. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes
  189. 189. Remember 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes Composition is the hardest problem
  190. 190. A happy crisis Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”

Editor's Notes

  • USER-CENTRIC
  • OMG pause here. Remember brewer 2012? Top-down vs bottom-up designs? We had this top-down thing and it was beautiful.
  • It was so beautiful that it didn’t matter that it was somewhat ugly
  • The abstraction was so beautiful,
    IT DOESN”T MATTER WHAT”S UNDERNEATH. Wait, or does it? When does it?
  • We’ve known for a long time that it is hard to hide the complexities of distribution
  • Focus not on semantics, but on the properties of components: thin interfaces, understandable latency & failure modes. DEV-centric
    But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong?
  • FIX ME: joe’s idea: sketch of a castle being filled in, vs bricks
    But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong?
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  • Meaning: translation
  • DS are hard because of uncertainty – nondeterminism – which is fundamental to the environment and can “leak” into the results”
    It’s astoundingly difficult to face these demons at the same time – tempting to try to defeat them one at a time.
  • Async isn’t a problem: just need to be careful to number messages and interleave correctly. Ignore arrival order.
    Whoa, this is easy so far.
  • Failure isn’t a problem: just do redundant computation and store redundant data. Make more copies than there will be failures.
    I win.
  • We can’t do deterministic interleaving if producers may fail.
    Nd message order makes it hard to keep replicas in agreement
  • We can’t do deterministic interleaving if producers may fail.
    Nd message order makes it hard to keep replicas in agreement
  • We can’t do deterministic interleaving if producers may fail.
    Nd message order makes it hard to keep replicas in agreement
  • To guard against failures, we replicate.
    NB: asynchrony => replicas might not agree
  • Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  • Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  • Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  • FIX: make it about translation vs. prayer

  • FIX: make it about translation vs. prayer

  • FIX: make it about translation vs. prayer

  • Ie, reorderability, batchability, tolerance to duplication / retry
    Now programmer must map from application invariants to object API (with richer semantics than read/write).
  • Ie, reorderability, batchability, tolerance to duplication / retry
    Now programmer must map from application invariants to object API (with richer semantics than read/write).
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  • However, not sufficient to synchronize GC.
    Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give?
    To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app.
    *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  • However, not sufficient to synchronize GC.
    Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give?
    To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app.
    *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  • However, not sufficient to synchronize GC.
    Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give?
    To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app.
    *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
    A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
    A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
    A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
    A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
    A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence.
    A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  • Confluence is compositional: Composing confluent components yields a confluent dataflow
  • Confluence is compositional: Composing confluent components yields a confluent dataflow
  • Confluence is compositional: Composing confluent components yields a confluent dataflow
  • All of these components are confluent! Composing confluent components yields a confluent dataflow
    But annotations are burdensome
  • All of these components are confluent! Composing confluent components yields a confluent dataflow
    But annotations are burdensome
  • A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
  • A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
  • M – a semantic property of code – implies confluence
    An appropriately constrained language provides a conservative syntactic test for M.
  • M – a semantic property of code – implies confluence
    An appropriately constrained language provides a conservative syntactic test for M.
  • Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
  • Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
  • Try to not use it! Learn how to choose it. Tools help!
  • Start with a hard problem Hard problem: is my FT protocol work?
    Harder: is the composition of my components FT
  • Point: we need to replicate data to both copies of a replica
    We need to commit multiple partitions together
  • Start with a hard problem Hard problem: is my FT protocol work?
    Harder: is the composition of my components FT
  • Examples! 2pc and replication. Properties, etc etc
  • Talk about speed too.
  • After all, FT is an end-to-end concern.
  • (synchronous)
  • (synchronous)
  • (synchronous)
  • TALK ABOUT SAT!!!
  • TALK ABOUT SAT!!!
  • ×