Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

683 views

Published on

Josef Widder, Vienna University of Technology

Published in: Science
  • Be the first to comment

  • Be the first to like this

Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

  1. 1. Model Checking of Fault-Tolerant Distributed Algorithms Part II: Modeling Fault-tolerant Distributed Algorithms Annu Gmeiner Igor Konnov Ulrich Schmid Helmut Veith Josef Widder TMPA 2014, Kostroma, Russia Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 1 / 64
  2. 2. Why Modeling and Verification? Let’s have a look at the recent technical report by amazon.com 1: We have found that the standard verification techniques in industry are necessary but not sufficient. We use deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing [. . . ], but we still find that subtle bugs can hide in complex concurrent fault-tolerant systems. . . . . .We have found that testing the code is inadequate as a method to find subtle errors in design, as the number of reachable states of the code is astronomical. 1C. Newcombe et al. Use of Formal Methods at Amazon Web Services, 2014 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 2 / 64
  3. 3. Why Modeling and Verification? (cont.) . . . the final executable code is unambiguous, but contains an overwhelming amount of detail. We needed to be able to capture the essence of a design in a few hundred lines of precise description . Engineers naturally focus on designing the “happy case” for a system, i.e. the processing path in which no errors occur. . . . the shortest error trace exhibiting the bug contained 35 high level steps. The improbability of such compound events is not a defense against such bugs; historically, AWS has observed many combinations of events at least as complicated as those that could trigger this bug. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 3 / 64
  4. 4. What are we doing? The approach we have: A high-level description of a design, which is precise. Sound verification method, as complete as possible. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64
  5. 5. What are we doing? The approach we have: A high-level description of a design, which is precise. Sound verification method, as complete as possible. More specifically: modeling approach suitable for model checking automatic parameterized verification method targeting at fault-tolerant distributed algorithms (We are not interested in verifying mathematical toy examples) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64
  6. 6. Why Model Checking? an alternative proof approach useful counter-examples ability to define and vary assumptions about the system and see why it breaks closer to code level good degree of automation Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 5 / 64
  7. 7. Distributed Algorithms: Model Checking Challenges unbounded data types unbounded number of rounds (round numbers part of messages) parameterization in multiple parameters among n processes f t are faulty with n 3t contrast to concurrent programs diverse fault models (adverse environments) continuous time fault-tolerant clock synchronization degrees of concurrency: synchronous, asynchronous partially synchronous a process makes at most 5 steps between 2 steps of any other process Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 6 / 64
  8. 8. Fault-tolerant distributed algorithms n n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64
  9. 9. Fault-tolerant distributed algorithms n ? ? ? t n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64
  10. 10. Fault-tolerant distributed algorithms n ? ? ? t f n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64
  11. 11. Challenge #1: fault models clean crashes: least severe faulty processes prematurely halt after/before “send to all” crash faults: faulty processes prematurely halt (also) in the middle of “send to all” omission faults: faulty processes follow the algorithm, but some messages sent by them might be lost symmetric faults: faulty processes send arbitrarily to all or nobody Byzantine faults: most severe faulty processes can do anything encompass all behaviors of above models Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 8 / 64
  12. 12. Challenges #2 #3: Pseudo-code and Communication Translate pseudo-code to a formal description that allows us to verify the algorithm and does not oversimplify the original algorithm. Assumptions about the communication medium are usually written in plain English, spread across research papers, constitute folklore knowledge. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 9 / 64
  13. 13. Asynchronous Reliable Broadcast (Srikanth Toueg, 87) The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi . Variables of process i vi : f0 , 1g i n i t i a l l y 0 or 1 accepti : f0 , 1g i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct proces ses and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct proces ses then accepti := 1 ; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64
  14. 14. Asynchronous Reliable Broadcast (Srikanth Toueg, 87) The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi . Variables of process i vi : f0 , 1g i n i t i a l l y 0 or 1 accepti : f0 , 1g i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct proces ses and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct proces ses then accepti := 1 ; asynchronous t Byzantine faults correct if n 3t the code is parameterized in n and t ) process template P(n; t; f ) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64
  15. 15. Typical Structure of a Computation Step receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64
  16. 16. Typical Structure of a Computation Step receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic implicit pseudo-code Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64
  17. 17. Challenge #4: Parameterized Model Checking Parameterized model checking problem: given a process template P(n; t; f ), resilience condition RC : n 3t ^ t f 0, fairness constraints , e.g., “all messages will be delivered” and an LTL-X formula ' show for all n, t, and f satisfying RC (P(n; t; f ))n

×