Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Model Checking of Fault-Tolerant Distributed Algorithms 
Part I: Fault-Tolerant Distributed Algorithms 
Annu Gmeiner Igor ...
Distributed Systems 
114 7.1 Assessing and validating the standard node HITS design 
Figure 7.1: DARTS prototype board, co...
Distributed Systems 
114 7.1 Assessing and validating the standard node HITS design 
Figure 7.1: DARTS prototype board, co...
No. . . some failing systems 
Therac-25 (1985) 
radiation therapy machine 
gave massive overdoses, e.g., due to race condi...
Why do they fail? 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 4 / 52
Why do they fail? 
faults at design/implementation phase 
faults at runtime 
outside of control of designer/developer 
e.g...
Why do they fail? 
faults at design/implementation phase 
approach:
nd and
x faults before operation 
) model checking 
faults at runtime 
outside of control of designer/developer 
e.g., to the rig...
Why do they fail? 
faults at design/implementation phase 
approach:
nd and
x faults before operation 
) model checking 
faults at runtime 
outside of control of designer/developer 
e.g., to the rig...
Bringing both together 
Goal: automatically veri
ed fault-tolerant distributed algorithms 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14,...
Bringing both together 
Goal: automatically veri
ed fault-tolerant distributed algorithms 
model checking FTDAs is a research challenge: 
computers run independently at di...
Lecture overview 
Part I: Fault-tolerant distributed algorithms 
introduction to distributed algorithms 
details of our ca...
Lecture overview cont. 
We will have to skip due to time limitations: 
Part III: Introduction into Parameterized model che...
nement (CEGAR) 
Part V: Parameterized model checking of FTDAs by BMC 
bounding the diameter 
bounded model checking as a c...
Part I: Fault-Tolerant Distributed 
Algorithms 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TM...
Distributed Systems are everywhere 
What they allow to do 
share resources 
communicate 
increase performance 
speed 
faul...
Application areas 
buzzwords from the 60ies 
operating systems 
(distributed) data base systems 
communication networks 
m...
Major challenge 
Uncertainty 
computers run independently at dierent speeds 
exchange messages with (unknown) delays 
faul...
Major challenge 
Uncertainty 
computers run independently at dierent speeds 
exchange messages with (unknown) delays 
faul...
Major challenge 
Uncertainty 
computers run independently at dierent speeds 
exchange messages with (unknown) delays 
faul...
From dependability to a distributed system 
P 
Process P provides a service. We want to access it reliably 
but P may fail...
From dependability to a distributed system 
P 
P1 P2 
P3 
replication 
Process P provides a service. We want to access it ...
From dependability to a distributed system 
P 
replication P P 
P1 P2 
P3 
P 
consistency 
Process P provides a service. W...
Replication|distributed systems 
P 
P1 P2 
P3 
replication 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distribu...
Distributed message passing system 
multiple distributed processes pi 
pi send receive internal 
dots represent states 
a ...
Types of Distributed Algorithms: 
Synchronous vs. Asynchronous 
Synchronous 
all processes move in lock-step 
rounds 
a me...
Types of Distributed Algorithms: 
Synchronous vs. Asynchronous 
Synchronous 
all processes move in lock-step 
rounds 
a me...
Types of Distributed Algorithms: 
Synchronous vs. Asynchronous 
Synchronous 
all processes move in lock-step 
rounds 
a me...
Asynchronous system 
has very mild restrictions on the environment 
interleaving semantics 
unbounded message delays 
very...
Where we stand 
P 
P1 P2 
P3 
replication 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14...
What we still need. . . 
P1 P2 
P3 
P P 
P 
consistency 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed...
What we still need. . . 
P1 P2 
P3 
P P 
P 
consistency 
consistency requirements have been formalized under several names...
nitions are similar but may have subtle dierences 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos...
What we still need. . . 
P1 P2 
P3 
P P 
P 
consistency 
consistency requirements have been formalized under several names...
nitions are similar but may have subtle dierences 
We use the famous Byzantine Generals to introduce this problem 
domain....
Fault tolerance { The Byzantine generals problem 
Wiktionary: 
Byzantine: adj. of a devious, usually stealthy manner, of p...
Fault tolerance { The Byzantine generals problem 
Lamport (this year's Turing laureate), Shostak, and Pease wrote in their...
Fault tolerance { The Byzantine generals problem 
Lamport (this year's Turing laureate), Shostak, and Pease wrote in their...
Fault tolerance { The Byzantine generals problem 
Lamport (this year's Turing laureate), Shostak, and Pease wrote in their...
Byzantine generals problem cont. 
In the absence of faults it is trivial to solve: 
send proposed plan (attack or not atta...
Byzantine generals problem cont. 
In the absence of faults it is trivial to solve: 
send proposed plan (attack or not atta...
Byzantine generals problem cont. 
In the absence of faults it is trivial to solve: 
send proposed plan (attack or not atta...
Fault-tolerant distributed algorithms 
n 
n processes communicate by messages (reliable communication) 
all processes know...
Fault-tolerant distributed algorithms 
n 
? 
? 
? 
t 
n processes communicate by messages (reliable communication) 
all pr...
Fault-tolerant distributed algorithms 
n 
? 
? 
? 
t f 
n processes communicate by messages (reliable communication) 
all ...
Fault models|abstractions of reality 
clean crashes: least severe 
faulty processes prematurely halt after/before send to ...
Fault models|the ugly truth 
A photo of a Byzantine fault: 
photo by Driscoll (Honeywell) 
he reports Byzantine behavior o...
Model vs. reality: impossibilities 
Hence, we would like the weakest assumptions possible. But there are 
theoretical limi...
Model vs. reality: impossibilities 
Hence, we would like the weakest assumptions possible. But there are 
theoretical limi...
After this excursion to faults, let's 
go back to the problem of de
ning 
consistency 
(asynchronous systems) 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14...
De
ning consistency|e.g., binary consensus 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on...
De
ning consistency|e.g., binary consensus 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on...
De
ning consistency|e.g., binary consensus 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on...
De
ning consistency|e.g., binary consensus 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on...
De
ning consistency|e.g., binary consensus 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on...
What if only two properties have to be satis
ed? 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on some value in concordance with the ...
What if only two properties have to be satis
ed? 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on some value in concordance with the ...
What if only two properties have to be satis
ed? 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on some value in concordance with the ...
What if only two properties have to be satis
ed? 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on some value in concordance with the ...
What if only two properties have to be satis
ed? 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on some value in concordance with the ...
What if only two properties have to be satis
ed? 
Every process has some initial value v 2 f0; 1g and has to decide 
irrevocably on some value in concordance with the ...
Wrap-up: Intro to FTDAs 
distributed systems 
replication and consistency 
synchronous vs. asynchronous 
fault models 
exa...
Our case study. . . 
Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 30 / 52
Asynchronous FTDAs 
In this lecture we consider methods for asynchronous FTDAs that either 
solve problems that are less h...
c initial state 
(Srikanth  Toueg, 1987). [Veri
ed in Parts II, III, V] 
condition-based consensus properties required only in runs from 
speci
c initial states (Mostefaoui et al., 2003) 
[Veri
ed in Part II] 
The Paxos idea fault-tolerant distributed algorithms that are safe and 
make progress only if you are luck...
Asynchronous FTDAs 
In this lecture we consider methods for asynchronous FTDAs that either 
solve problems that are less h...
c initial state 
(Srikanth  Toueg, 1987). [Veri
ed in Parts II, III, V] 
condition-based consensus properties required only in runs from 
speci
c initial states (Mostefaoui et al., 2003) 
[Veri
ed in Part II] 
The Paxos idea fault-tolerant distributed algorithms that are safe and 
make progress only if you are luck...
Asynchronous Reliable Broadcast (Srikanth  Toueg, 87) 
The core of the classic broadcast algorithm from the DA literature....
Assumptions from (Srikanth  Toueg, 87) 
asynchronous interleaving 
reliable message passing (no bounds on message delays) ...
The spec of our case-study 
Unforgeability. If vi = false for all correct processes i , then for all correct 
processes j ...
The spec of our case-study 
Unforgeability. If vi = false for all correct processes i , then for all correct 
processes j ...
The spec of our case-study 
Unforgeability. If vi = false for all correct processes i , then for all correct 
processes j ...
Reliable broadcast vs. Consensus 
Reliable broadcast: Completeness. If vi = true for all correct processes i , 
then there...
c initial state 
Termination requires to do something in all runs (from all initial 
states) 
weakening of spec makes reli...
Asynchronous Reliable Broadcast (Srikanth  Toueg, 87) 
The core of the classic broadcast algorithm from the DA literature....
Threshold-Guarded Distributed Algorithms 
Standard construct: quanti
ed guards (t=f=0) 
Existential Guard 
if received m from some process then ... 
Universal Guard 
if received m from all pr...
Threshold-Guarded Distributed Algorithms 
Standard construct: quanti
ed guards (t=f=0) 
Existential Guard 
if received m from some process then ... 
Universal Guard 
if received m from all pr...
Threshold-Guarded Distributed Algorithms 
Standard construct: quanti
ed guards (t=f=0) 
Existential Guard 
if received m from some process then ... 
Universal Guard 
if received m from all pr...
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)
Upcoming SlideShare
Loading in …5
×

Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)

834 views

Published on

Josef Widder, Vienna University of Technology

Published in: Science
  • Be the first to comment

Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part1)

  1. 1. Model Checking of Fault-Tolerant Distributed Algorithms Part I: Fault-Tolerant Distributed Algorithms Annu Gmeiner Igor Konnov Ulrich Schmid Helmut Veith Josef Widder TMPA 2014, Kostroma, Russia Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 1 / 52
  2. 2. Distributed Systems 114 7.1 Assessing and validating the standard node HITS design Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips
  3. 3. Distributed Systems 114 7.1 Assessing and validating the standard node HITS design Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips Are they always working? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 2 / 52
  4. 4. No. . . some failing systems Therac-25 (1985) radiation therapy machine gave massive overdoses, e.g., due to race conditions of concurrent tasks Quantas Airbus in- ight Learmonth upset (2008) 1 out of 3 replicated components failed computer initiated dangerous altitude drop Ariane 501 maiden ight (1996) primary/backup, i.e., 2 replicated computers both run into the same integer over ow Net ix outages due to Amazon's cloud (ongoing) one is not sure what is going on there hundreds of computers involved Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 3 / 52
  5. 5. Why do they fail? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 4 / 52
  6. 6. Why do they fail? faults at design/implementation phase faults at runtime outside of control of designer/developer e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ) led to erroneous messages being sent Driscoll (Honeywell) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 4 / 52
  7. 7. Why do they fail? faults at design/implementation phase approach:
  8. 8. nd and
  9. 9. x faults before operation ) model checking faults at runtime outside of control of designer/developer e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ) led to erroneous messages being sent Driscoll (Honeywell) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 4 / 52
  10. 10. Why do they fail? faults at design/implementation phase approach:
  11. 11. nd and
  12. 12. x faults before operation ) model checking faults at runtime outside of control of designer/developer e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ) led to erroneous messages being sent approach: keep system operational despite faults ) fault-tolerant distributed algorithms Driscoll (Honeywell) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 4 / 52
  13. 13. Bringing both together Goal: automatically veri
  14. 14. ed fault-tolerant distributed algorithms Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 5 / 52
  15. 15. Bringing both together Goal: automatically veri
  16. 16. ed fault-tolerant distributed algorithms model checking FTDAs is a research challenge: computers run independently at dierent speeds exchange messages with uncertain delays faults parameterization . . . fault-tolerance makes model checking harder Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 5 / 52
  17. 17. Lecture overview Part I: Fault-tolerant distributed algorithms introduction to distributed algorithms details of our case study algorithm motivation why model checking is cool Part II: Modeling fault-tolerant distributed algorithms model checking challenges in distributed algorithms Promela, control ow automata, etc. model checking of small instances with Spin Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 6 / 52
  18. 18. Lecture overview cont. We will have to skip due to time limitations: Part III: Introduction into Parameterized model checking the general problem statement well-known undecidability results Igor Konnov: Part IV: Parameterized model checking of FTDAs by abstraction parametric interval abstraction (PIA) PIA data and counter abstraction counterexample-guided abstraction re
  19. 19. nement (CEGAR) Part V: Parameterized model checking of FTDAs by BMC bounding the diameter bounded model checking as a complete method Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 7 / 52
  20. 20. Part I: Fault-Tolerant Distributed Algorithms Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 8 / 52
  21. 21. Distributed Systems are everywhere What they allow to do share resources communicate increase performance speed fault tolerance Dierence to centralized systems independent activities (concurrency) components do not have access to the global state (only local view) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 9 / 52
  22. 22. Application areas buzzwords from the 60ies operating systems (distributed) data base systems communication networks multiprocessor architectures control systems New buzzwords cloud computing social networks multi core cyber-physical systems Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 10 / 52
  23. 23. Major challenge Uncertainty computers run independently at dierent speeds exchange messages with (unknown) delays faults Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 11 / 52
  24. 24. Major challenge Uncertainty computers run independently at dierent speeds exchange messages with (unknown) delays faults challenge in design of distributed algorithms a process has access only to its local state but one wants to achieve some global property Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 11 / 52
  25. 25. Major challenge Uncertainty computers run independently at dierent speeds exchange messages with (unknown) delays faults challenge in design of distributed algorithms a process has access only to its local state but one wants to achieve some global property challenge in proving them correct large degree of non-determinism ) large execution and state space Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 11 / 52
  26. 26. From dependability to a distributed system P Process P provides a service. We want to access it reliably but P may fail Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 12 / 52
  27. 27. From dependability to a distributed system P P1 P2 P3 replication Process P provides a service. We want to access it reliably but P may fail canonical approach: replication, i.e., several copies of P Due to non-determinism, the behavior of the copies might deviate (e.g. in a replicated database, transactions are committed in dierent orders at dierent sites) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 12 / 52
  28. 28. From dependability to a distributed system P replication P P P1 P2 P3 P consistency Process P provides a service. We want to access it reliably but P may fail canonical approach: replication, i.e., several copies of P Due to non-determinism, the behavior of the copies might deviate (e.g. in a replicated database, transactions are committed in dierent orders at dierent sites) ) we have to enforce that the copies behave as one. ) Consistency in a distributed system: what does it mean to behave as one. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 12 / 52
  29. 29. Replication|distributed systems P P1 P2 P3 replication Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 13 / 52
  30. 30. Distributed message passing system multiple distributed processes pi pi send receive internal dots represent states a step of a process can be a send step (a process sends messages to other processes) a receive step (a process receives a subset of messages sent to it) an internal step (a local computation) steps are the atomic (indivisible) units of computations Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 14 / 52
  31. 31. Types of Distributed Algorithms: Synchronous vs. Asynchronous Synchronous all processes move in lock-step rounds a message sent in a round is received in the same round idealized view impossible or expensive to implement Asynchronous only one process moves at a time arbitrary interleavings of steps a message sent is received eventually Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 15 / 52
  32. 32. Types of Distributed Algorithms: Synchronous vs. Asynchronous Synchronous all processes move in lock-step rounds a message sent in a round is received in the same round idealized view impossible or expensive to implement Asynchronous only one process moves at a time arbitrary interleavings of steps a message sent is received eventually important problems not solvable (Fischer et al., 1985)! Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 15 / 52
  33. 33. Types of Distributed Algorithms: Synchronous vs. Asynchronous Synchronous all processes move in lock-step rounds a message sent in a round is received in the same round idealized view impossible or expensive to implement Asynchronous only one process moves at a time arbitrary interleavings of steps a message sent is received eventually important problems not solvable (Fischer et al., 1985)! We focus on asynchronous algorithms here. . . Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 15 / 52
  34. 34. Asynchronous system has very mild restrictions on the environment interleaving semantics unbounded message delays very little can be done. . . there is no distributed algorithm that solves consensus in the presence of one faulty process (as we will see, consensus is the paradigm of consistency) folklore explanation: you cannot distinguish a slow process from a crashed one real explanation: see intricate proof by Fischer, Lynch, and Paterson (JACM 1985) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 16 / 52
  35. 35. Where we stand P P1 P2 P3 replication Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 17 / 52
  36. 36. What we still need. . . P1 P2 P3 P P P consistency Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 18 / 52
  37. 37. What we still need. . . P1 P2 P3 P P P consistency consistency requirements have been formalized under several names, e.g., consensus atomic broadcast Byzantine Generals problem Byzantine agreement atomic commitment de
  38. 38. nitions are similar but may have subtle dierences Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 18 / 52
  39. 39. What we still need. . . P1 P2 P3 P P P consistency consistency requirements have been formalized under several names, e.g., consensus atomic broadcast Byzantine Generals problem Byzantine agreement atomic commitment de
  40. 40. nitions are similar but may have subtle dierences We use the famous Byzantine Generals to introduce this problem domain. . . Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 18 / 52
  41. 41. Fault tolerance { The Byzantine generals problem Wiktionary: Byzantine: adj. of a devious, usually stealthy manner, of practice. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 19 / 52
  42. 42. Fault tolerance { The Byzantine generals problem Lamport (this year's Turing laureate), Shostak, and Pease wrote in their Dijkstra Prize in Distributed Computing winning paper (Lamport et al., 1982): [: : :] several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. [: : :] However, some of the generals may be traitors [: : :] if the divisions of loyal generals attack together, the city falls if only some loyal generals attack, their armies fall generals communicate by obedient messengers Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 20 / 52
  43. 43. Fault tolerance { The Byzantine generals problem Lamport (this year's Turing laureate), Shostak, and Pease wrote in their Dijkstra Prize in Distributed Computing winning paper (Lamport et al., 1982): [: : :] several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. [: : :] However, some of the generals may be traitors [: : :] if the divisions of loyal generals attack together, the city falls if only some loyal generals attack, their armies fall generals communicate by obedient messengers The Byzantine generals problem: the loyal generals have to agree on whether to attack. if all want to attack they must attack, if no-one wants to attack they must not attack Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 20 / 52
  44. 44. Fault tolerance { The Byzantine generals problem Lamport (this year's Turing laureate), Shostak, and Pease wrote in their Dijkstra Prize in Distributed Computing winning paper (Lamport et al., 1982): [: : :] several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. [: : :] However, some of the generals may be traitors [: : :] if the divisions of loyal generals attack together, the city falls if only some loyal generals attack, their armies fall generals communicate by obedient messengers The Byzantine generals problem: the loyal generals have to agree on whether to attack. if all want to attack they must attack, if no-one wants to attack they must not attack metaphor for a distributed system where correct processes (loyal generals) act as one in the presence of faulty processes (traitors) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 20 / 52
  45. 45. Byzantine generals problem cont. In the absence of faults it is trivial to solve: send proposed plan (attack or not attack) to all wait until received messages from everyone if a process proposed attack decide to attack otherwise, decide to not attack Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 21 / 52
  46. 46. Byzantine generals problem cont. In the absence of faults it is trivial to solve: send proposed plan (attack or not attack) to all wait until received messages from everyone if a process proposed attack decide to attack otherwise, decide to not attack In the presence of faults it becomes tricky if a process may crash, some processes may not receive messages from everyone (but some may) if a process may send faulty messages, contradictory information may be received, e.g., A tells B that C told A that C wants to attack, while C tells B that C does not want to attack Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 21 / 52
  47. 47. Byzantine generals problem cont. In the absence of faults it is trivial to solve: send proposed plan (attack or not attack) to all wait until received messages from everyone if a process proposed attack decide to attack otherwise, decide to not attack In the presence of faults it becomes tricky if a process may crash, some processes may not receive messages from everyone (but some may) if a process may send faulty messages, contradictory information may be received, e.g., A tells B that C told A that C wants to attack, while C tells B that C does not want to attack Who is lying to whom? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 21 / 52
  48. 48. Fault-tolerant distributed algorithms n n processes communicate by messages (reliable communication) all processes know that at most t of them might be faulty f are actually faulty resilience conditions, e.g., n 3t ^ t f 0 no masquerading: the processes know the origin of incoming messages Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 22 / 52
  49. 49. Fault-tolerant distributed algorithms n ? ? ? t n processes communicate by messages (reliable communication) all processes know that at most t of them might be faulty f are actually faulty resilience conditions, e.g., n 3t ^ t f 0 no masquerading: the processes know the origin of incoming messages Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 22 / 52
  50. 50. Fault-tolerant distributed algorithms n ? ? ? t f n processes communicate by messages (reliable communication) all processes know that at most t of them might be faulty f are actually faulty resilience conditions, e.g., n 3t ^ t f 0 no masquerading: the processes know the origin of incoming messages Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 22 / 52
  51. 51. Fault models|abstractions of reality clean crashes: least severe faulty processes prematurely halt after/before send to all crash faults: faulty processes prematurely halt (also) in the middle of send to all omission faults: faulty processes follow the algorithm, but some messages sent by them might be lost symmetric faults: faulty processes send arbitrarily to all or nobody Byzantine faults: most severe faulty processes can do anything encompass all behaviors of above models Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 23 / 52
  52. 52. Fault models|the ugly truth A photo of a Byzantine fault: photo by Driscoll (Honeywell) he reports Byzantine behavior on the Space Shuttle computer network other sources of faults: bit- ips in memory, power outage, disconnection from the network, etc. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 24 / 52
  53. 53. Model vs. reality: impossibilities Hence, we would like the weakest assumptions possible. But there are theoretical limits on how weak assumptions can be made: consensus is impossible in asynchronous systems if there may be a crash fault, i.e., t = 1 (Fischer et al., 1985) consensus is possible in synchronous systems in the presence of Byzantine faults i n 3t (Lamport et al., 1982) consensus is impossible in (synchronous) round-based systems if bn=2c messages can be lost per round (Santoro Widmayer, 1989) fast Byzantine consensus is solvable i n 5t (Martin Alvisi, 2006) 32 dierent degrees of synchrony and whether consensus can be solved in the presence of how many faults investigated in (Dolev et al., 1987) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 25 / 52
  54. 54. Model vs. reality: impossibilities Hence, we would like the weakest assumptions possible. But there are theoretical limits on how weak assumptions can be made: consensus is impossible in asynchronous systems if there may be a crash fault, i.e., t = 1 (Fischer et al., 1985) consensus is possible in synchronous systems in the presence of Byzantine faults i n 3t (Lamport et al., 1982) consensus is impossible in (synchronous) round-based systems if bn=2c messages can be lost per round (Santoro Widmayer, 1989) fast Byzantine consensus is solvable i n 5t (Martin Alvisi, 2006) 32 dierent degrees of synchrony and whether consensus can be solved in the presence of how many faults investigated in (Dolev et al., 1987) arithmetic resilience conditions play crucial role! Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 25 / 52
  55. 55. After this excursion to faults, let's go back to the problem of de
  56. 56. ning consistency (asynchronous systems) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 26 / 52
  57. 57. De
  58. 58. ning consistency|e.g., binary consensus Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 27 / 52
  59. 59. De
  60. 60. ning consistency|e.g., binary consensus Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. either all attack or no-one Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 27 / 52
  61. 61. De
  62. 62. ning consistency|e.g., binary consensus Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. either all attack or no-one validity. If all correct processes have the same initial value v, then v is the only possible decision value the decision on whether to attack must be consistent with the will of at least one loyal general Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 27 / 52
  63. 63. De
  64. 64. ning consistency|e.g., binary consensus Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. either all attack or no-one validity. If all correct processes have the same initial value v, then v is the only possible decision value the decision on whether to attack must be consistent with the will of at least one loyal general termination. Every correct process eventually decides. at some point negotiations must be over Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 27 / 52
  65. 65. De
  66. 66. ning consistency|e.g., binary consensus Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. either all attack or no-one validity. If all correct processes have the same initial value v, then v is the only possible decision value the decision on whether to attack must be consistent with the will of at least one loyal general termination. Every correct process eventually decides. at some point negotiations must be over Interplay of safety and liveness makes the problem hard. . . Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 27 / 52
  67. 67. What if only two properties have to be satis
  68. 68. ed? Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: validity. If all correct processes have the same initial value v, then v is the only possible decision value. termination. Every correct process eventually decides. Give an algorithm that solves validity and termination! Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 28 / 52
  69. 69. What if only two properties have to be satis
  70. 70. ed? Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: validity. If all correct processes have the same initial value v, then v is the only possible decision value. termination. Every correct process eventually decides. Solution: decide my own proposed value. (no need to agree) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 28 / 52
  71. 71. What if only two properties have to be satis
  72. 72. ed? Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. termination. Every correct process eventually decides. Give an algorithm that solves agreement and termination! Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 28 / 52
  73. 73. What if only two properties have to be satis
  74. 74. ed? Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. termination. Every correct process eventually decides. Solution: decide 0. (no relation to initial values required) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 28 / 52
  75. 75. What if only two properties have to be satis
  76. 76. ed? Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. validity. If all correct processes have the same initial value v, then v is the only possible decision value. Give an algorithm that solves agreement and validity! Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 28 / 52
  77. 77. What if only two properties have to be satis
  78. 78. ed? Every process has some initial value v 2 f0; 1g and has to decide irrevocably on some value in concordance with the following properties: agreement. No two correct processes decide on dierent value. validity. If all correct processes have the same initial value v, then v is the only possible decision value. Solution: do nothing (doing nothing is always safe) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 28 / 52
  79. 79. Wrap-up: Intro to FTDAs distributed systems replication and consistency synchronous vs. asynchronous fault models example for an agreement problem: Byzantine Generals Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 29 / 52
  80. 80. Our case study. . . Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 30 / 52
  81. 81. Asynchronous FTDAs In this lecture we consider methods for asynchronous FTDAs that either solve problems that are less hard than consensus: reliable broadcast. termination required only for speci
  82. 82. c initial state (Srikanth Toueg, 1987). [Veri
  83. 83. ed in Parts II, III, V] condition-based consensus properties required only in runs from speci
  84. 84. c initial states (Mostefaoui et al., 2003) [Veri
  85. 85. ed in Part II] The Paxos idea fault-tolerant distributed algorithms that are safe and make progress only if you are lucky (Lamport, 1998) [Serious challenge] are asynchronous but use information on faults as a black box failure detector based atomic commitment. distributed databases (Raynal, 1997) [Challenge] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 31 / 52
  86. 86. Asynchronous FTDAs In this lecture we consider methods for asynchronous FTDAs that either solve problems that are less hard than consensus: reliable broadcast. termination required only for speci
  87. 87. c initial state (Srikanth Toueg, 1987). [Veri
  88. 88. ed in Parts II, III, V] condition-based consensus properties required only in runs from speci
  89. 89. c initial states (Mostefaoui et al., 2003) [Veri
  90. 90. ed in Part II] The Paxos idea fault-tolerant distributed algorithms that are safe and make progress only if you are lucky (Lamport, 1998) [Serious challenge] are asynchronous but use information on faults as a black box failure detector based atomic commitment. distributed databases (Raynal, 1997) [Challenge] We use the algorithm from (Srikanth Toueg, 1987) as running example Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 31 / 52
  91. 91. Asynchronous Reliable Broadcast (Srikanth Toueg, 87) The core of the classic broadcast algorithm from the DA literature. 1 Variables of process i 2 vi : f0 , 1g ini t i a l l y 0 or 1 3 accepti : f0 , 1g ini t i a l l y 0 4 5 An atomic step: 6 i f vi = 1 7 then send ( echo ) to all ; 8 9 i f received (echo) from at l e a s t 10 t + 1 distinct p r o c e s s e s 11 and not s ent ( echo ) be f o r e 12 then send ( echo ) to all ; 13 14 i f received ( echo ) from at l e a s t 15 n - t distinct p r o c e s s e s 16 then accepti := 1 ; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 32 / 52
  92. 92. Assumptions from (Srikanth Toueg, 87) asynchronous interleaving reliable message passing (no bounds on message delays) at most t Byzantine faults resilience condition: n 3t ^ t f Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 33 / 52
  93. 93. The spec of our case-study Unforgeability. If vi = false for all correct processes i , then for all correct processes j , acceptj remains false forever. Completeness. If vi = true for all correct processes i , then there is a correct process j that eventually sets acceptj to true. Relay. If a correct process i sets accepti to true, then eventually all correct processes j set acceptj to true. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 34 / 52
  94. 94. The spec of our case-study Unforgeability. If vi = false for all correct processes i , then for all correct processes j , acceptj remains false forever. if no loyal general wants to attack, then traitors should not be able to force one. Completeness. If vi = true for all correct processes i , then there is a correct process j that eventually sets acceptj to true. If all loyal generals want to attack, there shall be an attack. Relay. If a correct process i sets accepti to true, then eventually all correct processes j set acceptj to true. If one loyal general attacks, then all loyal generals should attack. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 34 / 52
  95. 95. The spec of our case-study Unforgeability. If vi = false for all correct processes i , then for all correct processes j , acceptj remains false forever. if no loyal general wants to attack, then traitors should not be able to force one. Completeness. If vi = true for all correct processes i , then there is a correct process j that eventually sets acceptj to true. If all loyal generals want to attack, there shall be an attack. Relay. If a correct process i sets accepti to true, then eventually all correct processes j set acceptj to true. If one loyal general attacks, then all loyal generals should attack. These are the specs as given in literature: they can be formalized in LTL Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 34 / 52
  96. 96. Reliable broadcast vs. Consensus Reliable broadcast: Completeness. If vi = true for all correct processes i , then there is a correct process j that eventually sets acceptj to true. Consensus: Termination. Every correct process eventually decides. Dierence: Completeness requires to do something only if 8i : vi = true, i.e., only for one speci
  97. 97. c initial state Termination requires to do something in all runs (from all initial states) weakening of spec makes reliable broadcast solvable in async, while consensus is not solvable Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 35 / 52
  98. 98. Asynchronous Reliable Broadcast (Srikanth Toueg, 87) The core of the classic broadcast algorithm from the DA literature. 1 Variables of process i 2 vi : f0 , 1g ini t i a l l y 0 or 1 3 accepti : f0 , 1g ini t i a l l y 0 4 5 An atomic step: 6 i f vi = 1 7 then send ( echo ) to all ; 8 9 i f received (echo) from at l e a s t 10 t + 1 distinct p r o c e s s e s 11 and not s ent ( echo ) be f o r e 12 then send ( echo ) to all ; 13 14 i f received ( echo ) from at l e a s t 15 n - t distinct p r o c e s s e s 16 then accepti := 1 ; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 36 / 52
  99. 99. Threshold-Guarded Distributed Algorithms Standard construct: quanti
  100. 100. ed guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 37 / 52
  101. 101. Threshold-Guarded Distributed Algorithms Standard construct: quanti
  102. 102. ed guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA'14, Nov. 2014 37 / 52
  103. 103. Threshold-Guarded Distributed Algorithms Standard construct: quanti
  104. 104. ed guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Fault-Tolerant Algorithms: n processes, at most t are Byzantine Threshold Guard if received m from n

×