SlideShare a Scribd company logo
1 of 97
Distributed Systems
(3rd Edition)
Chapter 08: Fault Tolerance
Version: March 20, 2022
2 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts
Dependability
Basics
A component provides services to clients. To provide services, the
component may require the services from other components ⇒ a
component may depend on some other component.
Specifically
A component C depends on C∗ if the correctness of C’s behavior
depends on the correctness of C∗’s behavior. (Components are
processes or channels.)
2 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts
Dependability
Basics
A component provides services to clients. To provide services, the
component may require the services from other components ⇒ a
component may depend on some other component.
Specifically
A component C depends on C∗ if the correctness of C’s behavior
depends on the correctness of C∗’s behavior. (Components are
processes or channels.)
Requirements related to dependability
Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired
3 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts
Reliability versus availability
Reliability R(t) of component C
Conditional probability that C has been functioning correctly during
[0,t) given C was functioning correctly at time T = 0.
Traditional metrics
► Mean Time To Failure (MTTF): The average time until a
component fails.
► Mean Time To Repair (MTTR): The average time needed to
repair a component.
► Mean Time Between Failures (MTBF): Simply MTTF + MTTR.
4 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts
Reliability versus availability
Availability A(t) of component C
Average fraction of time that C has been up-and-running in interval
[0,t).
► Long-term availability A: A(∞)
►Note: A = MTTF = MTTF
MTBF MTTF+MTTR
Observation
Reliability and availability make sense only if we have an accurate
notion of what a failure actually is.
5 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts
Terminology
Failure, error, fault
Term Description Example
Failure A component is not living up
to its specifications
Crashed program
Error Part of a component that
can lead to a failure
Programming bug
Fault Cause of an error Sloppy programmer
6 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts
Terminology
Handling faults
Term Description Example
Fault
prevention
Prevent the
occurrence of a fault
Don’t hire sloppy
programmers
Fault
tolerance
Build a component
such that it can mask
the occurrence of a
fault
Build each
component by two
independent
programmers
Fault removal Reduce the
presence, number, or
seriousness of a fault
Get rid of sloppy
programmers
Fault
forecasting
Estimate current
presence, future
incidence, and
consequences of
faults
Estimate how a
recruiter is doing
when it comes to
hiring sloppy
programmers
7 / 68
Fault tolerance: Introduction to fault tolerance Failure models
Failure models
Types of failures
Type Description of server’s behavior
Crash failure Halts, but is working correctly until it halts
Omission failure
Receive omission
Send omission
Fails to respond to incoming requests
Fails to receive incoming messages
Fails to send messages
Timing failure Response lies outside a specified time in-
terval
Response failure
Value failure
State-transition failure
Response is incorrect
The value of the response is wrong
Deviates from the correct flow of control
Arbitrary failure May produce arbitrary responses at arbi-
trary times
8 / 68
Fault tolerance: Introduction to fault tolerance Failure models
Dependability versus security
Omission versus commission
Arbitrary failures are sometimes qualified as malicious. It is better to
make the following distinction:
► Omission failures: a component fails to take an action that it
should have taken
► Commission failure: a component takes an action that it should
not have taken
8 / 68
Fault tolerance: Introduction to fault tolerance Failure models
Dependability versus security
Omission versus commission
Arbitrary failures are sometimes qualified as malicious. It is better to
make the following distinction:
► Omission failures: a component fails to take an action that it
should have taken
► Commission failure: a component takes an action that it should
not have taken
Observation
Note that deliberate failures, be they omission or commission failures
are typically security problems. Distinguishing between deliberate
failures and unintentional ones is, in general, impossible.
9 / 68
Fault tolerance: Introduction to fault tolerance Failure models
Halting failures
Scenario
C no longer perceives any activity from C∗ — a halting failure?
Distinguishing between a crash or omission/timing failure may be
impossible.
Asynchronous versus synchronous systems
► Asynchronous system: no assumptions about process execution
speeds or message delivery times → cannot reliably detect crash
failures.
► Synchronous system: process execution speeds and message
delivery times are bounded → we can reliably detect omission
and timing failures.
► In practice we have partially synchronous systems: most of the
time, we can assume the system to be synchronous, yet there is
no bound on the time that a system is asynchronous → can
normally reliably detect crash failures.
10 / 68
Fault tolerance: Introduction to fault tolerance Failure models
Halting failures
Assumptions we can make
Halting type Description
Fail-stop Crash failures, but reliably detectable
Fail-noisy Crash failures, eventually reliably de-
tectable
Fail-silent Omission or crash failures: clients cannot
tell what went wrong
Fail-safe Arbitrary, yet benign failures (i.e., they can-
not do any harm)
Fail-arbitrary Arbitrary, with malicious failures
11 / 68
Fault tolerance: Introduction to fault tolerance Failure masking by redundancy
Redundancy for failure masking
Types of redundancy
► Information redundancy: Add extra bits to data units so that errors
can recovered when bits are garbled.
► Time redundancy: Design a system such that an action can be
performed again if anything went wrong. Typically used when
faults are transient or intermittent.
► Physical redundancy: add equipment or processes in order to
allow one or more components to fail. This type is extensively
used in distributed systems.
Fault tolerance: Process resilience Resilience by process groups
Process resilience
Basic idea
Protect against malfunctioning processes through process replication,
organizing multiple processes into a process group. Distinguish
between flat groups and hierarchical groups.
Flat group
Hierarchical group Coordinator
Worker
Group organization 12 / 68
13 / 68
Fault tolerance: Process resilience Failure masking and replication
Groups and failure masking
k-fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
13 / 68
Fault tolerance: Process resilience Failure masking and replication
Groups and failure masking
k-fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
How large does a k-fault tolerant group need to be?
► With halting failures (crash/omission/timing failures): we need a
total of k + 1 members as no member will produce an incorrect
result, so the result of one member is good enough.
► With arbitrary failures: we need 2k + 1 members so that the
correct result can be obtained through a majority vote.
13 / 68
Fault tolerance: Process resilience Failure masking and replication
Groups and failure masking
k-fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
How large does a k-fault tolerant group need to be?
► With halting failures (crash/omission/timing failures): we need a
total of k + 1 members as no member will produce an incorrect
result, so the result of one member is good enough.
► With arbitrary failures: we need 2k + 1 members so that the
correct result can be obtained through a majority vote.
Important assumptions
► All members are identical
► All members process commands in the same order
Result: We can now be sure that all processes do exactly the same
thing.
14 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures
Consensus
Prerequisite
In a fault-tolerant process group, each nonfaulty process executes the
same commands, and in the same order, as every other nonfaulty
process.
Reformulation
Nonfaulty group members need to reach consensus on which
command to execute next.
15 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures
Flooding-based consensus
System model
► A process group P = {P1,...,Pn}
► Fail-stop failure semantics, i.e., with reliable failure detection
► A client contacts a Pi requesting it to execute a command
► Every Pi maintains a list of proposed commands
15 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures
Flooding-based consensus
System model
► A process group P = {P1,...,Pn}
► Fail-stop failure semantics, i.e., with reliable failure detection
► A client contacts a Pi requesting it to execute a command
► Every Pi maintains a list of proposed commands
Basic algorithm (based on rounds)
i
1. In round r, Pi multicasts its known set of commands Cr to all
others
2. At the end of r, each Pi merges all received commands into a
i
new Cr+1.
3. Next command cmdi selected through a globally shared,
i
r+ 1
i
deterministic function: cmd ← select(C ).
Fault tolerance: Process resilience Consensus in faulty systems with crash failures
Flooding-based consensus: Example
P4
P3
P1
P2
decide
decide
decide
16 / 68
Observations
► P2 received all proposed commands from all other processes ⇒
makes decision.
► P3 may have detected that P1 crashed, but does not know if P2
received anything, i.e., P3 cannot know if it has the same
information as P2 ⇒ cannot make decision (same for P4).
Fault tolerance: Process resilience Example: Paxos
Realistic consensus: Paxos
Assumptions (rather weak ones, and realistic)
► A partially synchronous system (in fact, it may even be
asynchronous).
► Communication between processes may be unreliable: messages
may be lost, duplicated, or reordered.
► Corrupted message can be detected (and thus subsequently
ignored).
► All operations are deterministic: once an execution is started, it is
known exactly what it will do.
► Processes may exhibit crash failures, but not arbitrary failures.
► Processes do not collude.
Understanding Paxos
We will build up Paxos from scratch to understand where many
consensus algorithms actually come from.
Essential Paxos 17 / 68
Understanding Paxos 18 / 68
Fault tolerance: Process resilience Example: Paxos
Paxos essentials
Starting point
► We assume a client-server configuration, with initially one primary
server.
► To make the server more robust, we start with adding a backup
server.
► To ensure that all commands are executed in the same order at
both servers, the primary assigns unique sequence numbers to
all commands. In Paxos, the primary is called the leader.
► Assume that actual commands can always be restored (either
from clients or servers) ⇒ we consider only control messages.
Fault tolerance: Process resilience Example: Paxos
Two-server situation
S2
S1
Leader
C1
C2
(o2)
(o1)
(Seq,o2, 1)
o2
o2
2
1
2
2
(σ ) (σ )
(Seq,o1, 2)
o1
o1
21
1
21
Understanding Paxos 19 / 68
2
(σ ) (σ )
Understanding Paxos 20 / 68
Fault tolerance: Process resilience Example: Paxos
Handling lost messages
Some Paxos terminology
► The leader sends an accept message ACCEPT(o,t) to backups
when assigning a timestamp t to command o.
► A backup responds by sending a learn message: LEARN(o,t)
► When the leader notices that operation o has not yet been
learned, it retransmits ACCEPT(o,t) with the original timestamp.
Fault tolerance: Process resilience Example: Paxos
Two servers and one crash: problem
C2
(o2)
C1
(o1)
(Acc, o1,1)
S1
o1
S2
1
1
(σ )
Leader 2
(Acc, o ,1) o2
2
Understanding Paxos 21 / 68
2
(σ )
Problem
Primary crashes after executing an operation, but the backup never
received the accept message.
Fault tolerance: Process resilience Example: Paxos
Two servers and one crash: solution
C2
(o2)
C1
(o1)
(Acc, o1,1)
S1
o1
S2
(Lr n , o1) o1
1 1
2 1
(σ ) (σ )
Leader 2
(Acc, o ,2) o2
12
Understanding Paxos 22 / 68
2
(σ )
Solution
Never execute an operation before it is clear that is has been learned.
Fault tolerance: Process resilience Example: Paxos
Three servers and two crashes: still a problem?
C2
(o2)
S3
S2
S1
C1
(o1)
(Acc, o1,1)
o1
(Lr n , o1) o1
1
2
1
1
(σ ) (σ )
Leader 2
(Acc, o , 1) o2
2
Understanding Paxos 23 / 68
3
(σ )
Fault tolerance: Process resilience Example: Paxos
Three servers and two crashes: still a problem?
C2
(o2)
S3
S2
S1
C1
(o1)
(Acc, o1,1)
o1
(Lr n , o1) o1
1
2
1
1
(σ ) (σ )
Leader 2
(Acc, o , 1) o2
2
Understanding Paxos 23 / 68
3
(σ )
Scenario
What happens when LEARN(o1) as sent by S2 to S1 is lost?
Fault tolerance: Process resilience Example: Paxos
Three servers and two crashes: still a problem?
C2
(o2)
S3
S2
S1
C1
(o1)
(Acc, o1,1)
o1
(Lr n , o1) o1
1
2
1
1
(σ ) (σ )
Leader 2
(Acc, o , 1) o2
2
Understanding Paxos 23 / 68
3
(σ )
Scenario
What happens when LEARN(o1) as sent by S2 to S1 is lost?
Solution
S2 will also have to wait until it knows that S3 has learned o1.
Understanding Paxos 24 / 68
Fault tolerance: Process resilience Example: Paxos
Paxos: fundamental rule
General rule
In Paxos, a server S cannot execute an operation o until it has
received a LEARN(o) from all other nonfaulty servers.
Understanding Paxos 25 / 68
Fault tolerance: Process resilience Example: Paxos
Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set
timeouts, but take into account that a detected failure may be false.
Fault tolerance: Process resilience Example: Paxos
Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set
timeouts, but take into account that a detected failure may be false.
S2
Leader
Leader
C1
C2
(o2)
(Acc, o1,1)
S1
(Alive, o1)
o1
1
1
(o1) (σ )
(Acc, o2, 1) o2
2
Understanding Paxos 25 / 68
2
(σ )
Understanding Paxos 26 / 68
Fault tolerance: Process resilience Example: Paxos
Required number of servers
Observation
Paxos needs at least three servers
Understanding Paxos 26 / 68
Fault tolerance: Process resilience Example: Paxos
Required number of servers
Observation
Paxos needs at least three servers
Adapted fundamental rule
In Paxos with three servers, a server S cannot execute an operation o
until it has received at least one (other) LEARN(o) message, so that it
knows that a majority of servers will execute o.
Understanding Paxos 27 / 68
Fault tolerance: Process resilience Example: Paxos
Required number of servers
Assumptions before taking the next steps
► Initially, S1 is the leader.
► A server can reliably detect it has missed a message, and recover
from that miss.
► When a new leader needs to be elected, the remaining servers
1 2 3
follow a strictly deterministic algorithm, such as S → S → S .
► A client cannot be asked to help the servers to resolve a situation.
Understanding Paxos 27 / 68
Fault tolerance: Process resilience Example: Paxos
Required number of servers
Assumptions before taking the next steps
► Initially, S1 is the leader.
► A server can reliably detect it has missed a message, and recover
from that miss.
► When a new leader needs to be elected, the remaining servers
1 2 3
follow a strictly deterministic algorithm, such as S → S → S .
► A client cannot be asked to help the servers to resolve a situation.
Observation
If either one of the backups (S2 or S3 ) crashes, Paxos will behave
correctly: operations at nonfaulty servers are executed in the same
order.
Understanding Paxos 28 / 68
Fault tolerance: Process resilience
Leader crashes after executing o1
Example: Paxos
Understanding Paxos 28 / 68
Fault tolerance: Process resilience
Leader crashes after executing o1
S3 is completely ignorant of any activity by S1
Example: Paxos
► S2 received ACCEPT(o,1), detects crash, and becomes leader.
► S3 even never received ACCEPT(o,1).
► If S2 sends ACCEPT(o2 ,2) ⇒ S3 sees unexpected timestamp and
tells S2 that it missed o1.
► S2 retransmits ACCEPT(o1,1), allowing S3 to catch up.
Understanding Paxos 28 / 68
Fault tolerance: Process resilience
Leader crashes after executing o1
S3 is completely ignorant of any activity by S1
Example: Paxos
► S2 received ACCEPT(o,1), detects crash, and becomes leader.
► S3 even never received ACCEPT(o,1).
► If S2 sends ACCEPT(o2 ,2) ⇒ S3 sees unexpected timestamp and
tells S2 that it missed o1.
► S2 retransmits ACCEPT(o1,1), allowing S3 to catch up.
S2 missed ACCEPT(o1,1)
► S2 did detect crash and became new leader
► If S2 sends ACCEPT(o1,1) ⇒ S3 retransmits LEARN(o1).
► If S2 sends ACCEPT(o2 ,1) ⇒ S3 tells S2 that it apparently missed
ACCEPT(o1,1) from S1, so that S2 can catch up.
Understanding Paxos 29 / 68
Fault tolerance: Process resilience
Leader crashes after sending ACCEPT(o1,1)
Example: Paxos
S3 is completely ignorant of any activity by S1
As soon as S2 announces that o2 is to be accepted, S3 will notice that
it missed an operation and can ask S2 to help recover.
S2 had missed ACCEPT(o1,1)
As soon as S2 proposes an operation, it will be using a stale
timestamp, allowing S3 to tell S2 that it missed operation o1.
Understanding Paxos 29 / 68
Fault tolerance: Process resilience
Leader crashes after sending ACCEPT(o1,1)
Example: Paxos
S3 is completely ignorant of any activity by S1
As soon as S2 announces that o2 is to be accepted, S3 will notice that
it missed an operation and can ask S2 to help recover.
S2 had missed ACCEPT(o1,1)
As soon as S2 proposes an operation, it will be using a stale
timestamp, allowing S3 to tell S2 that it missed operation o1.
Observation
Paxos (with three servers) behaves correctly when a single server
crashes, regardless when that crash took place.
Fault tolerance: Process resilience Example: Paxos
False crash detections
S3
S2
Leader
Leader
C2
(o2)
(o1)
C1
(Acc, o1,1)
S1
(Acc, o2,1)
(Lrn , o2)
drop leadership
o2
2
3
(σ )
confusion
Understanding Paxos 30 / 68
Problem and solution
S3 receives ACCEPT(o1, 1), but much later than ACCEPT(o2 , 1). If it
knew who the current leader was, it could safely reject the delayed
accept message ⇒ leaders should include their ID in messages.
Fault tolerance: Process resilience Example: Paxos
But what about progress?
S3
S2
Leader
S1
Leader
C1
C2
(o2)
(o1)
(Acc, S1,o1,1)
(Lr n , o1) o1
1
3
(σ )
(Acc, S2,o2,1)
(Lr n , o2) o2
12
Understanding Paxos 31 / 68
3
(σ )
Fault tolerance: Process resilience Example: Paxos
But what about progress?
S3
S2
Leader
S1
Leader
C1
C2
(o2)
(o1)
(Acc, S1,o1,1)
(Lr n , o1) o1
1
3
(σ )
(Acc, S2,o2,1)
(Lr n , o2) o2
12
Understanding Paxos 31 / 68
3
(σ )
Essence of solution
When S2 takes over, it needs to make sure than any outstanding
operations initiated by S1 have been properly flushed, i.e., executed by
enough servers. This requires an explicit leadership takeover by which
other servers are informed before sending out new accept messages.
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures
Consensus under arbitrary failure semantics
P2
P3 P2
32 / 68
P3
Essence
We consider process groups in which communication between process
is inconsistent: (a) improper forwarding of messages, or (b) telling
different things to different processes.
P1 P1
a a a b
b
(a)
b
(b)
33 / 68
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures
Consensus under arbitrary failure semantics
System model
► We consider a primary P and n−1 backups B1,...,Bn−1.
► A client sends v ∈{T,F}to P
► Messages may be lost, but this can be detected.
► Messages cannot be corrupted beyond detection.
► A receiver of a message can reliably detect its sender.
Byzantine agreement: requirements
BA1: Every nonfaulty backup process stores the same value.
BA2: If the primary is nonfaulty then every nonfaulty backup process
stores exactly what the primary had sent.
Observation
► Primary faulty ⇒ BA1 says that backups may store the same, but
different (and thus wrong) value than originally sent by the client.
► Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures
Why having 3k processes is not enough
P
B2
B1
T F
T
F
{T,F} {T,F}
Faulty process
First message round
Second message round
P
B2
B1
T T
F
T
{T,
Why having 3k processes is not enough 34 / 68
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures
Why having 3k+ 1 processes is enough
B1
B3
B2
T
T
T
T
T
F
T
F
F
{F,{T,T}} {T,{T,F}}
P
{T,{T,F}} Faulty process
First message round
Second message round
B1
B2
T
T
T
T
T
F
T
F
T
B3 {T,{T,F
{T,{T,T}}
P
{T,{T,F}}
Why having 3k+ 1 processes is enough 35 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance
Realizing fault tolerance
Observation
Considering that the members in a fault-tolerant process group are so
tightly coupled, we may bump into considerable performance problems,
but perhaps even situations in which realizing fault tolerance is
impossible.
Question
Are there limitations to what can be readily achieved?
► What is needed to enable reaching consensus?
► What happens when groups are partitioned?
36 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance
Distributed consensus: when can it be reached
Synchronous
Asynchronous
Ordered
Unordered
Bounded
Bounded
Unbounded
X X X X
X X
X
X Unbounded
Unicast Multicast Unicast Multicast
Message transmission
Communication
delay
Process
behavior Message ordering
Formal requirements for consensus
► Processes produce the same output value
► Every output value must be valid
► Every process must eventually provide output
On reaching consensus 37 / 68
Consistency, availability, and partitioning 38 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance
Consistency, availability, and partitioning
CAP theorem
Any networked system providing shared data can provide only two of
the following three properties:
C: consistency, by which a shared and replicated data item appears
as a single, up-to-date copy
A: availability, by which updates will always be eventually executed
P: Tolerant to the partitioning of process group.
Conclusion
In a network subject to communication failures, it is impossible to
realize an atomic read/write shared memory that guarantees a
response to every request.
Consistency, availability, and partitioning 39 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance
CAP theorem intuition
Simple situation: two interacting processes
► P and Q can no longer communicate:
► Allow P and Q to go ahead ⇒ no consistency
► Allow only one of P, Q to go ahead ⇒ no availability
► P and Q have to be assumed to continue communication ⇒ no
partitioning allowed.
Consistency, availability, and partitioning 39 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance
CAP theorem intuition
Simple situation: two interacting processes
► P and Q can no longer communicate:
► Allow P and Q to go ahead ⇒ no consistency
► Allow only one of P, Q to go ahead ⇒ no availability
► P and Q have to be assumed to continue communication ⇒ no
partitioning allowed.
Fundamental question
What are the practical ramifications of the CAP theorem?
40 / 68
Fault tolerance: Process resilience Failure detection
Failure detection
Issue
How can we reliably detect that a process has actually crashed?
General model
► Each process is equipped with a failure detection module
► A process P probes another process Q for a reaction
► If Q reacts: Q is considered to be alive (by P)
► If Q does not react with t time units: Q is suspected to have
crashed
Observation for a synchronous system
a suspected crash ≡ a known crash
41 / 68
Fault tolerance: Process resilience Failure detection
Practical failure detection
Implementation
► If P did not receive heartbeat from Q within time t: P suspects Q.
► If Q later sends a message (which is received by P):
► P stops suspecting Q
► P increases the timeout value t
► Note: if Q did crash, P will keep suspecting Q.
42 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Reliable remote procedure calls
What can go wrong?
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.
42 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Reliable remote procedure calls
What can go wrong?
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.
Two “easy” solutions
1: (cannot locate server): just report back to client
2: (request was lost): just resend message
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Reliable RPC: server crash
Receive
Execute
Reply
REQ
REP
Server
Receive
Execute
Crash
REQ
No REP
Server
Receive
Crash
REQ
Server crashes 43 / 68
No REP
Server
(a) (b) (c)
Problem
Where (a) is the normal case, situations (b) and (c) require different
solutions. However, we don’t know what happened. Two approaches:
► At-least-once-semantics: The server guarantees it will carry out
an operation at least once, no matter what.
► At-most-once-semantics: The server guarantees it will carry out
an operation at most once.
Server crashes 44 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Why fully transparent server recovery is impossible
Three type of events at the server
(Assume the server is requested to update a document.)
M: send the completion message
P: complete the processing of the document
C: crash
Six possible orderings
(Actions between brackets never take place)
1. M → P → C: Crash after reporting completion.
2. M → C → P: Crash after reporting completion, but before the
update.
3. P →M → C: Crash after reporting completion, and after the
update.
4. P →C(→ M): Update took place, and then a crash.
5. C(→ P →M): Crash before doing anything
6. C(→ M → P): Crash before doing anything
Server crashes 45 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Why fully transparent server recovery is impossible
Reissue strategy
Always
Never
Only when ACKed
Only when not ACKed
Client Server Server
OK = Document updated once
DUP = Document updated twice
ZERO = Document not update at all
Strategy M → P
MPC MC(P) C(MP)
DUP OK OK
OK ZERO ZERO
DUP OK ZERO
OK ZERO OK
Strategy P → M
PMC PC(M) C(PM)
DUP DUP OK
OK OK ZERO
DUP OK ZERO
OK DUP OK
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Reliable RPC: lost reply messages
The real issue
What the client notices, is that it is not getting an answer. However, it
cannot decide whether this is caused by a lost request, a crashed
server, or a lost response.
Partial solution
Design the server such that its operations are idempotent: repeating
the same operation is the same as carrying it out exactly once:
► pure read operations
► strict overwrite operations
Many operations are inherently nonidempotent, such as many banking
transactions.
Lost reply messages 46 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures
Reliable RPC: client crash
Problem
The server is doing work and holding resources for nothing (called
doing an orphan computation).
Solution
► Orphan is killed (or rolled back) by the client when it recovers
► Client broadcasts new epoch number when recovering ⇒ server
kills client’s orphans
► Require computations to complete in a T time units. Old ones are
simply removed.
Client crashes 47 / 68
Fault tolerance: Reliable group communication
Simple reliable group communication
Intuition
A message sent to a process group G should be delivered to each
member of G. Important: make distinction between receiving and
delivering messages.
Sender Recipient Recipient
Message reception
48 / 68
Message delivery
Message-handling
component
Message-handling
component
Message-handling
component
Group membership
functionality
Group membership
functionality
Group membership
functionality
Local OS Local OS Local OS
Network
49 / 68
Fault tolerance: Reliable group communication
Less simple reliable group communication
Reliable communication in the presence of faulty
processes
Group communication is reliable when it can be guaranteed that a
message is received and subsequently delivered by all nonfaulty group
members.
Tricky part
Agreement is needed on what the group actually looks like before a
received message can be delivered.
Fault tolerance: Reliable group communication
Simple reliable group communication
M25 History
buffer
M25 M25 M25 M25
Last = 24 Last = 23
Last = 24 Last = 24
Reliable communication, but assume nonfaulty processes
Reliable group communication now boils down to reliable multicasting:
is a message received and delivered to each recipient, as intended by
the sender.
Receiver missed
message #24
Sender Receiver Receiver Receiver Receiver
Network
Sender Receiver Receiver Receiver Receiver
M25 M25 M25 M25
Last = 25 Last = 23
Last = 24 Last = 24
ACK 25 ACK 25
ACK 25
Missed 24
Network
50 / 68
51 / 68
Fault tolerance: Distributed commit
Distributed commit protocols
Problem
Have an operation being performed by each member of a process
group, or none at all.
► Reliable multicasting: a message is to be delivered to all
recipients.
► Distributed transaction: each local transaction must succeed.
52 / 68
Fault tolerance: Distributed commit
Two-phase commit protocol (2PC)
Essence
The client who initiated the computation acts as coordinator; processes
required to commit are the participants.
► Phase 1a: Coordinator sends VOTE-REQUEST to participants
(also called a pre-write)
► Phase 1b: When participant receives VOTE-REQUEST it returns
either VOTE-COMMIT or VOTE-ABORT to coordinator. If it sends
VOTE-ABORT, it aborts its local computation
► Phase 2a: Coordinator collects all votes; if all are VOTE-COMMIT,
it sends GLOBAL-COMMIT to all participants, otherwise it sends
GLOBAL-ABORT
► Phase 2b: Each participant waits for GLOBAL-COMMIT or
GLOBAL-ABORT and handles accordingly.
Fault tolerance: Distributed commit
2PC - Finite state machines
INIT
WAIT
Commit
Vote-request
Vote-abort
Global-abort
Vote-commit
Global-commit
ABORT COMMIT
Coordinator
INIT
READY
Vote-request
Vote-commit
Vote-request
Vote-abort
Global-abort
ACK
Global-commit
ACK
ABORT COMMIT
Participant
53 / 68
54 / 68
Fault tolerance: Distributed commit
2PC – Failing participant
Analysis: participant crashes in state S, and recovers to S
► INIT: No problem: participant was unaware of protocol
54 / 68
Fault tolerance: Distributed commit
2PC – Failing participant
Analysis: participant crashes in state S, and recovers to S
► READY: Participant is waiting to either commit or abort. After
recovery, participant needs to know which state transition it
should make ⇒ log the coordinator’s decision
54 / 68
Fault tolerance: Distributed commit
2PC – Failing participant
Analysis: participant crashes in state S, and recovers to S
► ABORT: Merely make entry into abort state idempotent, e.g.,
removing the workspace of results
54 / 68
Fault tolerance: Distributed commit
2PC – Failing participant
Analysis: participant crashes in state S, and recovers to S
► COMMIT : Also make entry into commit state idempotent, e.g.,
copying workspace to storage.
54 / 68
Fault tolerance: Distributed commit
2PC – Failing participant
Analysis: participant crashes in state S, and recovers to S
► INIT: No problem: participant was unaware of protocol
► READY: Participant is waiting to either commit or abort. After
recovery, participant needs to know which state transition it
should make ⇒ log the coordinator’s decision
► ABORT : Merely make entry into abort state idempotent, e.g.,
removing the workspace of results
► COMMIT : Also make entry into commit state idempotent, e.g.,
copying workspace to storage.
Observation
When distributed commit is required, having participants use
temporary workspaces to keep their results allows for simple recovery
in the presence of failures.
55 / 68
Fault tolerance: Distributed commit
2PC – Failing participant
Alternative
When a recovery is needed to READY state, check state of other
participants ⇒ no need to log coordinator’s decision.
Recovering participant P contacts another participant Q
State of Q Action by P
COMMIT Make transition to COMMIT
ABORT Make transition to ABORT
INIT Make transition to ABORT
READY Contact another participant
Result
If all participants are in the READY state, the protocol blocks.
Apparently, the coordinator is failing. Note: The protocol prescribes
that we need the decision from the coordinator.
56 / 68
Fault tolerance: Distributed commit
2PC – Failing coordinator
Observation
The real problem lies in the fact that the coordinator’s final decision
may not be available for some time (or actually lost).
Alternative
Let a participant P in the READY state timeout when it hasn’t received
the coordinator’s decision; P tries to find out what other participants
know (as discussed).
Observation
Essence of the problem is that a recovering participant cannot make a
local decision: it is dependent on other (possibly failed) processes
57 / 68
Fault tolerance: Distributed commit
Coordinator in Python
1 c la ss Coordinator:
2
3 def r u n( se lf) :
4 yetToReceive = li st ( p a r t ic ip a n t s )
5 self.log.info(’WAIT’)
6 self.chan.sendTo(participants, VOTE_REQUEST)
7 while len(yetToReceive) > 0 :
8 msg = self.chan.recvFro m(participants, TIMEOUT)
9 i f (not msg) or (msg[1] == VOTE_ABORT):
10 self.log.info(’ABORT’)
11 self.chan.sendTo(participants, GLOBAL_ABORT)
12 return
13 e l s e : # msg[1] == VOTE_COMMIT
14 yetToReceive.remove(msg[0])
15 self.log.info(’COMMIT’)
16 self.chan.sendTo(participants, GLOBAL_COMMIT)
58 / 68
Fault tolerance: Distributed commit
Participant in Python
1 c la ss P a r t ic ip a nt :
2 def r u n( se lf) :
3 msg = self.chan.recvFrom(coordinator, TIMEOUT)
4 i f (not msg): # Crashed coordinator - g i v e up e n t i re l y
5 decisio n = LOCAL_ABORT
6 e l s e : # Coordinator wi l l have se n t VOTE_REQUEST
7 decisio n = self.do_work()
8 i f decisio n == LOCAL_ABORT:
9 self.chan.sendTo(coordinator, VOTE_ABORT)
10 e l s e : # Ready t o commit, e n t e r READY s t a t e
11 self.chan.sendTo(coordinator, VOTE_COMMIT)
12 msg = self.chan.recvFrom(coordinator, TIMEOUT)
13 i f (not msg): # Crashed coordinator - check t h e others
14 self.chan.sendTo(all_participants, NEED_DECISION)
15 while True:
16 msg = self.chan.recvFromAny()
17 i f msg[1] i n [GLOBAL_COMMIT, GLOBAL_ABORT, LOCAL_ABORT]:
18 decisio n = msg[1]
19 break
20 e l s e : # Coordinator came t o a decision
21 decisio n = msg[1]
22
23
24
25
26
while True: # Help any other participant when coordinator crashed
msg = self.chan.recvFro m(all_part icipants)
i f msg[1] == NEED_DECISION:
self.chan.sendTo([msg[0]], decisio n)
59 / 68
Fault tolerance: Recovery Introduction
Recovery: Background
Essence
When a failure occurs, we need to bring the system into an error-free
state:
► Forward error recovery: Find a new state from which the system
can continue operation
► Backward error recovery: Bring the system back into a previous
error-free state
Practice
Use backward error recovery, requiring that we establish recovery
points
Observation
Recovery in distributed systems is complicated by the fact that
processes need to cooperate in identifying a consistent state from
where to recover
Fault tolerance: Recovery Checkpointing
Consistent recovery state
Requirement
Every message that has been received is also shown to have been
sent in the state of the sender.
Recovery line
Assuming processes regularly checkpoint their state, the most recent
consistent global checkpoint.
P1
P2
Initial state
Failure
Recovery line Checkpoint
Time
Inconsistent collection
of checkpoints
Message sent
from P2 to P1
60 / 68
Coordinated checkpointing 61 / 68
Fault tolerance: Recovery Checkpointing
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
Coordinated checkpointing 61 / 68
Fault tolerance: Recovery Checkpointing
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
Coordinated checkpointing 61 / 68
Fault tolerance: Recovery Checkpointing
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
► When a participant receives such a message, it takes a
checkpoint, stops sending (application) messages, and reports
back that it has taken a checkpoint
Coordinated checkpointing 61 / 68
Fault tolerance: Recovery Checkpointing
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
► When a participant receives such a message, it takes a
checkpoint, stops sending (application) messages, and reports
back that it has taken a checkpoint
► When all checkpoints have been confirmed at the coordinator, the
latter broadcasts a checkpoint done message to allow all
processes to continue
Coordinated checkpointing 61 / 68
Fault tolerance: Recovery Checkpointing
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
► When a participant receives such a message, it takes a
checkpoint, stops sending (application) messages, and reports
back that it has taken a checkpoint
► When all checkpoints have been confirmed at the coordinator, the
latter broadcasts a checkpoint done message to allow all
processes to continue
Observation
It is possible to consider only those processes that depend on the
recovery of the coordinator, and ignore the rest
Fault tolerance: Recovery Checkpointing
Cascaded rollback
Observation
If checkpointing is done at the “wrong” instants, the recovery line may
lie at system startup time. We have a so-called cascaded rollback.
m
m*
P1
P2
Initial state
Failure
Checkpoint
Time
Independent checkpointing 62 / 68
Independent checkpointing 63 / 68
Fault tolerance: Recovery Checkpointing
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
Independent checkpointing 63 / 68
Fault tolerance: Recovery Checkpointing
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m −1) and CPi (m).
Independent checkpointing 63 / 68
Fault tolerance: Recovery Checkpointing
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m −1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
Independent checkpointing 63 / 68
Fault tolerance: Recovery Checkpointing
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m −1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
► When process Pj receives a message in interval INTj (n), it
records the dependency INTi(m) → INTj (n).
Independent checkpointing 63 / 68
Fault tolerance: Recovery Checkpointing
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m −1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
► When process Pj receives a message in interval INTj (n), it
records the dependency INTi(m) → INTj (n).
► The dependency INTi (m) → INTj (n) is saved to storage when
taking checkpoint CPj (n).
Independent checkpointing 63 / 68
Fault tolerance: Recovery Checkpointing
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m −1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
► When process Pj receives a message in interval INTj (n), it
records the dependency INTi(m) → INTj (n).
► The dependency INTi (m) → INTj (n) is saved to storage when
taking checkpoint CPj (n).
Observation
If process Pi rolls back to CPi (m −1), Pj must roll back to CPj (n −1).
64 / 68
Fault tolerance: Recovery Message logging
Message logging
Alternative
Instead of taking an (expensive) checkpoint, try to replay your
(communication) behavior from the most recent checkpoint ⇒ store
messages in a log.
Assumption
We assume a piecewise deterministic execution model:
► The execution of each process can be considered as a sequence
of state intervals
► Each state interval starts with a nondeterministic event (e.g.,
message receipt)
► Execution in a state interval is deterministic
Conclusion
If we record nondeterministic events (to replay them later), we obtain a
deterministic execution model that will allow us to do a complete replay.
Fault tolerance: Recovery Message logging
Message logging and consistency
When should we actually log messages?
Avoid orphan processes:
Time
P
► Process Q has just received and delivered messages m1 and m2
► Assume that m2 is never logged.
► After delivering m1 and m2 , Q sends message m3 to process R
► Process R receives and subsequently delivers m3 : it is an
orphan.
Q crashes and recovers
Q
R
m1
m2 m2 m3
m3
m1 m2 is never replayed,
so neither will m3
Unlogged message
Logged message
65 / 68
66 / 68
Fault tolerance: Recovery Message logging
Message-logging schemes
Notations
► DEP(m): processes to which m has been delivered. If message
m∗ is causally dependent on the delivery of m, and m∗ has been
delivered to Q, then Q ∈DEP(m).
► COPY(m): processes that have a copy of m, but have not (yet)
reliably stored it.
► FAIL: the collection of crashed processes.
Characterization
Q is orphaned ⇔∃
m: Q ∈DEP(m) and COPY(m) ⊆FAIL
67 / 68
Fault tolerance: Recovery Message logging
Message-logging schemes
Pessimistic protocol
For each nonstable message m, there is at most one process
dependent on m, that is |DEP(m)|≤1.
Consequence
An unstable message in a pessimistic protocol must be made stable
before sending a next message.
68 / 68
Fault tolerance: Recovery Message logging
Message-logging schemes
Optimistic protocol
For each unstable message m, we ensure that if COPY(m) ⊆FAIL,
then eventually also DEP(m) ⊆FAIL.
Consequence
Toguarantee that DEP(m) ⊆ FAIL, we generally rollback each orphan
process Q until Q /
∈DEP(m).

More Related Content

What's hot

Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency ModelRajat Kumar
 
Distributed system notes unit I
Distributed system notes unit IDistributed system notes unit I
Distributed system notes unit INANDINI SHARMA
 
Lecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networksLecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networksChandra Meena
 
Security issues and attacks in wireless sensor networks
Security issues and attacks in wireless sensor networksSecurity issues and attacks in wireless sensor networks
Security issues and attacks in wireless sensor networksMd Waresul Islam
 
Distributed design alternatives
Distributed design alternativesDistributed design alternatives
Distributed design alternativesPooja Dixit
 
Inductive bias
Inductive biasInductive bias
Inductive biasswapnac12
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rulesswapnac12
 
Scheduling in Cloud Computing
Scheduling in Cloud ComputingScheduling in Cloud Computing
Scheduling in Cloud ComputingHitesh Mohapatra
 
Dc ch10 : circuit switching and packet switching
Dc ch10 : circuit switching and packet switchingDc ch10 : circuit switching and packet switching
Dc ch10 : circuit switching and packet switchingSyaiful Ahdan
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Ravindra Raju Kolahalam
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERINGsingh7599
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representationSravanthi Emani
 
FUNCTION APPROXIMATION
FUNCTION APPROXIMATIONFUNCTION APPROXIMATION
FUNCTION APPROXIMATIONankita pandey
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDell World
 
1. Overview of Distributed Systems
1. Overview of Distributed Systems1. Overview of Distributed Systems
1. Overview of Distributed SystemsDaminda Herath
 
Naming in Distributed Systems
Naming in Distributed SystemsNaming in Distributed Systems
Naming in Distributed SystemsNandakumar P
 

What's hot (20)

Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency Model
 
Distributed system notes unit I
Distributed system notes unit IDistributed system notes unit I
Distributed system notes unit I
 
Lecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networksLecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networks
 
Security issues and attacks in wireless sensor networks
Security issues and attacks in wireless sensor networksSecurity issues and attacks in wireless sensor networks
Security issues and attacks in wireless sensor networks
 
Distributed design alternatives
Distributed design alternativesDistributed design alternatives
Distributed design alternatives
 
Inductive bias
Inductive biasInductive bias
Inductive bias
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
Scheduling in Cloud Computing
Scheduling in Cloud ComputingScheduling in Cloud Computing
Scheduling in Cloud Computing
 
Dc ch10 : circuit switching and packet switching
Dc ch10 : circuit switching and packet switchingDc ch10 : circuit switching and packet switching
Dc ch10 : circuit switching and packet switching
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
 
FUNCTION APPROXIMATION
FUNCTION APPROXIMATIONFUNCTION APPROXIMATION
FUNCTION APPROXIMATION
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
Concept learning
Concept learningConcept learning
Concept learning
 
1. Overview of Distributed Systems
1. Overview of Distributed Systems1. Overview of Distributed Systems
1. Overview of Distributed Systems
 
Naming in Distributed Systems
Naming in Distributed SystemsNaming in Distributed Systems
Naming in Distributed Systems
 

Similar to slides.08.pptx

Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)ggarber
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance Systemprakashjjaya
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance SystemEhsan Ilahi
 
Resisting to The Shocks
Resisting to The ShocksResisting to The Shocks
Resisting to The ShocksStefano Fago
 
Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9Ian Sommerville
 
Dependable Software Development in Software Engineering SE18
Dependable Software Development in Software Engineering SE18Dependable Software Development in Software Engineering SE18
Dependable Software Development in Software Engineering SE18koolkampus
 
fault-tolerance-slide.ppt
fault-tolerance-slide.pptfault-tolerance-slide.ppt
fault-tolerance-slide.pptShailendra61
 
basic concepts of reliability
basic concepts of reliabilitybasic concepts of reliability
basic concepts of reliabilitydennis gookyi
 
An Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud ComputingAn Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud Computingijtsrd
 
2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systems2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systemsJaap van Ekris
 
Software reliability
Software reliabilitySoftware reliability
Software reliabilityAnand Kumar
 
Resiliency vs High Availability vs Fault Tolerance vs Reliability
Resiliency vs High Availability vs Fault Tolerance vs  ReliabilityResiliency vs High Availability vs Fault Tolerance vs  Reliability
Resiliency vs High Availability vs Fault Tolerance vs Reliabilityjeetendra mandal
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Peter Tröger
 
Resilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix worldResilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix worldRares Musina
 
fault tolerance1.pptx
fault tolerance1.pptxfault tolerance1.pptx
fault tolerance1.pptxrithika858339
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 

Similar to slides.08.pptx (20)

Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
Lecture 20-21
Lecture 20-21Lecture 20-21
Lecture 20-21
 
Resisting to The Shocks
Resisting to The ShocksResisting to The Shocks
Resisting to The Shocks
 
Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9
 
Dependable Software Development in Software Engineering SE18
Dependable Software Development in Software Engineering SE18Dependable Software Development in Software Engineering SE18
Dependable Software Development in Software Engineering SE18
 
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
 
fault-tolerance-slide.ppt
fault-tolerance-slide.pptfault-tolerance-slide.ppt
fault-tolerance-slide.ppt
 
basic concepts of reliability
basic concepts of reliabilitybasic concepts of reliability
basic concepts of reliability
 
An Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud ComputingAn Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud Computing
 
2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systems2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systems
 
Software reliability
Software reliabilitySoftware reliability
Software reliability
 
Resiliency vs High Availability vs Fault Tolerance vs Reliability
Resiliency vs High Availability vs Fault Tolerance vs  ReliabilityResiliency vs High Availability vs Fault Tolerance vs  Reliability
Resiliency vs High Availability vs Fault Tolerance vs Reliability
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
 
Resilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix worldResilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix world
 
fault tolerance1.pptx
fault tolerance1.pptxfault tolerance1.pptx
fault tolerance1.pptx
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 

More from balewayalew

Chapter 1-Introduction.ppt
Chapter 1-Introduction.pptChapter 1-Introduction.ppt
Chapter 1-Introduction.pptbalewayalew
 
Data Analytics.ppt
Data Analytics.pptData Analytics.ppt
Data Analytics.pptbalewayalew
 
PE1 Module 4.ppt
PE1 Module 4.pptPE1 Module 4.ppt
PE1 Module 4.pptbalewayalew
 
PE1 Module 3.ppt
PE1 Module 3.pptPE1 Module 3.ppt
PE1 Module 3.pptbalewayalew
 
PE1 Module 2.ppt
PE1 Module 2.pptPE1 Module 2.ppt
PE1 Module 2.pptbalewayalew
 
Chapter -6- Ethics and Professionalism of ET (2).pptx
Chapter -6- Ethics and Professionalism of ET (2).pptxChapter -6- Ethics and Professionalism of ET (2).pptx
Chapter -6- Ethics and Professionalism of ET (2).pptxbalewayalew
 
Chapter -5- Augumented Reality (AR).pptx
Chapter -5- Augumented Reality (AR).pptxChapter -5- Augumented Reality (AR).pptx
Chapter -5- Augumented Reality (AR).pptxbalewayalew
 
PE1 Module 1.ppt
PE1 Module 1.pptPE1 Module 1.ppt
PE1 Module 1.pptbalewayalew
 
Ch 1-Non-functional Requirements.ppt
Ch 1-Non-functional Requirements.pptCh 1-Non-functional Requirements.ppt
Ch 1-Non-functional Requirements.pptbalewayalew
 
Ch 6 - Requirement Management.pptx
Ch 6 - Requirement Management.pptxCh 6 - Requirement Management.pptx
Ch 6 - Requirement Management.pptxbalewayalew
 

More from balewayalew (20)

slides.06.pptx
slides.06.pptxslides.06.pptx
slides.06.pptx
 
slides.07.pptx
slides.07.pptxslides.07.pptx
slides.07.pptx
 
Chapter 1-Introduction.ppt
Chapter 1-Introduction.pptChapter 1-Introduction.ppt
Chapter 1-Introduction.ppt
 
Data Analytics.ppt
Data Analytics.pptData Analytics.ppt
Data Analytics.ppt
 
PE1 Module 4.ppt
PE1 Module 4.pptPE1 Module 4.ppt
PE1 Module 4.ppt
 
PE1 Module 3.ppt
PE1 Module 3.pptPE1 Module 3.ppt
PE1 Module 3.ppt
 
PE1 Module 2.ppt
PE1 Module 2.pptPE1 Module 2.ppt
PE1 Module 2.ppt
 
Chapter -6- Ethics and Professionalism of ET (2).pptx
Chapter -6- Ethics and Professionalism of ET (2).pptxChapter -6- Ethics and Professionalism of ET (2).pptx
Chapter -6- Ethics and Professionalism of ET (2).pptx
 
Chapter -5- Augumented Reality (AR).pptx
Chapter -5- Augumented Reality (AR).pptxChapter -5- Augumented Reality (AR).pptx
Chapter -5- Augumented Reality (AR).pptx
 
Chapter 8.ppt
Chapter 8.pptChapter 8.ppt
Chapter 8.ppt
 
PE1 Module 1.ppt
PE1 Module 1.pptPE1 Module 1.ppt
PE1 Module 1.ppt
 
chapter7.ppt
chapter7.pptchapter7.ppt
chapter7.ppt
 
chapter6.ppt
chapter6.pptchapter6.ppt
chapter6.ppt
 
chapter5.ppt
chapter5.pptchapter5.ppt
chapter5.ppt
 
chapter4.ppt
chapter4.pptchapter4.ppt
chapter4.ppt
 
chapter3.ppt
chapter3.pptchapter3.ppt
chapter3.ppt
 
chapter2.ppt
chapter2.pptchapter2.ppt
chapter2.ppt
 
chapter1.ppt
chapter1.pptchapter1.ppt
chapter1.ppt
 
Ch 1-Non-functional Requirements.ppt
Ch 1-Non-functional Requirements.pptCh 1-Non-functional Requirements.ppt
Ch 1-Non-functional Requirements.ppt
 
Ch 6 - Requirement Management.pptx
Ch 6 - Requirement Management.pptxCh 6 - Requirement Management.pptx
Ch 6 - Requirement Management.pptx
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

slides.08.pptx

  • 1. Distributed Systems (3rd Edition) Chapter 08: Fault Tolerance Version: March 20, 2022
  • 2. 2 / 68 Fault tolerance: Introduction to fault tolerance Basic concepts Dependability Basics A component provides services to clients. To provide services, the component may require the services from other components ⇒ a component may depend on some other component. Specifically A component C depends on C∗ if the correctness of C’s behavior depends on the correctness of C∗’s behavior. (Components are processes or channels.)
  • 3. 2 / 68 Fault tolerance: Introduction to fault tolerance Basic concepts Dependability Basics A component provides services to clients. To provide services, the component may require the services from other components ⇒ a component may depend on some other component. Specifically A component C depends on C∗ if the correctness of C’s behavior depends on the correctness of C∗’s behavior. (Components are processes or channels.) Requirements related to dependability Requirement Description Availability Readiness for usage Reliability Continuity of service delivery Safety Very low probability of catastrophes Maintainability How easy can a failed system be repaired
  • 4. 3 / 68 Fault tolerance: Introduction to fault tolerance Basic concepts Reliability versus availability Reliability R(t) of component C Conditional probability that C has been functioning correctly during [0,t) given C was functioning correctly at time T = 0. Traditional metrics ► Mean Time To Failure (MTTF): The average time until a component fails. ► Mean Time To Repair (MTTR): The average time needed to repair a component. ► Mean Time Between Failures (MTBF): Simply MTTF + MTTR.
  • 5. 4 / 68 Fault tolerance: Introduction to fault tolerance Basic concepts Reliability versus availability Availability A(t) of component C Average fraction of time that C has been up-and-running in interval [0,t). ► Long-term availability A: A(∞) ►Note: A = MTTF = MTTF MTBF MTTF+MTTR Observation Reliability and availability make sense only if we have an accurate notion of what a failure actually is.
  • 6. 5 / 68 Fault tolerance: Introduction to fault tolerance Basic concepts Terminology Failure, error, fault Term Description Example Failure A component is not living up to its specifications Crashed program Error Part of a component that can lead to a failure Programming bug Fault Cause of an error Sloppy programmer
  • 7. 6 / 68 Fault tolerance: Introduction to fault tolerance Basic concepts Terminology Handling faults Term Description Example Fault prevention Prevent the occurrence of a fault Don’t hire sloppy programmers Fault tolerance Build a component such that it can mask the occurrence of a fault Build each component by two independent programmers Fault removal Reduce the presence, number, or seriousness of a fault Get rid of sloppy programmers Fault forecasting Estimate current presence, future incidence, and consequences of faults Estimate how a recruiter is doing when it comes to hiring sloppy programmers
  • 8. 7 / 68 Fault tolerance: Introduction to fault tolerance Failure models Failure models Types of failures Type Description of server’s behavior Crash failure Halts, but is working correctly until it halts Omission failure Receive omission Send omission Fails to respond to incoming requests Fails to receive incoming messages Fails to send messages Timing failure Response lies outside a specified time in- terval Response failure Value failure State-transition failure Response is incorrect The value of the response is wrong Deviates from the correct flow of control Arbitrary failure May produce arbitrary responses at arbi- trary times
  • 9. 8 / 68 Fault tolerance: Introduction to fault tolerance Failure models Dependability versus security Omission versus commission Arbitrary failures are sometimes qualified as malicious. It is better to make the following distinction: ► Omission failures: a component fails to take an action that it should have taken ► Commission failure: a component takes an action that it should not have taken
  • 10. 8 / 68 Fault tolerance: Introduction to fault tolerance Failure models Dependability versus security Omission versus commission Arbitrary failures are sometimes qualified as malicious. It is better to make the following distinction: ► Omission failures: a component fails to take an action that it should have taken ► Commission failure: a component takes an action that it should not have taken Observation Note that deliberate failures, be they omission or commission failures are typically security problems. Distinguishing between deliberate failures and unintentional ones is, in general, impossible.
  • 11. 9 / 68 Fault tolerance: Introduction to fault tolerance Failure models Halting failures Scenario C no longer perceives any activity from C∗ — a halting failure? Distinguishing between a crash or omission/timing failure may be impossible. Asynchronous versus synchronous systems ► Asynchronous system: no assumptions about process execution speeds or message delivery times → cannot reliably detect crash failures. ► Synchronous system: process execution speeds and message delivery times are bounded → we can reliably detect omission and timing failures. ► In practice we have partially synchronous systems: most of the time, we can assume the system to be synchronous, yet there is no bound on the time that a system is asynchronous → can normally reliably detect crash failures.
  • 12. 10 / 68 Fault tolerance: Introduction to fault tolerance Failure models Halting failures Assumptions we can make Halting type Description Fail-stop Crash failures, but reliably detectable Fail-noisy Crash failures, eventually reliably de- tectable Fail-silent Omission or crash failures: clients cannot tell what went wrong Fail-safe Arbitrary, yet benign failures (i.e., they can- not do any harm) Fail-arbitrary Arbitrary, with malicious failures
  • 13. 11 / 68 Fault tolerance: Introduction to fault tolerance Failure masking by redundancy Redundancy for failure masking Types of redundancy ► Information redundancy: Add extra bits to data units so that errors can recovered when bits are garbled. ► Time redundancy: Design a system such that an action can be performed again if anything went wrong. Typically used when faults are transient or intermittent. ► Physical redundancy: add equipment or processes in order to allow one or more components to fail. This type is extensively used in distributed systems.
  • 14. Fault tolerance: Process resilience Resilience by process groups Process resilience Basic idea Protect against malfunctioning processes through process replication, organizing multiple processes into a process group. Distinguish between flat groups and hierarchical groups. Flat group Hierarchical group Coordinator Worker Group organization 12 / 68
  • 15. 13 / 68 Fault tolerance: Process resilience Failure masking and replication Groups and failure masking k-fault tolerant group When a group can mask any k concurrent member failures (k is called degree of fault tolerance).
  • 16. 13 / 68 Fault tolerance: Process resilience Failure masking and replication Groups and failure masking k-fault tolerant group When a group can mask any k concurrent member failures (k is called degree of fault tolerance). How large does a k-fault tolerant group need to be? ► With halting failures (crash/omission/timing failures): we need a total of k + 1 members as no member will produce an incorrect result, so the result of one member is good enough. ► With arbitrary failures: we need 2k + 1 members so that the correct result can be obtained through a majority vote.
  • 17. 13 / 68 Fault tolerance: Process resilience Failure masking and replication Groups and failure masking k-fault tolerant group When a group can mask any k concurrent member failures (k is called degree of fault tolerance). How large does a k-fault tolerant group need to be? ► With halting failures (crash/omission/timing failures): we need a total of k + 1 members as no member will produce an incorrect result, so the result of one member is good enough. ► With arbitrary failures: we need 2k + 1 members so that the correct result can be obtained through a majority vote. Important assumptions ► All members are identical ► All members process commands in the same order Result: We can now be sure that all processes do exactly the same thing.
  • 18. 14 / 68 Fault tolerance: Process resilience Consensus in faulty systems with crash failures Consensus Prerequisite In a fault-tolerant process group, each nonfaulty process executes the same commands, and in the same order, as every other nonfaulty process. Reformulation Nonfaulty group members need to reach consensus on which command to execute next.
  • 19. 15 / 68 Fault tolerance: Process resilience Consensus in faulty systems with crash failures Flooding-based consensus System model ► A process group P = {P1,...,Pn} ► Fail-stop failure semantics, i.e., with reliable failure detection ► A client contacts a Pi requesting it to execute a command ► Every Pi maintains a list of proposed commands
  • 20. 15 / 68 Fault tolerance: Process resilience Consensus in faulty systems with crash failures Flooding-based consensus System model ► A process group P = {P1,...,Pn} ► Fail-stop failure semantics, i.e., with reliable failure detection ► A client contacts a Pi requesting it to execute a command ► Every Pi maintains a list of proposed commands Basic algorithm (based on rounds) i 1. In round r, Pi multicasts its known set of commands Cr to all others 2. At the end of r, each Pi merges all received commands into a i new Cr+1. 3. Next command cmdi selected through a globally shared, i r+ 1 i deterministic function: cmd ← select(C ).
  • 21. Fault tolerance: Process resilience Consensus in faulty systems with crash failures Flooding-based consensus: Example P4 P3 P1 P2 decide decide decide 16 / 68 Observations ► P2 received all proposed commands from all other processes ⇒ makes decision. ► P3 may have detected that P1 crashed, but does not know if P2 received anything, i.e., P3 cannot know if it has the same information as P2 ⇒ cannot make decision (same for P4).
  • 22. Fault tolerance: Process resilience Example: Paxos Realistic consensus: Paxos Assumptions (rather weak ones, and realistic) ► A partially synchronous system (in fact, it may even be asynchronous). ► Communication between processes may be unreliable: messages may be lost, duplicated, or reordered. ► Corrupted message can be detected (and thus subsequently ignored). ► All operations are deterministic: once an execution is started, it is known exactly what it will do. ► Processes may exhibit crash failures, but not arbitrary failures. ► Processes do not collude. Understanding Paxos We will build up Paxos from scratch to understand where many consensus algorithms actually come from. Essential Paxos 17 / 68
  • 23. Understanding Paxos 18 / 68 Fault tolerance: Process resilience Example: Paxos Paxos essentials Starting point ► We assume a client-server configuration, with initially one primary server. ► To make the server more robust, we start with adding a backup server. ► To ensure that all commands are executed in the same order at both servers, the primary assigns unique sequence numbers to all commands. In Paxos, the primary is called the leader. ► Assume that actual commands can always be restored (either from clients or servers) ⇒ we consider only control messages.
  • 24. Fault tolerance: Process resilience Example: Paxos Two-server situation S2 S1 Leader C1 C2 (o2) (o1) (Seq,o2, 1) o2 o2 2 1 2 2 (σ ) (σ ) (Seq,o1, 2) o1 o1 21 1 21 Understanding Paxos 19 / 68 2 (σ ) (σ )
  • 25. Understanding Paxos 20 / 68 Fault tolerance: Process resilience Example: Paxos Handling lost messages Some Paxos terminology ► The leader sends an accept message ACCEPT(o,t) to backups when assigning a timestamp t to command o. ► A backup responds by sending a learn message: LEARN(o,t) ► When the leader notices that operation o has not yet been learned, it retransmits ACCEPT(o,t) with the original timestamp.
  • 26. Fault tolerance: Process resilience Example: Paxos Two servers and one crash: problem C2 (o2) C1 (o1) (Acc, o1,1) S1 o1 S2 1 1 (σ ) Leader 2 (Acc, o ,1) o2 2 Understanding Paxos 21 / 68 2 (σ ) Problem Primary crashes after executing an operation, but the backup never received the accept message.
  • 27. Fault tolerance: Process resilience Example: Paxos Two servers and one crash: solution C2 (o2) C1 (o1) (Acc, o1,1) S1 o1 S2 (Lr n , o1) o1 1 1 2 1 (σ ) (σ ) Leader 2 (Acc, o ,2) o2 12 Understanding Paxos 22 / 68 2 (σ ) Solution Never execute an operation before it is clear that is has been learned.
  • 28. Fault tolerance: Process resilience Example: Paxos Three servers and two crashes: still a problem? C2 (o2) S3 S2 S1 C1 (o1) (Acc, o1,1) o1 (Lr n , o1) o1 1 2 1 1 (σ ) (σ ) Leader 2 (Acc, o , 1) o2 2 Understanding Paxos 23 / 68 3 (σ )
  • 29. Fault tolerance: Process resilience Example: Paxos Three servers and two crashes: still a problem? C2 (o2) S3 S2 S1 C1 (o1) (Acc, o1,1) o1 (Lr n , o1) o1 1 2 1 1 (σ ) (σ ) Leader 2 (Acc, o , 1) o2 2 Understanding Paxos 23 / 68 3 (σ ) Scenario What happens when LEARN(o1) as sent by S2 to S1 is lost?
  • 30. Fault tolerance: Process resilience Example: Paxos Three servers and two crashes: still a problem? C2 (o2) S3 S2 S1 C1 (o1) (Acc, o1,1) o1 (Lr n , o1) o1 1 2 1 1 (σ ) (σ ) Leader 2 (Acc, o , 1) o2 2 Understanding Paxos 23 / 68 3 (σ ) Scenario What happens when LEARN(o1) as sent by S2 to S1 is lost? Solution S2 will also have to wait until it knows that S3 has learned o1.
  • 31. Understanding Paxos 24 / 68 Fault tolerance: Process resilience Example: Paxos Paxos: fundamental rule General rule In Paxos, a server S cannot execute an operation o until it has received a LEARN(o) from all other nonfaulty servers.
  • 32. Understanding Paxos 25 / 68 Fault tolerance: Process resilience Example: Paxos Failure detection Practice Reliable failure detection is practically impossible. A solution is to set timeouts, but take into account that a detected failure may be false.
  • 33. Fault tolerance: Process resilience Example: Paxos Failure detection Practice Reliable failure detection is practically impossible. A solution is to set timeouts, but take into account that a detected failure may be false. S2 Leader Leader C1 C2 (o2) (Acc, o1,1) S1 (Alive, o1) o1 1 1 (o1) (σ ) (Acc, o2, 1) o2 2 Understanding Paxos 25 / 68 2 (σ )
  • 34. Understanding Paxos 26 / 68 Fault tolerance: Process resilience Example: Paxos Required number of servers Observation Paxos needs at least three servers
  • 35. Understanding Paxos 26 / 68 Fault tolerance: Process resilience Example: Paxos Required number of servers Observation Paxos needs at least three servers Adapted fundamental rule In Paxos with three servers, a server S cannot execute an operation o until it has received at least one (other) LEARN(o) message, so that it knows that a majority of servers will execute o.
  • 36. Understanding Paxos 27 / 68 Fault tolerance: Process resilience Example: Paxos Required number of servers Assumptions before taking the next steps ► Initially, S1 is the leader. ► A server can reliably detect it has missed a message, and recover from that miss. ► When a new leader needs to be elected, the remaining servers 1 2 3 follow a strictly deterministic algorithm, such as S → S → S . ► A client cannot be asked to help the servers to resolve a situation.
  • 37. Understanding Paxos 27 / 68 Fault tolerance: Process resilience Example: Paxos Required number of servers Assumptions before taking the next steps ► Initially, S1 is the leader. ► A server can reliably detect it has missed a message, and recover from that miss. ► When a new leader needs to be elected, the remaining servers 1 2 3 follow a strictly deterministic algorithm, such as S → S → S . ► A client cannot be asked to help the servers to resolve a situation. Observation If either one of the backups (S2 or S3 ) crashes, Paxos will behave correctly: operations at nonfaulty servers are executed in the same order.
  • 38. Understanding Paxos 28 / 68 Fault tolerance: Process resilience Leader crashes after executing o1 Example: Paxos
  • 39. Understanding Paxos 28 / 68 Fault tolerance: Process resilience Leader crashes after executing o1 S3 is completely ignorant of any activity by S1 Example: Paxos ► S2 received ACCEPT(o,1), detects crash, and becomes leader. ► S3 even never received ACCEPT(o,1). ► If S2 sends ACCEPT(o2 ,2) ⇒ S3 sees unexpected timestamp and tells S2 that it missed o1. ► S2 retransmits ACCEPT(o1,1), allowing S3 to catch up.
  • 40. Understanding Paxos 28 / 68 Fault tolerance: Process resilience Leader crashes after executing o1 S3 is completely ignorant of any activity by S1 Example: Paxos ► S2 received ACCEPT(o,1), detects crash, and becomes leader. ► S3 even never received ACCEPT(o,1). ► If S2 sends ACCEPT(o2 ,2) ⇒ S3 sees unexpected timestamp and tells S2 that it missed o1. ► S2 retransmits ACCEPT(o1,1), allowing S3 to catch up. S2 missed ACCEPT(o1,1) ► S2 did detect crash and became new leader ► If S2 sends ACCEPT(o1,1) ⇒ S3 retransmits LEARN(o1). ► If S2 sends ACCEPT(o2 ,1) ⇒ S3 tells S2 that it apparently missed ACCEPT(o1,1) from S1, so that S2 can catch up.
  • 41. Understanding Paxos 29 / 68 Fault tolerance: Process resilience Leader crashes after sending ACCEPT(o1,1) Example: Paxos S3 is completely ignorant of any activity by S1 As soon as S2 announces that o2 is to be accepted, S3 will notice that it missed an operation and can ask S2 to help recover. S2 had missed ACCEPT(o1,1) As soon as S2 proposes an operation, it will be using a stale timestamp, allowing S3 to tell S2 that it missed operation o1.
  • 42. Understanding Paxos 29 / 68 Fault tolerance: Process resilience Leader crashes after sending ACCEPT(o1,1) Example: Paxos S3 is completely ignorant of any activity by S1 As soon as S2 announces that o2 is to be accepted, S3 will notice that it missed an operation and can ask S2 to help recover. S2 had missed ACCEPT(o1,1) As soon as S2 proposes an operation, it will be using a stale timestamp, allowing S3 to tell S2 that it missed operation o1. Observation Paxos (with three servers) behaves correctly when a single server crashes, regardless when that crash took place.
  • 43. Fault tolerance: Process resilience Example: Paxos False crash detections S3 S2 Leader Leader C2 (o2) (o1) C1 (Acc, o1,1) S1 (Acc, o2,1) (Lrn , o2) drop leadership o2 2 3 (σ ) confusion Understanding Paxos 30 / 68 Problem and solution S3 receives ACCEPT(o1, 1), but much later than ACCEPT(o2 , 1). If it knew who the current leader was, it could safely reject the delayed accept message ⇒ leaders should include their ID in messages.
  • 44. Fault tolerance: Process resilience Example: Paxos But what about progress? S3 S2 Leader S1 Leader C1 C2 (o2) (o1) (Acc, S1,o1,1) (Lr n , o1) o1 1 3 (σ ) (Acc, S2,o2,1) (Lr n , o2) o2 12 Understanding Paxos 31 / 68 3 (σ )
  • 45. Fault tolerance: Process resilience Example: Paxos But what about progress? S3 S2 Leader S1 Leader C1 C2 (o2) (o1) (Acc, S1,o1,1) (Lr n , o1) o1 1 3 (σ ) (Acc, S2,o2,1) (Lr n , o2) o2 12 Understanding Paxos 31 / 68 3 (σ ) Essence of solution When S2 takes over, it needs to make sure than any outstanding operations initiated by S1 have been properly flushed, i.e., executed by enough servers. This requires an explicit leadership takeover by which other servers are informed before sending out new accept messages.
  • 46. Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures Consensus under arbitrary failure semantics P2 P3 P2 32 / 68 P3 Essence We consider process groups in which communication between process is inconsistent: (a) improper forwarding of messages, or (b) telling different things to different processes. P1 P1 a a a b b (a) b (b)
  • 47. 33 / 68 Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures Consensus under arbitrary failure semantics System model ► We consider a primary P and n−1 backups B1,...,Bn−1. ► A client sends v ∈{T,F}to P ► Messages may be lost, but this can be detected. ► Messages cannot be corrupted beyond detection. ► A receiver of a message can reliably detect its sender. Byzantine agreement: requirements BA1: Every nonfaulty backup process stores the same value. BA2: If the primary is nonfaulty then every nonfaulty backup process stores exactly what the primary had sent. Observation ► Primary faulty ⇒ BA1 says that backups may store the same, but different (and thus wrong) value than originally sent by the client. ► Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
  • 48. Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures Why having 3k processes is not enough P B2 B1 T F T F {T,F} {T,F} Faulty process First message round Second message round P B2 B1 T T F T {T, Why having 3k processes is not enough 34 / 68
  • 49. Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures Why having 3k+ 1 processes is enough B1 B3 B2 T T T T T F T F F {F,{T,T}} {T,{T,F}} P {T,{T,F}} Faulty process First message round Second message round B1 B2 T T T T T F T F T B3 {T,{T,F {T,{T,T}} P {T,{T,F}} Why having 3k+ 1 processes is enough 35 / 68
  • 50. Fault tolerance: Process resilience Some limitations on realizing fault tolerance Realizing fault tolerance Observation Considering that the members in a fault-tolerant process group are so tightly coupled, we may bump into considerable performance problems, but perhaps even situations in which realizing fault tolerance is impossible. Question Are there limitations to what can be readily achieved? ► What is needed to enable reaching consensus? ► What happens when groups are partitioned? 36 / 68
  • 51. Fault tolerance: Process resilience Some limitations on realizing fault tolerance Distributed consensus: when can it be reached Synchronous Asynchronous Ordered Unordered Bounded Bounded Unbounded X X X X X X X X Unbounded Unicast Multicast Unicast Multicast Message transmission Communication delay Process behavior Message ordering Formal requirements for consensus ► Processes produce the same output value ► Every output value must be valid ► Every process must eventually provide output On reaching consensus 37 / 68
  • 52. Consistency, availability, and partitioning 38 / 68 Fault tolerance: Process resilience Some limitations on realizing fault tolerance Consistency, availability, and partitioning CAP theorem Any networked system providing shared data can provide only two of the following three properties: C: consistency, by which a shared and replicated data item appears as a single, up-to-date copy A: availability, by which updates will always be eventually executed P: Tolerant to the partitioning of process group. Conclusion In a network subject to communication failures, it is impossible to realize an atomic read/write shared memory that guarantees a response to every request.
  • 53. Consistency, availability, and partitioning 39 / 68 Fault tolerance: Process resilience Some limitations on realizing fault tolerance CAP theorem intuition Simple situation: two interacting processes ► P and Q can no longer communicate: ► Allow P and Q to go ahead ⇒ no consistency ► Allow only one of P, Q to go ahead ⇒ no availability ► P and Q have to be assumed to continue communication ⇒ no partitioning allowed.
  • 54. Consistency, availability, and partitioning 39 / 68 Fault tolerance: Process resilience Some limitations on realizing fault tolerance CAP theorem intuition Simple situation: two interacting processes ► P and Q can no longer communicate: ► Allow P and Q to go ahead ⇒ no consistency ► Allow only one of P, Q to go ahead ⇒ no availability ► P and Q have to be assumed to continue communication ⇒ no partitioning allowed. Fundamental question What are the practical ramifications of the CAP theorem?
  • 55. 40 / 68 Fault tolerance: Process resilience Failure detection Failure detection Issue How can we reliably detect that a process has actually crashed? General model ► Each process is equipped with a failure detection module ► A process P probes another process Q for a reaction ► If Q reacts: Q is considered to be alive (by P) ► If Q does not react with t time units: Q is suspected to have crashed Observation for a synchronous system a suspected crash ≡ a known crash
  • 56. 41 / 68 Fault tolerance: Process resilience Failure detection Practical failure detection Implementation ► If P did not receive heartbeat from Q within time t: P suspects Q. ► If Q later sends a message (which is received by P): ► P stops suspecting Q ► P increases the timeout value t ► Note: if Q did crash, P will keep suspecting Q.
  • 57. 42 / 68 Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable remote procedure calls What can go wrong? 1. The client is unable to locate the server. 2. The request message from the client to the server is lost. 3. The server crashes after receiving a request. 4. The reply message from the server to the client is lost. 5. The client crashes after sending a request.
  • 58. 42 / 68 Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable remote procedure calls What can go wrong? 1. The client is unable to locate the server. 2. The request message from the client to the server is lost. 3. The server crashes after receiving a request. 4. The reply message from the server to the client is lost. 5. The client crashes after sending a request. Two “easy” solutions 1: (cannot locate server): just report back to client 2: (request was lost): just resend message
  • 59. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable RPC: server crash Receive Execute Reply REQ REP Server Receive Execute Crash REQ No REP Server Receive Crash REQ Server crashes 43 / 68 No REP Server (a) (b) (c) Problem Where (a) is the normal case, situations (b) and (c) require different solutions. However, we don’t know what happened. Two approaches: ► At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. ► At-most-once-semantics: The server guarantees it will carry out an operation at most once.
  • 60. Server crashes 44 / 68 Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Why fully transparent server recovery is impossible Three type of events at the server (Assume the server is requested to update a document.) M: send the completion message P: complete the processing of the document C: crash Six possible orderings (Actions between brackets never take place) 1. M → P → C: Crash after reporting completion. 2. M → C → P: Crash after reporting completion, but before the update. 3. P →M → C: Crash after reporting completion, and after the update. 4. P →C(→ M): Update took place, and then a crash. 5. C(→ P →M): Crash before doing anything 6. C(→ M → P): Crash before doing anything
  • 61. Server crashes 45 / 68 Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Why fully transparent server recovery is impossible Reissue strategy Always Never Only when ACKed Only when not ACKed Client Server Server OK = Document updated once DUP = Document updated twice ZERO = Document not update at all Strategy M → P MPC MC(P) C(MP) DUP OK OK OK ZERO ZERO DUP OK ZERO OK ZERO OK Strategy P → M PMC PC(M) C(PM) DUP DUP OK OK OK ZERO DUP OK ZERO OK DUP OK
  • 62. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable RPC: lost reply messages The real issue What the client notices, is that it is not getting an answer. However, it cannot decide whether this is caused by a lost request, a crashed server, or a lost response. Partial solution Design the server such that its operations are idempotent: repeating the same operation is the same as carrying it out exactly once: ► pure read operations ► strict overwrite operations Many operations are inherently nonidempotent, such as many banking transactions. Lost reply messages 46 / 68
  • 63. Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures Reliable RPC: client crash Problem The server is doing work and holding resources for nothing (called doing an orphan computation). Solution ► Orphan is killed (or rolled back) by the client when it recovers ► Client broadcasts new epoch number when recovering ⇒ server kills client’s orphans ► Require computations to complete in a T time units. Old ones are simply removed. Client crashes 47 / 68
  • 64. Fault tolerance: Reliable group communication Simple reliable group communication Intuition A message sent to a process group G should be delivered to each member of G. Important: make distinction between receiving and delivering messages. Sender Recipient Recipient Message reception 48 / 68 Message delivery Message-handling component Message-handling component Message-handling component Group membership functionality Group membership functionality Group membership functionality Local OS Local OS Local OS Network
  • 65. 49 / 68 Fault tolerance: Reliable group communication Less simple reliable group communication Reliable communication in the presence of faulty processes Group communication is reliable when it can be guaranteed that a message is received and subsequently delivered by all nonfaulty group members. Tricky part Agreement is needed on what the group actually looks like before a received message can be delivered.
  • 66. Fault tolerance: Reliable group communication Simple reliable group communication M25 History buffer M25 M25 M25 M25 Last = 24 Last = 23 Last = 24 Last = 24 Reliable communication, but assume nonfaulty processes Reliable group communication now boils down to reliable multicasting: is a message received and delivered to each recipient, as intended by the sender. Receiver missed message #24 Sender Receiver Receiver Receiver Receiver Network Sender Receiver Receiver Receiver Receiver M25 M25 M25 M25 Last = 25 Last = 23 Last = 24 Last = 24 ACK 25 ACK 25 ACK 25 Missed 24 Network 50 / 68
  • 67. 51 / 68 Fault tolerance: Distributed commit Distributed commit protocols Problem Have an operation being performed by each member of a process group, or none at all. ► Reliable multicasting: a message is to be delivered to all recipients. ► Distributed transaction: each local transaction must succeed.
  • 68. 52 / 68 Fault tolerance: Distributed commit Two-phase commit protocol (2PC) Essence The client who initiated the computation acts as coordinator; processes required to commit are the participants. ► Phase 1a: Coordinator sends VOTE-REQUEST to participants (also called a pre-write) ► Phase 1b: When participant receives VOTE-REQUEST it returns either VOTE-COMMIT or VOTE-ABORT to coordinator. If it sends VOTE-ABORT, it aborts its local computation ► Phase 2a: Coordinator collects all votes; if all are VOTE-COMMIT, it sends GLOBAL-COMMIT to all participants, otherwise it sends GLOBAL-ABORT ► Phase 2b: Each participant waits for GLOBAL-COMMIT or GLOBAL-ABORT and handles accordingly.
  • 69. Fault tolerance: Distributed commit 2PC - Finite state machines INIT WAIT Commit Vote-request Vote-abort Global-abort Vote-commit Global-commit ABORT COMMIT Coordinator INIT READY Vote-request Vote-commit Vote-request Vote-abort Global-abort ACK Global-commit ACK ABORT COMMIT Participant 53 / 68
  • 70. 54 / 68 Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S, and recovers to S ► INIT: No problem: participant was unaware of protocol
  • 71. 54 / 68 Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S, and recovers to S ► READY: Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make ⇒ log the coordinator’s decision
  • 72. 54 / 68 Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S, and recovers to S ► ABORT: Merely make entry into abort state idempotent, e.g., removing the workspace of results
  • 73. 54 / 68 Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S, and recovers to S ► COMMIT : Also make entry into commit state idempotent, e.g., copying workspace to storage.
  • 74. 54 / 68 Fault tolerance: Distributed commit 2PC – Failing participant Analysis: participant crashes in state S, and recovers to S ► INIT: No problem: participant was unaware of protocol ► READY: Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make ⇒ log the coordinator’s decision ► ABORT : Merely make entry into abort state idempotent, e.g., removing the workspace of results ► COMMIT : Also make entry into commit state idempotent, e.g., copying workspace to storage. Observation When distributed commit is required, having participants use temporary workspaces to keep their results allows for simple recovery in the presence of failures.
  • 75. 55 / 68 Fault tolerance: Distributed commit 2PC – Failing participant Alternative When a recovery is needed to READY state, check state of other participants ⇒ no need to log coordinator’s decision. Recovering participant P contacts another participant Q State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT Make transition to ABORT READY Contact another participant Result If all participants are in the READY state, the protocol blocks. Apparently, the coordinator is failing. Note: The protocol prescribes that we need the decision from the coordinator.
  • 76. 56 / 68 Fault tolerance: Distributed commit 2PC – Failing coordinator Observation The real problem lies in the fact that the coordinator’s final decision may not be available for some time (or actually lost). Alternative Let a participant P in the READY state timeout when it hasn’t received the coordinator’s decision; P tries to find out what other participants know (as discussed). Observation Essence of the problem is that a recovering participant cannot make a local decision: it is dependent on other (possibly failed) processes
  • 77. 57 / 68 Fault tolerance: Distributed commit Coordinator in Python 1 c la ss Coordinator: 2 3 def r u n( se lf) : 4 yetToReceive = li st ( p a r t ic ip a n t s ) 5 self.log.info(’WAIT’) 6 self.chan.sendTo(participants, VOTE_REQUEST) 7 while len(yetToReceive) > 0 : 8 msg = self.chan.recvFro m(participants, TIMEOUT) 9 i f (not msg) or (msg[1] == VOTE_ABORT): 10 self.log.info(’ABORT’) 11 self.chan.sendTo(participants, GLOBAL_ABORT) 12 return 13 e l s e : # msg[1] == VOTE_COMMIT 14 yetToReceive.remove(msg[0]) 15 self.log.info(’COMMIT’) 16 self.chan.sendTo(participants, GLOBAL_COMMIT)
  • 78. 58 / 68 Fault tolerance: Distributed commit Participant in Python 1 c la ss P a r t ic ip a nt : 2 def r u n( se lf) : 3 msg = self.chan.recvFrom(coordinator, TIMEOUT) 4 i f (not msg): # Crashed coordinator - g i v e up e n t i re l y 5 decisio n = LOCAL_ABORT 6 e l s e : # Coordinator wi l l have se n t VOTE_REQUEST 7 decisio n = self.do_work() 8 i f decisio n == LOCAL_ABORT: 9 self.chan.sendTo(coordinator, VOTE_ABORT) 10 e l s e : # Ready t o commit, e n t e r READY s t a t e 11 self.chan.sendTo(coordinator, VOTE_COMMIT) 12 msg = self.chan.recvFrom(coordinator, TIMEOUT) 13 i f (not msg): # Crashed coordinator - check t h e others 14 self.chan.sendTo(all_participants, NEED_DECISION) 15 while True: 16 msg = self.chan.recvFromAny() 17 i f msg[1] i n [GLOBAL_COMMIT, GLOBAL_ABORT, LOCAL_ABORT]: 18 decisio n = msg[1] 19 break 20 e l s e : # Coordinator came t o a decision 21 decisio n = msg[1] 22 23 24 25 26 while True: # Help any other participant when coordinator crashed msg = self.chan.recvFro m(all_part icipants) i f msg[1] == NEED_DECISION: self.chan.sendTo([msg[0]], decisio n)
  • 79. 59 / 68 Fault tolerance: Recovery Introduction Recovery: Background Essence When a failure occurs, we need to bring the system into an error-free state: ► Forward error recovery: Find a new state from which the system can continue operation ► Backward error recovery: Bring the system back into a previous error-free state Practice Use backward error recovery, requiring that we establish recovery points Observation Recovery in distributed systems is complicated by the fact that processes need to cooperate in identifying a consistent state from where to recover
  • 80. Fault tolerance: Recovery Checkpointing Consistent recovery state Requirement Every message that has been received is also shown to have been sent in the state of the sender. Recovery line Assuming processes regularly checkpoint their state, the most recent consistent global checkpoint. P1 P2 Initial state Failure Recovery line Checkpoint Time Inconsistent collection of checkpoints Message sent from P2 to P1 60 / 68
  • 81. Coordinated checkpointing 61 / 68 Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol:
  • 82. Coordinated checkpointing 61 / 68 Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: ► A coordinator multicasts a checkpoint request message
  • 83. Coordinated checkpointing 61 / 68 Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: ► A coordinator multicasts a checkpoint request message ► When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint
  • 84. Coordinated checkpointing 61 / 68 Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: ► A coordinator multicasts a checkpoint request message ► When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint ► When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue
  • 85. Coordinated checkpointing 61 / 68 Fault tolerance: Recovery Checkpointing Coordinated checkpointing Essence Each process takes a checkpoint after a globally coordinated action. Simple solution Use a two-phase blocking protocol: ► A coordinator multicasts a checkpoint request message ► When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint ► When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue Observation It is possible to consider only those processes that depend on the recovery of the coordinator, and ignore the rest
  • 86. Fault tolerance: Recovery Checkpointing Cascaded rollback Observation If checkpointing is done at the “wrong” instants, the recovery line may lie at system startup time. We have a so-called cascaded rollback. m m* P1 P2 Initial state Failure Checkpoint Time Independent checkpointing 62 / 68
  • 87. Independent checkpointing 63 / 68 Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup.
  • 88. Independent checkpointing 63 / 68 Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. ► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval between CPi (m −1) and CPi (m).
  • 89. Independent checkpointing 63 / 68 Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. ► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval between CPi (m −1) and CPi (m). ► When process Pi sends a message in interval INTi (m), it piggybacks (i, m)
  • 90. Independent checkpointing 63 / 68 Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. ► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval between CPi (m −1) and CPi (m). ► When process Pi sends a message in interval INTi (m), it piggybacks (i, m) ► When process Pj receives a message in interval INTj (n), it records the dependency INTi(m) → INTj (n).
  • 91. Independent checkpointing 63 / 68 Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. ► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval between CPi (m −1) and CPi (m). ► When process Pi sends a message in interval INTi (m), it piggybacks (i, m) ► When process Pj receives a message in interval INTj (n), it records the dependency INTi(m) → INTj (n). ► The dependency INTi (m) → INTj (n) is saved to storage when taking checkpoint CPj (n).
  • 92. Independent checkpointing 63 / 68 Fault tolerance: Recovery Checkpointing Independent checkpointing Essence Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. ► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval between CPi (m −1) and CPi (m). ► When process Pi sends a message in interval INTi (m), it piggybacks (i, m) ► When process Pj receives a message in interval INTj (n), it records the dependency INTi(m) → INTj (n). ► The dependency INTi (m) → INTj (n) is saved to storage when taking checkpoint CPj (n). Observation If process Pi rolls back to CPi (m −1), Pj must roll back to CPj (n −1).
  • 93. 64 / 68 Fault tolerance: Recovery Message logging Message logging Alternative Instead of taking an (expensive) checkpoint, try to replay your (communication) behavior from the most recent checkpoint ⇒ store messages in a log. Assumption We assume a piecewise deterministic execution model: ► The execution of each process can be considered as a sequence of state intervals ► Each state interval starts with a nondeterministic event (e.g., message receipt) ► Execution in a state interval is deterministic Conclusion If we record nondeterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay.
  • 94. Fault tolerance: Recovery Message logging Message logging and consistency When should we actually log messages? Avoid orphan processes: Time P ► Process Q has just received and delivered messages m1 and m2 ► Assume that m2 is never logged. ► After delivering m1 and m2 , Q sends message m3 to process R ► Process R receives and subsequently delivers m3 : it is an orphan. Q crashes and recovers Q R m1 m2 m2 m3 m3 m1 m2 is never replayed, so neither will m3 Unlogged message Logged message 65 / 68
  • 95. 66 / 68 Fault tolerance: Recovery Message logging Message-logging schemes Notations ► DEP(m): processes to which m has been delivered. If message m∗ is causally dependent on the delivery of m, and m∗ has been delivered to Q, then Q ∈DEP(m). ► COPY(m): processes that have a copy of m, but have not (yet) reliably stored it. ► FAIL: the collection of crashed processes. Characterization Q is orphaned ⇔∃ m: Q ∈DEP(m) and COPY(m) ⊆FAIL
  • 96. 67 / 68 Fault tolerance: Recovery Message logging Message-logging schemes Pessimistic protocol For each nonstable message m, there is at most one process dependent on m, that is |DEP(m)|≤1. Consequence An unstable message in a pessimistic protocol must be made stable before sending a next message.
  • 97. 68 / 68 Fault tolerance: Recovery Message logging Message-logging schemes Optimistic protocol For each unstable message m, we ensure that if COPY(m) ⊆FAIL, then eventually also DEP(m) ⊆FAIL. Consequence Toguarantee that DEP(m) ⊆ FAIL, we generally rollback each orphan process Q until Q / ∈DEP(m).