9 fault-tolerance

Basic Concepts
• Availability
The system is ready to work immediately
• Reliability
The system can run continuously
• Safety
When the system fails, nothing catastrophic happens
• Maintainability
A failed system can be easily repaired.
Fault types: transient, intermittent, permanent

Failure Models
Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure
Different types of failures.
Receive omission
Send omission
A server fails to respond to requests
A server fails to receive incoming messages
A server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure
Value failure
State transition failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times

Failure Masking by Redundancy
•Information redundancy (extra bits)
•Time redundancy (extra operations)
• Physical redundancy (extra equipment or processes)

Failure Masking by Redundancy
Triple modular redundancy (TMR).
An electronic circuit example

Process failures
To tolerate a faulty process, identical processes organized into a
group
When one process of the group fails, some other process in the group
takes care of the work
Process groups may be dynamic
Mechanisms are needed for managing groups membership
•Group server maintains information on membership (centralized)
•Distributed management (less simple and time consuming)

Flat Groups versus Hierarchical Groups
a) Communication in a flat group (voting mechanism, slow decision)
Replicated write protocols
b) Communication in a simple hierarchical group (single point of failure)
Primary based protocols

Client-server communication failures
Using a reliable transport protocol (TCP) masks omission failures,
but many failures are not masked.
Classes of failure
• The client is unable to locate the server – exception is a solution, but we loose
in transparency
•The request message from the client to the server is lost – retransmission
•The server crashes after receiving a request
•The reply message from the server to the client is lost – retransmission, but…
•The client crashes after sending a request – orphan is generated.
(extermination, reincarnation with epoch #, gentle reincarnation, expiration…)

Server Crashes (1)
A server in client-server communication
a) Normal case
b) Crash after execution
c) Crash before execution
At least once semantic: after server reboot, to try until a request is obtained
At most once semantic: immediate failure report
Exactly once semantic: no way

Server Crashes (2)
Example: a client send a message to a server for printing (P) it, having a completion
message back (M). The server can crash (C)
Client Server
Strategy M -> P Strategy P -> M
Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM)
Always DUP OK OK DUP DUP OK
Never OK ZERO ZERO OK OK ZERO
Only when ACKed DUP OK ZERO DUP OK ZERO
Only when not ACKed OK ZERO OK OK DUP OK
Different combinations of client and server strategies in the presence of server crashes.

Group Communication
Basic Reliable-Multicasting Schemes
Important for messaging in process group
A simple solution to reliable multicasting when all receivers are known and are
assumed not to fail
a) Message transmission b) Reporting feedback
Efficient only for little # of receivers ( only nack, timer etc..)

Nonhierarchical Feedback Control
To scale, we need to reduce the number of messages,
with feedback suppression
Several receivers have scheduled a request for retransmission, but the
first retransmission request leads to the suppression of others (Scalable
Reliable Multicasting protocol).
It leads to timing problems, useless retransmissions or a complicated
organization of the group membership

Hierarchical Feedback Control
The essence of hierarchical reliable multicasting. A tree of receivers partitions is
formed
• Each local coordinator forwards the message to its children.
• A local coordinator handles retransmission requests.
Acknowledge between coordinators

Atomic Multicast
In presence of process failures, the guarantee that a message is delivered to all or none
of the receivers is needed. This lead to the atomic multicast problem
Atomic multicasting ensures that group members maintain consistency
The logical organization of a distributed system to distinguish between message receipt
and message delivery
In atomic multicasting a multicast message is uniquely associated to a list of receiving
processes ( Group view )
A view change takes place when a process joins or leaves the group

Virtual Synchrony
We need an ordered reliable multicasting.
Virtual Synchrony guarantees that a message sent to a group view is delivered to each
non-faulty member of the group.
If the sender crashes, the message may be either delivered to all the other processes or
ignored by each of them.
The principle of virtual synchronous multicast (view change similar to
synchronization variable)

Message Ordering
Four different type of ordering of multicasts:
• Reliable, unordered multicast
no guarantees is given on the order in which messages are delivered
• FIFO ordered multicast
messages from the same process are delivered in the order as they are sent
• Causally ordered multicast
causality between messages is preserved
• Totally-ordered multicast
messages are delivered in the same order to all members of the group

Message Ordering
Process P1 Process P2 Process P3
sends m1 receives m1 receives m2
sends m2 receives m2 receives m1
Unordered multicast:
Three communicating processes in the same group. The ordering of events per process is shown
along the vertical axis.
Process P1 Process P2 Process P3 Process P4
sends m1 receives m1 receives m3 sends m3
sends m2 receives m3 receives m1 sends m4
receives m2 receives m2
receives m4 receives m4
Four processes in the same group with two different senders, and a possible delivery order of
messages under FIFO-ordered multicasting

Message Ordering
Virtually synchronous reliable multicasting offering totally ordered delivery
is called atomic multicasting
Multicast Basic Message Ordering Total-ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
Six different versions of virtually synchronous reliable multicasting.

Distributed Commit
Distributed commit means that an operation has to be performed by
each member of a group or none at all
One phase distributed commit is performed using a coordinator ( if a participant
cannot perform the operation, no means to advise the coordinator)
a) The finite state machine for the coordinator in two phase commit.
b) The finite state machine for a participant.
The first phase is the vote phase, the second is the decision phase
Timeout mechanisms are necessary, coordinator can crash

Two Phase Commit
• The coordinator send a vote_request to all participants
• A participant returns a vote-commit (it is ready to commit its
part of transaction) or a vote-abort
• The coordinator collects the votes and send a global_commit or a
global_abort (if one of the participants has sent a vote_abort)
• A participant receive a global_commit and locally commits the
transaction, or receive a global_abort and locally aborts the
transaction
1 – voting phase
2 – decision phase
1
2

Three-Phase Commit
It avoids blocking processes in case of coordinator crash
• There is no state from which it is possible to make a transition directly to
COMMIT or ABORT
• There is no state in which it is not possible to make a final decision and
from which a transition to a COMMIT can be made

Recovery
• Backward recovery brings the system to the previous correct
state. It is necessary to record the state (check-pointing)
• Forward recovery attempt to bring the system in a correct new
state to continue the execution.

9 fault-tolerance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to 9 fault-tolerance

Similar to 9 fault-tolerance (20)

Recently uploaded

Recently uploaded (20)

9 fault-tolerance