CS9222 Advanced Operating
System
Unit – IV
Dr.A.Kathirvel
Professor & Head/IT - VCEW
Unit - IV
Basic Concepts-Classification of Failures – Basic
Approaches to Recovery; Recovery in Concurrent
System; Synchronous and Asynchronous
checkpointing and Recovery; Check pointing in
Distributed Database Systems; Fault Tolerance;
Issues - Two-phase and Nonblocking commit
Protocols; Voting Protocols; Dynamic Voting
Protocols;
Recovery
Recovery in computer systems refers to restoring a
system to its normal operational state.
Recovery may be as simple as restarting a failed
computer or restarting failed processes.
Recovery is generally a very complicated process.
For example, a process has memory allocated to it and
a process may have locked shared resources, such as
files and memory. Under such circumstances, if a
process fails, it is imperative that the resources
allocated to the failed process are undone.
4
Recovery
 Computer system recovery:
 Restore the system to a normal operational state
 Process recovery:
 Reclaim resources allocated to process,
 Undo modification made to databases, and
 Restart the process
 Or restart process from point of failure and resume execution
 Distributed process recovery (cooperating processes):
 Undo effect of interactions of failed process with other cooperating
processes.
 Replication (hardware components, processes, data):
 Main method for increasing system availability
 System:
 Set of hardware and software components
 Designed to provide a specified service (I.e. meet a set of requirements)
5
System failure:
– System does not meet requirements, i.e.does not perform its services as specified
Erroneous System State:
– State which could lead to a system failure by a sequence of valid state transitions
– Error: the part of the system state which differs from its intended value
Fault:
– Anomalous physical condition, e.g. design errors, manufacturing problems, damage,
external disturbances.
Recovery (cont.)
Error could lead to system failure
Error is a manifestation of a fault
6
Process failure:
 Behavior: process causes system state to deviate from specification (e.g. incorrect computation,
process stop execution)
 Errors causing process failure: protection violation, deadlocks, timeout, wrong user input, etc…
 Recovery: Abort process or
 Restart process from prior state
System failure:
 Behavior: processor fails to execute
 Caused by software errors or hardware faults (CPU/memory/bus/…/ failure)
 Recovery: system stopped and restarted in correct state
 Assumption: fail-stop processors, i.e. system stops execution, internal state is lost
Secondary Storage Failure:
 Behavior: stored data cannot be accessed
 Errors causing failure: parity error, head crash, etc.
 Recovery/Design strategies:
 Reconstruct content from archive + log of activities
 Design mirrored disk system
Communication Medium Failure:
 Behavior: a site cannot communicate with another operational site
 Errors/Faults: failure of switching nodes or communication links
 Recovery/Design Strategies: reroute, error-resistant communication protocols
Classification of failures
7
Failure recovery: restore an erroneous state to an error-free state
Approaches to failure recovery:
 Forward-error recovery:
 Remove errors in process/system state (if errors can be completely assessed)
 Continue process/system forward execution
 Backward-error recovery:
 Restore process/system to previous error-free state and restart from there
Comparison: Forward vs. Backward error recovery
 Backward-error recovery
 (+) Simple to implement
 (+) Can be used as general recovery mechanism
 (-) Performance penalty
 (-) No guarantee that fault does not occur again
 (-) Some components cannot be recovered
 Forward-error Recovery
 (+) Less overhead
 (-) Limited use, i.e. only when impact of faults understood
 (-) Cannot be used as general mechanism for error recovery
Backward and Forward Error Recovery
8
Principle: restore process/system to a known, error-free “recovery point”/ “checkpoint”.
System model:
Approaches:
 (1) Operation-based approach
 (2) State-based approach
Backward-Error Recovery: Basic approach
CPU
Main memory
secondar
y storage
stable
storage
Storage that
maintains
information in
the event of
system failure
Bring object to MM
to be accessed
Store logs and
recovery points
Write object back
if modified
9
Principle:
 Record all changes made to state of process (‘audit trail’ or ‘log’) such that process
can be returned to a previous state
 Example: A transaction based environment where transactions update a database
 It is possible to commit or undo updates on a per-transaction basis
 A commit indicates that the transaction on the object was successful and changes are
permanent
(1.a) Updating-in-place
 Principle: every update (write) operation to an object creates a log in stable storage
that can be used to ‘undo’ and ‘redo’ the operation
 Log content: object name, old object state, new object state
 Implementation of a recoverable update operation:
Do operation: update object and write log record
Undo operation: log(old) -> object (undoes the action performed by a do)
Redo operation: log(new) -> object (redoes the action performed by a do)
Display operation: display log record (optional)
 Problem: a ‘do’ cannot be recovered if system crashes after write object but before
log record write
(1.b) The write-ahead log protocol
 Principle: write log record before updating object
(1) The Operation-based Approach
10
 Principle: establish frequent ‘recovery points’ or ‘checkpoints’ saving the
entire state of process
 Actions:
 ‘Checkpointing’ or ‘taking a checkpoint’: saving process state
 ‘Rolling back’ a process: restoring a process to a prior state
 Note: A process should be rolled back to the most recent ‘recovery point’ to
minimize the overhead and delays in the completion of the process
 Shadow Pages: Special case of state-based approach
 Only a part of the system state is saved to minimize recovery
 When an object is modified, page containing object is first copied on stable
storage (shadow page)
 If process successfully commits: shadow page discarded and modified page is
made part of the database
 If process fails: shadow page used and the modified page discarded
(2) State-based Approach
11
Recovery in concurrent systems
 Issue: if one of a set of cooperating processes fails and has to be rolled back to a
recovery point, all processes it communicated with since the recovery point have to be
rolled back.
 Conclusion: In concurrent and/or distributed systems all cooperating processes have to
establish recovery points
 Orphan messages and the domino effect
 Case 1: failure of X after x3 : no impact on Y or Z
 Case 2: failure of Y after sending msg. ‘m’
 Y rolled back to y2
 ‘m’ ≡ orphan massage
 X rolled back to x2
 Case 3: failure of Z after z2
 Y has to roll back to y1
 X has to roll back to x1 Domino Effect
 Z has to roll back to z1
X
Y
Z
y1
x1
z1 z2
x2
y2
x3
m
Time
12
Lost messages
• Assume that x1 and y1 are the only recovery points for processes X and Y, respectively
• Assume Y fails after receiving message ‘m’
• Y rolled back to y1,X rolled back to x1
• Message ‘m’ is lost
Note: there is no distinction between this case and the case where message ‘m’ is lost in
communication channel and processes X and Y are in states x1 and y1, respectively
X
Y y1
x1
m
Time
Failure
13
Problem of livelock
• Livelock: case where a single failure can cause an infinite number of rollbacks
• Process Y fails before receiving message ‘n1’ sent by X
• Y rolled back to y1, no record of sending message ‘m1’, causing X to roll back to x1
• When Y restarts, sends out ‘m2’ and receives ‘n1’ (delayed)
• When X restarts from x1, sends out ‘n2’ and receives ‘m2’
• Y has to roll back again, since there is no record of ‘n1’ being sent
• This cause X to be rolled back again, since it has received ‘m2’ and there is no record of sending
‘m2’ in Y
• The above sequence can repeat indefinitely
X
Y y1
x1
m1
Time
Failure
n1
(a)
X
Y y1
x1
m2
Time
2nd roll back
n2
n1
(b)
(a)
(b)
CS-550 (M.Soneru): Recovery [SaS] 14
Consistent set of checkpoints
• Checkpointing in distributed systems requires that all processes (sites) that
interact with one another establish periodic checkpoints
• All the sites save their local states: local checkpoints
• All the local checkpoints, one from each site, collectively form a global
checkpoint
• The domino effect is caused by orphan messages, which in turn are caused
by rollbacks
1. Strongly consistent set of checkpoints
– Establish a set of local checkpoints (one for each process in the set)
such that no information flow takes place (i.e., no orphan messages)
during the interval spanned by the checkpoints
2. Consistent set of checkpoints
– Similar to the consistent global state
– Each message that is received in a checkpoint (state) should also be
recorded as sent in another checkpoint (state)
Consistency of Checkpoint
• Strongly consistent set of checkpoints
no messages penetrating the set
• Consistent set of checkpoints
no messages penetrating the set backward
[
[
[
x1
y1
z1
[
[
[
y2
x2
z2
Strongly consistent consistent
need to deal with
lost messages
Checkpoint/Recovery Algorithm
• Synchronous
– with global synchronization at checkpointing
• Asynchronous
– without global synchronization at checkpointing
Preliminary (Assumption)
Goal
To make a consistent global checkpoint
Assumptions
– Communication channels are FIFO
– No partition of the network
– End-to-end protocols cope with message loss due to
rollback recovery and communication failure
– No failure during the execution of the algorithm
~Synchronous Checkpoint~
Preliminary (Two types of checkpoint)
tentative checkpoint :
– a temporary checkpoint
– a candidate for permanent checkpoint
permanent checkpoint :
– a local checkpoint at a process
– a part of a consistent global checkpoint
~Synchronous Checkpoint~
Checkpoint Algorithm
Algorithm
1. an initiating process (a single process that invokes this algorithm) takes a
tentative checkpoint
2. it requests all the processes to take tentative checkpoints
3. it waits for receiving from all the processes whether taking a tentative
checkpoint has been succeeded
4. if it learns all the processes has succeeded, it decides all tentative
checkpoints should be made permanent; otherwise, should be discarded.
5. it informs all the processes of the decision
6. The processes that receive the decision act accordingly
Supplement
Once a process has taken a tentative checkpoint, it shouldn’t send messages
until it is informed of initiator’s decision.
~Synchronous Checkpoint~
Diagram of Checkpoint Algorithm
[
[
[
|
|
Tentative
checkpoint
|
request to
take a
tentative
checkpoint
OK
decide to commit
[
permanent checkpoint
[
[
consistent global checkpoint
consistent global checkpoint Unnecessary checkpoint
Initiator
~Synchronous Checkpoint~
Optimized Algorithm
Each message is labeled by order of sending
Labeling Scheme
⊥ : smallest label
т : largest label
last_label_rcvdX[Y] :
the last message that X received from Y after X has taken its last permanent or
tentative checkpoint. if not exists, ⊥is in it.
first_label_sentX[Y] :
the first message that X sent to Y after X took its last permanent or tentative
checkpoint . if not exists, ⊥is in it.
ckpt_cohortX :
the set of all processes that may have to take checkpoints when X decides to
take a checkpoint.
~Synchronous Checkpoint~
[
[
X
Y
x2
x3
y1 y2
y2
x2
Checkpoint request need to be sent to only the processes
included in ckpt_cohort
Optimized Algorithm
ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }
Y takes a tentative checkpoint only if
last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
~Synchronous Checkpoint~
X
Y
[
[
last_label_rcvdX[Y]
first_label_sentY[X]
Optimized Algorithm
Algorithm
1. an initiating process takes a tentative checkpoint
2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this
message includes last_label_rcvd[reciever] of sender )
3. if the processes that receive the request need to take a checkpoint,
they do the same as 1.2.; otherwise, return OK messages.
4. they wait for receiving OK from all of p ∈ ckpt_cohort
5. if the initiator learns all the processes have succeeded, it decides all
tentative checkpoints should be made permanent; otherwise, should
be discarded.
6. it informs p ∈ ckpt_cohort of the decision
7. The processes that receive the decision act accordingly
~Synchronous Checkpoint~
Diagram of Optimized Algorithm
[
[
[
[
A
C
B
D
ab1 ac1
bd1
dc1 dc2
cb1
ba1 ba2
ac2cb2
cd1
|
Tentative
checkpoint
ca2
last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
2 >= 1 > 0|
2 >= 2 > 0|
2 >= 0 > 0
OK
decide to commit
[
Permanent
checkpoint
[
[
ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }
~Synchronous Checkpoint~
Correctness
• A set of permanent checkpoints taken by this
algorithm is consistent
– No process sends messages after taking a
tentative checkpoint until the receipt of the
decision
– New checkpoints include no message from the
processes that don’t take a checkpoint
– The set of tentative checkpoints is fully either
made to permanent checkpoints or discarded.
~Synchronous Checkpoint~
Recovery Algorithm
Labeling Scheme
⊥ : smallest label
т : largest label
last_label_rcvdX[Y] :
the last message that X received from Y after X has taken its last
permanent or tentative checkpoint. If not exists, ⊥is in it.
first_label_sentX[Y] :
the first message that X sent to Y after X took its last permanent or
tentative checkpoint . If not exists, ⊥is in it.
roll_cohortX :
the set of all processes that may have to roll back to the latest
checkpoint when process X rolls back.
last_label_sentX[Y] :
the last message that X sent to Y before X takes its latest permanent
checkpoint. If not exist, т is in it.
~Synchronous Recovery~
Recovery Algorithm
roll_cohortX = { Y | X can send messages to Y }
Y will restart from the permanent checkpoint only if
last_label_rcvdY[X] > last_label_sentX[Y]
~Synchronous Recovery~
Recovery Algorithm
Algorithm
1. an initiator requests p ∈ roll_cohort to prepare to rollback ( this
message includes last_label_sent[reciever] of sender )
2. if the processes that receive the request need to rollback, they
do the same as 1.; otherwise, return OK message.
3. they wait for receiving OK from all of p ∈ ckpt_cohort.
4. if the initiator learns p ∈ roll_cohort have succeeded, it decides
to rollback; otherwise, not to rollback.
5. it informs p ∈ roll_cohort of the decision
6. the processes that receive the decision act accordingly
~Synchronous Recovery~
Diagram of Synchronous Recovery
[
[
[
[
A
C
B
D
ab1 ac1
bd1
dc1 dc2
cb1
ba1 ba2
ac2cb2
dc1
request to
roll back
0 > 1
last_label_rcvdY[X] > last_label_sentX[Y]
2 > 1
0 >т
OK
[
[
2 > 1
0 >т
[
decide to
roll back
roll_cohortX = { Y | X can send messages to Y }
Drawbacks of Synchronous Approach
• Additional messages are exchanged
• Synchronization delay
• An unnecessary extra load on the system if
failure rarely occurs
Asynchronous Checkpoint
Characteristic
– Each process takes checkpoints independently
– No guarantee that a set of local checkpoints is
consistent
– A recovery algorithm has to search consistent set
of checkpoints
– No additional message
– No synchronization delay
– Lighter load during normal excution
Preliminary (Assumptions)
Goal
To find the latest consistent set of checkpoints
Assumptions
– Communication channels are FIFO
– Communication channels are reliable
– The underlying computation is event-driven
~Asynchronous Checkpoint / Recovery~
Preliminary (Two types of log)
• save an event on the memory at receipt of messages
(volatile log)
• volatile log periodically flushed to the disk (stable
log) ⇔ checkpoint
volatile log :
quick access
lost if the corresponding processor fails
stable log :
slow access
not lost even if processors fail
~Asynchronous Checkpoint / Recovery~
Preliminary (Definition)
Definition
CkPti : the checkpoint (stable log) that i rolled back to when failure
occurs
RCVDi←j (CkPti / e ) :
the number of messages received by processor i from processor j, per
the information stored in the checkpoint CkPti or event e.
SENTi→j(CkPti / e ) :
the number of messages sent by processor i to processor j, per the
information stored in the checkpoint CkPti or event e
~Asynchronous Checkpoint / Recovery~
Recovery Algorithm
Algorithm
1. When one process crashes, it recovers to the latest checkpoint
CkPt.
2. It broadcasts the message that it had failed. Others receive this
message, and rollback to the latest event.
3. Each process sends SENT(CkPt) to neighboring processes
4. Each process waits for SENT(CkPt) messages from every
neighbor
5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) >
SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e)
= SENTj→i(e),
6. repeat 3,4,and 5 N times (N is the number of processes)
~Asynchronous Checkpoint / Recovery~
Asynchronous Recovery
X
Y
Z
Ex0 Ex1 Ex2 Ex3
Ey0 Ey1 Ey2 Ey3
Ez0 Ez1 Ez2
[
[
[
x1
y1
z1
(Y,2)
(Y,1)
(X,2)
(X,0)
(Z,0)
(Z,1)
3 <= 2
RCVDi←j (CkPti) <= SENTj→i(CkPtj)
2 <= 2
X:Y X:Z
0 <= 0
1 <= 2
Y:X
1 <= 1
Y:Z
0 <= 0
Z:X
2 <= 1
Z:Y
1 <= 1
37
System reliability: Fault-Intolerance vs. Fault-Tolerance
 The fault intolerance (or fault-avoidance) approach improves
system reliability by removing the source of failures (i.e.,
hardware and software faults) before normal operation begins
 The approach of fault-tolerance expect faults to be present
during system operation, but employs design techniques which
insure the continued correct execution of the computing
process
38
Approaches to fault-tolerance
 Approaches:
 (a) Mask failures
 (b) Well defined failure behavior
 Mask failures:
 System continues to provide its specified function(s) in the presence of failures
 Example: voting protocols
 (b) Well defined failure behaviour:
 System exhibits a well define behaviour in the presence of failures
 It may or it may not perform its specified function(s), but facilitates actions
suitable for fault recovery
 Example: commit protocols
 A transaction made to a database is made visible only if successful and it commits
 If it fails, transaction is undone
 Redundancy:
 Method for achieving fault tolerance (multiple copies of hardware, processes,
data, etc...)
39
Issues
 Process Deaths:
 All resources allocated to a process must be recovered when a process
dies
 Kernel and remaining processes can notify other cooperating processes
 Client-server systems: client (server) process needs to be informed that
the corresponding server (client) process died
 Machine failure:
 All processes running on that machine will die
 Client-server systems: difficult to distinguish between a process and
machine failure
 Issue: detection by processes of other machines
 Network Failure:
 Network may be partitioned into subnets
 Machines from different subnets cannot communicate
 Difficult for a process to distinguish between a machine and a
communication link failure
40
Atomic actions
 System activity: sequence of primitive or atomic actions
 Atomic Action:
 Machine Level: uninterruptible instruction
 Process Level: Group of instructions that accomplish a task
 Example: Two processes, P1 and P2, share a memory location ‘x’ and both
modify ‘x’
Process P1 Process P2
… …
Lock(x); Lock(x);
x := x + z; x := x + y; Atomic action
Unlock(x); Unlock(x);
… …
successful exit
 System level: group of cooperating process performing a task (global
atomicity)
41
Committing
 Transaction: Sequence of actions treated as an atomic action to preserve
consistency (e.g. access to a database)
 Commit a transaction: Unconditional guarantee that the transaction will
complete successfully (even in the presence of failures)
 Abort a transaction: Unconditional guarantee to back out of a transaction,
i.e., that all the effects of the transaction have been removed (transaction
was backed out)
 Events that may cause aborting a transaction: deadlocks, timeouts, protection
violation
 Mechanisms that facilitate backing out of an aborting transaction
Write-ahead-log protocol
Shadow pages
 Commit protocols:
 Enforce global atomicity (involving several cooperating distributed processes)
 Ensure that all the sites either commit or abort transaction unanimously, even in
the presence of multiple and repetitive failures
42
The two-phase commit protocol
 Assumption:
 One process is coordinator, the others are “cohorts” (different sites)
 Stable store available at each site
 Write-ahead log protocol
Coordinator
Initialization
Send start transaction message to all cohorts
Phase 1
Send commit-request message, requesting all
cohort to commit
Wait for reply from cohorts
Phase 2
If all cohorts sent agreed and coordinator
agrees
then write commit record into log
and send commit message to cohorts
else send abort message to cohorts
Wait for acknowledgment from cohorts
If acknowledgment from a cohort not received
within specified period
resent commit/abort to that cohort
If all acknowledgments received,
write complete record to log
Cohorts
If transaction at cohort is successful
then write undo and redo log on stable
storage and return agreed
message
else return abort message
If commit received,
release all resources and locks held for
transaction and
send acknowledgment
if abort received,
undo the transaction using undo log record,
release resources and locks and
send acknowledgment
NonBlocking Commit Protocols
Our Blocking Theorem from last week states that if
network partitioning is possible, then any distributed
commit protocol may block.
Let’s assume now that the network can not partition.
Then we can consult other processes to make
progress.
However, if all processes fail, then we are, again,
blocked.
Let’s further assume that total failure is not possible
ie. not all processes are crashed at the same time.
Automata representation
We model the participants with finite state automata
(FSA).
The participants move from one state to another as a
result of receiving one or several messages or as a
result of a timeout event.
Having received these messages, a participant may
send some messages before executing the state
transition.
Commit Protocol Automata
 Final states are divided into Abort states and Commit states
(finally, either Abort or Commit takes place).
 Once an Abort state is reached, it is not possible to do a
transition to a non-Abort state. (Abort is irreversible). Similarly
for Commit states (Commit is also irreversible).
 The state diagram is acyclic.
 We denote the initial state by q, the terminal states are a (an
abort/rollback state) and c (a commit state). Often there is a
wait-state, which we denote by w.
 Assume the participants are P1,…,Pn. Possible coordinator is
P0, when the protocol starts.
2PC Coordinator
q
w
a c
A commit-request from application
VoteReq to P1,…,Pn
Timeout or No from one of P1,.., Pn
Abort to P1,…Pn
Yes from all P1,..,Pn
Commit to P1,…,Pn
2PC Participant
q
w a
c
VoteReq from P0
No to P0
Commit from P0
-
Abort from P0
-
VoteReq from P0
Yes to P0
Commit Protocol State Transitions
 In a commit protocol, the idea is to inform other participants
on local progress.
 In fact, a state transition without message change is
uninteresting, unless the participant moves into a terminal
state.
 Therefore, unless a participant moves into a terminal state,
we may assume that it sends messages to other participants
about its change of state.
 To simplify our analysis, we may assume that the messages
are sent to all other participants. This is not necessary, but
creates unnecessary complication.
Concurrency set
 A concurrency set of a state s is the set of
possible states among all participants, if
some participant is in state s.
 In other words, the concurrency set of state s
is the set of all states that can co-exist with
state s.
2PC Concurrency Sets
q
w
a c
Commit-req.
VoteReq to All
Timeout or a No
Abort to all
Yes from all
Commit to all
q
w a
c
VoteReq from P0
No to P0
Abort from P0
-
VoteReq from P0
Yes to P0
Concurrency_set(q) = {q,w,a}, Concurrency_set(a) = {q,w,a}
Concurrency_set(w) = {q,w,a,c}, Concurrency_set(c) = (w,c)
Committable states
We say that a state is committable, if the
existence of a participant in this state means
that everyone has voted Yes.
If a state is not committable, we say that it is
non-committable.
In 2PC, w and c are committable states.
How can a site terminate when there
is a timeout?
Either (1) one of the operational sites knows the fate
of the transaction, or (2) the operational sites can
decide the fate of the transaction.
Knowing the fate of the transaction means, in
practice, that there is a participant in a terminal
state.
Start by considering a single participant s. The site
must infer the possible states of other participants
from its own state. This can be done using
concurrency sets.
When can’t a single participant
unilaterally abort?
Suppose a participant is in a state, which has a
commit state in its concurrency set. Then, it is
possible that some other participant is in a
commit state.
A participant in a state, which has a commit
state in its concurrency set, should not
unilaterally abort.
When can’t a single participant
unilaterally commit?
Suppose a participant is in a state, which has an
abort state in its concurrency set. Then, some
participant may be in an abort state.
A participant in a state, which has an abort state in
its concurrency set, should not unilaterally commit.
Also, a participant that is not in a committable state
should not commit.
The Fundamental Non-Blocking
Theorem
A protocol is non-blocking, if and only if it
satisfies the following conditions:
(1) There exists no local state such that its
concurrency set contains both an abort and a
commit state, and
(2) there exists no noncommittable state,
whose concurrency set contains a commit
state.
Showing the Fundamental Non-
Blocking Theorem
From our discussion above it follows that
Conditions (1) and (2) are necessary.
We discuss their sufficiency later by showing
how to terminate a commit protocol fulfilling
conditions (1) and (2).
Observations on 2PC
As the participants exchange messages as they
progress, they progress in a synchronised fashion.
In fact, there is always at most one step difference
between the states of any two live participants.
We say that the participants keep a one-step
synchronisation.
It is easy to see by Fundamental Nonblocking
Theorem that 2PC is blocking.
One-step synchronisation and non-
blocking property
If a commit protocol keeps one-step
synchronisation, then the concurrency set of
state s consists of s and the states adjacent to
s.
By applying this observation and the
Fundamental Non-blocking Theorem, we get a
useful Lemma:
Lemma
A protocol that is synchronous within one
state transition is non-blocking, if and only if
(1) it contains no state adjacet to both a
Commit and an Abort state, and
(2) it contains non non-committable state that
is adjacet to a commit state.
How to improve 2PC to get a non-
blocking protocol
It is easy to see that the state w is the problematic
state – and in two ways:
- it has both Abort and Commit in its concurrency
set, and
- it is a non-committable state, but it has Commit
in its concurrency set.
Solution: add an extra state between w and c
(adding between w and a would not do – why?)
We are primarily interested in the centralised
protocol, but similar decentralised improvement
is possible.
3PC Coordinator
q
w
a
c
A commit-request from application
VoteReq to P1,…,Pn
Timeout or No from one of P1,.., Pn
Abort to P1,…Pn
Yes from all P1,..,Pn
Prepare to P1,…,Pn
p
Ack from all P1,..,Pn
Commit to P1,…,Pn
3PC Participant
q
w a
c
VoteReq from P0
No to P0
Prepare from P0
Ack to P0
Abort from P0
-
VoteReq from P0
Yes to P0
p
Commit from P0
-
3PC Concurrency sets (cs)
q
w
a
c
A commit-request
VoteReq to all
Timeout or one No
Abort to all
Yes from all
Prepare to all
p
Ack from all
Commit to all
q
w a
c
VoteReq from P0
No to P0
Abort from P0
-
VoteReq from P0
Yes to P0
p
Commit from P0
-
Prepare from P0
Ack to P0
cs(p) = {w,p,c},
cs(w) = {q,a,w,p},
etc.
3PC and failures
If there are no failures, then clearly 3PC is correct.
In the presence of failures, the operational
participants should be able to terminate their
execution.
In the centralised case, a need for termination
protocol implies that the coordinator is no longer
operational.
We discuss a general termination protocol. It makes
the assumption that at least one participant remains
operational and that the participants obey the
Fundamental Non-Blocking Theorem.
Termination
Basic idea: Choose a backup coordinator B – vote or
use some preassigned ids.
Backup Coordinator Decision Rule:
If the B’s state contains commit in its concurrency
set, commit the transaction. Else abort the
transaction.
Reasoning behind the rule: If B’s state contains
commit in the concurrency set, then it is possible
that some site has performed commit – otherwise
not.
Re-executing termination
It is, of course, possible the backup
coordinator fails.
For this reason, the termination protocol
should be executed in such a way that it can
be re-executed.
In particular, the termination protocol must
not break the one-step synchronisation.
Implementing termination
To keep one-step synchronisation, the termination
protocol should be executed in two steps:
1. The backup coordinator B tells the others to make
a transition to B’s state. Others answer Ok. (This is
not necessary if B is in Commit or Abort state.)
2. B tells the others to commit or abort by the
decision rule.
Fundamental Non-Blocking Theorem
Proof - Sufficiency
The basic termination procedure and decision
rule is valid for any protocol that fulfills the
conditions given in the Fundamental Non-
Blocking Theorem.
The existence of a termination protocol
completes the proof.
69
Voting protocols
 Principles:
 Data replicated at several sites to increase reliability
 Each replica assigned a number of votes
 To access a replica, a process must collect a majority of votes
 Vote mechanism:
 (1) Static voting:
Each replica has number of votes (in stable storage)
A process can access a replica for a read or write operation if it can
collect a certain number of votes (read or write quorum)
 (2) Dynamic voting
Number of votes or the set of sites that form a quorum change with
the state of system (due to site and communication failures)
(2.1) Majority based approach:
Set of sites that can form a majority to allow access to replicated data of
changes with the changing state of the system
(2.2) Dynamic vote reassignment:
Number of votes assigned to a site changes dynamically
70
Failure resilient processes
 Resilient process: continues execution in the presence of failures
with minimum disruption to the service provided (masks failures)
 Approaches for implementing resilient processes:
 Backup processes and
 Replicated execution
 (1) Backup processes
 Each process made of a primary process and one or more backup
processes
 Primary process execute, while the backup processes are inactive
 If primary process fails, a backup process takes over
 Primary process establishes checkpoints, such that backup process can
restart
 (2) Replicated execution
 Several processes execute same program concurrently
 Majority consensus (voting) of their results
 Increases both the reliability and availability of the process
71
Recovery (fault tolerant) block concept
 Provide fault-tolerance within an individual sequential process in
which assignments to stored variables are the only means of
making recognizable progress
 The recovery block is made of:
 A primary block (the conventional program),
 Zero or more alternates (providing the same function as the primary block,
but using different algorithm), and
 An acceptance test (performed on exit from a primary or alternate block to
validate its actions).
72
Recovery (fault tolerant) Block concept
Recovery Block A
Acceptance test AT
Primary block AP
<Program text>
Alternate block AQ
<Program text>
Primary block
alternate block
Acceptance
test
Recovery block
73
N-version programming
Module ‘0’
Module ‘1’
Module ‘n-1’
Voter
Thank U

CS9222 ADVANCED OPERATING SYSTEMS

  • 1.
    CS9222 Advanced Operating System Unit– IV Dr.A.Kathirvel Professor & Head/IT - VCEW
  • 2.
    Unit - IV BasicConcepts-Classification of Failures – Basic Approaches to Recovery; Recovery in Concurrent System; Synchronous and Asynchronous checkpointing and Recovery; Check pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking commit Protocols; Voting Protocols; Dynamic Voting Protocols;
  • 3.
    Recovery Recovery in computersystems refers to restoring a system to its normal operational state. Recovery may be as simple as restarting a failed computer or restarting failed processes. Recovery is generally a very complicated process. For example, a process has memory allocated to it and a process may have locked shared resources, such as files and memory. Under such circumstances, if a process fails, it is imperative that the resources allocated to the failed process are undone.
  • 4.
    4 Recovery  Computer systemrecovery:  Restore the system to a normal operational state  Process recovery:  Reclaim resources allocated to process,  Undo modification made to databases, and  Restart the process  Or restart process from point of failure and resume execution  Distributed process recovery (cooperating processes):  Undo effect of interactions of failed process with other cooperating processes.  Replication (hardware components, processes, data):  Main method for increasing system availability  System:  Set of hardware and software components  Designed to provide a specified service (I.e. meet a set of requirements)
  • 5.
    5 System failure: – Systemdoes not meet requirements, i.e.does not perform its services as specified Erroneous System State: – State which could lead to a system failure by a sequence of valid state transitions – Error: the part of the system state which differs from its intended value Fault: – Anomalous physical condition, e.g. design errors, manufacturing problems, damage, external disturbances. Recovery (cont.) Error could lead to system failure Error is a manifestation of a fault
  • 6.
    6 Process failure:  Behavior:process causes system state to deviate from specification (e.g. incorrect computation, process stop execution)  Errors causing process failure: protection violation, deadlocks, timeout, wrong user input, etc…  Recovery: Abort process or  Restart process from prior state System failure:  Behavior: processor fails to execute  Caused by software errors or hardware faults (CPU/memory/bus/…/ failure)  Recovery: system stopped and restarted in correct state  Assumption: fail-stop processors, i.e. system stops execution, internal state is lost Secondary Storage Failure:  Behavior: stored data cannot be accessed  Errors causing failure: parity error, head crash, etc.  Recovery/Design strategies:  Reconstruct content from archive + log of activities  Design mirrored disk system Communication Medium Failure:  Behavior: a site cannot communicate with another operational site  Errors/Faults: failure of switching nodes or communication links  Recovery/Design Strategies: reroute, error-resistant communication protocols Classification of failures
  • 7.
    7 Failure recovery: restorean erroneous state to an error-free state Approaches to failure recovery:  Forward-error recovery:  Remove errors in process/system state (if errors can be completely assessed)  Continue process/system forward execution  Backward-error recovery:  Restore process/system to previous error-free state and restart from there Comparison: Forward vs. Backward error recovery  Backward-error recovery  (+) Simple to implement  (+) Can be used as general recovery mechanism  (-) Performance penalty  (-) No guarantee that fault does not occur again  (-) Some components cannot be recovered  Forward-error Recovery  (+) Less overhead  (-) Limited use, i.e. only when impact of faults understood  (-) Cannot be used as general mechanism for error recovery Backward and Forward Error Recovery
  • 8.
    8 Principle: restore process/systemto a known, error-free “recovery point”/ “checkpoint”. System model: Approaches:  (1) Operation-based approach  (2) State-based approach Backward-Error Recovery: Basic approach CPU Main memory secondar y storage stable storage Storage that maintains information in the event of system failure Bring object to MM to be accessed Store logs and recovery points Write object back if modified
  • 9.
    9 Principle:  Record allchanges made to state of process (‘audit trail’ or ‘log’) such that process can be returned to a previous state  Example: A transaction based environment where transactions update a database  It is possible to commit or undo updates on a per-transaction basis  A commit indicates that the transaction on the object was successful and changes are permanent (1.a) Updating-in-place  Principle: every update (write) operation to an object creates a log in stable storage that can be used to ‘undo’ and ‘redo’ the operation  Log content: object name, old object state, new object state  Implementation of a recoverable update operation: Do operation: update object and write log record Undo operation: log(old) -> object (undoes the action performed by a do) Redo operation: log(new) -> object (redoes the action performed by a do) Display operation: display log record (optional)  Problem: a ‘do’ cannot be recovered if system crashes after write object but before log record write (1.b) The write-ahead log protocol  Principle: write log record before updating object (1) The Operation-based Approach
  • 10.
    10  Principle: establishfrequent ‘recovery points’ or ‘checkpoints’ saving the entire state of process  Actions:  ‘Checkpointing’ or ‘taking a checkpoint’: saving process state  ‘Rolling back’ a process: restoring a process to a prior state  Note: A process should be rolled back to the most recent ‘recovery point’ to minimize the overhead and delays in the completion of the process  Shadow Pages: Special case of state-based approach  Only a part of the system state is saved to minimize recovery  When an object is modified, page containing object is first copied on stable storage (shadow page)  If process successfully commits: shadow page discarded and modified page is made part of the database  If process fails: shadow page used and the modified page discarded (2) State-based Approach
  • 11.
    11 Recovery in concurrentsystems  Issue: if one of a set of cooperating processes fails and has to be rolled back to a recovery point, all processes it communicated with since the recovery point have to be rolled back.  Conclusion: In concurrent and/or distributed systems all cooperating processes have to establish recovery points  Orphan messages and the domino effect  Case 1: failure of X after x3 : no impact on Y or Z  Case 2: failure of Y after sending msg. ‘m’  Y rolled back to y2  ‘m’ ≡ orphan massage  X rolled back to x2  Case 3: failure of Z after z2  Y has to roll back to y1  X has to roll back to x1 Domino Effect  Z has to roll back to z1 X Y Z y1 x1 z1 z2 x2 y2 x3 m Time
  • 12.
    12 Lost messages • Assumethat x1 and y1 are the only recovery points for processes X and Y, respectively • Assume Y fails after receiving message ‘m’ • Y rolled back to y1,X rolled back to x1 • Message ‘m’ is lost Note: there is no distinction between this case and the case where message ‘m’ is lost in communication channel and processes X and Y are in states x1 and y1, respectively X Y y1 x1 m Time Failure
  • 13.
    13 Problem of livelock •Livelock: case where a single failure can cause an infinite number of rollbacks • Process Y fails before receiving message ‘n1’ sent by X • Y rolled back to y1, no record of sending message ‘m1’, causing X to roll back to x1 • When Y restarts, sends out ‘m2’ and receives ‘n1’ (delayed) • When X restarts from x1, sends out ‘n2’ and receives ‘m2’ • Y has to roll back again, since there is no record of ‘n1’ being sent • This cause X to be rolled back again, since it has received ‘m2’ and there is no record of sending ‘m2’ in Y • The above sequence can repeat indefinitely X Y y1 x1 m1 Time Failure n1 (a) X Y y1 x1 m2 Time 2nd roll back n2 n1 (b) (a) (b)
  • 14.
    CS-550 (M.Soneru): Recovery[SaS] 14 Consistent set of checkpoints • Checkpointing in distributed systems requires that all processes (sites) that interact with one another establish periodic checkpoints • All the sites save their local states: local checkpoints • All the local checkpoints, one from each site, collectively form a global checkpoint • The domino effect is caused by orphan messages, which in turn are caused by rollbacks 1. Strongly consistent set of checkpoints – Establish a set of local checkpoints (one for each process in the set) such that no information flow takes place (i.e., no orphan messages) during the interval spanned by the checkpoints 2. Consistent set of checkpoints – Similar to the consistent global state – Each message that is received in a checkpoint (state) should also be recorded as sent in another checkpoint (state)
  • 15.
    Consistency of Checkpoint •Strongly consistent set of checkpoints no messages penetrating the set • Consistent set of checkpoints no messages penetrating the set backward [ [ [ x1 y1 z1 [ [ [ y2 x2 z2 Strongly consistent consistent need to deal with lost messages
  • 16.
    Checkpoint/Recovery Algorithm • Synchronous –with global synchronization at checkpointing • Asynchronous – without global synchronization at checkpointing
  • 17.
    Preliminary (Assumption) Goal To makea consistent global checkpoint Assumptions – Communication channels are FIFO – No partition of the network – End-to-end protocols cope with message loss due to rollback recovery and communication failure – No failure during the execution of the algorithm ~Synchronous Checkpoint~
  • 18.
    Preliminary (Two typesof checkpoint) tentative checkpoint : – a temporary checkpoint – a candidate for permanent checkpoint permanent checkpoint : – a local checkpoint at a process – a part of a consistent global checkpoint ~Synchronous Checkpoint~
  • 19.
    Checkpoint Algorithm Algorithm 1. aninitiating process (a single process that invokes this algorithm) takes a tentative checkpoint 2. it requests all the processes to take tentative checkpoints 3. it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded 4. if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. 5. it informs all the processes of the decision 6. The processes that receive the decision act accordingly Supplement Once a process has taken a tentative checkpoint, it shouldn’t send messages until it is informed of initiator’s decision. ~Synchronous Checkpoint~
  • 20.
    Diagram of CheckpointAlgorithm [ [ [ | | Tentative checkpoint | request to take a tentative checkpoint OK decide to commit [ permanent checkpoint [ [ consistent global checkpoint consistent global checkpoint Unnecessary checkpoint Initiator ~Synchronous Checkpoint~
  • 21.
    Optimized Algorithm Each messageis labeled by order of sending Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, ⊥is in it. first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, ⊥is in it. ckpt_cohortX : the set of all processes that may have to take checkpoints when X decides to take a checkpoint. ~Synchronous Checkpoint~ [ [ X Y x2 x3 y1 y2 y2 x2 Checkpoint request need to be sent to only the processes included in ckpt_cohort
  • 22.
    Optimized Algorithm ckpt_cohortX :{ Y | last_label_rcvdX[Y] > ⊥ } Y takes a tentative checkpoint only if last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥ ~Synchronous Checkpoint~ X Y [ [ last_label_rcvdX[Y] first_label_sentY[X]
  • 23.
    Optimized Algorithm Algorithm 1. aninitiating process takes a tentative checkpoint 2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this message includes last_label_rcvd[reciever] of sender ) 3. if the processes that receive the request need to take a checkpoint, they do the same as 1.2.; otherwise, return OK messages. 4. they wait for receiving OK from all of p ∈ ckpt_cohort 5. if the initiator learns all the processes have succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. 6. it informs p ∈ ckpt_cohort of the decision 7. The processes that receive the decision act accordingly ~Synchronous Checkpoint~
  • 24.
    Diagram of OptimizedAlgorithm [ [ [ [ A C B D ab1 ac1 bd1 dc1 dc2 cb1 ba1 ba2 ac2cb2 cd1 | Tentative checkpoint ca2 last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥ 2 >= 1 > 0| 2 >= 2 > 0| 2 >= 0 > 0 OK decide to commit [ Permanent checkpoint [ [ ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ } ~Synchronous Checkpoint~
  • 25.
    Correctness • A setof permanent checkpoints taken by this algorithm is consistent – No process sends messages after taking a tentative checkpoint until the receipt of the decision – New checkpoints include no message from the processes that don’t take a checkpoint – The set of tentative checkpoints is fully either made to permanent checkpoints or discarded. ~Synchronous Checkpoint~
  • 26.
    Recovery Algorithm Labeling Scheme ⊥: smallest label т : largest label last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, ⊥is in it. first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, ⊥is in it. roll_cohortX : the set of all processes that may have to roll back to the latest checkpoint when process X rolls back. last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it. ~Synchronous Recovery~
  • 27.
    Recovery Algorithm roll_cohortX ={ Y | X can send messages to Y } Y will restart from the permanent checkpoint only if last_label_rcvdY[X] > last_label_sentX[Y] ~Synchronous Recovery~
  • 28.
    Recovery Algorithm Algorithm 1. aninitiator requests p ∈ roll_cohort to prepare to rollback ( this message includes last_label_sent[reciever] of sender ) 2. if the processes that receive the request need to rollback, they do the same as 1.; otherwise, return OK message. 3. they wait for receiving OK from all of p ∈ ckpt_cohort. 4. if the initiator learns p ∈ roll_cohort have succeeded, it decides to rollback; otherwise, not to rollback. 5. it informs p ∈ roll_cohort of the decision 6. the processes that receive the decision act accordingly ~Synchronous Recovery~
  • 29.
    Diagram of SynchronousRecovery [ [ [ [ A C B D ab1 ac1 bd1 dc1 dc2 cb1 ba1 ba2 ac2cb2 dc1 request to roll back 0 > 1 last_label_rcvdY[X] > last_label_sentX[Y] 2 > 1 0 >т OK [ [ 2 > 1 0 >т [ decide to roll back roll_cohortX = { Y | X can send messages to Y }
  • 30.
    Drawbacks of SynchronousApproach • Additional messages are exchanged • Synchronization delay • An unnecessary extra load on the system if failure rarely occurs
  • 31.
    Asynchronous Checkpoint Characteristic – Eachprocess takes checkpoints independently – No guarantee that a set of local checkpoints is consistent – A recovery algorithm has to search consistent set of checkpoints – No additional message – No synchronization delay – Lighter load during normal excution
  • 32.
    Preliminary (Assumptions) Goal To findthe latest consistent set of checkpoints Assumptions – Communication channels are FIFO – Communication channels are reliable – The underlying computation is event-driven ~Asynchronous Checkpoint / Recovery~
  • 33.
    Preliminary (Two typesof log) • save an event on the memory at receipt of messages (volatile log) • volatile log periodically flushed to the disk (stable log) ⇔ checkpoint volatile log : quick access lost if the corresponding processor fails stable log : slow access not lost even if processors fail ~Asynchronous Checkpoint / Recovery~
  • 34.
    Preliminary (Definition) Definition CkPti :the checkpoint (stable log) that i rolled back to when failure occurs RCVDi←j (CkPti / e ) : the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e. SENTi→j(CkPti / e ) : the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e ~Asynchronous Checkpoint / Recovery~
  • 35.
    Recovery Algorithm Algorithm 1. Whenone process crashes, it recovers to the latest checkpoint CkPt. 2. It broadcasts the message that it had failed. Others receive this message, and rollback to the latest event. 3. Each process sends SENT(CkPt) to neighboring processes 4. Each process waits for SENT(CkPt) messages from every neighbor 5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) > SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e), 6. repeat 3,4,and 5 N times (N is the number of processes) ~Asynchronous Checkpoint / Recovery~
  • 36.
    Asynchronous Recovery X Y Z Ex0 Ex1Ex2 Ex3 Ey0 Ey1 Ey2 Ey3 Ez0 Ez1 Ez2 [ [ [ x1 y1 z1 (Y,2) (Y,1) (X,2) (X,0) (Z,0) (Z,1) 3 <= 2 RCVDi←j (CkPti) <= SENTj→i(CkPtj) 2 <= 2 X:Y X:Z 0 <= 0 1 <= 2 Y:X 1 <= 1 Y:Z 0 <= 0 Z:X 2 <= 1 Z:Y 1 <= 1
  • 37.
    37 System reliability: Fault-Intolerancevs. Fault-Tolerance  The fault intolerance (or fault-avoidance) approach improves system reliability by removing the source of failures (i.e., hardware and software faults) before normal operation begins  The approach of fault-tolerance expect faults to be present during system operation, but employs design techniques which insure the continued correct execution of the computing process
  • 38.
    38 Approaches to fault-tolerance Approaches:  (a) Mask failures  (b) Well defined failure behavior  Mask failures:  System continues to provide its specified function(s) in the presence of failures  Example: voting protocols  (b) Well defined failure behaviour:  System exhibits a well define behaviour in the presence of failures  It may or it may not perform its specified function(s), but facilitates actions suitable for fault recovery  Example: commit protocols  A transaction made to a database is made visible only if successful and it commits  If it fails, transaction is undone  Redundancy:  Method for achieving fault tolerance (multiple copies of hardware, processes, data, etc...)
  • 39.
    39 Issues  Process Deaths: All resources allocated to a process must be recovered when a process dies  Kernel and remaining processes can notify other cooperating processes  Client-server systems: client (server) process needs to be informed that the corresponding server (client) process died  Machine failure:  All processes running on that machine will die  Client-server systems: difficult to distinguish between a process and machine failure  Issue: detection by processes of other machines  Network Failure:  Network may be partitioned into subnets  Machines from different subnets cannot communicate  Difficult for a process to distinguish between a machine and a communication link failure
  • 40.
    40 Atomic actions  Systemactivity: sequence of primitive or atomic actions  Atomic Action:  Machine Level: uninterruptible instruction  Process Level: Group of instructions that accomplish a task  Example: Two processes, P1 and P2, share a memory location ‘x’ and both modify ‘x’ Process P1 Process P2 … … Lock(x); Lock(x); x := x + z; x := x + y; Atomic action Unlock(x); Unlock(x); … … successful exit  System level: group of cooperating process performing a task (global atomicity)
  • 41.
    41 Committing  Transaction: Sequenceof actions treated as an atomic action to preserve consistency (e.g. access to a database)  Commit a transaction: Unconditional guarantee that the transaction will complete successfully (even in the presence of failures)  Abort a transaction: Unconditional guarantee to back out of a transaction, i.e., that all the effects of the transaction have been removed (transaction was backed out)  Events that may cause aborting a transaction: deadlocks, timeouts, protection violation  Mechanisms that facilitate backing out of an aborting transaction Write-ahead-log protocol Shadow pages  Commit protocols:  Enforce global atomicity (involving several cooperating distributed processes)  Ensure that all the sites either commit or abort transaction unanimously, even in the presence of multiple and repetitive failures
  • 42.
    42 The two-phase commitprotocol  Assumption:  One process is coordinator, the others are “cohorts” (different sites)  Stable store available at each site  Write-ahead log protocol Coordinator Initialization Send start transaction message to all cohorts Phase 1 Send commit-request message, requesting all cohort to commit Wait for reply from cohorts Phase 2 If all cohorts sent agreed and coordinator agrees then write commit record into log and send commit message to cohorts else send abort message to cohorts Wait for acknowledgment from cohorts If acknowledgment from a cohort not received within specified period resent commit/abort to that cohort If all acknowledgments received, write complete record to log Cohorts If transaction at cohort is successful then write undo and redo log on stable storage and return agreed message else return abort message If commit received, release all resources and locks held for transaction and send acknowledgment if abort received, undo the transaction using undo log record, release resources and locks and send acknowledgment
  • 43.
    NonBlocking Commit Protocols OurBlocking Theorem from last week states that if network partitioning is possible, then any distributed commit protocol may block. Let’s assume now that the network can not partition. Then we can consult other processes to make progress. However, if all processes fail, then we are, again, blocked. Let’s further assume that total failure is not possible ie. not all processes are crashed at the same time.
  • 44.
    Automata representation We modelthe participants with finite state automata (FSA). The participants move from one state to another as a result of receiving one or several messages or as a result of a timeout event. Having received these messages, a participant may send some messages before executing the state transition.
  • 45.
    Commit Protocol Automata Final states are divided into Abort states and Commit states (finally, either Abort or Commit takes place).  Once an Abort state is reached, it is not possible to do a transition to a non-Abort state. (Abort is irreversible). Similarly for Commit states (Commit is also irreversible).  The state diagram is acyclic.  We denote the initial state by q, the terminal states are a (an abort/rollback state) and c (a commit state). Often there is a wait-state, which we denote by w.  Assume the participants are P1,…,Pn. Possible coordinator is P0, when the protocol starts.
  • 46.
    2PC Coordinator q w a c Acommit-request from application VoteReq to P1,…,Pn Timeout or No from one of P1,.., Pn Abort to P1,…Pn Yes from all P1,..,Pn Commit to P1,…,Pn
  • 47.
    2PC Participant q w a c VoteReqfrom P0 No to P0 Commit from P0 - Abort from P0 - VoteReq from P0 Yes to P0
  • 48.
    Commit Protocol StateTransitions  In a commit protocol, the idea is to inform other participants on local progress.  In fact, a state transition without message change is uninteresting, unless the participant moves into a terminal state.  Therefore, unless a participant moves into a terminal state, we may assume that it sends messages to other participants about its change of state.  To simplify our analysis, we may assume that the messages are sent to all other participants. This is not necessary, but creates unnecessary complication.
  • 49.
    Concurrency set  Aconcurrency set of a state s is the set of possible states among all participants, if some participant is in state s.  In other words, the concurrency set of state s is the set of all states that can co-exist with state s.
  • 50.
    2PC Concurrency Sets q w ac Commit-req. VoteReq to All Timeout or a No Abort to all Yes from all Commit to all q w a c VoteReq from P0 No to P0 Abort from P0 - VoteReq from P0 Yes to P0 Concurrency_set(q) = {q,w,a}, Concurrency_set(a) = {q,w,a} Concurrency_set(w) = {q,w,a,c}, Concurrency_set(c) = (w,c)
  • 51.
    Committable states We saythat a state is committable, if the existence of a participant in this state means that everyone has voted Yes. If a state is not committable, we say that it is non-committable. In 2PC, w and c are committable states.
  • 52.
    How can asite terminate when there is a timeout? Either (1) one of the operational sites knows the fate of the transaction, or (2) the operational sites can decide the fate of the transaction. Knowing the fate of the transaction means, in practice, that there is a participant in a terminal state. Start by considering a single participant s. The site must infer the possible states of other participants from its own state. This can be done using concurrency sets.
  • 53.
    When can’t asingle participant unilaterally abort? Suppose a participant is in a state, which has a commit state in its concurrency set. Then, it is possible that some other participant is in a commit state. A participant in a state, which has a commit state in its concurrency set, should not unilaterally abort.
  • 54.
    When can’t asingle participant unilaterally commit? Suppose a participant is in a state, which has an abort state in its concurrency set. Then, some participant may be in an abort state. A participant in a state, which has an abort state in its concurrency set, should not unilaterally commit. Also, a participant that is not in a committable state should not commit.
  • 55.
    The Fundamental Non-Blocking Theorem Aprotocol is non-blocking, if and only if it satisfies the following conditions: (1) There exists no local state such that its concurrency set contains both an abort and a commit state, and (2) there exists no noncommittable state, whose concurrency set contains a commit state.
  • 56.
    Showing the FundamentalNon- Blocking Theorem From our discussion above it follows that Conditions (1) and (2) are necessary. We discuss their sufficiency later by showing how to terminate a commit protocol fulfilling conditions (1) and (2).
  • 57.
    Observations on 2PC Asthe participants exchange messages as they progress, they progress in a synchronised fashion. In fact, there is always at most one step difference between the states of any two live participants. We say that the participants keep a one-step synchronisation. It is easy to see by Fundamental Nonblocking Theorem that 2PC is blocking.
  • 58.
    One-step synchronisation andnon- blocking property If a commit protocol keeps one-step synchronisation, then the concurrency set of state s consists of s and the states adjacent to s. By applying this observation and the Fundamental Non-blocking Theorem, we get a useful Lemma:
  • 59.
    Lemma A protocol thatis synchronous within one state transition is non-blocking, if and only if (1) it contains no state adjacet to both a Commit and an Abort state, and (2) it contains non non-committable state that is adjacet to a commit state.
  • 60.
    How to improve2PC to get a non- blocking protocol It is easy to see that the state w is the problematic state – and in two ways: - it has both Abort and Commit in its concurrency set, and - it is a non-committable state, but it has Commit in its concurrency set. Solution: add an extra state between w and c (adding between w and a would not do – why?) We are primarily interested in the centralised protocol, but similar decentralised improvement is possible.
  • 61.
    3PC Coordinator q w a c A commit-requestfrom application VoteReq to P1,…,Pn Timeout or No from one of P1,.., Pn Abort to P1,…Pn Yes from all P1,..,Pn Prepare to P1,…,Pn p Ack from all P1,..,Pn Commit to P1,…,Pn
  • 62.
    3PC Participant q w a c VoteReqfrom P0 No to P0 Prepare from P0 Ack to P0 Abort from P0 - VoteReq from P0 Yes to P0 p Commit from P0 -
  • 63.
    3PC Concurrency sets(cs) q w a c A commit-request VoteReq to all Timeout or one No Abort to all Yes from all Prepare to all p Ack from all Commit to all q w a c VoteReq from P0 No to P0 Abort from P0 - VoteReq from P0 Yes to P0 p Commit from P0 - Prepare from P0 Ack to P0 cs(p) = {w,p,c}, cs(w) = {q,a,w,p}, etc.
  • 64.
    3PC and failures Ifthere are no failures, then clearly 3PC is correct. In the presence of failures, the operational participants should be able to terminate their execution. In the centralised case, a need for termination protocol implies that the coordinator is no longer operational. We discuss a general termination protocol. It makes the assumption that at least one participant remains operational and that the participants obey the Fundamental Non-Blocking Theorem.
  • 65.
    Termination Basic idea: Choosea backup coordinator B – vote or use some preassigned ids. Backup Coordinator Decision Rule: If the B’s state contains commit in its concurrency set, commit the transaction. Else abort the transaction. Reasoning behind the rule: If B’s state contains commit in the concurrency set, then it is possible that some site has performed commit – otherwise not.
  • 66.
    Re-executing termination It is,of course, possible the backup coordinator fails. For this reason, the termination protocol should be executed in such a way that it can be re-executed. In particular, the termination protocol must not break the one-step synchronisation.
  • 67.
    Implementing termination To keepone-step synchronisation, the termination protocol should be executed in two steps: 1. The backup coordinator B tells the others to make a transition to B’s state. Others answer Ok. (This is not necessary if B is in Commit or Abort state.) 2. B tells the others to commit or abort by the decision rule.
  • 68.
    Fundamental Non-Blocking Theorem Proof- Sufficiency The basic termination procedure and decision rule is valid for any protocol that fulfills the conditions given in the Fundamental Non- Blocking Theorem. The existence of a termination protocol completes the proof.
  • 69.
    69 Voting protocols  Principles: Data replicated at several sites to increase reliability  Each replica assigned a number of votes  To access a replica, a process must collect a majority of votes  Vote mechanism:  (1) Static voting: Each replica has number of votes (in stable storage) A process can access a replica for a read or write operation if it can collect a certain number of votes (read or write quorum)  (2) Dynamic voting Number of votes or the set of sites that form a quorum change with the state of system (due to site and communication failures) (2.1) Majority based approach: Set of sites that can form a majority to allow access to replicated data of changes with the changing state of the system (2.2) Dynamic vote reassignment: Number of votes assigned to a site changes dynamically
  • 70.
    70 Failure resilient processes Resilient process: continues execution in the presence of failures with minimum disruption to the service provided (masks failures)  Approaches for implementing resilient processes:  Backup processes and  Replicated execution  (1) Backup processes  Each process made of a primary process and one or more backup processes  Primary process execute, while the backup processes are inactive  If primary process fails, a backup process takes over  Primary process establishes checkpoints, such that backup process can restart  (2) Replicated execution  Several processes execute same program concurrently  Majority consensus (voting) of their results  Increases both the reliability and availability of the process
  • 71.
    71 Recovery (fault tolerant)block concept  Provide fault-tolerance within an individual sequential process in which assignments to stored variables are the only means of making recognizable progress  The recovery block is made of:  A primary block (the conventional program),  Zero or more alternates (providing the same function as the primary block, but using different algorithm), and  An acceptance test (performed on exit from a primary or alternate block to validate its actions).
  • 72.
    72 Recovery (fault tolerant)Block concept Recovery Block A Acceptance test AT Primary block AP <Program text> Alternate block AQ <Program text> Primary block alternate block Acceptance test Recovery block
  • 73.
    73 N-version programming Module ‘0’ Module‘1’ Module ‘n-1’ Voter
  • 74.