SlideShare a Scribd company logo
1 of 93
Coordination
and
Agreement
1
Topics
• How do processes coordinate their actions
and agree on shared values?
• Mutual Exclusion Agreements
• Distributed Elections
• Multicast Communication
• Consensus
• Byzantine Agreement
• Interactive Consistency
2
The First Space Shuttle Flight -
1981
• The US Space Shuttle program used redundant systems
to manage the probability of failures in space with no
repairmen, spare parts, or down-time available. The
computer flight control system had four identical
computers, so that if one failed, there were still enough
to determine a correct action by voting if another failed.
• In addition, there was a backup system, developed by a
different contractor on different hardware with a
different operating system to take over if the first system
failed completely.
3
The Coordination Challenge
• Developing these five computer systems had many
challenges, including detecting faults, isolating problems,
and switching from one configuration to another. The
first space shuttle mission was delayed due to a failure in
the coordination and agreement protocols between the
redundant main flight system and its backup.
4
A Famous Software Failure
• On April 10, 1981, about 20 minutes prior to the
scheduled launching of the first flight of America's Space
Transportation System, astronauts and technicians
attempted to initialize the software system which "backs-
up" the quad-redundant primary software system ...and
could not. In fact, there was no possible way, it turns out,
that the Backup Flight Control System in the fifth onboard
computer could have been initialized properly with the
Primary Avionics Software System already executing in the
other four computers.
5
Detecting Failures
• Detecting, locating, and isolating a failure in a distributed
computer system is a challenge. The design of any
distributed algorithm should allow for fault detection and
consider failure mitigation procedures.
• Figure 12.1 gives an example of a failure in a network—a
crashed router that divides a network into two non-
communicating partitions. One common protocol for
detecting this type of network failure is a timeout
mechanism.
6
Figure 12.1 A network partition
7
Crashed
router
8
Detecting Failures
• Assumptions
• Each pair is of processes connected by reliable
channel
• Network components may fail, but handled by
reliable Communication Protocols
• Processes may fail only by crashing, unless stated
otherwise
Failure Detector
• A Failure Detector is a service that processes queries
about whether a particular process has crashed. It is often
implemented by a local object known as a Local Failure
Detector.
• Failure detectors are not necessarily accurate. For
example, a process that timed-out after 255 seconds
might have succeeded if allowed to proceed for 256
seconds. Most failure detectors fall into the category of
Unreliable Failure Detector.
• Although, a failure detector acts for collection of
processes, but may sometimes give different responses to
different processes
9
Reliable Failure Detector
• A Reliable Failure Detector is a service that is always
accurate in detecting a process failure. A failure detector
is only as good as the information that is provided to it by
a process or about a process. Some categories of faults
lend themselves to easy detection, while others do not.
• As a human example, consider the two questions: “Should
I study tonight?” and “Is that light turned on?” It is usually
easier to get a definitive answer for the second question
than the first.
10
Algorithm for implementing unreliable
failure detector
• After each T seconds, each p sends “p is here” message to
every other process q.
• If q does not receive “p is here” message after T+D seconds (D
is transmission delay) of the last one, then it reports to q that
p is suspected.
• However, if subsequently it receives “p is here” message, then
it reports to q that p is OK.
If we choose small values of T and D then failure detector is likely to suspect non-
crashed processes many times. If T and D are large then crashed processes will
often be reported as Unsuspected.
Reliable failure detector require that the system is synchronous.
11
Distributed Mutual Exclusion
• There is a need for distributed processes to
coordinate shared activities. For example, it is not
usually acceptable for two applications to update the
same record in a database file at the same time.
• One possible approach to preventing this is the Unix
daemon lockd, which places a file lock on a text file
while it is being written by a process so that no other
process can write to that file until the lock is
released.
12
Resource Managers
• In the case of lockd, the operating system functions
as a server or resource manager to provide the
service. Similar functions are routine on networks
such as Ethernets. A resource manager can keep
track of locks, simplifying the process.
• It is desirable to have a generic mechanism for
distributed mutual exclusion so that a resource
manager is not needed i.e. peer processes must
coordinate their actions.
13
Mutual Exclusion Algorithms
• Mutual Exclusion Algorithms define critical sections
and allow only one process to access a resource in a
critical region at one time. There are three basic
operations for the algorithms:
– enter( ) to access the critical region
– resourceAccesses( ) to use the resources
– exit( ) to leave the critical section
14
Mutual Exclusion Requirements
• Mutual Exclusion Algorithms have two basic
requirements:
– (1) Safety—at most one process may execute in
the critical section at one time
– (2) Liveness—all requests to enter and exit the
critical section must eventually succeed. This
implies freedom from deadlock and starvation.
15
Fairness Conditions
• Starvation is the indefinite postponement of entry
for a process that has requested it. It is a fairness
condition. (Absence of starvation is fairness)
• Ordering is another fairness condition.
– (3) If one request to enter a critical section happened-
before another request (based on actual time not
possible), then the first request received is granted first.
Figure 12.2 shows a server managing happened-before
ordering.
16
Figure 12.2 Managing a mutual
exclusion token for processes
17
Server
1. Request
token
Queue of
requests
2. Release
token
3. Grant
token
4
2
p4
p
3
p2
p
1
Performance Criteria
• Mutual exclusion algorithms are evaluated by the
following criteria:
– Bandwidth consumed—proportional to the number of
messages sent for each entry and exit.
– Client delay incurred by a process at each entry and
exit.
– Throughput—the rate at which processes can use the
critical section. Synchronization delay is the time
between one process exiting the critical section and
the next entering it. Shorter delays imply greater
throughput.
18
Central Server Algorithm
• The simplest way to achieve mutual exclusion is to
establish a server that grants permission to enter the
critical section. A process requests entry and waits for a
reply. Conceptually, the reply is a token that grants
permission. If no other process has the token, it can be
granted. Otherwise the process must wait until the token
is available. Other processes may have made prior
requests for the token, in which case the most recent
process must wait until previous requests are met.
19
Ring-based Algorithm
• A simple way to arrange mutual exclusion is to arrange requests in
a logical ring.
• Each process has a link to the next process.
• A token is passed around the ring.
• If the process receiving the token needs access to the critical
section, it enters the section, otherwise it passes the token to the
next process.
• Figure 12.3 shows a ring-based algorithm graphically.
• It is self evident that safety, liveness and ordering requirements are
met by this algorithm.
• Network bandwidth is consumed continuously, even when there
are no requests.
20
Figure 12.3 A ring of processes
transferring a mutual exclusion token
21
pn
p
2
p
3
p
4
Token
p
1
Using Multicast and Logical Clocks
• Ricart and Agrawala developed an algorithm to
implement mutual exclusion between N peer processes
based on multicast.
• A process that want to access a critical section multicasts
a request message, and can enter only when all other
processes have replied.
• Figure 12.4 shows the algorithm protocol, and figure
12.5 illustrates how multicast messages can be
synchronized. The numbers 34 and 41 indicate logical
timestamps. Since 34 is earlier, it gets first access.
22
Figure 12.4 Ricart and Agrawala’s
algorithm
23
On initialization
state := RELEASED;
To enter the section
state := WANTED;
Multicast request to all processes; request processing deferred here
T := request’s timestamp;
Wait until (number of replies received = (N – 1));
state := HELD;
On receipt of a request <Ti, pi> at pj (i ≠ j)
if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))
then
queue request from pi without replying;
else
reply immediately to pi;
end if
To exit the critical section
state := RELEASED;
reply to any queued requests;
Figure 12.5
Multicast synchronization
24
p
3
34
Reply
34
41
41
41
34
p
1
p
2
Reply
Reply
• Safety
– If Pi anf Pj were to enter CS, then each one might
have replied to other. Since pairs <Ti,Pi> are totally
ordered, this is not possible.
• Liveness?
• Ordering?
• Gaining access to CS takes 2(N-1) messages.
• If hardware support for multicast, then N
messages.
• delay is round-trip time
25
Efficiency
• The multicast algorithm improves on the ring
algorithm by avoiding messages passing tokens to
inactive messages. It also requires only a single
message transition time instead of a round trip. But it
still has a number of inefficiencies.
• Maekawa developed a voting algorithm (figure 12.6)
that allows a subset of the processes to grant access,
reducing entry time. Unfortunately, this algorithm is
deadlock prone. Sanders has adapted it to avoid
deadlocks.
26
Figure 12.6 Maekawa’s algorithm
– part 1
27
On initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi;
Wait until (number of replies received = K);
state := HELD;
On receipt of a request from pi at pj
if (state = HELD or voted = TRUE)
then
queue request from pi without replying;
else
send reply to pi;
voted := TRUE;
end if
For pi to exit the critical section
state := RELEASED;
Multicast release to all processes in Vi;
On receipt of a release from pi at pj
if (queue of requests is non-empty)
then
remove head of queue – from pk, say;
send reply to pk;
voted := TRUE;
else
voted := FALSE;
end if
Fault Tolerance
• What happens when messages are lost?
• What happens when a process crashes?
• None of the preceding algorithms tolerate the
loss of messages with unreliable channels. If
there is a reliable failure detector available, a
protocol would have to be developed that would
allow for failures at any point, including during a
recovery protocol.
28
Elections
• An algorithm to choose a process to play a role is
called an election algorithm. For example, a group of
peers may select one of themselves to act as server for
a mutual exclusion algorithm.
• A process that initiates a run of an election algorithm
calls the election. Multiple elections could be called at
the same time.
• A process that is engaged in an election algorithm at a
particular time is a participant.
• At other times or when it is not engaged it is a non-
participant.
29
Assumptions
• Processes arranged in logical ring
– Each process Pi has a Comm. Channel to next process in
the ring i.e. process P(i+1)mod N
• The system is asynchronous and there are no
failures
• Goal
– To elect a process with largest identifier
30
Election Requirements
• A criteria is established for deciding an election. The
text refers to the criteria as having the largest
identifier, where “largest” and identifier are defined
by the criteria.
• Safety—a participant process Pi has electedi = |_ or
electedi = P, where P is chosen as the non-crashed
process at the end of the run with the largest
identifier.
• Liveness—All processes Pi participate and eventually
set electedi ≠ |_ or crash.
31
Performance of an Election Algorithm
• Network Bandwidth Utilization
– Proportional to the total number of messages sent
• Turnaround Time
– Number of serialized message transmission times
between the initiation and termination of a run
32
The Algorithm
• Initially every process is marked as non-participant.
• Any process can begin algorithm and marks itself participant.
– Places its identifier in the election message and sends it to clockwise
neighbor
• Receiving process compares the identifier with its own.
– If received identifier is greater, it simply forwards the election message to its
clockwise neighbor.
– If received identifier is smaller and receiver is not participant, then it
substitutes its own identifier in the message and forwards to its clockwise
neighbor.
– If it is participant, then it does not forward the message.
(On forwarding election message in any case, process marks itself as participant.)
• If received identifier is that of receiver itself, then this process identifier must be
the greatest and it becomes coordinator
– It marks itself non-participant and sends an elected message to its neighbor.
• When a Process Pi receives an elected message, it marks itself non-participant
– Sets its variable electedi to the identifier in the message and unless it is new
coordinator forwards the message to its neighbor. 33
Figure 12.7 A ring-based election in
progress
34
24
15
9
4
3
28
17
24
1
Note: The election was started by process 17.
The highest process identifier encountered so far is 24.
Participant processes are shown darkened
Analysis
• When single process starts election, the worst case is
when its anti-clockwise neighbor is the process with
highest identifier.
• N-1 messages to reach it, it will not announce its
election unless its identifier travels around the circuit
with N messages and finally N messages to announce
election.
• So 3N-1 messages in all.
• Turnaround time is also 3N-1
• Does not tolerate any failure and hence is of little
practical use, but is useful to understand properties
of election algorithm. 35
The Bully Algorithm
• Synchronous system, message delivery between processes is
reliable.
• Processes can crash during operation, and faults are detected by
timeouts.
• Each process knows how to communicate with its neighbor and
none knows the identifiers of others.
• It also assumes that each process knows which processes have
higher identifiers and that it can communicate with all such
processes.
• Election messages announce an election
• Answer messages replies to an election message
• Coordinator messages announce the elected process
• An election message contains the election identifier of the sender.
Answer messages contain the identifier of a higher message. Upon
timeout, the highest identifier is the coordinator.
36
The Bully Algorithm
• A process starts election when it detects failure of coordinator through timeout (
many processes can detect concurrently) (local failure detector
T=2Ttrans+Tprocess)
• The process that knows that it has highest identifier can elect itself as the
coordinator by sending coordinator message to all processes with lower
identifiers.
• A process with lower identifier begins election by sending election message to
processes with higher identifier and awaits answer message in response.
• If none arrives within time T, the process considers itself as Coordinator and
sends a coordinator message to all processes with lower identifiers.
• Otherwise, the process waits for further for a period T’ for coordinator
message to arrive from the new coordinator. If none arrives it begins new
election.
• It Pi receives coordinator message, it sets its variable electedi to the identifier of
coordinator contained with in it and treats that process as the coordinator.
• If Pi receives an election message, it sends back an answer message and begins
another election—unless it has begun already.
37
Figure 12.8 The bully algorithm
38
p1 p
2
p
3
p
4
p
1
p
2
p
3
p
4
C
coordinator
Stage 4
C
election
election
Stage 2
p
1
p
2
p
3
p
4
C
election
answer
answer
election
Stage 1
timeout
Stage 3
Ev entually .....
p
1
p
2
p
3
p
4
election
answer
The election of
coordinator p2,
after the failure of
p4 and then p3
Why Bully?
When crashed process is replaced, it begins
election. If it has highest identifier, it treats
itself as coordinator and announces this to
other processes, although some coordinator
must be existing.
39
Multicast Communication
• Group or multicast communication requires
coordination and agreement.
• Agreement on the set of messages that every
process in the
• Ordering on the delivery of messages group should
receive
40
Multicast Communication
• Group Communication is challenging even when all
members of the group are static and aware of each
other.
• Dynamic groups, with processes joining and leaving
the group, are even more challenging. Most of the
challenges are concerned with efficiency and delivery
guarantees.
• Efficiency concerns include minimizing overhead
activities and increasing throughput and bandwidth
utilization. Uses hardware support where ever
available.
41
Multicast Communication
• Delivery guarantees ensure that operations are
completed.
• In multiple sends by a process to other processes, no
way to provide delivery guarantees. If sender fails
half way, some processes will get message while
others not.
 Also, relative ordering for two messages is also
undefined.
 In IP Multicast, no reliability and delivery guarantees
are offered but stronger multicast guarantees can be
built…
42
Closed and Open Groups
• For multicast communications, a group is said to
be closed if only members of the group can
multicast to it. A process in a closed group sends
to itself any messages to the group. (See figure
12.9)
• A group is open if processes outside the group
can send to it. Some algorithms assume closed
groups while others assume open groups.
43
Figure 12.9
Open and closed groups
44
Closed group Open group
Basic Multicast
• Unlike IP multicast it guarantees that correct processes
will eventually deliver message, as long as multicaster
does not crash.
To B-multicast(g, m): for each process p ε g, send(p, m);
On receive(m): B-deliver(m) at p.
ack-implosion ? Buffer overflows…drops some
acks….again sends… more acks
45
Reliable Multicast
• Simple multicasting is sending a message to every
process that is a member of a defined group. Reliable
multicasting requires these properties:
• Integrity—a correct process p delivers a message m at
most once.
• Validity—if a correct process multicasts a message m,
then it will eventually deliver m.
• Agreement—if a correct process delivers a message m,
then all other correct processes in group(m) will
eventually deliver m.
46
Reliable multicast algorithm
47
Agreement follows from the fact that every correct process B-multicasts the
message to the other processes after it has B-delivered it. If a correct
process does not R-deliver the message, then this can only be because it
never B-delivered it. That in turn can only be because no other correct
process B-delivered it either; therefore none will R-deliver it.
Reliable Multicast Over IP multicast
• IP multicast, Piggybacked acknowledgements, negative acknowledgements, closed group.
• Each process p maintains a sequence number Sp
g for e ach group g to which it belongs.
Initially it is zero.
• Each process also records Rq
g , the sequence number of the latest message it has delivered
from process q that was sent to group g.
• For p to R-multicast to g, it piggybacks on to the message the value Sp
g and
acknowledgements, of the form <q, Rq
g >. This conveys the for some sender q, the seq no.
of the latest message from q destined for g that p has delivered since it last multicasts a
message.
• The multicaster p then IP-multicasts the message with its piggybacked values to g, and
increments Sp
g by one.
• A process R-delivers a meaage destined for g bearing the seq no S from p if and only if S=
Rp
g +1, and it increments Rp
g by one immediately after delivery.
• If arriving message has S<= Rp
g , then r has delivered the message before and it discards it.
• If S>Rp
g +1, or if R> Rq
g for an enclosed acknowledgement <q,R>, the there are one or
more messages that it has not yet received. It keeps all such messages for which S< Rp
g +1
in a hold back queue.
• It requests missing messages by sending negative acknowledgements– to the original
sender or to a process from which it has received an ack <q, Rq
g > with Rq
g no less than
the required sequence number.
48
Figure 12.11 The hold-back queue for
arriving multicast messages
49
Message
processing
Delivery queue
Hold-back
queue
deliver
Incoming
messages
When delivery
guarantees are
met
Reliable Multicast
• Integrity—a correct process p delivers a message m at most once.
• Follows from detection of duplicates and underlying properties of IP-
multicast (uses checksums to expunge corrupted messages)
• Validity—if a correct process multicasts a message m, then it will
eventually deliver m.
• IP-multicast has that property
• Agreement—if a correct process delivers a message m, then all
other correct processes in group(m) will eventually deliver m.
• We require that a process can always detect missing messages. That in turn
means that it will always receive a further message that enables it to detect
omission.
• In present, protocol we assume that a correct process multicasts messages indefinitely.
• Second, it is required that a process can always get missing message
• i.e. it is assumed that processes maintain indefinitely a copy of all messages delivered.
50
Uniform Property
Above definition of agreement only refers to the behavior of correct
processes – processes that never fail. But, what if process crashes.
• In above algorithm, if a process crashes after it has R-delivered.
• Uniform agreement: If a process, whether it is correct or fails,
delivers message m, then all processes in group(m) will eventually
deliver m.
• Uniform agreement is useful in many applications.
• Example, banking servers. If an update is sent to group of servers , then if a server
process crashes immediately after it deliver update message, a client accessing that
server just before it crashed may observe an update that no other server processes
if there is no uniform agreement.
51
What if we reverse the lines
‘R-deliver m’
and
if(q ≠ p)? then B-multicast(g, m); end if’
52
Ordered messages
It is important that messages be delivered in order, there are basic three types
of ordering:
• FIFO—(First-in, first-out) if a correct process issues multicast(g, m) and then
multicast(g, m’), then every correct process that delivers m’ will deliver m
before m’.
• Casual—If multicast(g, m)→multicast(g, m’) where → is the happened-
before relation induced only by messages sent between the members of g,
then any correct process that delivers m’ will deliver m before m’.
• Total—if a correct process delivers message m before it delivers m’, then
any other correct process that delivers m’ will deliver m before m’.
• Hybrids total-casual, total-FIFO
Assumption: Any process belongs to at most one group
53
Comments on Ordering
• Note that FIFO ordering and casual ordering are only
partial orders. Not all messages are sent by the same
sending process. In addition some multicasts are
concurrent, not able to be ordered by happened-before.
• In figure 12.12, T1 and T2 show total ordering, F1 and F2
show FIFO ordering, and C1 and C3 show casual ordering.
Note that T1 and T2 are delivered in opposite order to the
physical time of message creation. Total order demands
consistency, but not a particular order.
54
Figure 12.12 Total, FIFO and causal
ordering of multicast messages
55
Notice the consistent
ordering of totally ordered
messages T1 and T2,
the FIFO-related messages
F1 and F2 and the causally
related messages C1 and C3
– and the otherwise
arbitrary delivery ordering of
messages.
F3
F1
F2
T2
T1
P1 P2 P3
Time
C3
C1
C2
Reliability?
• Definition of ordered multicast do not imply reliability.
– Example: Under total ordering, if correct process p delivers
message m and then delivers message m’, then a correct
process q can deliver m without also delivering m’ or any
other message ordered after m.
• We can also form hybrids of ordered and reliable
protocols.
– In literature, reliable totally ordered multicast is often
referred as atomic multicast.
– Similarly, reliable casual multicast and reliable versions of
hybrid ordered multicasts can be formed.
56
Bulletin Board example
• A bulletin board illustrates the desirability of consistency and at
minimum FIFO ordering.
– Users can best refer to preceding messages from a user if they are
delivered in order i.e. messages from a user are delivered in FIFO order.
Message 25 in figure 12.13 refers to message 24, and message 27 refers to
message 23.
• Reliable multicast is required if every user is to receive every
posting eventually.
• Note the further advantage that Web Board allows by permitting
messages to begin threads by replying to a particular message.
Thus messages do not have to be displayed in the same order
they are delivered.
• If total ordering is followed, then left hand side ordering will
appear same to all users and can refer as “message number 24”.
57
Figure 12.13 Display from bulletin
board program
58
Bulletin board:os.interesting
Item From Subject
23 A.Hanlon Mach
24 G.Joseph Microkernels
25 A.Hanlon Re: Microkernels
26 T.L’Heureux RPC performance
27 M.Walker Re: Mach
end
Implementing FIFO Ordering
• FO-multicast and FO-deliver achieved through
sequence number
• Sp
g and Rq
g held at process p are used as used in
previous protocol discussed (over IP) and FIFO ordering
for messages from a process is maintained.
• if a R-multicast is used instead of B-multicast, then we
obtain reliable FIFO multicast.
59
Implementing Total Ordering
• The normal approach to total ordering is to assign
totally ordered identifiers to multicast messages, using
the identifiers to make ordering decisions.
• One possible implementation is to use a sequencer
process to assign identifiers. See figure 12.14. A
drawback of this is that the sequencer can become a
bottleneck.
• An alternative is to have the processes collectively
agree on identifiers. A simple algorithm is shown in
figure 12.15.
60
Using Sequencer
• A process wishing to TO-multicast a message m to group g attaches
a unique identifier id(m) to it.
• The message for g are sent to the sequencer(g), as well as to the
members of g. (The sequencer may be chosen as the member of g)
• Process sequencer(g) maintains group specific number Sg, which it
uses group-specific sequence number sg, which it uses to assign
increasing and consecutive sequence numbers to the messages
that it B-delivers.
• It announces the sequence numbers by B-multicast order
messages to g.
• A message will remain in the hold-back queue indefinitely until it
can be TO-delivered according to the corresponding sequence
number.
If the processes use a FIFO-ordered variant of B-multicast, then the
totally ordered multicast is also casually ordered. 61
Figure 12.14 Total ordering using a sequencer
62
Variants
• Obvious problem with sequencer-based approach is
that the sequencer may become bottleneck.
• Variants
– Chang and Maxemchuk [1994] [1991]
– Kaashoek et al. [1989]
• Kaashoek et al. [1989] uses hardware-based multicast–
For example, available on Ethernet.
• In their simplest variant, processes send the message to be
multicast to the sequencer, one-to-one. The sequencer
multicasts the message itself, as well as the identifier and
sequence number.
63
ISIS algorithm for total ordering
• Processes collectively agree on assignment of sequence
number to messages in a distributed manner.
• Each process q in g keeps Aq
g , the largest seq no it has observed so far for g and Pq
g
, its own largest proposed seq no.
• Algorithm:
– p B-multicasts <m,i> to g, where i is a unique identifier for m.
– Each process q replies to sender p with proposal for the message’s agreed seq
no of Pq
g = Max(Aq
g, Pq
g )+1.
– Each process provisionally assigns the proposed seq no to message and places in
hold-back queue, which is ordered with the smallest seq no at the front.
– P collects all proposed seq nos and selects largest one a as the next agreed seq
no. It then B-multicasts <i,a> to g. Each process q in g sets Aq
g= Max(Aq
g,,a) and
attaches a to the message (which is identified by i). It reorders the message in
hold-back queue if the agreed seq no differs from the proposed one.
– When the message at the front of hold-back queue has been assigned its
agreed seq no, it is transferred to the tail of the delivery queue. Messages that
have been assigned their agreed seq no but are not at the head of the hold-back
queue are yet not transferred yet.
64
Second Approach
Figure 12.15 The ISIS algorithm for total ordering
65
2
1
1
2
2
1 Message
P2
P3
P1
P4
3 Agreed Sequence
3
3
66
 If every process agrees the same set of seq nos and delivers them
in the corresponding order, then total order is satisfied.
 Correct processes ultimately agree on the same set of seq nos and
they are monotonically increasing and that no correct process can
deliver a message prematurely.
Assume that a message m1 has been assigned an agreed seq no and
has reached the front of the hold back queue.
agreedSequence(m2)>=proposedSequence(m2)
(by algo above)
proposedSequence(m2)>agreedSequence(m1)
(Since m1 is on front of the queue)
Therefore,
agreedSequence(m2) >agreedSequence(m1)
67
 This algorithm has higher latency then sequencer based
algorithm since three messages are send between sender
and the group before a message can be delivered.
 Total ordering chosen by this algorithm is also not
guaranteed to be casually or FIFO-ordered as any two
messages are delivered in essentially arbitrary total order,
influenced by communication delays.
Implementing Casual Ordering
(Birman et al. [1991])
• Non-overlapping closed groups can have casually
ordered multicasts using vector timestamps
• This algorithm only orders happened-before caused by
multicasts only, and ignores one-to-one messages
between processes.
• Each process updates its vector timestamp before
delivering a message to maintain the count of
precedent messages.
• Co-multicast and co-deliver
68
Implementing Casual Ordering
(Birman et al. [1991])
• Logic
• When a process pi B-delivers a message from pj, it must
place it in hold-back queue before it can CO-deliver it:
until it is assured that it has delivered any messages
that casually preceded it. To establish this, Pi waits until:
– (a) it has delivered any earlier message sent by pj, and
– (b) it has delivered any message that pj had delivered at the
time it multicast the message.
• Both of these conditions can be detected by examining
the vector timestamps
69
Figure 12.16 Causal ordering using
vector timestamps
70
• Check if we substitute the R-multicast
primitive in place of B-multicast, then we
obtain a multicast that is both reliable and
casually ordered.
• Furthermore, if we combine the protocol for
casual multicast with the sequencer-based
protocol for totally ordered delivery, then we
obtain message delivery then is both casual
and total.
71
Overlapping groups
• Assumption of non-overlapping groups is not
satisfactory in real scenario
• We have to consider global orders in which if a
message m is multicast to group g, and
message m’ is multicast to group g’, then both
messages are addressed to the members of
gΩg’.
72
Overlapping groups
• Global FIFO ordering
– If a correct process issues multicast(g, m) and then multicast(g’, m’),
then every correct process in gΩg’ that delivers m’ will deliver m
before m’.
• Global Casual Ordering
– If multicast(g,m)→ multicast(g’,m’), where → is the happened before
relation induced by any chain of multicast messages, then any correct
process in gΩg’ that delivers m’ will deliver m before m’.
• Pair wise Total Ordering
– If correct process delivers message m sent to g before it delivers m’
sent to g’, then any correct process in gΩg’ that delivers m’ will deliver
m before m’.
• Global Total Ordering
– Let ‘<‘ be the relation of ordering between delivery events. We require
that ‘<‘ obeys pair wise total ordering and that it is acyclic- under pair
wise total ordering, ‘<‘ is not acyclic by default. 73
• Consensus is a process for a group of processes to agree
on a value that is proposed by one of the processes.
• The classic formulation of this process is the Byzantine
Generals problem: a decision whether multiple armies
should attack or retreat, assuming that united action will
be more successful than some attacking and some
retreating.
• Another example might be space ship controllers
deciding whether to proceed or abort. Failure handling
during consensus is a key concern.
74
Consensus and Related Problems
Agreement Problems
• Consensus
• Byzantine Generals
• Interactive Consistency
75
Consensus and Related Problems
Consensus
System Model
1. System has collection of processes Pi (i=1,2,…..,n)
2. Processes communicate through message passing
3. Consensus is reached even in the presence of
failures (f processes may fail).
4. Communication is reliable but processes may fail.
76
Consensus Process
1. Each process begins in an undecided state
2. A value is proposed from a set of values
3. Processes communicate with each other
4. Each process sets the state of a decision variable di
• Figure 12.17 shows a three processes engaged in a
consensus algorithm. Two processes propose
“proceed.” One proposes “abort,” but then crashes.
The two remaining processes decide proceed.
77
Figure 12.17 Consensus for three
processes
78
1
P2
P3 (crashes)
P1
Consensus algorithm
v1=proceed
v3=abort
v2=proceed
d1:=proceed d2:=proceed
Requirements for Consensus
• Termination—eventually each correct process
sets its decision variable
• Agreement—the decision value of all correct
processes is the same
• Integrity—if the correct processes all
proposed the same value, then any correct
process in the decided state has that value.
79
Byzantine Generals
• The Byzantine Empire was fraught by frequent infighting
among rival military leaders for control of the empire.
Where several generals had to cooperate to achieve an
objective, a treacherous general could weaken or even
eliminate a rival by retreating and encouraging another
general to retreat while encouraging the rival to attack.
Without expected support, the rival was likely to be
defeated. The Byzantine Generals problem concerns
decision making in anticipation of an attack.
80
Formal Statement of Problem
• Here is the Byzantine Generals problem:
– Three or more generals must agree to attack or
retreat
– One general, the commander, issues the order
– Other generals, the lieutenants, must decide to attack
or retreat
– One or more generals may be treacherous
– A treacherous general tells one general to attack and
another to retreat
• Difference from consensus is that a single process
supplies the value to agree on
81
Byzantine General Requirements
• Termination—eventually each correct process
sets its decision variable
• Agreement—the decision variable of all
correct processes is the same
• Integrity—if the commander is correct, then
all correct processes agree on the value that
the commander has proposed
82
Interactive Consistency
• A problem related to the Byzantine Generals problem is
interactive consistency. In this, all correct processes agree
on a vector of values, one for each process. This is called
the decision vector. Required:
– Termination—eventually each correct process sets its
decision variable
– Agreement—the decision vector of all correct processes
is the same
– Integrity—if Pi is correct, then all correct processes
decide on vi as the ith component of their vector.
83
Linking the problems
• Consensus (C), Byzantine Generals (BG), and Interactive
Consensus (IC) are all problems concerned with making
decisions in the context of arbitrary or crash failures.
• We can sometimes generate solutions for one problem in
terms of another. For example.
• We can derive IC from BG by running BG N times, once
for each process with that process acting as commander.
84
Derived Solutions
• We can derive C from IC by running IC to produce a
vector of values at each process, then applying a
function to the vector’s values to derive a single
value.
• We can derive BG from C by
– Commander sends proposed value to itself and
each remaining process
– All processes run C with received values
– They derive BG from the vector of C values
85
Consensus in a Synchronous
System
• Figure 12.18 shows an algorithm to derive Consensus in
a synchronous system. Up to f processes may have
crash failures, all failures occurring during f+1 rounds.
During each round, each of the correct processes
multicasts the values among themselves.
• The algorithm guarantees that all surviving correct
processes are in a position to agree.
• Note: any process with f failures will require at least f+1
rounds to agree.
86
Figure 12.18 Consensus in a
synchronous system
87
Limits for solutions to Byzantine
Generals
• Some cases of the Byzantine Generals problems have no
solutions.
• Lamport et al found that if there are only 3 processes,
there is no solution.
• Pease et al found that if the total number of processes is
less than three times the number of failures plus one,
there is no solution. Thus there is a solution with 4
processes and 1 failure, if there are two rounds. In the
first, the commander sends the values, while in the
second, each lieutenant sends the values it received.
88
Figure 12.19
Three Byzantine generals
89
p1 (Commander)
p2 p3
1:v
1:v
2:1:v
3:1:u
p1 (Commander)
p2 p3
1:x
1:w
2:1:w
3:1:x
Faulty processes are shown coloured
Figure 12.20
Four Byzantine generals
90
p1 (Commander)
p2 p3
1:v
1:v
2:1:v
3:1:u
Faulty processes are shown coloured
p4
1:v
4:1:v
2:1:v 3:1:w
4:1:v
p1 (Commander)
p2 p3
1:w
1:u
2:1:u
3:1:w
p4
1:v
4:1:v
2:1:u 3:1:w
4:1:v
Asynchronous Systems
• All solutions to consistency and Byzantine generals
problems are limited to synchronous systems.
• Fischer et al found that there are no solutions in an
asynchronous system with even one failure.
• This impossibility is circumvented by masking faults or
using failure detection.
• There is also a partial solution, assuming an adversary
process, based on introducing random values in the
process to prevent an effective thwarting strategy. This
does not always reach consensus.
91
Discussion Question
• Why can’t consensus be guaranteed in an
asynchronous environment?
92
Bibliography
• George Coulouris, Jean Dollimore and Tim Kindberg,
Distributed Systems, Concepts and Design, Addison
Wesley, Fourth Edition, 2005
• Figures from the Coulouris text are from the
instructor’s guide and are copyrighted by Pearson
Education 2005.
• Fischer et al, Pease et al, and Lamport et al: See
references in Coulouris text, pp. 859 ff.
93

More Related Content

What's hot

Distributed System-Multicast & Indirect communication
Distributed System-Multicast & Indirect communicationDistributed System-Multicast & Indirect communication
Distributed System-Multicast & Indirect communicationMNM Jain Engineering College
 
Locks In Disributed Systems
Locks In Disributed SystemsLocks In Disributed Systems
Locks In Disributed Systemsmridul mishra
 
Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlbalamurugan.k Kalibalamurugan
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemorySHIKHA GAUTAM
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating SystemsDr Sandeep Kumar Poonia
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systemsmridul mishra
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computingSVijaylakshmi
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactionsNilu Desai
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communicationAbDul ThaYyal
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating systemudaya khanal
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemPoojaBele1
 
Practical Byzantine Fault Tolernace
Practical Byzantine Fault TolernacePractical Byzantine Fault Tolernace
Practical Byzantine Fault TolernaceYongraeJo
 
Tapestry
TapestryTapestry
TapestrySutha31
 
fault-tolerance-slide.ppt
fault-tolerance-slide.pptfault-tolerance-slide.ppt
fault-tolerance-slide.pptShailendra61
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency ControlDilum Bandara
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architectureMaulik Togadiya
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software conceptsPrajakta Rane
 

What's hot (20)

Distributed System-Multicast & Indirect communication
Distributed System-Multicast & Indirect communicationDistributed System-Multicast & Indirect communication
Distributed System-Multicast & Indirect communication
 
Locks In Disributed Systems
Locks In Disributed SystemsLocks In Disributed Systems
Locks In Disributed Systems
 
Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency control
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systems
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computing
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
 
Chapter 4 a interprocess communication
Chapter 4 a interprocess communicationChapter 4 a interprocess communication
Chapter 4 a interprocess communication
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating system
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed System
 
Distributed Mutual exclusion algorithms
Distributed Mutual exclusion algorithmsDistributed Mutual exclusion algorithms
Distributed Mutual exclusion algorithms
 
Practical Byzantine Fault Tolernace
Practical Byzantine Fault TolernacePractical Byzantine Fault Tolernace
Practical Byzantine Fault Tolernace
 
Tapestry
TapestryTapestry
Tapestry
 
fault-tolerance-slide.ppt
fault-tolerance-slide.pptfault-tolerance-slide.ppt
fault-tolerance-slide.ppt
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
 
Distributed Coordination-Based Systems
Distributed Coordination-Based SystemsDistributed Coordination-Based Systems
Distributed Coordination-Based Systems
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
 

Similar to Coordination and Agreement .ppt

Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...Govt. P.G. College Dharamshala
 
Mutual Exclusion using Peterson's Algorithm
Mutual Exclusion using Peterson's AlgorithmMutual Exclusion using Peterson's Algorithm
Mutual Exclusion using Peterson's AlgorithmSouvik Roy
 
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdfDC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdfLegesseSamuel
 
Lecture 5- Process Synchronization (1).pptx
Lecture 5- Process Synchronization (1).pptxLecture 5- Process Synchronization (1).pptx
Lecture 5- Process Synchronization (1).pptxAmanuelmergia
 
Process Synchronization And Deadlocks
Process Synchronization And DeadlocksProcess Synchronization And Deadlocks
Process Synchronization And Deadlockstech2click
 
7308346-Deadlock.pptx
7308346-Deadlock.pptx7308346-Deadlock.pptx
7308346-Deadlock.pptxsheraz7288
 
The implementation of Banker's algorithm, data structure and its parser
The implementation of Banker's algorithm, data structure and its parserThe implementation of Banker's algorithm, data structure and its parser
The implementation of Banker's algorithm, data structure and its parserMatthew Chang
 
Concurrency Control, Recovery, Case Studies
Concurrency Control, Recovery, Case StudiesConcurrency Control, Recovery, Case Studies
Concurrency Control, Recovery, Case StudiesPrabu U
 
Ch17 OS
Ch17 OSCh17 OS
Ch17 OSC.U
 
Mutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory SystemsMutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory SystemsDilum Bandara
 
Concurrency Control in Distributed Systems.pptx
Concurrency Control in Distributed Systems.pptxConcurrency Control in Distributed Systems.pptx
Concurrency Control in Distributed Systems.pptxMArshad35
 
Lecture 2 data link layer 1 v1
Lecture 2 data link layer 1 v1Lecture 2 data link layer 1 v1
Lecture 2 data link layer 1 v1Ronoh Kennedy
 
Distributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere MortalsDistributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere MortalsEnsar Basri Kahveci
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systemsguest61205606
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systemsguest61205606
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systemsguest0f5a7d
 

Similar to Coordination and Agreement .ppt (20)

Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...
 
Mutual Exclusion using Peterson's Algorithm
Mutual Exclusion using Peterson's AlgorithmMutual Exclusion using Peterson's Algorithm
Mutual Exclusion using Peterson's Algorithm
 
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdfDC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
 
Lecture 5- Process Synchronization (1).pptx
Lecture 5- Process Synchronization (1).pptxLecture 5- Process Synchronization (1).pptx
Lecture 5- Process Synchronization (1).pptx
 
Process Synchronization And Deadlocks
Process Synchronization And DeadlocksProcess Synchronization And Deadlocks
Process Synchronization And Deadlocks
 
7308346-Deadlock.pptx
7308346-Deadlock.pptx7308346-Deadlock.pptx
7308346-Deadlock.pptx
 
The implementation of Banker's algorithm, data structure and its parser
The implementation of Banker's algorithm, data structure and its parserThe implementation of Banker's algorithm, data structure and its parser
The implementation of Banker's algorithm, data structure and its parser
 
Concurrency Control, Recovery, Case Studies
Concurrency Control, Recovery, Case StudiesConcurrency Control, Recovery, Case Studies
Concurrency Control, Recovery, Case Studies
 
Ch17 OS
Ch17 OSCh17 OS
Ch17 OS
 
OSCh17
OSCh17OSCh17
OSCh17
 
OS_Ch17
OS_Ch17OS_Ch17
OS_Ch17
 
Mutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory SystemsMutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory Systems
 
Concurrency Control in Distributed Systems.pptx
Concurrency Control in Distributed Systems.pptxConcurrency Control in Distributed Systems.pptx
Concurrency Control in Distributed Systems.pptx
 
Lecture 2 data link layer 1 v1
Lecture 2 data link layer 1 v1Lecture 2 data link layer 1 v1
Lecture 2 data link layer 1 v1
 
Distributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere MortalsDistributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere Mortals
 
WEEK-01.pdf
WEEK-01.pdfWEEK-01.pdf
WEEK-01.pdf
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Deadlock
DeadlockDeadlock
Deadlock
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 

Recently uploaded (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

Coordination and Agreement .ppt

  • 2. Topics • How do processes coordinate their actions and agree on shared values? • Mutual Exclusion Agreements • Distributed Elections • Multicast Communication • Consensus • Byzantine Agreement • Interactive Consistency 2
  • 3. The First Space Shuttle Flight - 1981 • The US Space Shuttle program used redundant systems to manage the probability of failures in space with no repairmen, spare parts, or down-time available. The computer flight control system had four identical computers, so that if one failed, there were still enough to determine a correct action by voting if another failed. • In addition, there was a backup system, developed by a different contractor on different hardware with a different operating system to take over if the first system failed completely. 3
  • 4. The Coordination Challenge • Developing these five computer systems had many challenges, including detecting faults, isolating problems, and switching from one configuration to another. The first space shuttle mission was delayed due to a failure in the coordination and agreement protocols between the redundant main flight system and its backup. 4
  • 5. A Famous Software Failure • On April 10, 1981, about 20 minutes prior to the scheduled launching of the first flight of America's Space Transportation System, astronauts and technicians attempted to initialize the software system which "backs- up" the quad-redundant primary software system ...and could not. In fact, there was no possible way, it turns out, that the Backup Flight Control System in the fifth onboard computer could have been initialized properly with the Primary Avionics Software System already executing in the other four computers. 5
  • 6. Detecting Failures • Detecting, locating, and isolating a failure in a distributed computer system is a challenge. The design of any distributed algorithm should allow for fault detection and consider failure mitigation procedures. • Figure 12.1 gives an example of a failure in a network—a crashed router that divides a network into two non- communicating partitions. One common protocol for detecting this type of network failure is a timeout mechanism. 6
  • 7. Figure 12.1 A network partition 7 Crashed router
  • 8. 8 Detecting Failures • Assumptions • Each pair is of processes connected by reliable channel • Network components may fail, but handled by reliable Communication Protocols • Processes may fail only by crashing, unless stated otherwise
  • 9. Failure Detector • A Failure Detector is a service that processes queries about whether a particular process has crashed. It is often implemented by a local object known as a Local Failure Detector. • Failure detectors are not necessarily accurate. For example, a process that timed-out after 255 seconds might have succeeded if allowed to proceed for 256 seconds. Most failure detectors fall into the category of Unreliable Failure Detector. • Although, a failure detector acts for collection of processes, but may sometimes give different responses to different processes 9
  • 10. Reliable Failure Detector • A Reliable Failure Detector is a service that is always accurate in detecting a process failure. A failure detector is only as good as the information that is provided to it by a process or about a process. Some categories of faults lend themselves to easy detection, while others do not. • As a human example, consider the two questions: “Should I study tonight?” and “Is that light turned on?” It is usually easier to get a definitive answer for the second question than the first. 10
  • 11. Algorithm for implementing unreliable failure detector • After each T seconds, each p sends “p is here” message to every other process q. • If q does not receive “p is here” message after T+D seconds (D is transmission delay) of the last one, then it reports to q that p is suspected. • However, if subsequently it receives “p is here” message, then it reports to q that p is OK. If we choose small values of T and D then failure detector is likely to suspect non- crashed processes many times. If T and D are large then crashed processes will often be reported as Unsuspected. Reliable failure detector require that the system is synchronous. 11
  • 12. Distributed Mutual Exclusion • There is a need for distributed processes to coordinate shared activities. For example, it is not usually acceptable for two applications to update the same record in a database file at the same time. • One possible approach to preventing this is the Unix daemon lockd, which places a file lock on a text file while it is being written by a process so that no other process can write to that file until the lock is released. 12
  • 13. Resource Managers • In the case of lockd, the operating system functions as a server or resource manager to provide the service. Similar functions are routine on networks such as Ethernets. A resource manager can keep track of locks, simplifying the process. • It is desirable to have a generic mechanism for distributed mutual exclusion so that a resource manager is not needed i.e. peer processes must coordinate their actions. 13
  • 14. Mutual Exclusion Algorithms • Mutual Exclusion Algorithms define critical sections and allow only one process to access a resource in a critical region at one time. There are three basic operations for the algorithms: – enter( ) to access the critical region – resourceAccesses( ) to use the resources – exit( ) to leave the critical section 14
  • 15. Mutual Exclusion Requirements • Mutual Exclusion Algorithms have two basic requirements: – (1) Safety—at most one process may execute in the critical section at one time – (2) Liveness—all requests to enter and exit the critical section must eventually succeed. This implies freedom from deadlock and starvation. 15
  • 16. Fairness Conditions • Starvation is the indefinite postponement of entry for a process that has requested it. It is a fairness condition. (Absence of starvation is fairness) • Ordering is another fairness condition. – (3) If one request to enter a critical section happened- before another request (based on actual time not possible), then the first request received is granted first. Figure 12.2 shows a server managing happened-before ordering. 16
  • 17. Figure 12.2 Managing a mutual exclusion token for processes 17 Server 1. Request token Queue of requests 2. Release token 3. Grant token 4 2 p4 p 3 p2 p 1
  • 18. Performance Criteria • Mutual exclusion algorithms are evaluated by the following criteria: – Bandwidth consumed—proportional to the number of messages sent for each entry and exit. – Client delay incurred by a process at each entry and exit. – Throughput—the rate at which processes can use the critical section. Synchronization delay is the time between one process exiting the critical section and the next entering it. Shorter delays imply greater throughput. 18
  • 19. Central Server Algorithm • The simplest way to achieve mutual exclusion is to establish a server that grants permission to enter the critical section. A process requests entry and waits for a reply. Conceptually, the reply is a token that grants permission. If no other process has the token, it can be granted. Otherwise the process must wait until the token is available. Other processes may have made prior requests for the token, in which case the most recent process must wait until previous requests are met. 19
  • 20. Ring-based Algorithm • A simple way to arrange mutual exclusion is to arrange requests in a logical ring. • Each process has a link to the next process. • A token is passed around the ring. • If the process receiving the token needs access to the critical section, it enters the section, otherwise it passes the token to the next process. • Figure 12.3 shows a ring-based algorithm graphically. • It is self evident that safety, liveness and ordering requirements are met by this algorithm. • Network bandwidth is consumed continuously, even when there are no requests. 20
  • 21. Figure 12.3 A ring of processes transferring a mutual exclusion token 21 pn p 2 p 3 p 4 Token p 1
  • 22. Using Multicast and Logical Clocks • Ricart and Agrawala developed an algorithm to implement mutual exclusion between N peer processes based on multicast. • A process that want to access a critical section multicasts a request message, and can enter only when all other processes have replied. • Figure 12.4 shows the algorithm protocol, and figure 12.5 illustrates how multicast messages can be synchronized. The numbers 34 and 41 indicate logical timestamps. Since 34 is earlier, it gets first access. 22
  • 23. Figure 12.4 Ricart and Agrawala’s algorithm 23 On initialization state := RELEASED; To enter the section state := WANTED; Multicast request to all processes; request processing deferred here T := request’s timestamp; Wait until (number of replies received = (N – 1)); state := HELD; On receipt of a request <Ti, pi> at pj (i ≠ j) if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi))) then queue request from pi without replying; else reply immediately to pi; end if To exit the critical section state := RELEASED; reply to any queued requests;
  • 25. • Safety – If Pi anf Pj were to enter CS, then each one might have replied to other. Since pairs <Ti,Pi> are totally ordered, this is not possible. • Liveness? • Ordering? • Gaining access to CS takes 2(N-1) messages. • If hardware support for multicast, then N messages. • delay is round-trip time 25
  • 26. Efficiency • The multicast algorithm improves on the ring algorithm by avoiding messages passing tokens to inactive messages. It also requires only a single message transition time instead of a round trip. But it still has a number of inefficiencies. • Maekawa developed a voting algorithm (figure 12.6) that allows a subset of the processes to grant access, reducing entry time. Unfortunately, this algorithm is deadlock prone. Sanders has adapted it to avoid deadlocks. 26
  • 27. Figure 12.6 Maekawa’s algorithm – part 1 27 On initialization state := RELEASED; voted := FALSE; For pi to enter the critical section state := WANTED; Multicast request to all processes in Vi; Wait until (number of replies received = K); state := HELD; On receipt of a request from pi at pj if (state = HELD or voted = TRUE) then queue request from pi without replying; else send reply to pi; voted := TRUE; end if For pi to exit the critical section state := RELEASED; Multicast release to all processes in Vi; On receipt of a release from pi at pj if (queue of requests is non-empty) then remove head of queue – from pk, say; send reply to pk; voted := TRUE; else voted := FALSE; end if
  • 28. Fault Tolerance • What happens when messages are lost? • What happens when a process crashes? • None of the preceding algorithms tolerate the loss of messages with unreliable channels. If there is a reliable failure detector available, a protocol would have to be developed that would allow for failures at any point, including during a recovery protocol. 28
  • 29. Elections • An algorithm to choose a process to play a role is called an election algorithm. For example, a group of peers may select one of themselves to act as server for a mutual exclusion algorithm. • A process that initiates a run of an election algorithm calls the election. Multiple elections could be called at the same time. • A process that is engaged in an election algorithm at a particular time is a participant. • At other times or when it is not engaged it is a non- participant. 29
  • 30. Assumptions • Processes arranged in logical ring – Each process Pi has a Comm. Channel to next process in the ring i.e. process P(i+1)mod N • The system is asynchronous and there are no failures • Goal – To elect a process with largest identifier 30
  • 31. Election Requirements • A criteria is established for deciding an election. The text refers to the criteria as having the largest identifier, where “largest” and identifier are defined by the criteria. • Safety—a participant process Pi has electedi = |_ or electedi = P, where P is chosen as the non-crashed process at the end of the run with the largest identifier. • Liveness—All processes Pi participate and eventually set electedi ≠ |_ or crash. 31
  • 32. Performance of an Election Algorithm • Network Bandwidth Utilization – Proportional to the total number of messages sent • Turnaround Time – Number of serialized message transmission times between the initiation and termination of a run 32
  • 33. The Algorithm • Initially every process is marked as non-participant. • Any process can begin algorithm and marks itself participant. – Places its identifier in the election message and sends it to clockwise neighbor • Receiving process compares the identifier with its own. – If received identifier is greater, it simply forwards the election message to its clockwise neighbor. – If received identifier is smaller and receiver is not participant, then it substitutes its own identifier in the message and forwards to its clockwise neighbor. – If it is participant, then it does not forward the message. (On forwarding election message in any case, process marks itself as participant.) • If received identifier is that of receiver itself, then this process identifier must be the greatest and it becomes coordinator – It marks itself non-participant and sends an elected message to its neighbor. • When a Process Pi receives an elected message, it marks itself non-participant – Sets its variable electedi to the identifier in the message and unless it is new coordinator forwards the message to its neighbor. 33
  • 34. Figure 12.7 A ring-based election in progress 34 24 15 9 4 3 28 17 24 1 Note: The election was started by process 17. The highest process identifier encountered so far is 24. Participant processes are shown darkened
  • 35. Analysis • When single process starts election, the worst case is when its anti-clockwise neighbor is the process with highest identifier. • N-1 messages to reach it, it will not announce its election unless its identifier travels around the circuit with N messages and finally N messages to announce election. • So 3N-1 messages in all. • Turnaround time is also 3N-1 • Does not tolerate any failure and hence is of little practical use, but is useful to understand properties of election algorithm. 35
  • 36. The Bully Algorithm • Synchronous system, message delivery between processes is reliable. • Processes can crash during operation, and faults are detected by timeouts. • Each process knows how to communicate with its neighbor and none knows the identifiers of others. • It also assumes that each process knows which processes have higher identifiers and that it can communicate with all such processes. • Election messages announce an election • Answer messages replies to an election message • Coordinator messages announce the elected process • An election message contains the election identifier of the sender. Answer messages contain the identifier of a higher message. Upon timeout, the highest identifier is the coordinator. 36
  • 37. The Bully Algorithm • A process starts election when it detects failure of coordinator through timeout ( many processes can detect concurrently) (local failure detector T=2Ttrans+Tprocess) • The process that knows that it has highest identifier can elect itself as the coordinator by sending coordinator message to all processes with lower identifiers. • A process with lower identifier begins election by sending election message to processes with higher identifier and awaits answer message in response. • If none arrives within time T, the process considers itself as Coordinator and sends a coordinator message to all processes with lower identifiers. • Otherwise, the process waits for further for a period T’ for coordinator message to arrive from the new coordinator. If none arrives it begins new election. • It Pi receives coordinator message, it sets its variable electedi to the identifier of coordinator contained with in it and treats that process as the coordinator. • If Pi receives an election message, it sends back an answer message and begins another election—unless it has begun already. 37
  • 38. Figure 12.8 The bully algorithm 38 p1 p 2 p 3 p 4 p 1 p 2 p 3 p 4 C coordinator Stage 4 C election election Stage 2 p 1 p 2 p 3 p 4 C election answer answer election Stage 1 timeout Stage 3 Ev entually ..... p 1 p 2 p 3 p 4 election answer The election of coordinator p2, after the failure of p4 and then p3
  • 39. Why Bully? When crashed process is replaced, it begins election. If it has highest identifier, it treats itself as coordinator and announces this to other processes, although some coordinator must be existing. 39
  • 40. Multicast Communication • Group or multicast communication requires coordination and agreement. • Agreement on the set of messages that every process in the • Ordering on the delivery of messages group should receive 40
  • 41. Multicast Communication • Group Communication is challenging even when all members of the group are static and aware of each other. • Dynamic groups, with processes joining and leaving the group, are even more challenging. Most of the challenges are concerned with efficiency and delivery guarantees. • Efficiency concerns include minimizing overhead activities and increasing throughput and bandwidth utilization. Uses hardware support where ever available. 41
  • 42. Multicast Communication • Delivery guarantees ensure that operations are completed. • In multiple sends by a process to other processes, no way to provide delivery guarantees. If sender fails half way, some processes will get message while others not.  Also, relative ordering for two messages is also undefined.  In IP Multicast, no reliability and delivery guarantees are offered but stronger multicast guarantees can be built… 42
  • 43. Closed and Open Groups • For multicast communications, a group is said to be closed if only members of the group can multicast to it. A process in a closed group sends to itself any messages to the group. (See figure 12.9) • A group is open if processes outside the group can send to it. Some algorithms assume closed groups while others assume open groups. 43
  • 44. Figure 12.9 Open and closed groups 44 Closed group Open group
  • 45. Basic Multicast • Unlike IP multicast it guarantees that correct processes will eventually deliver message, as long as multicaster does not crash. To B-multicast(g, m): for each process p ε g, send(p, m); On receive(m): B-deliver(m) at p. ack-implosion ? Buffer overflows…drops some acks….again sends… more acks 45
  • 46. Reliable Multicast • Simple multicasting is sending a message to every process that is a member of a defined group. Reliable multicasting requires these properties: • Integrity—a correct process p delivers a message m at most once. • Validity—if a correct process multicasts a message m, then it will eventually deliver m. • Agreement—if a correct process delivers a message m, then all other correct processes in group(m) will eventually deliver m. 46
  • 47. Reliable multicast algorithm 47 Agreement follows from the fact that every correct process B-multicasts the message to the other processes after it has B-delivered it. If a correct process does not R-deliver the message, then this can only be because it never B-delivered it. That in turn can only be because no other correct process B-delivered it either; therefore none will R-deliver it.
  • 48. Reliable Multicast Over IP multicast • IP multicast, Piggybacked acknowledgements, negative acknowledgements, closed group. • Each process p maintains a sequence number Sp g for e ach group g to which it belongs. Initially it is zero. • Each process also records Rq g , the sequence number of the latest message it has delivered from process q that was sent to group g. • For p to R-multicast to g, it piggybacks on to the message the value Sp g and acknowledgements, of the form <q, Rq g >. This conveys the for some sender q, the seq no. of the latest message from q destined for g that p has delivered since it last multicasts a message. • The multicaster p then IP-multicasts the message with its piggybacked values to g, and increments Sp g by one. • A process R-delivers a meaage destined for g bearing the seq no S from p if and only if S= Rp g +1, and it increments Rp g by one immediately after delivery. • If arriving message has S<= Rp g , then r has delivered the message before and it discards it. • If S>Rp g +1, or if R> Rq g for an enclosed acknowledgement <q,R>, the there are one or more messages that it has not yet received. It keeps all such messages for which S< Rp g +1 in a hold back queue. • It requests missing messages by sending negative acknowledgements– to the original sender or to a process from which it has received an ack <q, Rq g > with Rq g no less than the required sequence number. 48
  • 49. Figure 12.11 The hold-back queue for arriving multicast messages 49 Message processing Delivery queue Hold-back queue deliver Incoming messages When delivery guarantees are met
  • 50. Reliable Multicast • Integrity—a correct process p delivers a message m at most once. • Follows from detection of duplicates and underlying properties of IP- multicast (uses checksums to expunge corrupted messages) • Validity—if a correct process multicasts a message m, then it will eventually deliver m. • IP-multicast has that property • Agreement—if a correct process delivers a message m, then all other correct processes in group(m) will eventually deliver m. • We require that a process can always detect missing messages. That in turn means that it will always receive a further message that enables it to detect omission. • In present, protocol we assume that a correct process multicasts messages indefinitely. • Second, it is required that a process can always get missing message • i.e. it is assumed that processes maintain indefinitely a copy of all messages delivered. 50
  • 51. Uniform Property Above definition of agreement only refers to the behavior of correct processes – processes that never fail. But, what if process crashes. • In above algorithm, if a process crashes after it has R-delivered. • Uniform agreement: If a process, whether it is correct or fails, delivers message m, then all processes in group(m) will eventually deliver m. • Uniform agreement is useful in many applications. • Example, banking servers. If an update is sent to group of servers , then if a server process crashes immediately after it deliver update message, a client accessing that server just before it crashed may observe an update that no other server processes if there is no uniform agreement. 51
  • 52. What if we reverse the lines ‘R-deliver m’ and if(q ≠ p)? then B-multicast(g, m); end if’ 52
  • 53. Ordered messages It is important that messages be delivered in order, there are basic three types of ordering: • FIFO—(First-in, first-out) if a correct process issues multicast(g, m) and then multicast(g, m’), then every correct process that delivers m’ will deliver m before m’. • Casual—If multicast(g, m)→multicast(g, m’) where → is the happened- before relation induced only by messages sent between the members of g, then any correct process that delivers m’ will deliver m before m’. • Total—if a correct process delivers message m before it delivers m’, then any other correct process that delivers m’ will deliver m before m’. • Hybrids total-casual, total-FIFO Assumption: Any process belongs to at most one group 53
  • 54. Comments on Ordering • Note that FIFO ordering and casual ordering are only partial orders. Not all messages are sent by the same sending process. In addition some multicasts are concurrent, not able to be ordered by happened-before. • In figure 12.12, T1 and T2 show total ordering, F1 and F2 show FIFO ordering, and C1 and C3 show casual ordering. Note that T1 and T2 are delivered in opposite order to the physical time of message creation. Total order demands consistency, but not a particular order. 54
  • 55. Figure 12.12 Total, FIFO and causal ordering of multicast messages 55 Notice the consistent ordering of totally ordered messages T1 and T2, the FIFO-related messages F1 and F2 and the causally related messages C1 and C3 – and the otherwise arbitrary delivery ordering of messages. F3 F1 F2 T2 T1 P1 P2 P3 Time C3 C1 C2
  • 56. Reliability? • Definition of ordered multicast do not imply reliability. – Example: Under total ordering, if correct process p delivers message m and then delivers message m’, then a correct process q can deliver m without also delivering m’ or any other message ordered after m. • We can also form hybrids of ordered and reliable protocols. – In literature, reliable totally ordered multicast is often referred as atomic multicast. – Similarly, reliable casual multicast and reliable versions of hybrid ordered multicasts can be formed. 56
  • 57. Bulletin Board example • A bulletin board illustrates the desirability of consistency and at minimum FIFO ordering. – Users can best refer to preceding messages from a user if they are delivered in order i.e. messages from a user are delivered in FIFO order. Message 25 in figure 12.13 refers to message 24, and message 27 refers to message 23. • Reliable multicast is required if every user is to receive every posting eventually. • Note the further advantage that Web Board allows by permitting messages to begin threads by replying to a particular message. Thus messages do not have to be displayed in the same order they are delivered. • If total ordering is followed, then left hand side ordering will appear same to all users and can refer as “message number 24”. 57
  • 58. Figure 12.13 Display from bulletin board program 58 Bulletin board:os.interesting Item From Subject 23 A.Hanlon Mach 24 G.Joseph Microkernels 25 A.Hanlon Re: Microkernels 26 T.L’Heureux RPC performance 27 M.Walker Re: Mach end
  • 59. Implementing FIFO Ordering • FO-multicast and FO-deliver achieved through sequence number • Sp g and Rq g held at process p are used as used in previous protocol discussed (over IP) and FIFO ordering for messages from a process is maintained. • if a R-multicast is used instead of B-multicast, then we obtain reliable FIFO multicast. 59
  • 60. Implementing Total Ordering • The normal approach to total ordering is to assign totally ordered identifiers to multicast messages, using the identifiers to make ordering decisions. • One possible implementation is to use a sequencer process to assign identifiers. See figure 12.14. A drawback of this is that the sequencer can become a bottleneck. • An alternative is to have the processes collectively agree on identifiers. A simple algorithm is shown in figure 12.15. 60
  • 61. Using Sequencer • A process wishing to TO-multicast a message m to group g attaches a unique identifier id(m) to it. • The message for g are sent to the sequencer(g), as well as to the members of g. (The sequencer may be chosen as the member of g) • Process sequencer(g) maintains group specific number Sg, which it uses group-specific sequence number sg, which it uses to assign increasing and consecutive sequence numbers to the messages that it B-delivers. • It announces the sequence numbers by B-multicast order messages to g. • A message will remain in the hold-back queue indefinitely until it can be TO-delivered according to the corresponding sequence number. If the processes use a FIFO-ordered variant of B-multicast, then the totally ordered multicast is also casually ordered. 61
  • 62. Figure 12.14 Total ordering using a sequencer 62
  • 63. Variants • Obvious problem with sequencer-based approach is that the sequencer may become bottleneck. • Variants – Chang and Maxemchuk [1994] [1991] – Kaashoek et al. [1989] • Kaashoek et al. [1989] uses hardware-based multicast– For example, available on Ethernet. • In their simplest variant, processes send the message to be multicast to the sequencer, one-to-one. The sequencer multicasts the message itself, as well as the identifier and sequence number. 63
  • 64. ISIS algorithm for total ordering • Processes collectively agree on assignment of sequence number to messages in a distributed manner. • Each process q in g keeps Aq g , the largest seq no it has observed so far for g and Pq g , its own largest proposed seq no. • Algorithm: – p B-multicasts <m,i> to g, where i is a unique identifier for m. – Each process q replies to sender p with proposal for the message’s agreed seq no of Pq g = Max(Aq g, Pq g )+1. – Each process provisionally assigns the proposed seq no to message and places in hold-back queue, which is ordered with the smallest seq no at the front. – P collects all proposed seq nos and selects largest one a as the next agreed seq no. It then B-multicasts <i,a> to g. Each process q in g sets Aq g= Max(Aq g,,a) and attaches a to the message (which is identified by i). It reorders the message in hold-back queue if the agreed seq no differs from the proposed one. – When the message at the front of hold-back queue has been assigned its agreed seq no, it is transferred to the tail of the delivery queue. Messages that have been assigned their agreed seq no but are not at the head of the hold-back queue are yet not transferred yet. 64
  • 65. Second Approach Figure 12.15 The ISIS algorithm for total ordering 65 2 1 1 2 2 1 Message P2 P3 P1 P4 3 Agreed Sequence 3 3
  • 66. 66  If every process agrees the same set of seq nos and delivers them in the corresponding order, then total order is satisfied.  Correct processes ultimately agree on the same set of seq nos and they are monotonically increasing and that no correct process can deliver a message prematurely. Assume that a message m1 has been assigned an agreed seq no and has reached the front of the hold back queue. agreedSequence(m2)>=proposedSequence(m2) (by algo above) proposedSequence(m2)>agreedSequence(m1) (Since m1 is on front of the queue) Therefore, agreedSequence(m2) >agreedSequence(m1)
  • 67. 67  This algorithm has higher latency then sequencer based algorithm since three messages are send between sender and the group before a message can be delivered.  Total ordering chosen by this algorithm is also not guaranteed to be casually or FIFO-ordered as any two messages are delivered in essentially arbitrary total order, influenced by communication delays.
  • 68. Implementing Casual Ordering (Birman et al. [1991]) • Non-overlapping closed groups can have casually ordered multicasts using vector timestamps • This algorithm only orders happened-before caused by multicasts only, and ignores one-to-one messages between processes. • Each process updates its vector timestamp before delivering a message to maintain the count of precedent messages. • Co-multicast and co-deliver 68
  • 69. Implementing Casual Ordering (Birman et al. [1991]) • Logic • When a process pi B-delivers a message from pj, it must place it in hold-back queue before it can CO-deliver it: until it is assured that it has delivered any messages that casually preceded it. To establish this, Pi waits until: – (a) it has delivered any earlier message sent by pj, and – (b) it has delivered any message that pj had delivered at the time it multicast the message. • Both of these conditions can be detected by examining the vector timestamps 69
  • 70. Figure 12.16 Causal ordering using vector timestamps 70
  • 71. • Check if we substitute the R-multicast primitive in place of B-multicast, then we obtain a multicast that is both reliable and casually ordered. • Furthermore, if we combine the protocol for casual multicast with the sequencer-based protocol for totally ordered delivery, then we obtain message delivery then is both casual and total. 71
  • 72. Overlapping groups • Assumption of non-overlapping groups is not satisfactory in real scenario • We have to consider global orders in which if a message m is multicast to group g, and message m’ is multicast to group g’, then both messages are addressed to the members of gΩg’. 72
  • 73. Overlapping groups • Global FIFO ordering – If a correct process issues multicast(g, m) and then multicast(g’, m’), then every correct process in gΩg’ that delivers m’ will deliver m before m’. • Global Casual Ordering – If multicast(g,m)→ multicast(g’,m’), where → is the happened before relation induced by any chain of multicast messages, then any correct process in gΩg’ that delivers m’ will deliver m before m’. • Pair wise Total Ordering – If correct process delivers message m sent to g before it delivers m’ sent to g’, then any correct process in gΩg’ that delivers m’ will deliver m before m’. • Global Total Ordering – Let ‘<‘ be the relation of ordering between delivery events. We require that ‘<‘ obeys pair wise total ordering and that it is acyclic- under pair wise total ordering, ‘<‘ is not acyclic by default. 73
  • 74. • Consensus is a process for a group of processes to agree on a value that is proposed by one of the processes. • The classic formulation of this process is the Byzantine Generals problem: a decision whether multiple armies should attack or retreat, assuming that united action will be more successful than some attacking and some retreating. • Another example might be space ship controllers deciding whether to proceed or abort. Failure handling during consensus is a key concern. 74 Consensus and Related Problems
  • 75. Agreement Problems • Consensus • Byzantine Generals • Interactive Consistency 75 Consensus and Related Problems
  • 76. Consensus System Model 1. System has collection of processes Pi (i=1,2,…..,n) 2. Processes communicate through message passing 3. Consensus is reached even in the presence of failures (f processes may fail). 4. Communication is reliable but processes may fail. 76
  • 77. Consensus Process 1. Each process begins in an undecided state 2. A value is proposed from a set of values 3. Processes communicate with each other 4. Each process sets the state of a decision variable di • Figure 12.17 shows a three processes engaged in a consensus algorithm. Two processes propose “proceed.” One proposes “abort,” but then crashes. The two remaining processes decide proceed. 77
  • 78. Figure 12.17 Consensus for three processes 78 1 P2 P3 (crashes) P1 Consensus algorithm v1=proceed v3=abort v2=proceed d1:=proceed d2:=proceed
  • 79. Requirements for Consensus • Termination—eventually each correct process sets its decision variable • Agreement—the decision value of all correct processes is the same • Integrity—if the correct processes all proposed the same value, then any correct process in the decided state has that value. 79
  • 80. Byzantine Generals • The Byzantine Empire was fraught by frequent infighting among rival military leaders for control of the empire. Where several generals had to cooperate to achieve an objective, a treacherous general could weaken or even eliminate a rival by retreating and encouraging another general to retreat while encouraging the rival to attack. Without expected support, the rival was likely to be defeated. The Byzantine Generals problem concerns decision making in anticipation of an attack. 80
  • 81. Formal Statement of Problem • Here is the Byzantine Generals problem: – Three or more generals must agree to attack or retreat – One general, the commander, issues the order – Other generals, the lieutenants, must decide to attack or retreat – One or more generals may be treacherous – A treacherous general tells one general to attack and another to retreat • Difference from consensus is that a single process supplies the value to agree on 81
  • 82. Byzantine General Requirements • Termination—eventually each correct process sets its decision variable • Agreement—the decision variable of all correct processes is the same • Integrity—if the commander is correct, then all correct processes agree on the value that the commander has proposed 82
  • 83. Interactive Consistency • A problem related to the Byzantine Generals problem is interactive consistency. In this, all correct processes agree on a vector of values, one for each process. This is called the decision vector. Required: – Termination—eventually each correct process sets its decision variable – Agreement—the decision vector of all correct processes is the same – Integrity—if Pi is correct, then all correct processes decide on vi as the ith component of their vector. 83
  • 84. Linking the problems • Consensus (C), Byzantine Generals (BG), and Interactive Consensus (IC) are all problems concerned with making decisions in the context of arbitrary or crash failures. • We can sometimes generate solutions for one problem in terms of another. For example. • We can derive IC from BG by running BG N times, once for each process with that process acting as commander. 84
  • 85. Derived Solutions • We can derive C from IC by running IC to produce a vector of values at each process, then applying a function to the vector’s values to derive a single value. • We can derive BG from C by – Commander sends proposed value to itself and each remaining process – All processes run C with received values – They derive BG from the vector of C values 85
  • 86. Consensus in a Synchronous System • Figure 12.18 shows an algorithm to derive Consensus in a synchronous system. Up to f processes may have crash failures, all failures occurring during f+1 rounds. During each round, each of the correct processes multicasts the values among themselves. • The algorithm guarantees that all surviving correct processes are in a position to agree. • Note: any process with f failures will require at least f+1 rounds to agree. 86
  • 87. Figure 12.18 Consensus in a synchronous system 87
  • 88. Limits for solutions to Byzantine Generals • Some cases of the Byzantine Generals problems have no solutions. • Lamport et al found that if there are only 3 processes, there is no solution. • Pease et al found that if the total number of processes is less than three times the number of failures plus one, there is no solution. Thus there is a solution with 4 processes and 1 failure, if there are two rounds. In the first, the commander sends the values, while in the second, each lieutenant sends the values it received. 88
  • 89. Figure 12.19 Three Byzantine generals 89 p1 (Commander) p2 p3 1:v 1:v 2:1:v 3:1:u p1 (Commander) p2 p3 1:x 1:w 2:1:w 3:1:x Faulty processes are shown coloured
  • 90. Figure 12.20 Four Byzantine generals 90 p1 (Commander) p2 p3 1:v 1:v 2:1:v 3:1:u Faulty processes are shown coloured p4 1:v 4:1:v 2:1:v 3:1:w 4:1:v p1 (Commander) p2 p3 1:w 1:u 2:1:u 3:1:w p4 1:v 4:1:v 2:1:u 3:1:w 4:1:v
  • 91. Asynchronous Systems • All solutions to consistency and Byzantine generals problems are limited to synchronous systems. • Fischer et al found that there are no solutions in an asynchronous system with even one failure. • This impossibility is circumvented by masking faults or using failure detection. • There is also a partial solution, assuming an adversary process, based on introducing random values in the process to prevent an effective thwarting strategy. This does not always reach consensus. 91
  • 92. Discussion Question • Why can’t consensus be guaranteed in an asynchronous environment? 92
  • 93. Bibliography • George Coulouris, Jean Dollimore and Tim Kindberg, Distributed Systems, Concepts and Design, Addison Wesley, Fourth Edition, 2005 • Figures from the Coulouris text are from the instructor’s guide and are copyrighted by Pearson Education 2005. • Fischer et al, Pease et al, and Lamport et al: See references in Coulouris text, pp. 859 ff. 93