2. Topics
• How do processes coordinate their actions
and agree on shared values?
• Mutual Exclusion Agreements
• Distributed Elections
• Multicast Communication
• Consensus
• Byzantine Agreement
• Interactive Consistency
2
3. The First Space Shuttle Flight -
1981
• The US Space Shuttle program used redundant systems
to manage the probability of failures in space with no
repairmen, spare parts, or down-time available. The
computer flight control system had four identical
computers, so that if one failed, there were still enough
to determine a correct action by voting if another failed.
• In addition, there was a backup system, developed by a
different contractor on different hardware with a
different operating system to take over if the first system
failed completely.
3
4. The Coordination Challenge
• Developing these five computer systems had many
challenges, including detecting faults, isolating problems,
and switching from one configuration to another. The
first space shuttle mission was delayed due to a failure in
the coordination and agreement protocols between the
redundant main flight system and its backup.
4
5. A Famous Software Failure
• On April 10, 1981, about 20 minutes prior to the
scheduled launching of the first flight of America's Space
Transportation System, astronauts and technicians
attempted to initialize the software system which "backs-
up" the quad-redundant primary software system ...and
could not. In fact, there was no possible way, it turns out,
that the Backup Flight Control System in the fifth onboard
computer could have been initialized properly with the
Primary Avionics Software System already executing in the
other four computers.
5
6. Detecting Failures
• Detecting, locating, and isolating a failure in a distributed
computer system is a challenge. The design of any
distributed algorithm should allow for fault detection and
consider failure mitigation procedures.
• Figure 12.1 gives an example of a failure in a network—a
crashed router that divides a network into two non-
communicating partitions. One common protocol for
detecting this type of network failure is a timeout
mechanism.
6
8. 8
Detecting Failures
• Assumptions
• Each pair is of processes connected by reliable
channel
• Network components may fail, but handled by
reliable Communication Protocols
• Processes may fail only by crashing, unless stated
otherwise
9. Failure Detector
• A Failure Detector is a service that processes queries
about whether a particular process has crashed. It is often
implemented by a local object known as a Local Failure
Detector.
• Failure detectors are not necessarily accurate. For
example, a process that timed-out after 255 seconds
might have succeeded if allowed to proceed for 256
seconds. Most failure detectors fall into the category of
Unreliable Failure Detector.
• Although, a failure detector acts for collection of
processes, but may sometimes give different responses to
different processes
9
10. Reliable Failure Detector
• A Reliable Failure Detector is a service that is always
accurate in detecting a process failure. A failure detector
is only as good as the information that is provided to it by
a process or about a process. Some categories of faults
lend themselves to easy detection, while others do not.
• As a human example, consider the two questions: “Should
I study tonight?” and “Is that light turned on?” It is usually
easier to get a definitive answer for the second question
than the first.
10
11. Algorithm for implementing unreliable
failure detector
• After each T seconds, each p sends “p is here” message to
every other process q.
• If q does not receive “p is here” message after T+D seconds (D
is transmission delay) of the last one, then it reports to q that
p is suspected.
• However, if subsequently it receives “p is here” message, then
it reports to q that p is OK.
If we choose small values of T and D then failure detector is likely to suspect non-
crashed processes many times. If T and D are large then crashed processes will
often be reported as Unsuspected.
Reliable failure detector require that the system is synchronous.
11
12. Distributed Mutual Exclusion
• There is a need for distributed processes to
coordinate shared activities. For example, it is not
usually acceptable for two applications to update the
same record in a database file at the same time.
• One possible approach to preventing this is the Unix
daemon lockd, which places a file lock on a text file
while it is being written by a process so that no other
process can write to that file until the lock is
released.
12
13. Resource Managers
• In the case of lockd, the operating system functions
as a server or resource manager to provide the
service. Similar functions are routine on networks
such as Ethernets. A resource manager can keep
track of locks, simplifying the process.
• It is desirable to have a generic mechanism for
distributed mutual exclusion so that a resource
manager is not needed i.e. peer processes must
coordinate their actions.
13
14. Mutual Exclusion Algorithms
• Mutual Exclusion Algorithms define critical sections
and allow only one process to access a resource in a
critical region at one time. There are three basic
operations for the algorithms:
– enter( ) to access the critical region
– resourceAccesses( ) to use the resources
– exit( ) to leave the critical section
14
15. Mutual Exclusion Requirements
• Mutual Exclusion Algorithms have two basic
requirements:
– (1) Safety—at most one process may execute in
the critical section at one time
– (2) Liveness—all requests to enter and exit the
critical section must eventually succeed. This
implies freedom from deadlock and starvation.
15
16. Fairness Conditions
• Starvation is the indefinite postponement of entry
for a process that has requested it. It is a fairness
condition. (Absence of starvation is fairness)
• Ordering is another fairness condition.
– (3) If one request to enter a critical section happened-
before another request (based on actual time not
possible), then the first request received is granted first.
Figure 12.2 shows a server managing happened-before
ordering.
16
17. Figure 12.2 Managing a mutual
exclusion token for processes
17
Server
1. Request
token
Queue of
requests
2. Release
token
3. Grant
token
4
2
p4
p
3
p2
p
1
18. Performance Criteria
• Mutual exclusion algorithms are evaluated by the
following criteria:
– Bandwidth consumed—proportional to the number of
messages sent for each entry and exit.
– Client delay incurred by a process at each entry and
exit.
– Throughput—the rate at which processes can use the
critical section. Synchronization delay is the time
between one process exiting the critical section and
the next entering it. Shorter delays imply greater
throughput.
18
19. Central Server Algorithm
• The simplest way to achieve mutual exclusion is to
establish a server that grants permission to enter the
critical section. A process requests entry and waits for a
reply. Conceptually, the reply is a token that grants
permission. If no other process has the token, it can be
granted. Otherwise the process must wait until the token
is available. Other processes may have made prior
requests for the token, in which case the most recent
process must wait until previous requests are met.
19
20. Ring-based Algorithm
• A simple way to arrange mutual exclusion is to arrange requests in
a logical ring.
• Each process has a link to the next process.
• A token is passed around the ring.
• If the process receiving the token needs access to the critical
section, it enters the section, otherwise it passes the token to the
next process.
• Figure 12.3 shows a ring-based algorithm graphically.
• It is self evident that safety, liveness and ordering requirements are
met by this algorithm.
• Network bandwidth is consumed continuously, even when there
are no requests.
20
21. Figure 12.3 A ring of processes
transferring a mutual exclusion token
21
pn
p
2
p
3
p
4
Token
p
1
22. Using Multicast and Logical Clocks
• Ricart and Agrawala developed an algorithm to
implement mutual exclusion between N peer processes
based on multicast.
• A process that want to access a critical section multicasts
a request message, and can enter only when all other
processes have replied.
• Figure 12.4 shows the algorithm protocol, and figure
12.5 illustrates how multicast messages can be
synchronized. The numbers 34 and 41 indicate logical
timestamps. Since 34 is earlier, it gets first access.
22
23. Figure 12.4 Ricart and Agrawala’s
algorithm
23
On initialization
state := RELEASED;
To enter the section
state := WANTED;
Multicast request to all processes; request processing deferred here
T := request’s timestamp;
Wait until (number of replies received = (N – 1));
state := HELD;
On receipt of a request <Ti, pi> at pj (i ≠ j)
if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))
then
queue request from pi without replying;
else
reply immediately to pi;
end if
To exit the critical section
state := RELEASED;
reply to any queued requests;
25. • Safety
– If Pi anf Pj were to enter CS, then each one might
have replied to other. Since pairs <Ti,Pi> are totally
ordered, this is not possible.
• Liveness?
• Ordering?
• Gaining access to CS takes 2(N-1) messages.
• If hardware support for multicast, then N
messages.
• delay is round-trip time
25
26. Efficiency
• The multicast algorithm improves on the ring
algorithm by avoiding messages passing tokens to
inactive messages. It also requires only a single
message transition time instead of a round trip. But it
still has a number of inefficiencies.
• Maekawa developed a voting algorithm (figure 12.6)
that allows a subset of the processes to grant access,
reducing entry time. Unfortunately, this algorithm is
deadlock prone. Sanders has adapted it to avoid
deadlocks.
26
27. Figure 12.6 Maekawa’s algorithm
– part 1
27
On initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi;
Wait until (number of replies received = K);
state := HELD;
On receipt of a request from pi at pj
if (state = HELD or voted = TRUE)
then
queue request from pi without replying;
else
send reply to pi;
voted := TRUE;
end if
For pi to exit the critical section
state := RELEASED;
Multicast release to all processes in Vi;
On receipt of a release from pi at pj
if (queue of requests is non-empty)
then
remove head of queue – from pk, say;
send reply to pk;
voted := TRUE;
else
voted := FALSE;
end if
28. Fault Tolerance
• What happens when messages are lost?
• What happens when a process crashes?
• None of the preceding algorithms tolerate the
loss of messages with unreliable channels. If
there is a reliable failure detector available, a
protocol would have to be developed that would
allow for failures at any point, including during a
recovery protocol.
28
29. Elections
• An algorithm to choose a process to play a role is
called an election algorithm. For example, a group of
peers may select one of themselves to act as server for
a mutual exclusion algorithm.
• A process that initiates a run of an election algorithm
calls the election. Multiple elections could be called at
the same time.
• A process that is engaged in an election algorithm at a
particular time is a participant.
• At other times or when it is not engaged it is a non-
participant.
29
30. Assumptions
• Processes arranged in logical ring
– Each process Pi has a Comm. Channel to next process in
the ring i.e. process P(i+1)mod N
• The system is asynchronous and there are no
failures
• Goal
– To elect a process with largest identifier
30
31. Election Requirements
• A criteria is established for deciding an election. The
text refers to the criteria as having the largest
identifier, where “largest” and identifier are defined
by the criteria.
• Safety—a participant process Pi has electedi = |_ or
electedi = P, where P is chosen as the non-crashed
process at the end of the run with the largest
identifier.
• Liveness—All processes Pi participate and eventually
set electedi ≠ |_ or crash.
31
32. Performance of an Election Algorithm
• Network Bandwidth Utilization
– Proportional to the total number of messages sent
• Turnaround Time
– Number of serialized message transmission times
between the initiation and termination of a run
32
33. The Algorithm
• Initially every process is marked as non-participant.
• Any process can begin algorithm and marks itself participant.
– Places its identifier in the election message and sends it to clockwise
neighbor
• Receiving process compares the identifier with its own.
– If received identifier is greater, it simply forwards the election message to its
clockwise neighbor.
– If received identifier is smaller and receiver is not participant, then it
substitutes its own identifier in the message and forwards to its clockwise
neighbor.
– If it is participant, then it does not forward the message.
(On forwarding election message in any case, process marks itself as participant.)
• If received identifier is that of receiver itself, then this process identifier must be
the greatest and it becomes coordinator
– It marks itself non-participant and sends an elected message to its neighbor.
• When a Process Pi receives an elected message, it marks itself non-participant
– Sets its variable electedi to the identifier in the message and unless it is new
coordinator forwards the message to its neighbor. 33
34. Figure 12.7 A ring-based election in
progress
34
24
15
9
4
3
28
17
24
1
Note: The election was started by process 17.
The highest process identifier encountered so far is 24.
Participant processes are shown darkened
35. Analysis
• When single process starts election, the worst case is
when its anti-clockwise neighbor is the process with
highest identifier.
• N-1 messages to reach it, it will not announce its
election unless its identifier travels around the circuit
with N messages and finally N messages to announce
election.
• So 3N-1 messages in all.
• Turnaround time is also 3N-1
• Does not tolerate any failure and hence is of little
practical use, but is useful to understand properties
of election algorithm. 35
36. The Bully Algorithm
• Synchronous system, message delivery between processes is
reliable.
• Processes can crash during operation, and faults are detected by
timeouts.
• Each process knows how to communicate with its neighbor and
none knows the identifiers of others.
• It also assumes that each process knows which processes have
higher identifiers and that it can communicate with all such
processes.
• Election messages announce an election
• Answer messages replies to an election message
• Coordinator messages announce the elected process
• An election message contains the election identifier of the sender.
Answer messages contain the identifier of a higher message. Upon
timeout, the highest identifier is the coordinator.
36
37. The Bully Algorithm
• A process starts election when it detects failure of coordinator through timeout (
many processes can detect concurrently) (local failure detector
T=2Ttrans+Tprocess)
• The process that knows that it has highest identifier can elect itself as the
coordinator by sending coordinator message to all processes with lower
identifiers.
• A process with lower identifier begins election by sending election message to
processes with higher identifier and awaits answer message in response.
• If none arrives within time T, the process considers itself as Coordinator and
sends a coordinator message to all processes with lower identifiers.
• Otherwise, the process waits for further for a period T’ for coordinator
message to arrive from the new coordinator. If none arrives it begins new
election.
• It Pi receives coordinator message, it sets its variable electedi to the identifier of
coordinator contained with in it and treats that process as the coordinator.
• If Pi receives an election message, it sends back an answer message and begins
another election—unless it has begun already.
37
38. Figure 12.8 The bully algorithm
38
p1 p
2
p
3
p
4
p
1
p
2
p
3
p
4
C
coordinator
Stage 4
C
election
election
Stage 2
p
1
p
2
p
3
p
4
C
election
answer
answer
election
Stage 1
timeout
Stage 3
Ev entually .....
p
1
p
2
p
3
p
4
election
answer
The election of
coordinator p2,
after the failure of
p4 and then p3
39. Why Bully?
When crashed process is replaced, it begins
election. If it has highest identifier, it treats
itself as coordinator and announces this to
other processes, although some coordinator
must be existing.
39
40. Multicast Communication
• Group or multicast communication requires
coordination and agreement.
• Agreement on the set of messages that every
process in the
• Ordering on the delivery of messages group should
receive
40
41. Multicast Communication
• Group Communication is challenging even when all
members of the group are static and aware of each
other.
• Dynamic groups, with processes joining and leaving
the group, are even more challenging. Most of the
challenges are concerned with efficiency and delivery
guarantees.
• Efficiency concerns include minimizing overhead
activities and increasing throughput and bandwidth
utilization. Uses hardware support where ever
available.
41
42. Multicast Communication
• Delivery guarantees ensure that operations are
completed.
• In multiple sends by a process to other processes, no
way to provide delivery guarantees. If sender fails
half way, some processes will get message while
others not.
Also, relative ordering for two messages is also
undefined.
In IP Multicast, no reliability and delivery guarantees
are offered but stronger multicast guarantees can be
built…
42
43. Closed and Open Groups
• For multicast communications, a group is said to
be closed if only members of the group can
multicast to it. A process in a closed group sends
to itself any messages to the group. (See figure
12.9)
• A group is open if processes outside the group
can send to it. Some algorithms assume closed
groups while others assume open groups.
43
45. Basic Multicast
• Unlike IP multicast it guarantees that correct processes
will eventually deliver message, as long as multicaster
does not crash.
To B-multicast(g, m): for each process p ε g, send(p, m);
On receive(m): B-deliver(m) at p.
ack-implosion ? Buffer overflows…drops some
acks….again sends… more acks
45
46. Reliable Multicast
• Simple multicasting is sending a message to every
process that is a member of a defined group. Reliable
multicasting requires these properties:
• Integrity—a correct process p delivers a message m at
most once.
• Validity—if a correct process multicasts a message m,
then it will eventually deliver m.
• Agreement—if a correct process delivers a message m,
then all other correct processes in group(m) will
eventually deliver m.
46
47. Reliable multicast algorithm
47
Agreement follows from the fact that every correct process B-multicasts the
message to the other processes after it has B-delivered it. If a correct
process does not R-deliver the message, then this can only be because it
never B-delivered it. That in turn can only be because no other correct
process B-delivered it either; therefore none will R-deliver it.
48. Reliable Multicast Over IP multicast
• IP multicast, Piggybacked acknowledgements, negative acknowledgements, closed group.
• Each process p maintains a sequence number Sp
g for e ach group g to which it belongs.
Initially it is zero.
• Each process also records Rq
g , the sequence number of the latest message it has delivered
from process q that was sent to group g.
• For p to R-multicast to g, it piggybacks on to the message the value Sp
g and
acknowledgements, of the form <q, Rq
g >. This conveys the for some sender q, the seq no.
of the latest message from q destined for g that p has delivered since it last multicasts a
message.
• The multicaster p then IP-multicasts the message with its piggybacked values to g, and
increments Sp
g by one.
• A process R-delivers a meaage destined for g bearing the seq no S from p if and only if S=
Rp
g +1, and it increments Rp
g by one immediately after delivery.
• If arriving message has S<= Rp
g , then r has delivered the message before and it discards it.
• If S>Rp
g +1, or if R> Rq
g for an enclosed acknowledgement <q,R>, the there are one or
more messages that it has not yet received. It keeps all such messages for which S< Rp
g +1
in a hold back queue.
• It requests missing messages by sending negative acknowledgements– to the original
sender or to a process from which it has received an ack <q, Rq
g > with Rq
g no less than
the required sequence number.
48
49. Figure 12.11 The hold-back queue for
arriving multicast messages
49
Message
processing
Delivery queue
Hold-back
queue
deliver
Incoming
messages
When delivery
guarantees are
met
50. Reliable Multicast
• Integrity—a correct process p delivers a message m at most once.
• Follows from detection of duplicates and underlying properties of IP-
multicast (uses checksums to expunge corrupted messages)
• Validity—if a correct process multicasts a message m, then it will
eventually deliver m.
• IP-multicast has that property
• Agreement—if a correct process delivers a message m, then all
other correct processes in group(m) will eventually deliver m.
• We require that a process can always detect missing messages. That in turn
means that it will always receive a further message that enables it to detect
omission.
• In present, protocol we assume that a correct process multicasts messages indefinitely.
• Second, it is required that a process can always get missing message
• i.e. it is assumed that processes maintain indefinitely a copy of all messages delivered.
50
51. Uniform Property
Above definition of agreement only refers to the behavior of correct
processes – processes that never fail. But, what if process crashes.
• In above algorithm, if a process crashes after it has R-delivered.
• Uniform agreement: If a process, whether it is correct or fails,
delivers message m, then all processes in group(m) will eventually
deliver m.
• Uniform agreement is useful in many applications.
• Example, banking servers. If an update is sent to group of servers , then if a server
process crashes immediately after it deliver update message, a client accessing that
server just before it crashed may observe an update that no other server processes
if there is no uniform agreement.
51
52. What if we reverse the lines
‘R-deliver m’
and
if(q ≠ p)? then B-multicast(g, m); end if’
52
53. Ordered messages
It is important that messages be delivered in order, there are basic three types
of ordering:
• FIFO—(First-in, first-out) if a correct process issues multicast(g, m) and then
multicast(g, m’), then every correct process that delivers m’ will deliver m
before m’.
• Casual—If multicast(g, m)→multicast(g, m’) where → is the happened-
before relation induced only by messages sent between the members of g,
then any correct process that delivers m’ will deliver m before m’.
• Total—if a correct process delivers message m before it delivers m’, then
any other correct process that delivers m’ will deliver m before m’.
• Hybrids total-casual, total-FIFO
Assumption: Any process belongs to at most one group
53
54. Comments on Ordering
• Note that FIFO ordering and casual ordering are only
partial orders. Not all messages are sent by the same
sending process. In addition some multicasts are
concurrent, not able to be ordered by happened-before.
• In figure 12.12, T1 and T2 show total ordering, F1 and F2
show FIFO ordering, and C1 and C3 show casual ordering.
Note that T1 and T2 are delivered in opposite order to the
physical time of message creation. Total order demands
consistency, but not a particular order.
54
55. Figure 12.12 Total, FIFO and causal
ordering of multicast messages
55
Notice the consistent
ordering of totally ordered
messages T1 and T2,
the FIFO-related messages
F1 and F2 and the causally
related messages C1 and C3
– and the otherwise
arbitrary delivery ordering of
messages.
F3
F1
F2
T2
T1
P1 P2 P3
Time
C3
C1
C2
56. Reliability?
• Definition of ordered multicast do not imply reliability.
– Example: Under total ordering, if correct process p delivers
message m and then delivers message m’, then a correct
process q can deliver m without also delivering m’ or any
other message ordered after m.
• We can also form hybrids of ordered and reliable
protocols.
– In literature, reliable totally ordered multicast is often
referred as atomic multicast.
– Similarly, reliable casual multicast and reliable versions of
hybrid ordered multicasts can be formed.
56
57. Bulletin Board example
• A bulletin board illustrates the desirability of consistency and at
minimum FIFO ordering.
– Users can best refer to preceding messages from a user if they are
delivered in order i.e. messages from a user are delivered in FIFO order.
Message 25 in figure 12.13 refers to message 24, and message 27 refers to
message 23.
• Reliable multicast is required if every user is to receive every
posting eventually.
• Note the further advantage that Web Board allows by permitting
messages to begin threads by replying to a particular message.
Thus messages do not have to be displayed in the same order
they are delivered.
• If total ordering is followed, then left hand side ordering will
appear same to all users and can refer as “message number 24”.
57
58. Figure 12.13 Display from bulletin
board program
58
Bulletin board:os.interesting
Item From Subject
23 A.Hanlon Mach
24 G.Joseph Microkernels
25 A.Hanlon Re: Microkernels
26 T.L’Heureux RPC performance
27 M.Walker Re: Mach
end
59. Implementing FIFO Ordering
• FO-multicast and FO-deliver achieved through
sequence number
• Sp
g and Rq
g held at process p are used as used in
previous protocol discussed (over IP) and FIFO ordering
for messages from a process is maintained.
• if a R-multicast is used instead of B-multicast, then we
obtain reliable FIFO multicast.
59
60. Implementing Total Ordering
• The normal approach to total ordering is to assign
totally ordered identifiers to multicast messages, using
the identifiers to make ordering decisions.
• One possible implementation is to use a sequencer
process to assign identifiers. See figure 12.14. A
drawback of this is that the sequencer can become a
bottleneck.
• An alternative is to have the processes collectively
agree on identifiers. A simple algorithm is shown in
figure 12.15.
60
61. Using Sequencer
• A process wishing to TO-multicast a message m to group g attaches
a unique identifier id(m) to it.
• The message for g are sent to the sequencer(g), as well as to the
members of g. (The sequencer may be chosen as the member of g)
• Process sequencer(g) maintains group specific number Sg, which it
uses group-specific sequence number sg, which it uses to assign
increasing and consecutive sequence numbers to the messages
that it B-delivers.
• It announces the sequence numbers by B-multicast order
messages to g.
• A message will remain in the hold-back queue indefinitely until it
can be TO-delivered according to the corresponding sequence
number.
If the processes use a FIFO-ordered variant of B-multicast, then the
totally ordered multicast is also casually ordered. 61
63. Variants
• Obvious problem with sequencer-based approach is
that the sequencer may become bottleneck.
• Variants
– Chang and Maxemchuk [1994] [1991]
– Kaashoek et al. [1989]
• Kaashoek et al. [1989] uses hardware-based multicast–
For example, available on Ethernet.
• In their simplest variant, processes send the message to be
multicast to the sequencer, one-to-one. The sequencer
multicasts the message itself, as well as the identifier and
sequence number.
63
64. ISIS algorithm for total ordering
• Processes collectively agree on assignment of sequence
number to messages in a distributed manner.
• Each process q in g keeps Aq
g , the largest seq no it has observed so far for g and Pq
g
, its own largest proposed seq no.
• Algorithm:
– p B-multicasts <m,i> to g, where i is a unique identifier for m.
– Each process q replies to sender p with proposal for the message’s agreed seq
no of Pq
g = Max(Aq
g, Pq
g )+1.
– Each process provisionally assigns the proposed seq no to message and places in
hold-back queue, which is ordered with the smallest seq no at the front.
– P collects all proposed seq nos and selects largest one a as the next agreed seq
no. It then B-multicasts <i,a> to g. Each process q in g sets Aq
g= Max(Aq
g,,a) and
attaches a to the message (which is identified by i). It reorders the message in
hold-back queue if the agreed seq no differs from the proposed one.
– When the message at the front of hold-back queue has been assigned its
agreed seq no, it is transferred to the tail of the delivery queue. Messages that
have been assigned their agreed seq no but are not at the head of the hold-back
queue are yet not transferred yet.
64
65. Second Approach
Figure 12.15 The ISIS algorithm for total ordering
65
2
1
1
2
2
1 Message
P2
P3
P1
P4
3 Agreed Sequence
3
3
66. 66
If every process agrees the same set of seq nos and delivers them
in the corresponding order, then total order is satisfied.
Correct processes ultimately agree on the same set of seq nos and
they are monotonically increasing and that no correct process can
deliver a message prematurely.
Assume that a message m1 has been assigned an agreed seq no and
has reached the front of the hold back queue.
agreedSequence(m2)>=proposedSequence(m2)
(by algo above)
proposedSequence(m2)>agreedSequence(m1)
(Since m1 is on front of the queue)
Therefore,
agreedSequence(m2) >agreedSequence(m1)
67. 67
This algorithm has higher latency then sequencer based
algorithm since three messages are send between sender
and the group before a message can be delivered.
Total ordering chosen by this algorithm is also not
guaranteed to be casually or FIFO-ordered as any two
messages are delivered in essentially arbitrary total order,
influenced by communication delays.
68. Implementing Casual Ordering
(Birman et al. [1991])
• Non-overlapping closed groups can have casually
ordered multicasts using vector timestamps
• This algorithm only orders happened-before caused by
multicasts only, and ignores one-to-one messages
between processes.
• Each process updates its vector timestamp before
delivering a message to maintain the count of
precedent messages.
• Co-multicast and co-deliver
68
69. Implementing Casual Ordering
(Birman et al. [1991])
• Logic
• When a process pi B-delivers a message from pj, it must
place it in hold-back queue before it can CO-deliver it:
until it is assured that it has delivered any messages
that casually preceded it. To establish this, Pi waits until:
– (a) it has delivered any earlier message sent by pj, and
– (b) it has delivered any message that pj had delivered at the
time it multicast the message.
• Both of these conditions can be detected by examining
the vector timestamps
69
71. • Check if we substitute the R-multicast
primitive in place of B-multicast, then we
obtain a multicast that is both reliable and
casually ordered.
• Furthermore, if we combine the protocol for
casual multicast with the sequencer-based
protocol for totally ordered delivery, then we
obtain message delivery then is both casual
and total.
71
72. Overlapping groups
• Assumption of non-overlapping groups is not
satisfactory in real scenario
• We have to consider global orders in which if a
message m is multicast to group g, and
message m’ is multicast to group g’, then both
messages are addressed to the members of
gΩg’.
72
73. Overlapping groups
• Global FIFO ordering
– If a correct process issues multicast(g, m) and then multicast(g’, m’),
then every correct process in gΩg’ that delivers m’ will deliver m
before m’.
• Global Casual Ordering
– If multicast(g,m)→ multicast(g’,m’), where → is the happened before
relation induced by any chain of multicast messages, then any correct
process in gΩg’ that delivers m’ will deliver m before m’.
• Pair wise Total Ordering
– If correct process delivers message m sent to g before it delivers m’
sent to g’, then any correct process in gΩg’ that delivers m’ will deliver
m before m’.
• Global Total Ordering
– Let ‘<‘ be the relation of ordering between delivery events. We require
that ‘<‘ obeys pair wise total ordering and that it is acyclic- under pair
wise total ordering, ‘<‘ is not acyclic by default. 73
74. • Consensus is a process for a group of processes to agree
on a value that is proposed by one of the processes.
• The classic formulation of this process is the Byzantine
Generals problem: a decision whether multiple armies
should attack or retreat, assuming that united action will
be more successful than some attacking and some
retreating.
• Another example might be space ship controllers
deciding whether to proceed or abort. Failure handling
during consensus is a key concern.
74
Consensus and Related Problems
76. Consensus
System Model
1. System has collection of processes Pi (i=1,2,…..,n)
2. Processes communicate through message passing
3. Consensus is reached even in the presence of
failures (f processes may fail).
4. Communication is reliable but processes may fail.
76
77. Consensus Process
1. Each process begins in an undecided state
2. A value is proposed from a set of values
3. Processes communicate with each other
4. Each process sets the state of a decision variable di
• Figure 12.17 shows a three processes engaged in a
consensus algorithm. Two processes propose
“proceed.” One proposes “abort,” but then crashes.
The two remaining processes decide proceed.
77
78. Figure 12.17 Consensus for three
processes
78
1
P2
P3 (crashes)
P1
Consensus algorithm
v1=proceed
v3=abort
v2=proceed
d1:=proceed d2:=proceed
79. Requirements for Consensus
• Termination—eventually each correct process
sets its decision variable
• Agreement—the decision value of all correct
processes is the same
• Integrity—if the correct processes all
proposed the same value, then any correct
process in the decided state has that value.
79
80. Byzantine Generals
• The Byzantine Empire was fraught by frequent infighting
among rival military leaders for control of the empire.
Where several generals had to cooperate to achieve an
objective, a treacherous general could weaken or even
eliminate a rival by retreating and encouraging another
general to retreat while encouraging the rival to attack.
Without expected support, the rival was likely to be
defeated. The Byzantine Generals problem concerns
decision making in anticipation of an attack.
80
81. Formal Statement of Problem
• Here is the Byzantine Generals problem:
– Three or more generals must agree to attack or
retreat
– One general, the commander, issues the order
– Other generals, the lieutenants, must decide to attack
or retreat
– One or more generals may be treacherous
– A treacherous general tells one general to attack and
another to retreat
• Difference from consensus is that a single process
supplies the value to agree on
81
82. Byzantine General Requirements
• Termination—eventually each correct process
sets its decision variable
• Agreement—the decision variable of all
correct processes is the same
• Integrity—if the commander is correct, then
all correct processes agree on the value that
the commander has proposed
82
83. Interactive Consistency
• A problem related to the Byzantine Generals problem is
interactive consistency. In this, all correct processes agree
on a vector of values, one for each process. This is called
the decision vector. Required:
– Termination—eventually each correct process sets its
decision variable
– Agreement—the decision vector of all correct processes
is the same
– Integrity—if Pi is correct, then all correct processes
decide on vi as the ith component of their vector.
83
84. Linking the problems
• Consensus (C), Byzantine Generals (BG), and Interactive
Consensus (IC) are all problems concerned with making
decisions in the context of arbitrary or crash failures.
• We can sometimes generate solutions for one problem in
terms of another. For example.
• We can derive IC from BG by running BG N times, once
for each process with that process acting as commander.
84
85. Derived Solutions
• We can derive C from IC by running IC to produce a
vector of values at each process, then applying a
function to the vector’s values to derive a single
value.
• We can derive BG from C by
– Commander sends proposed value to itself and
each remaining process
– All processes run C with received values
– They derive BG from the vector of C values
85
86. Consensus in a Synchronous
System
• Figure 12.18 shows an algorithm to derive Consensus in
a synchronous system. Up to f processes may have
crash failures, all failures occurring during f+1 rounds.
During each round, each of the correct processes
multicasts the values among themselves.
• The algorithm guarantees that all surviving correct
processes are in a position to agree.
• Note: any process with f failures will require at least f+1
rounds to agree.
86
88. Limits for solutions to Byzantine
Generals
• Some cases of the Byzantine Generals problems have no
solutions.
• Lamport et al found that if there are only 3 processes,
there is no solution.
• Pease et al found that if the total number of processes is
less than three times the number of failures plus one,
there is no solution. Thus there is a solution with 4
processes and 1 failure, if there are two rounds. In the
first, the commander sends the values, while in the
second, each lieutenant sends the values it received.
88
91. Asynchronous Systems
• All solutions to consistency and Byzantine generals
problems are limited to synchronous systems.
• Fischer et al found that there are no solutions in an
asynchronous system with even one failure.
• This impossibility is circumvented by masking faults or
using failure detection.
• There is also a partial solution, assuming an adversary
process, based on introducing random values in the
process to prevent an effective thwarting strategy. This
does not always reach consensus.
91
93. Bibliography
• George Coulouris, Jean Dollimore and Tim Kindberg,
Distributed Systems, Concepts and Design, Addison
Wesley, Fourth Edition, 2005
• Figures from the Coulouris text are from the
instructor’s guide and are copyrighted by Pearson
Education 2005.
• Fischer et al, Pease et al, and Lamport et al: See
references in Coulouris text, pp. 859 ff.
93