6.Distributed Operating Systems

DISTRIBUTED OPERATING
SYSTEMS
Sandeep Kumar Poonia

CANONICAL PROBLEMS IN DISTRIBUTED SYSTEMS
 Time ordering and clock synchronization
 Leader election
 Mutual exclusion
 Distributed transactions
 Deadlock detection

THE IMPORTANCE OF SYNCHRONIZATION
 Because various components of a distributed
system must cooperate and exchange information,
synchronization is a necessity.
 Various components of the system must agree on
the timing and ordering of events. Imagine a
banking system that did not track the timing and
ordering of financial transactions. Similar chaos
would ensure if distributed systems were not
synchronized.
 Constraints, both implicit and explicit, are therefore
enforced to ensure synchronization of components.

CLOCK SYNCHRONIZATION
 As in non-distributed systems, the knowledge
of “when events occur” is necessary.
 However, clock synchronization is often more
difficult in distributed systems because there
is no ideal time source, and because
distributed algorithms must sometimes be
used.
 Distributed algorithms must overcome:
 Scattering of information
 Local, rather than global, decision-making

 Time is unambiguous in centralized systems
 System clock keeps time, all entities use this for
time
 Distributed systems: each node has own
system clock
 Crystal-based clocks are less accurate (1 part in
million)
 Problem: An event that occurred after another
may be assigned an earlier time

LACK OF GLOBAL TIME IN DS
 It is impossible to guarantee that
physical clocks run at the same
frequency
 Lack of global time, can cause problems
 Example: UNIX make
 Edit output.c at a client
 output.o is at a server (compile at server)
 Client machine clock can be lagging behind
the server machine clock

LACK OF GLOBAL TIME – EXAMPLE
When each machine has its own clock, an
event that occurred after another event may
nevertheless be assigned an earlier time.

LOGICAL CLOCKS
 For many problems, internal consistency of
clocks is important
 Absolute time is less important
 Use logical clocks
 Key idea:
 Clock synchronization need not be absolute
 If two machines do not interact, no need to
synchronize them
 More importantly, processes need to agree on
the order in which events occur rather than the
time at which they occurred

EVENT ORDERING
 Problem: define a total ordering of all events that
occur in a system
 Events in a single processor machine are totally
ordered
 In a distributed system:
 No global clock, local clocks may be unsynchronized
 Can not order events on different machines using local
times
 Key idea [Lamport ]
 Processes exchange messages
 Message must be sent before received
 Send/receive used to order events (and synchronize

LOGICAL CLOCKS
 Often, it is not necessary for a computer to know the exact
time, only relative time. This is known as “logical time”.
 Logical time is not based on timing but on the ordering of
events.
 Logical clocks can only advance forward, not in reverse.
 Non-interacting processes cannot share a logical clock.
 Computers generally obtain logical time using interrupts to
update a software clock. The more interrupts (the more
frequently time is updated), the higher the overhead.

LAMPORT’S LOGICAL CLOCK
SYNCHRONIZATION ALGORITHM
 The most common logical clock synchronization algorithm
for distributed systems is Lamport‟s Algorithm. It is used in
situations where ordering is important but global time is not
required.
 Based on the “happens-before” relation:
 Event A “happens-before” Event B (A→B) when all
processes involved in a distributed system agree that
event A occurred first, and B subsequently occurred.
 This DOES NOT mean that Event A actually occurred
before Event B in absolute clock time.

LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION
ALGORITHM
 A distributed system can use the “happens-before”
relation when:
 Events A and B are observed by the same
process, or by multiple processes with the same
global clock
 Event A acknowledges sending a message and
Event B acknowledges receiving it, since a
message cannot be received before it is sent
 If two events do not communicate via messages,
they are considered concurrent – because order
cannot be determined and it does not matter.
Concurrent events can be ignored.

LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION
ALGORITHM (CONT.)
 In the previous examples, Clock C(a) < C(b)
 If they are concurrent, C(a) = C(b)
 Concurrent events can only occur on the same system,
because every message transfer between two systems
takes at least one clock tick.
 In Lamport‟s Algorithm, logical clock values for events may
be changed, but always by moving the clock forward. Time
values can never be decreased.
 An additional refinement in the algorithm is often used:
 If Event A and Event B are concurrent. C(a) = C(b), some unique
property of the processes associated with these events can be used
to choose a winner. This establishes a total ordering of all events.
 Process ID is often used as the tiebreaker.

LAMPORT’S LOGICAL CLOCK
SYNCHRONIZATION ALGORITHM (CONT.)
 Lamport‟s Algorithm can thus be used in distributed
systems to ensure synchronization:
 A logical clock is implemented in each node in
the system.
 Each node can determine the order in which
events have occurred in that system’s own point
of view.
 The logical clock of one node does not need to
have any relation to real time or to any other
node in the system.

EVENT ORDERING USING HB
 Goal: define the notion of time of an event such
that
 If A-> B then C(A) < C(B)
 If A and B are concurrent, then C(A) <, = or > C(B)
 Solution:
 Each processor maintains a logical clock LCi
 Whenever an event occurs locally at I, LCi = LCi+1
 When i sends message to j, piggyback Lci
 When j receives message from i
 If LCj < LCi then LCj = LCi +1 else do nothing
 Claim: this algorithm meets the above goals

PROCESS EACH WITH ITS OWN CLOCK
•At time 6 , Process 0 sends message A to Process 1
•It arrives to Process 1 at 16( It took 10 ticks to make journey
•Message B from 1 to 2 takes 16 ticks
•Message C from 2 to 1 leaves at
60 and arrives at 56 -Not Possible
•Message D from 1 to 0 leaves at
64 and arrives at 54 -Not Possible

LAMPORT’S ALGORITHM CORRECTS THE CLOCKS
 Use „happened-before‟
relation
 Each message carries the
sending time (as per sender‟s
clock)
 When arrives, receiver fast
forwards its clock to be one
more than the sending time.
(between every two events,
the clock must tick at least
once)

PHYSICAL CLOCKS
 The time difference between two computers is known as
drift. Clock drift over time is known as skew. Computer
clock manufacturers specify a maximum skew rate in their
products.
 Computer clocks are among the least accurate modern
timepieces.
 Inside every computer is a chip surrounding a quartz
crystal oscillator to record time. These crystals cost 25
seconds to produce.
 Average loss of accuracy: 0.86 seconds per day
 This skew is unacceptable for distributed systems. Several
methods are now in use to attempt the synchronization of
physical clocks in distributed systems:

PHYSICAL CLOCKS
 17th Century: time has been measured
astronomically
 Solar Day: interval between two consecutive
transit of sun
 Solar Second: 1/86400th of solar day

PHYSICAL CLOCKS
 1948: Atomic Clocks are invented
 Accurate clocks are atomic oscillators (one part in 1013)
 BIH decide TAI(International Atomic Time)
 TAI seconds is now about 3 msec less than solar day
 BIH solves the problem by introducing leap seconds
Whenever discrepancy between TAI and solar time grow to
800 msec
 This time is called Universal Coordinated Time(UTC)
 When BIH announces leap second, the power companies
raise their frequency to 61 & 51 Hz for 60 & 50 sec, to
advance all the clocks in their distribution area.

PHYSICAL CLOCKS - UTC
Coordinated Universal Time
(UTC) is the international
time standard.
UTC is the current term for
what was commonly
referred to as Greenwich
Mean Time (GMT).
Zero hours UTC is midnight in
Greenwich, England, which
lies on the zero longitudinal
meridian.
UTC is based on a 24-hour
clock.

PHYSICAL CLOCKS
 Most clocks are less accurate (e.g., mechanical watches)
 Computers use crystal-based blocks (one part in million)
 Results in clock drift
 How do you tell time?
 Use astronomical metrics (solar day)
 Coordinated universal time (UTC) – international standard
based on atomic time
 Add leap seconds to be consistent with astronomical time
 UTC broadcast on radio (satellite and earth)
 Receivers accurate to 0.1 – 10 ms
 Need to synchronize machines with a master or with one
another

 Each clock has a maximum drift rate
 1- dC/dt <= 1+
 Two clocks may drift by 2 t in time t
 To limit drift to  resynchronize after every
2 seconds

CHRISTIAN’S ALGORITHM
 Assuming there is one time server with UTC:
 Each node in the distributed system periodically polls the time server.
 Time( treply) is estimated as t + (Treply + Treq)/2
 This process is repeated several times and an average is provided.
 Machine Treply then attempts to adjust its time.
 Disadvantages:
 Must attempt to take delay between server Treply and time
server into account
 Single point of failure if time server fails

CRISTIAN’S ALGORITHM
Synchronize machines to a
time server with a UTC
receiver
Machine P requests time from
server every seconds
 Receives time t from
server, P sets clock to
t+treply where treply is the
time to send reply to P
 Use (treq+treply)/2 as an
estimate of treply
 Improve accuracy by
making a series of
measurements

PROBLEM WITH CRISTIAN’S ALGORITHM
Major Problem
 Time must never run
backward
 If sender‟s clock is
fast, CUTC will be
smaller than the
sender‟s current value
of C
Minor Problem
 It takes nonzero time
for the time server‟s
reply
 This delay may be large
and vary with network
load

SOLUTION
Major Problem
 Control the clock
 Suppose that the timer set
to generate 100 intrpt/sec
 Normally each interrupt
add 10 msec to the time
 To slow down add only 9
msec
 To advance add 11 msec to
the time
Minor Problem
 Measure it
 Make a series of
measurements for accuracy
 Discard the measurements
that exceeds the threshold
value
 The message that came
back fastest can be taken to
be the most accurate.

BERKELEY ALGORITHM
 Used in systems without UTC receiver
 Keep clocks synchronized with one another
 One computer is master, other are slaves
 Master periodically polls slaves for their times
 Average times and return differences to slaves
 Communication delays compensated as in Cristian‟s
algo
 Failure of master => election of a new master

BERKELEY ALGORITHM
a) The time daemon asks all the other machines for their clock values
b) The machines answer
c) The time daemon tells everyone how to adjust their clock

30
DECENTRALIZED AVERAGING ALGORITHM
 Each machine on the distributed system has a daemon
without UTC.
 Periodically, at an agreed-upon fixed time, each machine
broadcasts its local time.
 Each machine calculates the correct time by averaging
all results.

31
NETWORK TIME PROTOCOL (NTP)
 Enables clients across the Internet to be
synchronized accurately to UTC.
 Overcomes large and variable message delays
 Employs statistical techniques for filtering, based on past
quality of servers and several other measures
 Can survive lengthy losses of connectivity:
 Redundant servers
 Redundant paths to servers
 Provides protection against malicious interference
through authentication techniques

32
NETWORK TIME PROTOCOL (NTP) (CONT.)
 Uses a hierarchy of servers located across the Internet.
Primary servers are directly connected to a UTC time
source.

33
NETWORK TIME PROTOCOL (NTP) (CONT.)
 NTP has three modes:
 Multicast Mode:
 Suitable for user workstations on a LAN
 One or more servers periodically multicasts the time to other
machines on the network.
 Procedure Call Mode:
 Similar to Christian‟s Algorithm
 Provides higher accuracy than Multicast Mode because delays
are compensated.
 Symmetric Mode:
 Pairs of servers exchange pairs of timing messages that contain
time stamps of recent message events.
 The most accurate, but also the most expensive mode
Although NTP is quite advanced, there is still a drift of 20-35 milliseconds!!!

MORE PROBLEMS
 Causality
 Vector timestamps
 Global state and termination detection
 Election algorithms

LOGICAL CLOCKS
 For many DS algorithms, associating
an event to an absolute real time is
not essential, we only need to know
an unambiguous order of events
 Lamport's timestamps
 Vector timestamps

LOGICAL CLOCKS (CONT.)
 Synchronization based on “relative time”.
 “relative time” may not relate to the “real
time”.
 Example: Unix make (Is output.c updated after the
generation of output.o?)
 What‟s important is that the processes in
the Distributed System agree on the
ordering in which certain events occur.
 Such “clocks” are referred to as Logical
Clocks.

EXAMPLE: WHY ORDER MATTERS?
 Replicated accounts in Jaipur(JP) and Bikaner(BN)
 Two updates occur at the same time
 Current balance: $1,000
 Update1: Add $100 at BN; Update2: Add interest of 1% at
JP
 Whoops, inconsistent states!

LAMPORT ALGORITHM
 Clock synchronization does not have to be
exact
 Synchronization not needed if there is no
interaction between machines
 Synchronization only needed when machines
communicate
 i.e. must only agree on ordering of interacting
events

LAMPORT’S “HAPPENS-BEFORE” PARTIAL
ORDER
 Given two events e & e`, e < e` if:
1. Same process: e <i e`, for some
process Pi
2. Same message: e = send(m) and
e`=receive(m) for some message m
3. Transitivity: there is an event e* such
that e < e* and e* < e`

CONCURRENT EVENTS
 Given two events e & e`:
 If neither e < e` nor e`< e, then e || e`
P1
P2
P3
Real Time
a b
c
f
d
e
m1
m2

LAMPORT LOGICAL CLOCKS
 Substitute synchronized clocks with a global
ordering of events
 ei < ej LC(ei) < LC(ej)
 LCi is a local clock: contains increasing values
 each process i has own LCi
 Increment LCi on each event occurrence
 within same process i, if ej occurs before ek
 LCi(ej) < LCi(ek)
 If es is a send event and er receives that send,
then
 LCi(es) < LCj(er)

LAMPORT ALGORITHM
 Each process increments local clock
between any two successive events
 Message contains a timestamp
 Upon receiving a message, if received
timestamp is ahead, receiver fast forward
its clock to be one more than sending
time

LAMPORT ALGORITHM (CONT.)
 Timestamp
 Each event is given a timestamp t
 If es is a send message m from pi, then
t=LCi(es)
 When pj receives m, set LCj value as follows
 If t < LCj, increment LCj by one
 Message regarded as next event on j
 If t ≥ LCj, set LCj to t+1

LAMPORT’S ALGORITHM ANALYSIS (1)
 Claim: ei < ej LC(ei) < LC(ej)
 Proof: by induction on the length of the
sequence of events relating to ei and ej
P1
P2
P3
Real Timea b
c
f
d
e
m1
m2
1 2
3
5
4
1
g
5

LAMPORT’S ALGORITHM ANALYSIS (2)
 LC(ei) < LC(ej) ei < ej ?
 Claim: if LC(ei) < LC(ej), then it is not
necessarily true that ei < ej
P1
P2
P3
Real Timea b
c
f
d
e
m1
m2
1 2
3
5
4
1
g
2

TOTAL ORDERING OF EVENTS
 Happens before is only a partial order
 Make the timestamp of an event e of
process Pi be: (LC(e),i)
 (a,b) < (c,d) iff a < c, or a = c and b < d

APPLICATION: TOTALLY-ORDERED MULTICASTING
 Message is timestamped with sender‟s
logical time
 Message is multicast (including sender itself)
 When message is received
 It is put into local queue
 Ordered according to timestamp
 Multicast acknowledgement
 Message is delivered to applications only
when
 It is at head of queue
 It has been acknowledged by all involved
processes

APPLICATION: TOTALLY-ORDERED MULTICASTING
 Update 1 is time-stamped and multicast. Added to local queues.
 Update 2 is time-stamped and multicast. Added to local queues.
 Acknowledgements for Update 2 sent/received. Update 2 can now be
processed.
 Acknowledgements for Update 1 sent/received. Update 1 can now be
processed.
 (Note: all queues are the same, as the timestamps have been used to
ensure the “happens-before” relation holds.)

LIMITATION OF LAMPORT’S ALGORITHM
 ei < ej LC(ei) < LC(ej)
 However, LC(ei) < LC(ej) does not imply ei <
ej
 for instance, (1,1) < (1,3), but events a and e are
concurrent
P1
P2
P3
Real Timea b
c
f
d
e
m1
m2
(1,1) (2,1)
(3,2)
(5,3)
(4,2)
(1,3)
g
(2,3)

VECTOR TIMESTAMPS
 Pi‟s clock is a vector VTi[]
 VTi[i] = number of events Pi has
stamped
 VTi[j] = what Pi thinks number of
events Pj has stamped (i j)

VECTOR TIMESTAMPS (CONT.)
 Initialization
 the vector timestamp for each process is
initialized to (0,0,…,0)
 Local event
 when an event occurs on process Pi, VTi[i]
VTi[i] + 1
 e.g., on processor 3, (1,2,1,3) (1,2,2,3)

 Message passing
 when Pi sends a message to Pj, the message
has timestamp t[]=VTi[]
 when Pj receives the message, it sets VTj[k] to
max (VTj[k],t[k]), for k = 1, 2, …, N
 e.g., P2 receives a message with timestamp (3,2,4)
and P2‟s timestamp is (3,4,3), then P2 adjust its
timestamp to (3,4,4)
VECTOR TIMESTAMPS (CONT.)

COMPARING VECTORS
 VT1 = VT2 iff VT1[i] = VT2[i] for all i
 VT1 VT2 iff VT1[i] VT2[i] for all i
 VT1 < VT2 iff VT1 VT2 & VT1 VT2
 for instance, (1, 2, 2) < (1, 3, 2)

VECTOR TIMESTAMP ANALYSIS
 Claim: e < e‟ iff e.VT < e‟.VT
P1
P2
P3
Real Timea b
c
f
d
e
m1
m2
[1,0,0]
[2,0,0]
[2,1,0]
[2,2,3]
[2,2,0]
[0,0,1]
g
[0,0,2]

APPLICATION: CAUSALLY-ORDERED MULTICASTING
 For ordered delivery, we also need…
 Multicast msgs (reliable but may be out-of-order)
 Vi[i] is only incremented when sending
 When k gets a msg from j, with timestamp ts,
the msg is buffered until:
 1: ts[j] = Vk[j] + 1
 (this is the next timestamp that k is expecting from j)
 2: ts[i] ≤ Vk[i] for all i ≠ j
 (k has seen all msgs that were seen by j when j sent the
msg)

CAUSALLY-ORDERED MULTICASTING
P2
a
P1
c
d
P3
e
g
[1,0,0]
[1,0,0][1,0,0]
[1,0,1]
[1,0,1]
Post a
r: Reply a
Message a arrives at P2 before the reply r from P3 does
b
[1,0,1]
[0,0,0]
[0,0,0]
[0,0,0]

CAUSALLY-ORDERED MULTICASTING (CONT.)
P2
a
P1 P3
d
g
[1,0,0]
[1,0,0]
[1,0,1]
Post a
r: Reply a
Buffered
c
[1,0,0]
The message a arrives at P2 after the reply from P3; The reply is
not delivered right away.
b
[1,0,1]
[0,0,0]
[0,0,0]
[0,0,0]
Deliver r

ORDERED COMMUNICATION
 Totally ordered multicast
 Use Lamport timestamps
 Causally ordered multicast
 Use vector timestamps

VECTOR CLOCKS
 Each process i maintains a vector Vi
 Vi[i] : number of events that have occurred at i
 Vi[j] : number of events I knows have occurred at process
j
 Update vector clocks as follows
 Local event: increment Vi[I]
 Send a message :piggyback entire vector V
 Receipt of a message: Vj[k] = max( Vj[k],Vi[k] )
 Receiver is told about how many events the sender knows
occurred at another process k
 Also Vj[i] = Vj[i]+1

GLOBAL STATE
 Global state of a distributed system
 Local state of each process
 Messages sent but not received (state of the queues)
 Many applications need to know the state of the
system
 Failure recovery, distributed deadlock detection
 Problem: how can you figure out the state of a
distributed system?
 Each process is independent
 No global clock or synchronization
 Distributed snapshot: a consistent global state

GLOBAL STATE (1)
a) A consistent cut
b) An inconsistent cut

DISTRIBUTED SNAPSHOT ALGORITHM
 Assume each process communicates with
another process using unidirectional point-to-
point channels (e.g, TCP connections)
 Any process can initiate the algorithm
 Checkpoint local state
 Send marker on every outgoing channel
 On receiving a marker
 Checkpoint state if first marker and send marker
on outgoing channels, save messages on all
other channels until:
 Subsequent marker on a channel: stop saving
state for that channel

DISTRIBUTED SNAPSHOT
 A process finishes when
 It receives a marker on each incoming channel
and processes them all
 State: local state plus state of all channels
 Send state to initiator
 Any process can initiate snapshot
 Multiple snapshots may be in progress
 Each is separate, and each is distinguished by tagging
the marker with the initiator ID (and sequence
number)
A
C
B
M
M

SNAPSHOT ALGORITHM EXAMPLE
a) Organization of a process and channels for a distributed
snapshot

SNAPSHOT ALGORITHM EXAMPLE
b) Process Q receives a marker for the first time and records its
local state
c) Q records all incoming message
d) Q receives a marker for its incoming channel and finishes
recording the state of the incoming channel

TERMINATION DETECTION
 Detecting the end of a distributed computation
 Notation: let sender be predecessor, receiver be successor
 Two types of markers: Done and Continue
 After finishing its part of the snapshot, process Q sends a
Done or a Continue to its predecessor
 Send a Done only when
 All of Q‟s successors send a Done
 Q has not received any message since it check-pointed its local state
and received a marker on all incoming channels
 Else send a Continue
 Computation has terminated if the initiator receives Done
messages from everyone

DISTRIBUTED SYNCHRONIZATION
 Distributed system with multiple processes may
need to share data or access shared data
structures
 Use critical sections with mutual exclusion
 Single process with multiple threads
 Semaphores, locks, monitors
 How do you do this for multiple processes in a
distributed system?
 Processes may be running on different machines
 Solution: lock mechanism for a distributed
environment
 Can be centralized or distributed

CENTRALIZED MUTUAL EXCLUSION
 Assume processes are numbered
 One process is elected coordinator (highest ID
process)
 Every process needs to check with coordinator
before entering the critical section
 To obtain exclusive access: send request, await
reply
 To release: send release message
 Coordinator:
 Receive request: if available and queue empty, send
grant; if not, queue request
 Receive release: remove next request from queue and
send grant

MUTUAL EXCLUSION:
A CENTRALIZED ALGORITHM
a) Process 1 asks the coordinator for permission to enter a
critical region. Permission is granted
b) Process 2 then asks permission to enter the same critical
region. The coordinator does not reply.
c) When process 1 exits the critical region, it tells the
coordinator, when then replies to 2

PROPERTIES
 Simulates centralized lock using blocking calls
 Fair: requests are granted the lock in the order they were
received
 Simple: three messages per use of a critical section
(request, grant, release)
 Shortcomings:
 Single point of failure
 How do you detect a dead coordinator?
 A process can not distinguish between “lock in use” from a dead
coordinator
 No response from coordinator in either case
 Performance bottleneck in large distributed systems

DISTRIBUTED ALGORITHM
 [Ricart and Agrawala]: needs 2(n-1) messages
 Based on event ordering and time stamps
 Process k enters critical section as follows
 Generate new time stamp TSk = TSk+1
 Send request(k,TSk) all other n-1 processes
 Wait until reply(j) received from all other processes
 Enter critical section
 Upon receiving a request message, process j
 Sends reply if no contention
 If already in critical section, does not reply, queue request
 If wants to enter, compare TSj with TSk and send reply if TSk<TSj,
else queue

A DISTRIBUTED ALGORITHM
a) Two processes want to enter the same critical region at the same
moment.
b) Process 0 has the lowest timestamp, so it wins.
c) When process 0 is done, it sends an OK also, so 2 can now enter
the critical region.

PROPERTIES
 Fully decentralized
 N points of failure!
 All processes are involved in all decisions
 Any overloaded process can become a
bottleneck

ELECTION ALGORITHMS
 Many distributed algorithms need one process
to act as coordinator
 Doesn‟t matter which process does the job, just
need to pick one
 Election algorithms: technique to pick a unique
coordinator (aka leader election)
 Examples: take over the role of a failed
process, pick a master in Berkeley clock
synchronization algorithm
 Types of election algorithms: Bully and Ring
algorithms

BULLY ALGORITHM
 Each process has a unique numerical ID
 Processes know the Ids and address of every other
process
 Communication is assumed reliable
 Key Idea: select process with highest ID
 Process initiates election if it just recovered from
failure or if coordinator failed
 3 message types: election, OK, I won
 Several processes can initiate an election
simultaneously
 Need consistent result
 O(n2) messages required with n processes

BULLY ALGORITHM DETAILS
 Any process P can initiate an election
 P sends Election messages to all process with
higher Ids and awaits OK messages
 If no OK messages, P becomes coordinator and
sends I won messages to all process with lower Ids
 If it receives an OK, it drops out and waits for an I
won
 If a process receives an Election msg, it returns an
OK and starts an election
 If a process receives a I won, it treats sender an
coordinator

BULLY ALGORITHM EXAMPLE
 The bully election algorithm
 Process 4 holds an election
 Process 5 and 6 respond, telling 4 to stop
 Now 5 and 6 each hold an election

BULLY ALGORITHM EXAMPLE
d) Process 6 tells 5 to stop
e) Process 6 wins and tells everyone

LAST CLASS
 Vector timestamps
 Global state
 Distributed Snapshot

TODAY: STILL MORE CANONICAL
PROBLEMS
 Bully algorithm
 Ring algorithm
 Distributed synchronization and mutual
exclusion
 Distributed transactions

ELECTION ALGORITHMS
 Many distributed algorithms need one process
to act as coordinator
 Doesn‟t matter which process does the job, just
need to pick one
 Election algorithms: technique to pick a unique
coordinator (aka leader election)
 Examples: take over the role of a failed
process, pick a master in Berkeley clock
synchronization algorithm
 Types of election algorithms: Bully and Ring

BULLY ALGORITHM
 Each process has a unique numerical ID
 Processes know the Ids and address of every
other process
 Communication is assumed reliable
 Key Idea: select process with highest ID
 Process initiates election if it just recovered
from failure or if coordinator failed
 3 message types: election, OK, I won
 Several processes can initiate an election
simultaneously

BULLY ALGORITHM DETAILS
 Any process P can initiate an election
 P sends Election messages to all process with higher Ids and
awaits OK messages
 If no OK messages, P becomes coordinator and sends I won
messages to all process with lower Ids
 If it receives an OK, it drops out and waits for an I won
 If a process receives an Election msg, it returns an OK and
starts an election
 If a process receives a I won, it treats sender an coordinator

RING-BASED ELECTION
 Processes have unique Ids and arranged in a logical ring
 Each process knows its neighbors
 Select process with highest ID
 Begin election if just recovered or coordinator has failed
 Send Election to closest downstream node that is alive
 Sequentially poll each successor until a live node is found
 Each process tags its ID on the message
 Initiator picks node with highest ID and sends a coordinator
message
 Multiple elections can be in progress
 Wastes network bandwidth but does no harm

A RING ALGORITHM
 Election algorithm using a ring.

COMPARISON
 Assume n processes and one election in
progress
 Bully algorithm
 Worst case: initiator is node with lowest ID
 Triggers n-2 elections at higher ranked nodes: O(n2)
msgs
 Best case: immediate election: n-2 messages
 Ring
 2 (n-1) messages always

A TOKEN RING ALGORITHM
a) An unordered group of processes on a network.
b) A logical ring constructed in software.
• Use a token to arbitrate access to critical section
• Must wait for token before entering CS
• Pass the token to neighbor once done or if not interested
• Detecting token loss in non-trivial

COMPARISON
 A comparison of three mutual exclusion
algorithms.
Algorithm
Messages per
entry/exit
Delay before entry (in
message times)
Problems
Centralized 3 2 Coordinator crash
Distributed 2 ( n – 1 ) 2 ( n – 1 )
Crash of any
process
Token ring 1 to 0 to n – 1
Lost token, process
crash

TRANSACTIONS
Transactions provide higher
level mechanism for atomicity
of processing in distributed
systems
 Have their origins in databases
Banking example: Three
accounts A:$100, B:$200,
C:$300
 Client 1: transfer $4 from A to
B
 Client 2: transfer $3 from C to
B
Result can be inconsistent
unless certain properties are
Client 1 Client 2
Read A: $100
Write A: $96
Read C: $300
Write C:$297
Read B: $200
Read B: $200
Write B:$203
Write B:$204

ACID PROPERTIES
Atomic: all or nothing
Consistent: transaction takes
system from one consistent
state to another
Isolated: Immediate effects
are not visible to other
(serializable)
Durable: Changes are
permanent once transaction
completes (commits)
Client 1 Client 2
Read A: $100
Write A: $96
Read B: $200
Write B:$204
Read C: $300
Write C:$297
Read B: $204
Write B:$207

6.Distributed Operating Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 6.Distributed Operating Systems

Similar to 6.Distributed Operating Systems (20)

More from Dr Sandeep Kumar Poonia

More from Dr Sandeep Kumar Poonia (20)

Recently uploaded

Recently uploaded (20)

6.Distributed Operating Systems