6.Distributed Operating Systems

5,699 views

Published on

CLOCK SYNCHRONIZATION
Time ordering and clock synchronization
Leader election
Mutual exclusion
Distributed transactions
Deadlock detection

Published in: Education, Technology, Business
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,699
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
416
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

6.Distributed Operating Systems

  1. 1. DISTRIBUTED OPERATING SYSTEMS Sandeep Kumar Poonia
  2. 2. CANONICAL PROBLEMS IN DISTRIBUTED SYSTEMS  Time ordering and clock synchronization  Leader election  Mutual exclusion  Distributed transactions  Deadlock detection
  3. 3. THE IMPORTANCE OF SYNCHRONIZATION  Because various components of a distributed system must cooperate and exchange information, synchronization is a necessity.  Various components of the system must agree on the timing and ordering of events. Imagine a banking system that did not track the timing and ordering of financial transactions. Similar chaos would ensure if distributed systems were not synchronized.  Constraints, both implicit and explicit, are therefore enforced to ensure synchronization of components.
  4. 4. CLOCK SYNCHRONIZATION  As in non-distributed systems, the knowledge of “when events occur” is necessary.  However, clock synchronization is often more difficult in distributed systems because there is no ideal time source, and because distributed algorithms must sometimes be used.  Distributed algorithms must overcome:  Scattering of information  Local, rather than global, decision-making
  5. 5. CLOCK SYNCHRONIZATION  Time is unambiguous in centralized systems  System clock keeps time, all entities use this for time  Distributed systems: each node has own system clock  Crystal-based clocks are less accurate (1 part in million)  Problem: An event that occurred after another may be assigned an earlier time
  6. 6. LACK OF GLOBAL TIME IN DS  It is impossible to guarantee that physical clocks run at the same frequency  Lack of global time, can cause problems  Example: UNIX make  Edit output.c at a client  output.o is at a server (compile at server)  Client machine clock can be lagging behind the server machine clock
  7. 7. LACK OF GLOBAL TIME – EXAMPLE When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.
  8. 8. LOGICAL CLOCKS  For many problems, internal consistency of clocks is important  Absolute time is less important  Use logical clocks  Key idea:  Clock synchronization need not be absolute  If two machines do not interact, no need to synchronize them  More importantly, processes need to agree on the order in which events occur rather than the time at which they occurred
  9. 9. EVENT ORDERING  Problem: define a total ordering of all events that occur in a system  Events in a single processor machine are totally ordered  In a distributed system:  No global clock, local clocks may be unsynchronized  Can not order events on different machines using local times  Key idea [Lamport ]  Processes exchange messages  Message must be sent before received  Send/receive used to order events (and synchronize
  10. 10. LOGICAL CLOCKS  Often, it is not necessary for a computer to know the exact time, only relative time. This is known as “logical time”.  Logical time is not based on timing but on the ordering of events.  Logical clocks can only advance forward, not in reverse.  Non-interacting processes cannot share a logical clock.  Computers generally obtain logical time using interrupts to update a software clock. The more interrupts (the more frequently time is updated), the higher the overhead.
  11. 11. LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION ALGORITHM  The most common logical clock synchronization algorithm for distributed systems is Lamport‟s Algorithm. It is used in situations where ordering is important but global time is not required.  Based on the “happens-before” relation:  Event A “happens-before” Event B (A→B) when all processes involved in a distributed system agree that event A occurred first, and B subsequently occurred.  This DOES NOT mean that Event A actually occurred before Event B in absolute clock time.
  12. 12. LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION ALGORITHM  A distributed system can use the “happens-before” relation when:  Events A and B are observed by the same process, or by multiple processes with the same global clock  Event A acknowledges sending a message and Event B acknowledges receiving it, since a message cannot be received before it is sent  If two events do not communicate via messages, they are considered concurrent – because order cannot be determined and it does not matter. Concurrent events can be ignored.
  13. 13. LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION ALGORITHM (CONT.)  In the previous examples, Clock C(a) < C(b)  If they are concurrent, C(a) = C(b)  Concurrent events can only occur on the same system, because every message transfer between two systems takes at least one clock tick.  In Lamport‟s Algorithm, logical clock values for events may be changed, but always by moving the clock forward. Time values can never be decreased.  An additional refinement in the algorithm is often used:  If Event A and Event B are concurrent. C(a) = C(b), some unique property of the processes associated with these events can be used to choose a winner. This establishes a total ordering of all events.  Process ID is often used as the tiebreaker.
  14. 14. LAMPORT’S LOGICAL CLOCK SYNCHRONIZATION ALGORITHM (CONT.)  Lamport‟s Algorithm can thus be used in distributed systems to ensure synchronization:  A logical clock is implemented in each node in the system.  Each node can determine the order in which events have occurred in that system’s own point of view.  The logical clock of one node does not need to have any relation to real time or to any other node in the system.
  15. 15. EVENT ORDERING USING HB  Goal: define the notion of time of an event such that  If A-> B then C(A) < C(B)  If A and B are concurrent, then C(A) <, = or > C(B)  Solution:  Each processor maintains a logical clock LCi  Whenever an event occurs locally at I, LCi = LCi+1  When i sends message to j, piggyback Lci  When j receives message from i  If LCj < LCi then LCj = LCi +1 else do nothing  Claim: this algorithm meets the above goals
  16. 16. PROCESS EACH WITH ITS OWN CLOCK •At time 6 , Process 0 sends message A to Process 1 •It arrives to Process 1 at 16( It took 10 ticks to make journey •Message B from 1 to 2 takes 16 ticks •Message C from 2 to 1 leaves at 60 and arrives at 56 -Not Possible •Message D from 1 to 0 leaves at 64 and arrives at 54 -Not Possible
  17. 17. LAMPORT’S ALGORITHM CORRECTS THE CLOCKS  Use „happened-before‟ relation  Each message carries the sending time (as per sender‟s clock)  When arrives, receiver fast forwards its clock to be one more than the sending time. (between every two events, the clock must tick at least once)
  18. 18. PHYSICAL CLOCKS  The time difference between two computers is known as drift. Clock drift over time is known as skew. Computer clock manufacturers specify a maximum skew rate in their products.  Computer clocks are among the least accurate modern timepieces.  Inside every computer is a chip surrounding a quartz crystal oscillator to record time. These crystals cost 25 seconds to produce.  Average loss of accuracy: 0.86 seconds per day  This skew is unacceptable for distributed systems. Several methods are now in use to attempt the synchronization of physical clocks in distributed systems:
  19. 19. PHYSICAL CLOCKS  17th Century: time has been measured astronomically  Solar Day: interval between two consecutive transit of sun  Solar Second: 1/86400th of solar day
  20. 20. PHYSICAL CLOCKS  1948: Atomic Clocks are invented  Accurate clocks are atomic oscillators (one part in 1013)  BIH decide TAI(International Atomic Time)  TAI seconds is now about 3 msec less than solar day  BIH solves the problem by introducing leap seconds Whenever discrepancy between TAI and solar time grow to 800 msec  This time is called Universal Coordinated Time(UTC)  When BIH announces leap second, the power companies raise their frequency to 61 & 51 Hz for 60 & 50 sec, to advance all the clocks in their distribution area.
  21. 21. PHYSICAL CLOCKS - UTC Coordinated Universal Time (UTC) is the international time standard. UTC is the current term for what was commonly referred to as Greenwich Mean Time (GMT). Zero hours UTC is midnight in Greenwich, England, which lies on the zero longitudinal meridian. UTC is based on a 24-hour clock.
  22. 22. PHYSICAL CLOCKS  Most clocks are less accurate (e.g., mechanical watches)  Computers use crystal-based blocks (one part in million)  Results in clock drift  How do you tell time?  Use astronomical metrics (solar day)  Coordinated universal time (UTC) – international standard based on atomic time  Add leap seconds to be consistent with astronomical time  UTC broadcast on radio (satellite and earth)  Receivers accurate to 0.1 – 10 ms  Need to synchronize machines with a master or with one another
  23. 23. CLOCK SYNCHRONIZATION  Each clock has a maximum drift rate  1- dC/dt <= 1+  Two clocks may drift by 2 t in time t  To limit drift to  resynchronize after every 2 seconds
  24. 24. CHRISTIAN’S ALGORITHM  Assuming there is one time server with UTC:  Each node in the distributed system periodically polls the time server.  Time( treply) is estimated as t + (Treply + Treq)/2  This process is repeated several times and an average is provided.  Machine Treply then attempts to adjust its time.  Disadvantages:  Must attempt to take delay between server Treply and time server into account  Single point of failure if time server fails
  25. 25. CRISTIAN’S ALGORITHM Synchronize machines to a time server with a UTC receiver Machine P requests time from server every seconds  Receives time t from server, P sets clock to t+treply where treply is the time to send reply to P  Use (treq+treply)/2 as an estimate of treply  Improve accuracy by making a series of measurements
  26. 26. PROBLEM WITH CRISTIAN’S ALGORITHM Major Problem  Time must never run backward  If sender‟s clock is fast, CUTC will be smaller than the sender‟s current value of C Minor Problem  It takes nonzero time for the time server‟s reply  This delay may be large and vary with network load
  27. 27. SOLUTION Major Problem  Control the clock  Suppose that the timer set to generate 100 intrpt/sec  Normally each interrupt add 10 msec to the time  To slow down add only 9 msec  To advance add 11 msec to the time Minor Problem  Measure it  Make a series of measurements for accuracy  Discard the measurements that exceeds the threshold value  The message that came back fastest can be taken to be the most accurate.
  28. 28. BERKELEY ALGORITHM  Used in systems without UTC receiver  Keep clocks synchronized with one another  One computer is master, other are slaves  Master periodically polls slaves for their times  Average times and return differences to slaves  Communication delays compensated as in Cristian‟s algo  Failure of master => election of a new master
  29. 29. BERKELEY ALGORITHM a) The time daemon asks all the other machines for their clock values b) The machines answer c) The time daemon tells everyone how to adjust their clock
  30. 30. 30 DECENTRALIZED AVERAGING ALGORITHM  Each machine on the distributed system has a daemon without UTC.  Periodically, at an agreed-upon fixed time, each machine broadcasts its local time.  Each machine calculates the correct time by averaging all results.
  31. 31. 31 NETWORK TIME PROTOCOL (NTP)  Enables clients across the Internet to be synchronized accurately to UTC.  Overcomes large and variable message delays  Employs statistical techniques for filtering, based on past quality of servers and several other measures  Can survive lengthy losses of connectivity:  Redundant servers  Redundant paths to servers  Provides protection against malicious interference through authentication techniques
  32. 32. 32 NETWORK TIME PROTOCOL (NTP) (CONT.)  Uses a hierarchy of servers located across the Internet. Primary servers are directly connected to a UTC time source.
  33. 33. 33 NETWORK TIME PROTOCOL (NTP) (CONT.)  NTP has three modes:  Multicast Mode:  Suitable for user workstations on a LAN  One or more servers periodically multicasts the time to other machines on the network.  Procedure Call Mode:  Similar to Christian‟s Algorithm  Provides higher accuracy than Multicast Mode because delays are compensated.  Symmetric Mode:  Pairs of servers exchange pairs of timing messages that contain time stamps of recent message events.  The most accurate, but also the most expensive mode Although NTP is quite advanced, there is still a drift of 20-35 milliseconds!!!
  34. 34. MORE PROBLEMS  Causality  Vector timestamps  Global state and termination detection  Election algorithms
  35. 35. LOGICAL CLOCKS  For many DS algorithms, associating an event to an absolute real time is not essential, we only need to know an unambiguous order of events  Lamport's timestamps  Vector timestamps
  36. 36. LOGICAL CLOCKS (CONT.)  Synchronization based on “relative time”.  “relative time” may not relate to the “real time”.  Example: Unix make (Is output.c updated after the generation of output.o?)  What‟s important is that the processes in the Distributed System agree on the ordering in which certain events occur.  Such “clocks” are referred to as Logical Clocks.
  37. 37. EXAMPLE: WHY ORDER MATTERS?  Replicated accounts in Jaipur(JP) and Bikaner(BN)  Two updates occur at the same time  Current balance: $1,000  Update1: Add $100 at BN; Update2: Add interest of 1% at JP  Whoops, inconsistent states!
  38. 38. LAMPORT ALGORITHM  Clock synchronization does not have to be exact  Synchronization not needed if there is no interaction between machines  Synchronization only needed when machines communicate  i.e. must only agree on ordering of interacting events
  39. 39. LAMPORT’S “HAPPENS-BEFORE” PARTIAL ORDER  Given two events e & e`, e < e` if: 1. Same process: e <i e`, for some process Pi 2. Same message: e = send(m) and e`=receive(m) for some message m 3. Transitivity: there is an event e* such that e < e* and e* < e`
  40. 40. CONCURRENT EVENTS  Given two events e & e`:  If neither e < e` nor e`< e, then e || e` P1 P2 P3 Real Time a b c f d e m1 m2
  41. 41. LAMPORT LOGICAL CLOCKS  Substitute synchronized clocks with a global ordering of events  ei < ej LC(ei) < LC(ej)  LCi is a local clock: contains increasing values  each process i has own LCi  Increment LCi on each event occurrence  within same process i, if ej occurs before ek  LCi(ej) < LCi(ek)  If es is a send event and er receives that send, then  LCi(es) < LCj(er)
  42. 42. LAMPORT ALGORITHM  Each process increments local clock between any two successive events  Message contains a timestamp  Upon receiving a message, if received timestamp is ahead, receiver fast forward its clock to be one more than sending time
  43. 43. LAMPORT ALGORITHM (CONT.)  Timestamp  Each event is given a timestamp t  If es is a send message m from pi, then t=LCi(es)  When pj receives m, set LCj value as follows  If t < LCj, increment LCj by one  Message regarded as next event on j  If t ≥ LCj, set LCj to t+1
  44. 44. LAMPORT’S ALGORITHM ANALYSIS (1)  Claim: ei < ej LC(ei) < LC(ej)  Proof: by induction on the length of the sequence of events relating to ei and ej P1 P2 P3 Real Timea b c f d e m1 m2 1 2 3 5 4 1 g 5
  45. 45. LAMPORT’S ALGORITHM ANALYSIS (2)  LC(ei) < LC(ej) ei < ej ?  Claim: if LC(ei) < LC(ej), then it is not necessarily true that ei < ej P1 P2 P3 Real Timea b c f d e m1 m2 1 2 3 5 4 1 g 2
  46. 46. TOTAL ORDERING OF EVENTS  Happens before is only a partial order  Make the timestamp of an event e of process Pi be: (LC(e),i)  (a,b) < (c,d) iff a < c, or a = c and b < d
  47. 47. APPLICATION: TOTALLY-ORDERED MULTICASTING  Message is timestamped with sender‟s logical time  Message is multicast (including sender itself)  When message is received  It is put into local queue  Ordered according to timestamp  Multicast acknowledgement  Message is delivered to applications only when  It is at head of queue  It has been acknowledged by all involved processes
  48. 48. APPLICATION: TOTALLY-ORDERED MULTICASTING  Update 1 is time-stamped and multicast. Added to local queues.  Update 2 is time-stamped and multicast. Added to local queues.  Acknowledgements for Update 2 sent/received. Update 2 can now be processed.  Acknowledgements for Update 1 sent/received. Update 1 can now be processed.  (Note: all queues are the same, as the timestamps have been used to ensure the “happens-before” relation holds.)
  49. 49. LIMITATION OF LAMPORT’S ALGORITHM  ei < ej LC(ei) < LC(ej)  However, LC(ei) < LC(ej) does not imply ei < ej  for instance, (1,1) < (1,3), but events a and e are concurrent P1 P2 P3 Real Timea b c f d e m1 m2 (1,1) (2,1) (3,2) (5,3) (4,2) (1,3) g (2,3)
  50. 50. VECTOR TIMESTAMPS  Pi‟s clock is a vector VTi[]  VTi[i] = number of events Pi has stamped  VTi[j] = what Pi thinks number of events Pj has stamped (i j)
  51. 51. VECTOR TIMESTAMPS (CONT.)  Initialization  the vector timestamp for each process is initialized to (0,0,…,0)  Local event  when an event occurs on process Pi, VTi[i] VTi[i] + 1  e.g., on processor 3, (1,2,1,3) (1,2,2,3)
  52. 52.  Message passing  when Pi sends a message to Pj, the message has timestamp t[]=VTi[]  when Pj receives the message, it sets VTj[k] to max (VTj[k],t[k]), for k = 1, 2, …, N  e.g., P2 receives a message with timestamp (3,2,4) and P2‟s timestamp is (3,4,3), then P2 adjust its timestamp to (3,4,4) VECTOR TIMESTAMPS (CONT.)
  53. 53. COMPARING VECTORS  VT1 = VT2 iff VT1[i] = VT2[i] for all i  VT1 VT2 iff VT1[i] VT2[i] for all i  VT1 < VT2 iff VT1 VT2 & VT1 VT2  for instance, (1, 2, 2) < (1, 3, 2)
  54. 54. VECTOR TIMESTAMP ANALYSIS  Claim: e < e‟ iff e.VT < e‟.VT P1 P2 P3 Real Timea b c f d e m1 m2 [1,0,0] [2,0,0] [2,1,0] [2,2,3] [2,2,0] [0,0,1] g [0,0,2]
  55. 55. APPLICATION: CAUSALLY-ORDERED MULTICASTING  For ordered delivery, we also need…  Multicast msgs (reliable but may be out-of-order)  Vi[i] is only incremented when sending  When k gets a msg from j, with timestamp ts, the msg is buffered until:  1: ts[j] = Vk[j] + 1  (this is the next timestamp that k is expecting from j)  2: ts[i] ≤ Vk[i] for all i ≠ j  (k has seen all msgs that were seen by j when j sent the msg)
  56. 56. CAUSALLY-ORDERED MULTICASTING P2 a P1 c d P3 e g [1,0,0] [1,0,0][1,0,0] [1,0,1] [1,0,1] Post a r: Reply a Message a arrives at P2 before the reply r from P3 does b [1,0,1] [0,0,0] [0,0,0] [0,0,0]
  57. 57. CAUSALLY-ORDERED MULTICASTING (CONT.) P2 a P1 P3 d g [1,0,0] [1,0,0] [1,0,1] Post a r: Reply a Buffered c [1,0,0] The message a arrives at P2 after the reply from P3; The reply is not delivered right away. b [1,0,1] [0,0,0] [0,0,0] [0,0,0] Deliver r
  58. 58. ORDERED COMMUNICATION  Totally ordered multicast  Use Lamport timestamps  Causally ordered multicast  Use vector timestamps
  59. 59. VECTOR CLOCKS  Each process i maintains a vector Vi  Vi[i] : number of events that have occurred at i  Vi[j] : number of events I knows have occurred at process j  Update vector clocks as follows  Local event: increment Vi[I]  Send a message :piggyback entire vector V  Receipt of a message: Vj[k] = max( Vj[k],Vi[k] )  Receiver is told about how many events the sender knows occurred at another process k  Also Vj[i] = Vj[i]+1
  60. 60. GLOBAL STATE  Global state of a distributed system  Local state of each process  Messages sent but not received (state of the queues)  Many applications need to know the state of the system  Failure recovery, distributed deadlock detection  Problem: how can you figure out the state of a distributed system?  Each process is independent  No global clock or synchronization  Distributed snapshot: a consistent global state
  61. 61. GLOBAL STATE (1) a) A consistent cut b) An inconsistent cut
  62. 62. DISTRIBUTED SNAPSHOT ALGORITHM  Assume each process communicates with another process using unidirectional point-to- point channels (e.g, TCP connections)  Any process can initiate the algorithm  Checkpoint local state  Send marker on every outgoing channel  On receiving a marker  Checkpoint state if first marker and send marker on outgoing channels, save messages on all other channels until:  Subsequent marker on a channel: stop saving state for that channel
  63. 63. DISTRIBUTED SNAPSHOT  A process finishes when  It receives a marker on each incoming channel and processes them all  State: local state plus state of all channels  Send state to initiator  Any process can initiate snapshot  Multiple snapshots may be in progress  Each is separate, and each is distinguished by tagging the marker with the initiator ID (and sequence number) A C B M M
  64. 64. SNAPSHOT ALGORITHM EXAMPLE a) Organization of a process and channels for a distributed snapshot
  65. 65. SNAPSHOT ALGORITHM EXAMPLE b) Process Q receives a marker for the first time and records its local state c) Q records all incoming message d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel
  66. 66. TERMINATION DETECTION  Detecting the end of a distributed computation  Notation: let sender be predecessor, receiver be successor  Two types of markers: Done and Continue  After finishing its part of the snapshot, process Q sends a Done or a Continue to its predecessor  Send a Done only when  All of Q‟s successors send a Done  Q has not received any message since it check-pointed its local state and received a marker on all incoming channels  Else send a Continue  Computation has terminated if the initiator receives Done messages from everyone
  67. 67. DISTRIBUTED SYNCHRONIZATION  Distributed system with multiple processes may need to share data or access shared data structures  Use critical sections with mutual exclusion  Single process with multiple threads  Semaphores, locks, monitors  How do you do this for multiple processes in a distributed system?  Processes may be running on different machines  Solution: lock mechanism for a distributed environment  Can be centralized or distributed
  68. 68. CENTRALIZED MUTUAL EXCLUSION  Assume processes are numbered  One process is elected coordinator (highest ID process)  Every process needs to check with coordinator before entering the critical section  To obtain exclusive access: send request, await reply  To release: send release message  Coordinator:  Receive request: if available and queue empty, send grant; if not, queue request  Receive release: remove next request from queue and send grant
  69. 69. MUTUAL EXCLUSION: A CENTRALIZED ALGORITHM a) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply. c) When process 1 exits the critical region, it tells the coordinator, when then replies to 2
  70. 70. PROPERTIES  Simulates centralized lock using blocking calls  Fair: requests are granted the lock in the order they were received  Simple: three messages per use of a critical section (request, grant, release)  Shortcomings:  Single point of failure  How do you detect a dead coordinator?  A process can not distinguish between “lock in use” from a dead coordinator  No response from coordinator in either case  Performance bottleneck in large distributed systems
  71. 71. DISTRIBUTED ALGORITHM  [Ricart and Agrawala]: needs 2(n-1) messages  Based on event ordering and time stamps  Process k enters critical section as follows  Generate new time stamp TSk = TSk+1  Send request(k,TSk) all other n-1 processes  Wait until reply(j) received from all other processes  Enter critical section  Upon receiving a request message, process j  Sends reply if no contention  If already in critical section, does not reply, queue request  If wants to enter, compare TSj with TSk and send reply if TSk<TSj, else queue
  72. 72. A DISTRIBUTED ALGORITHM a) Two processes want to enter the same critical region at the same moment. b) Process 0 has the lowest timestamp, so it wins. c) When process 0 is done, it sends an OK also, so 2 can now enter the critical region.
  73. 73. PROPERTIES  Fully decentralized  N points of failure!  All processes are involved in all decisions  Any overloaded process can become a bottleneck
  74. 74. ELECTION ALGORITHMS  Many distributed algorithms need one process to act as coordinator  Doesn‟t matter which process does the job, just need to pick one  Election algorithms: technique to pick a unique coordinator (aka leader election)  Examples: take over the role of a failed process, pick a master in Berkeley clock synchronization algorithm  Types of election algorithms: Bully and Ring algorithms
  75. 75. BULLY ALGORITHM  Each process has a unique numerical ID  Processes know the Ids and address of every other process  Communication is assumed reliable  Key Idea: select process with highest ID  Process initiates election if it just recovered from failure or if coordinator failed  3 message types: election, OK, I won  Several processes can initiate an election simultaneously  Need consistent result  O(n2) messages required with n processes
  76. 76. BULLY ALGORITHM DETAILS  Any process P can initiate an election  P sends Election messages to all process with higher Ids and awaits OK messages  If no OK messages, P becomes coordinator and sends I won messages to all process with lower Ids  If it receives an OK, it drops out and waits for an I won  If a process receives an Election msg, it returns an OK and starts an election  If a process receives a I won, it treats sender an coordinator
  77. 77. BULLY ALGORITHM EXAMPLE  The bully election algorithm  Process 4 holds an election  Process 5 and 6 respond, telling 4 to stop  Now 5 and 6 each hold an election
  78. 78. BULLY ALGORITHM EXAMPLE d) Process 6 tells 5 to stop e) Process 6 wins and tells everyone
  79. 79. LAST CLASS  Vector timestamps  Global state  Distributed Snapshot  Election algorithms
  80. 80. TODAY: STILL MORE CANONICAL PROBLEMS  Election algorithms  Bully algorithm  Ring algorithm  Distributed synchronization and mutual exclusion  Distributed transactions
  81. 81. ELECTION ALGORITHMS  Many distributed algorithms need one process to act as coordinator  Doesn‟t matter which process does the job, just need to pick one  Election algorithms: technique to pick a unique coordinator (aka leader election)  Examples: take over the role of a failed process, pick a master in Berkeley clock synchronization algorithm  Types of election algorithms: Bully and Ring
  82. 82. BULLY ALGORITHM  Each process has a unique numerical ID  Processes know the Ids and address of every other process  Communication is assumed reliable  Key Idea: select process with highest ID  Process initiates election if it just recovered from failure or if coordinator failed  3 message types: election, OK, I won  Several processes can initiate an election simultaneously
  83. 83. BULLY ALGORITHM DETAILS  Any process P can initiate an election  P sends Election messages to all process with higher Ids and awaits OK messages  If no OK messages, P becomes coordinator and sends I won messages to all process with lower Ids  If it receives an OK, it drops out and waits for an I won  If a process receives an Election msg, it returns an OK and starts an election  If a process receives a I won, it treats sender an coordinator
  84. 84. BULLY ALGORITHM EXAMPLE  The bully election algorithm  Process 4 holds an election  Process 5 and 6 respond, telling 4 to stop  Now 5 and 6 each hold an election
  85. 85. BULLY ALGORITHM EXAMPLE d) Process 6 tells 5 to stop e) Process 6 wins and tells everyone
  86. 86. RING-BASED ELECTION  Processes have unique Ids and arranged in a logical ring  Each process knows its neighbors  Select process with highest ID  Begin election if just recovered or coordinator has failed  Send Election to closest downstream node that is alive  Sequentially poll each successor until a live node is found  Each process tags its ID on the message  Initiator picks node with highest ID and sends a coordinator message  Multiple elections can be in progress  Wastes network bandwidth but does no harm
  87. 87. A RING ALGORITHM  Election algorithm using a ring.
  88. 88. COMPARISON  Assume n processes and one election in progress  Bully algorithm  Worst case: initiator is node with lowest ID  Triggers n-2 elections at higher ranked nodes: O(n2) msgs  Best case: immediate election: n-2 messages  Ring  2 (n-1) messages always
  89. 89. A TOKEN RING ALGORITHM a) An unordered group of processes on a network. b) A logical ring constructed in software. • Use a token to arbitrate access to critical section • Must wait for token before entering CS • Pass the token to neighbor once done or if not interested • Detecting token loss in non-trivial
  90. 90. COMPARISON  A comparison of three mutual exclusion algorithms. Algorithm Messages per entry/exit Delay before entry (in message times) Problems Centralized 3 2 Coordinator crash Distributed 2 ( n – 1 ) 2 ( n – 1 ) Crash of any process Token ring 1 to 0 to n – 1 Lost token, process crash
  91. 91. TRANSACTIONS Transactions provide higher level mechanism for atomicity of processing in distributed systems  Have their origins in databases Banking example: Three accounts A:$100, B:$200, C:$300  Client 1: transfer $4 from A to B  Client 2: transfer $3 from C to B Result can be inconsistent unless certain properties are Client 1 Client 2 Read A: $100 Write A: $96 Read C: $300 Write C:$297 Read B: $200 Read B: $200 Write B:$203 Write B:$204
  92. 92. ACID PROPERTIES Atomic: all or nothing Consistent: transaction takes system from one consistent state to another Isolated: Immediate effects are not visible to other (serializable) Durable: Changes are permanent once transaction completes (commits) Client 1 Client 2 Read A: $100 Write A: $96 Read B: $200 Write B:$204 Read C: $300 Write C:$297 Read B: $204 Write B:$207

×