Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply



Published on

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. 分散式系統
  • 2. 優點  Resource Sharing  不同地區的Process連通時→USER A可使用USER B的資源  Computation Speedup  困難複雜的問題分派多個處理器綜合處理  Reliability  因各處理器有各自獨立的Memory→當有一個處理器受損 時,將不致影響其他處理器之作業;同時互相幫忙修補  Communication  任何連通的USER皆可藉由網路互相通訊和諮詢
  • 3. 作業系統的類型  資料傳輸  Site A ─Data→ Site B  資料可視需求而定,但格式需一致,避免遺失資料  計算傳輸  使用者將指令藉由網路傳送至遠端處理器  由遠端處理器以Local Resources執行  再將執行結果回傳予使用者  行程傳輸  將Process藉由網路傳送至遠端執行,用此執行的理由:  Load Balancing  Computation Speedup  Hardware / Software Preference  Data Access
  • 4. C Socket for Windows
  • 5. C Socket for Windows  Server.c #include<winsock2.h> #include<stdio.h> int main() { SOCKET server_sockfd, client_sockfd; int server_len, client_len; struct sockaddr_in server_address , sockaddr_in client_address; // 註冊 Winsock DLL WSADATA wsadata; WSAStartup(0x101,(LPWSADATA)&wsadata) // 產生 server socket server_sockfd = socket(AF_INET, SOCK_STREAM, 0); // AF_INET(使用IPv4); SOCK_STREAM; 0(即TCP)
  • 6. C Socket for Windows  Server.c server_address.sin_family = AF_INET; server_address.sin_addr.s_addr = inet_addr(""); server_address.sin_port = 1234; server_len = sizeof(server_address); bind(server_sockfd, (struct sockaddr *) &server_address, server_len); listen(server_sockfd, 5); // 5(即佇列數)
  • 7. C Socket for Windows  Server.c while(1) { char ch; printf("Server waiting...n"); client_len = sizeof(client_address); client_sockfd = accept(server_sockfd, (struct sockaddr *) &client_address, &client_len); recv(client_sockfd, &ch, 1, 0); // 接收‟A‟ ch++; // „A‟→‟B‟ send(client_sockfd, &ch, 1, 0); // 傳送‟B‟ closesocket(client_sockfd); WSACleanup(); } }
  • 8. C Socket for Windows  Client.c #include<winsock2.h> #include<stdio.h> int main() { SOCKET sockfd; int len , result; struct sockaddr_in address; char ch = 'A'; WSADATA wsadata; WSAStartup(0x202,(LPWSADATA)&wsadata); sockfd = socket(AF_INET, SOCK_STREAM, 0); address.sin_family = AF_INET;
  • 9. C Socket for Windows  Client.c address.sin_addr.s_addr = inet_addr(""); address.sin_port = 1234; len = sizeof(address); connect(sockfd, (struct sockaddr *)&address, len); send(sockfd, &ch, 1, 0); recv(sockfd, &ch, 1, 0); printf("char from server = %cn", ch); closesocket(sockfd); WSACleanup(); system("pause"); }
  • 10. Client and server with threads Thread 2 makes requests to server Input-output Receipt & Thread 1 queuing generates results T1 Requests N threads Client Server Distributed Systems: Concepts and Design
  • 11. Alternative server threading architectures workers per-connection threads per-object threads I/O remote I/O remote remote objects objects objects a. Thread-per-request b. Thread-per-connection c. Thread-per-object Distributed Systems: Concepts and Design
  • 12. C Thread -lpthreadGC2
  • 13. C Thread  pthread.c #include <stdio.h> #include <pthread.h> void *thread_func(void *arg); char message[] = "Hello World"; int main() { pthread_t thread; void *thread_result; pthread_create(&thread,NULL,thread_func,(void *)message); printf("Waiting for thread to finish...n");
  • 14. C Thread  pthread.c pthread_join(thread,&thread_result); printf("Thread joined, it returned %sn",(char *)thread_result); system("pause"); } void *thread_func(void *arg) { printf("thread %s is runningn",(char *)arg); sleep(3); pthread_exit("Thange you use CPU Timen"); }
  • 15. Java TCP Socket (per-connection threads)  String data = in.readUTF(); import*; System.out.println("Received: "+ data) ; import*; s.close(); public class Client { }catch (IOException e){ public static void main (String args[]) { System.out.println(e.getMessage()); Socket s = null; }finally { try{ if(s!=null) int serverPort = 1234; try {s.close();} s = new Socket("localhost", serverPort); catch (IOException e){} DataInputStream in = new DataInputStream( s.getInputStream()); } DataOutputStream out = new } DataOutputStream( s.getOutputStream()); } out.writeUTF(“Hello");
  • 16. Java TCP Socket (per-connection threads)  import*; import*; public class Server { public static void main(String args[]) { try{ int serverPort = 1234; ServerSocket listenSocket = new ServerSocket(serverPort); while(true) { Socket clientSocket = listenSocket.accept(); Connection c = new Connection(clientSocket); } } catch(IOException e) { System.out.println(e.getMessage()); } } }
  • 17. Java TCP Socket (per-connection threads)  this.start(); } catch(IOException e){ import*; System.out.println(e.getMessage());} import*; } class Connection extends Thread { public void run(){ DataInputStream in; try { DataOutputStream out; String data = in.readUTF(); Socket clientSocket; out.writeUTF("client data is " + data); public Connection (Socket ClientSocket) { } catch(IOException e) { try { System.out.println(e.getMessage()); clientSocket = ClientSocket; } finally { in = new try { DataInputStream( clientSocket.getInputStream()); clientSocket.close(); out = new } catch (IOException e) {} DataOutputStream( clientSocket.getOutputStream()); } } }
  • 18. 時間同步的類型  External  Synchronize all clocks against a single one, usually the one with external, accurate time information  Internal  Synchronize all clocks among themselves  At least time monotonicity must be preserved
  • 19. 時間同步的類型  External (accuracy) : 同步於驗證來源的時間  Each system clock Ci S differs at most Dext at every point in the synchronization interval from an external UTC source S: |S - Ci| < Dext for all i C1 C3 C2
  • 20. 時間同步的類型  Internal (agreement) : 彼此間合力同步時間  Any two system clocks C1 C3 Ci and Cj differs at most Dint at every point C2 in the synchronization interval from each other: | Cj - Ci| < Dint for all i and j
  • 21. 時間同步的類型  Dext and Dint are synchronization bounds  Dint <= 2Dext  Max-Synch-interval = Dint / 2Dext  It means:  If two events have single-value timestamps which differ by less than some value,we CAN‟T SAY in which order the events occurred.  With interval timestamps, when intervals overlap, we CAN‟T SAY in which order the events occurred.
  • 22. 同步系統時間 TB B B‟s clock time TA TA+Ttrans A A‟s clock time Ttrans real time Tmin < Ttrans < Tmax Ttrans= (Tmin+ Tmax)/2 is at most wrong by (Tmin- Tmax)/2 If A sends its clock time TA to B → B can set its clock to TA + (Tmin+ Tmax)/2 → then A and B are synchronized with bound (Tmin- Tmax)/2 Tmin (Tmin+ Tmax)/2 Tmax Ttrans (Tmin- Tmax)/2(Tmin- Tmax)/2
  • 23. 非同步系統時間 TB TB +Tround/2 B B‟s clock time TA TA+Ttrans T‟A A A‟s clock time Tround  In asynchronous system, we have no Tmax  How can A synchronize with B?  By using the round-trip time Tround=TA-T‟A in Cristian‟s algorithm: TB= TB+ Tround/2
  • 24. JAVA RMI (External Clock Synchronize)
  • 25. JAVA RMI (External Clock Synchronize)  import java.rmi.*; public interface Clock extends Remote{ String getTime() throws RemoteException; }  import java.rmi.*; import java.rmi.server.*; import java.util.*; public class ClockImpl extends UnicastRemoteObject implements Clock { public ClockImpl() throws RemoteException { super(); } public String getTime() { Date d = new Date(); return d.toString(); } }
  • 26. JAVA RMI (External Clock Synchronize)  import java.rmi.*; public class ClockServer { public ClockServer() { try { Clock c = new ClockImpl(); Naming.rebind("//localhost/ClockService",c); } catch (Exception e) { System.out.print(e.getMessage()); } } public static void main(String args[]) { new ClockServer(); } }
  • 27. JAVA RMI (External Clock Synchronize)  import java.rmi.*; import*; public class ClockClient { public static void main(String args[]) { try { Clock c = (Clock)Naming.lookup("//localhost/ClockService"); System.out.println(c.getTime()); } catch (Exception e) { System.out.print(e.getMessage()); } } }
  • 28. Logical time  One aspect of clock synchronization is to provide a mechanism whereby systems can assign sequence numbers (“timestamps”) to messages upon which all cooperating processes can agree.  Leslie Lamport (1978) showed that clock synchronization need not be absolute and L. Lamport„s two important points lead to “causality”  First point:  If two processes do not interact, it is not necessary that their clocks be synchronized  they can operate concurrently without fear of interferring with each other  Second (critical) point:  It is not important that all processes agree on time, but rather, that they agree on the order in which events occur  Such “clocks” are referred to as Logical Clocks  Logical time is based on happens-before relationship
  • 29. 事件序列 Event Ordering  Happens before and concurrent events illustrated No causal path neither from e1 to e2 nor from e2 to e1 e1 and e2 are concurrent from e1 to e6 nor from e6 to e1 e1 and e6 are concurrent from e2 to e6 nor from e6 to e2 e2 and e6 are concurrent Types of events Send Receive Internal (change of state)
  • 30. 協調 Co-ordination  對於分散式系統的困難點  Centralised solutions not appropriate  communications bottleneck  Fixed master-slave arrangements not appropriate  process crashes  Varying network topologies  ring, tree, arbitrary; connectivity problems  Failures must be tolerated if possible  link failures  process crashes  Impossibility results  in presence of failures, esp asynchronous model
  • 31. Mutual Exclusion  要求  Safety  At most one process may execute in CS at any time  Liveness  Every request to enter and exit a CS is eventually granted  Ordering (desirable)  Requests to enter are granted according to causality order (FIFO) Synchronization Centralized Distributed scheme Based on mutual Central Circulating exclusion process token No mutual Physical Clock Physical clocks exclusion Event Count Logical clocks
  • 32. Mutual Exclusion  執行分三大類  Centralized Approach  P1有意進入Critical Section時→傳遞一個意願訊息Request→C接受意願訊息Request → 若Critical Section允許Process進入→傳遞一個允許訊息Reply→P1就能進入  此時當P2也有意願進行Critical Section →C將P2之意願訊息置入至Waiting Queue  當P1離開臨界區時→傳遞一個釋出訊息Release至C→C將傳遞一個允許訊息Reply至Waiting Queue中的下一個意訊願訊息的擁有者Process  Distributed Approach  比較Timestamp  要知道網路上所有Node的Name及也要將本身的Name告知其它節點,降低增加節點的頻率  當Node故障,系統應立刻通知其它Node且進行修復後,故應經常維護各Node正常運作  Process未進入Critical Section,必會頻頻停頓等待其他Process之操作  Token Passing Approach  適當的路徑,避免Node發生Starvation  若Token遺失,系統應重新設定一個Token補救  若路徑有Node故障,系統應重組最佳新路徑
  • 33. 緊密聚合 Aotomicity 
  • 34. Two-Phase Commit Protocol  prepare(T) <prepare T> ready(T) abort(T) <ready T> <no T>
  • 35. Two-Phase Commit Protocol  commit(T) abort(T) <commit T> <abort T> acknowledge(T) acknowledge(T) <complete T>
  • 36. Failure Handling in 2PC 
  • 37. Failure Handling in 2PC 
  • 38. Deadlock Prevention and Avoidance  資源編碼演算法Resources Ordering Algorithm  將網路上所有的資料源依我們想像的工作進行Global Resources- ordering ,並給予唯一的編號  當某Process當時正佔有資源i時,不得再對於小於i的資源提出要求,如此 可降低循環等待的機會  Simple to implement; requires little overhead  銀行家演算法Banker‟s Algorithm  分散式系統選出一個最適當的Process擔任銀行家Banker,管理網路上所有 的資源及對商上各Process作最適當的資源分配  (New)時間戳記優先演算法Timestamp Priority Algorithm  網路上所有Process的TS均設定為各Process之Priority Number  TS愈小的Process其優先等級愈高(愈早發生)  唯有優先等級較高的Process,可以向優先等級低的提出資源要求
  • 39. Timestamp Priority Algorithm  TR=5 TR=10 TR=10 TR=15
  • 40. Deadlock Detection 區域等待圖Local Wait For Graph 全域等待圖Global Wait For Graph  集中式執行Centralized Approach  分散式執行Distributed Approach
  • 41. 基本分散式演算法
  • 42. 複雜度測量  Computational Rounds  同步將以計時器度量回合數  非同步演算法將以透過網路散播事件的次數waves來決 定回合數  Local Running Time  Spaced  Global→所有電腦使用空間的總和  Local→每台電腦需要使用多少空間  Message complexity  電腦傳送的總訊息數  訊息M透過p個邊傳輸→訊息複雜度為p|M|,|M|代表M的長度
  • 43. 基本分散式演算法  Ring Leader  Tree Leader  BFS  MST
  • 44. Ring Leader  每Process將它的id傳送到環狀裡的下一個Process 之後的回合裡,每個Process將執行如下的計算:  從上一個Process收到一個識別號碼id  將id與自己的識別號碼比較  把兩值之中的最小值,傳送到環狀裡的下一個Process
  • 45. Algorithm RingLeader(id): Input:The unique identifier, id, for the processor running Output:The smallest identifier of a processor in the ring M←[Candidate is id] Send message M to the successor processor in the ring done←false repeat Get message M from the predecessor processor in the ring. if M=[Candidate is i] then if i=id then M←[Leader is id] done←true
  • 46. Algorithm else m←min{i,id} M←[Candidate is m] else {M is a “Leader is” message} done←true Send message M to the next processor in the ring until done return M
  • 47. Analysis  Computational Rounds  O(2N)  Local Running Time  O(N)  Local Spaced  O(1)  Message Complexity  O(N2)
  • 48. Tree Leader  假設網路是一個自由樹狀圖  自然起始點  外部節點  非同步  訊息檢查Message Check  特定邊是否已送出訊息且到達該節點  二階段  Accumulation Phase  id自樹的外部節點流入,記錄最小id的節點  找出Leader  Broadcast Phase  廣播Leader id至各外部節點
  • 49. Algorithm TreeLeader(id): Input:The unique identifier, id, for the processor running Output:The smallest identifier of a processor in the ring {Accumulation Phase} Let d be the number of neighbors of processor id m ←0 {counter for messages received} ℓ ←id {tentative leader} repeat {begin a new round} for each neighbor j do check if a message from processor j has arrived if a message M = [Candidate is i] from j has arrived then ℓ←min{i. ℓ} m←m+1
  • 50. Algorithm until m > d-1 if m=d then M←[Leader is ℓ] for each neighbor i≠k do send message M to processor j return M {M is a “leader is ” message} else M←[Candidate is ℓ] send M to the neighbor k that has not sent a message yet
  • 51. Algorithm {Broadcast Phase} repeat {begin a new round} check if a message from processor k has arrived if a message M from k has arrived then m←m+1 if M=[Candidate is i] then ℓ←min{i,ℓ} M←[Leader is ℓ] for each neighbor j do send message M to process j
  • 52. Algorithm else {M is a “leader is” message} for each neighbor j≠k do send message M to processor j until m=d return M {M is a “leader is” message}
  • 53. Analysis • di為處理器i的相鄰Process之數量  Computational Rounds  O(D)  Local Running Time  O(diD)  Local Spaced  O(di)  Message Complexity  O(N)
  • 54. Tree Leader  同步  一塊石頭被丟池塘內後引起的漣漪  直徑Diameter為圖中任兩個節點之間最長之路徑之長度  回合數為Diameter  二階段  Accumulation Phase:中心  Broadcast Phase:向外傳播
  • 55. Breadth-first Search  認定s為source node  同步  以波wave的型態向外散播  一層層由上往下建構BFS Tree  每部節點v傳送訊息給先前沒有與v有所接觸的鄰居  任一節點v必須選擇另一個節點v當父節點
  • 56. Algorithm SynchronousBFS(v,s): Input: The identifier v of the node (processor) executing this algorithm and the identifier s of the start node of the BFS traversal Output: For each node v, its parent in a BFS tree rooted at s repeat {begin a new round} if v=s or v has received a message from one of its neighbors then set parent(v) to be a node requesting v to become its child (or null, if v=s) for each node w adjacent to v that has not contacted v yet do send a message to w asking w to become a child of v until v=s or v has received a message
  • 57. Analysis  n個節點,m個邊  Computational Rounds  Local Running Time  Local Spaced  Message complexity  O(n+m)
  • 58. Breadth-first Search  非同步  要求每個處理器知道在網路中的Process總數  根節點s送出的一個「脈衝」訊息,來觸發其他Process 開始進行整體計算的下一回合  合併  向下脈衝從根節點s傳遞至BFS Tree  向上脈衝從BFS Tree的外部節點一直到根節點s  先收到向上脈衝信號之後, 才會發出一個新的向下脈衝信號
  • 59. Algorithm AsynchronousBFS(v,s): Input: The identifier v of the node (processor) executing this algorithm and the identifier s of the start node of the BFS traversal Output: For each node v, its parent in a BFS tree rooted at s C←ø {verified BFS children for v} set A to be the set of neighbors of v repeat {begin a new round} if parent(v) is defined or v=s then if parent(v) is defined then wait for pulse-down message from parent(v)
  • 60. Algorithm if C is not empty then {v is an internal node in the BFS tree} send a pulse-down message to all nodes in C wait for a pulse-up message from all nodes in C else {v is an external node in the BFS tree} for each node u in A do send a make child message to u
  • 61. Algorithm for each node u in A do get a message M from u and remove u from A if M is an accept-child message then add u to C send a pulse-up message to parent(v) else {v ≠s has no parent yet} for each node w in A do if w has sent v a make-child message then remove w from A {w is no longer a candidate child for v}
  • 62. Algorithm if parent(v) is undefined then parent(v)←w send an accept-child message to w else send a reject-child message to w until (v has received message done) or (v=s and has pulsed-down n-1 times) send a done message to all the nodes in C
  • 63. Analysis • n個節點,m個邊  Computational Rounds  Local Running Time  Local Spaced  Message complexity  O(n2+m)
  • 64. Minimum Spanning Tree  利用Baruskal演算法找出MST所提出的有效率的序列式  同步模式下的Baruskal分散式演算法  決定出所有連通分量圖  針對每個連通分量圖,找到具最小權重的邊  加入到另一個分量圖
  • 65. Baruskal Algorithm KruskalMST(G): Input: A simple connected weighted graph G with n vertices and m edges Output: A minimum spanning tree T for G for each vertext v in G do define an elementary cluster C(v)←{v} initialize a priority queue Q to contain all edges in G, using the weights as keys T←ø
  • 66. Baruskal Algorithm while T has fewer than n-1 edges do (u,v)←Q.removeMin() Let C(v) be the cluster containing v , Let C(u) be the cluster containing u. if C(v)≠C(u) then Add edge(v,u) to T. Merge C(v) and C(u) into one cluster, that is union C(v) and C(u). return tree T
  • 67. Analysis • n個節點,m個邊  Computational Rounds  O(logn)  Local Running Time  Local Spaced  O(m)  Message complexity  O(mlogn)
  • 68. 時間同步演算法
  • 69. Synchronization Algorithms  Multicast  Uses a central time server to synchronize clocks  Cristian‟s algorithm (centralised)  Berkeley algorithm (centralised)  The Network Time Protocol (decentralised) 69
  • 70. Cristian’s Algorithm(1989)  使用time server來同步時間,且為保留供參考的時間  Clients ask the time server for time  period depends on maximum clock drift and accuracy required  Clients receive the value and may:  use it as it is  add the known minimum network delay  add half the time between this send and receive  For links with symmetrical latency:  RTT = resp.-received-time – req.-sent-time  adjusted-local-time =  server-timestamp + minimum network delay or  server-timestamp + (RTT / 2) or  server-timestamp + (RTT – server-latency) /2  local-clock-error = adjusted-local-time – local-time
  • 71. Berkeley algorithm (Gusella & Zatti, 1989)  if no machines have receivers, …  Berkeley algorithm uses a designated server to synchronize  The designated server polls or broadcasts to all machines for their time, adjusts times received for RTT & latency, averages times, and tells each machine how to adjust.  Polling is done using Cristian‟s algorithm  Avg. time is more accurate, but still drifts
  • 72. Network Time Protocol  NTP is a best known and most widely implemented decentralised algorithm  Used for time synchronization on Internet 1 Primary server, direct synchronization Secondary server, 2 2 2 synchronized by the primary server 3 3 3 3 3 3 Tertiary server, synchronized by the secondary server
  • 73. 互斥存取演算法
  • 74. 假設  Each pair of processes is connected by reliable channels (such as TCP).  Messages are eventually delivered to recipients‟ input buffer.  Processes will not fail.  There is agreement on how a resource is identified  Pass identifier with requests
  • 75. Exclusive Access Algorithm  Centralized Algorithm  Token Ring Algorithm  Lamport Algorithm (Timestamp Approach)  Ricart & Agrawala Algorithm  Leader Election Algorithms  Bully Algorithm  Ring Algorithm  Chang&Roberts Algorithm  Itai&Rodeh Algorithm
  • 76. Centralized Algorithm Operations Request(R 1. Request resource ) C  Send request to coordinator to enter CS Grant(R) 2. Wait for response P 3. Receive grant Release(R)  Grants permission to enter CS  keeps a queue of requests to enter the CS. 4. access resource Coordinator Queue of 5. Release resource Requests 4  Send release message to inform coordinator 2  Safety, liveness and order are guaranteed Grant Delay Request P1 P4  Client and Synchronization Release  one round trip time (release + grant) P2 P3
  • 77. Token Ring Algorithm Operations  For each CS a token is used.  Only the process holding the token can enter the CS.  To exit the CS, the process sends the token onto its neighbor.  If a process does not require to enter the CS when it receives the token, it forwards the token to the next neighbor.  在一個時間只會有一個程序取得Token,保證Mutual exclusion  Order well-defined,讓Starvation不會發生  假如token遺失 (e.g. process died),將必須重新產生  Safety & liveness are guaranteed, but ordering is not. Delay  Client : 0 to N message transmissions.  Synchronization :between one process‟s exit from the CS and the next process‟s entry is between 1 and N message transmissions.
  • 78. Lamport Algorithm  A total ordering of requests is established by logical timestamps.  Each process maintains request Queue (mutual exclusion requests)  Requesting CS, Pi  multicasts “request” (i, Ti) to all processes (Ti is local Lamport time).  Places request on its own queue  waits until all processes “reply”  Entering CS, Pi  receives message (ack or release) from every other process with a timestamp larger than Ti  Releasing CS , Pi  Remove request from its queue  Send a timestamped release message  This may cause its own entry have the earliest timestamp in the queue, enabling it to access the critical section
  • 79. Ricart & Agrawala Algorithm  Using reliable multicast and logical clocks  Process wants to enter critical section  Compose message containing  Identifier (machine ID, process ID)  Name of resource  Current time  Send request to all processes ,wait until everyone gives permission  When process receives request  If receiver not interested →Send OK to sender  If receiver is in critical section →Do not reply; add request to queue  If receiver just sent a request as well:  Compare timestamps: received & sent msgs→Earliest wins  If receiver is loser then send OK else receiver is winner, do not reply, queue  When done with critical section→Send OK to all queued requests
  • 80. Ricart & Agrawala Algorithm On initialization state := RELEASED; To enter the critical section state := WANTED; Multicast request to all processes; request processing deferred here T := request‟s timestamp; Wait until (number of replies received = (N – 1)); state := HELD; On receipt of a request <Ti, pi> at pj (i≠ j) if (state = HELD) or ((state = WANTED) and ((T, pj) < (Ti, pi)) then queue request from pi without replying; else reply immediately to pi; To exit the critical section state := RELEASED; reply to any queued requests;
  • 81. Ricart & Agrawala Algorithm  Safety, liveness, and ordering are guaranteed.  It takes 2(N-1) messages per entry operation (N-1 multicast requests + N-1 replies); N messages if the underlying network supports multicast. [3(N-1) in Lamport‟s algorithm] Delay  Client P3  one round-trip time P1 P1 remains in  Synchronization “wanted” until P2 sends “reply”  one message transmission time. Reply P2不能傳Reply給P1 P2 P2 message: 因為Timestamp →P1大於P2 Timestamp is 78 P2 Changes to “held” P1 message: Timestamp is 87
  • 82. Leader Election Algorithms  Solution the problem  N processes, may or may not have unique IDs (UIDs)  for simplicity assume no crashes  must choose unique master coordinator amongst processes  Requirements  Every process knows P, identity of leader, where P is unique process id (usually maximum) or is yet undefined.  All processes participate and eventually discover the identity of the leader (cannot be undefined).  When a coordinator fails, the algorithm must elect that active process with the largest priority number  兩種類型的演算法  Bully: “the biggest guy in town wins”  Ring: a logical, cyclic grouping
  • 83. Bully Algorithm  假設  Synchronous system  All messages arrive within Ttrans units of time.  A reply is dispatched within Tprocess units of time of the receipt of a message.  if no response is received in 2Ttrans + Tprocess, the node is assumed to be dead.  若Process知道自己有最高的id,就會elect自己當Coordinator 且會傳送coordinator訊息給所有比其id低的其餘process  當Process P注意到coordinator太久沒回應要求,就初始一個election  當Process P拿到election就會傳送election訊息給其餘process  若都沒人回應,P就會當Coordinator  若有一個人有更higher numbered process回答,就結束P‟s job is done
  • 84. Bully Algorithm  Performce  Best case scenario: The process with the second highest id notices the failure of the coordinator and elects itself.  N-2 coordinator messages are sent.  Turnaround time is one message transmission time.  Worst case scenario: When the process with the least id detects the failure.  N-1 processes altogether begin elections, each sending messages to processes with higher ids.  The message overhead is O(N2).  Turnaround time is approximately 5 message transmission times.
  • 85. Ring Algorithm  No token is used in this algorithm  當演算法結束時,任一Process分有Active清單(consisting of all the priority numbers of all active processes in the system)  若Process Pi偵測Coordinator failure,就會建立初始空白的Active 清單,之後傳送訊息elect(i)給Pi的right neighbor,和增加number i 到Pi的Active清單  若Pi接收到訊訊elect(j)從左邊的Process,它必須有所回應  If this is the first elect message it has seen or sent, Pi creates a new active list with the numbers i and j and send the message elect(j)  If i  j, then the active list for Pi now contains the numbers of all the active processes in the system , Pi can now determine the largest number in the active list to identify the new coordinator process  If i = j, then Pi receives the message elect(i) , The active list for Pi contains all the active processes in the system Pi can now determine the new coordinator process.
  • 86. Chang&Roberts Algorithm  Assume  Unidirectional ring  Asynchronous system  Each Process has UID  Election  initially each process non-participant  determine leader (election message):  initiator becomes participant and passes own UID on to neighbour  when non-participant receives election message, forwards maximum of own and the received UID and becomes participant  participant does not forward the election message  announce winner (elected message):  when participant receives election message with own UID, becomes leader and non-participant, and forwards UID in elected message  otherwise, records the leader‟s UID, becomes non-participant and forwards it
  • 87. Itai&Rodeh Algorithm  Assume  Unidirectional ring  Synchronous system  Each Process not has UID  Election  each process selects ID at random from set {1,..K}  non-unique! but fast  process pass all IDs around the ring  after one round, if there exists a unique ID then elect maximum unique ID  otherwise, repeat  How do know the algorithm terminates?  from probabilities:if you keep flipping a fair coin then after several heads you must get tails