Your SlideShare is downloading. ×
Ajay d. kshemkalyani, mukesh singhal distributed computing principles, algorithms, and systems  2008 (1)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Ajay d. kshemkalyani, mukesh singhal distributed computing principles, algorithms, and systems 2008 (1)

946
views

Published on

Distributed operating system.

Distributed operating system.

Published in: Engineering

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
946
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. c A. Kshemkalyani and M. Singhal, 2005 Limited circulation within a classroom, after seeking permission from authors. 1
  • 2. DISTRIBUTED COMPUTING: PRINCIPLES, ALGORITHMS, and SYSTEMS Ajay D. Kshemkalyani Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 Mukesh Singhal Department of Computer Science University of Kentucky Lexington, KY 40506 January 29, 2007
  • 3. “To my father Shri Digambar and my mother Shrimati Vimala.” -Ajay D. Kshemkalyani “To my mother Chandra Prabha Singhal, my father Brij Mohan Singhal, and my daughters Meenakshi, Malvika, and Priyanka.“ -Mukesh Singhal i
  • 4. Preface Background The field of Distributed Computing covers “all aspects of computing and information access across multiple processing elements connected by any form of communication networks, whether local or wide-area in the coverage”. Since the advent of the Internet in the 1970s, there has been a steady growth of new applications requiring distributed processing. This was enabled by advances in networking and hardware technology, falling cost of hardware, and greater end-user awareness. These factors contributed to making distributed computing a cost-effective, high-performance, and fault-tolerant reality. Around the turn of the millenium, there has been an explosive growth in the expansion and efficiency of the Internet, and a matching outreach of access to networked resources through the World Wide Web, all across the world. Coupled with an equally dramatic growth in the wireless and mobile networking areas, and plummeting prices of bandwidth and storage devices, we are witnessing a rapid spurt in distributed applications and an accompanying interest in the field of distributed computing in universities, governments organizations, and private institutions. Advances in hardware technology have suddenly made sensor networking a reality, and em- bedded and sensor networks are rapidly becoming an integral part of each person’s life – from the home network with the interconnected gadgets to the automobile communicating by GPS (Global Positioning System), to the fully networked office with RFID monitoring. In the emerging global village, distributed computing will be the centerpiece of all computing and information access sub-disciplines within computer science. Clearly, this is a very important field. Moreover, this evolving field is characterized by a diverse range of challenges for which the solutions need to have foundations on solid principles. The field of distributed computing is very important, and there is a huge demand for a good comprehensive book. The book comprehensively covers all important topics in a great depth. This book provides both the depth and the breadth of coverage of topics in conjunction with the clarity of explanation and ease of understanding. The book will be particularly valuable to the academic community and the computer industry at large. Writing such a comprehensive book is an herculean task and is a huge undertaking. There is a deep sense of satisfaction in knowing that we were able complete this major task and perform this service to the community. Description, Approach, and Features The book will focus on fundamental principles and models underlying all aspects of distributed computing. The book will address the principles underlying the theory, algorithms, and systems aspects of distributed computing. The manner of presentation of the algorithms is very lucid, ex- plaining the main ideas and the intuition with figures and simple explanations rather than getting entangled in intimidating notations and lengthy and hard-to-follow rigorous proofs of the algo- rithms. The selection of chapter themes is broad and comprehensive, and the book covers all im- portant topics in depth. The selection of algorithms within each chapter has been done carefully to elucidate new and important techniques of algorithm design. Although the book focuses on foun- dational aspects and algorithms for distributed computing, it thoroughly addresses all practical systems-like problems (e.g., mutual exclusion, deadlock detection, termination detection, failure ii
  • 5. recovery, authentication, global state and time, etc.) by presenting the theory behind and algo- rithms for such problems. The book is written keeping in mind the impact of emerging topics such as peer-to-peer computing and network security on the foundational aspects of distributed computing. Chapters of the book include figures, examples, exercise problems, a summary, and bibliogra- phy/references. An Index includes the index terms. Readership This book is aimed as a textbook for • Graduate students and Senior level undergraduate students in Computer Science and Com- puter Engineering. • Graduate students in Electrical Engineering and Mathematics. As wireless networks, peer- to-peer networks, and mobile computing continue to grow in importance, an increasing num- ber of students from Electrical Engineering Departments will also find this book necessary. • Practitioners, systems designers/programmers, and consultants in industry and research labs will find the book a very useful reference because it will contain the state of the art algorithms and principles to address various design issues in distributed systems, as well as the latest references. The breadth and depth of coverage, accompanied by clarity of explanation and ease of under- standing that the book offers will make it a very widely adopted textbook. Hard and soft prerequisites for the use of this book include • An undergraduate course in algorithms is required. • Undergraduate courses in operating systems and computer networks would be useful. • A reasonable familiarity with programming. We have aimed for a very comprehensive book that will be the single source that covers dis- tributed computing models and algorithms. The book will have both depth and breadth of coverage of topics, and will be characterized by clear and easy explanations. None of the existing textbooks on distributed computing provides all of these features. Acknowledgements This book grew from the notes used in the graduate courses on Distributed Computing at the Ohio State University, the University of Illinois at Chicago and at University of Kentucky. We would like to thank the graduate students at both these schools for their contributions to the book in many ways. The book is based on the published research results of numerous researchers in the field. We have made all efforts to present the material in our language and have given credit to original source of the information. We would like to thank all researchers whose work has been reported in this book. Finally, we would like to thank the staff of the Cambridge University Press for providing us with an excellent support for publication of the book. iii
  • 6. Access to Resources The following web sites will be maintained for the book. Any errors and comments should be sent to ajayk@cs.uic.edu or singhal@cs.uky.edu. Further information about the book can be obtained from authors’ web pages. • http://www.cs.uic.edu/∼ajayk/DCS-Book • http://www.cs.uky.edu/∼singhal/DCS-Book Ajay D. Kshemkalyani Mukesh Singhal iv
  • 7. Contents 1 Introduction 1 1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Relation to Computer System Components . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Relation to Parallel Multiprocessor/Multicomputer Systems . . . . . . . . . . . . . 5 1.4.1 Characteristics of Parallel Systems . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.3 Coupling, Parallelism, Concurrency, and Granularity . . . . . . . . . . . . 10 1.5 Message Passing Systems versus Shared Memory Systems . . . . . . . . . . . . . 13 1.5.1 Emulating message-passing on shared memory system (MP → SM). . . . 13 1.5.2 Emulating shared memory on a message-passing system (SM → MP). . . 13 1.6 Primitives for Distributed Communication . . . . . . . . . . . . . . . . . . . . . . 14 1.6.1 Blocking/Nonblocking, Synchronous/Asynchronous Primitives . . . . . . 14 1.6.2 Processor Synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.3 Libraries and Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.7 Synchronous versus Asynchronous Executions . . . . . . . . . . . . . . . . . . . 18 1.7.1 Emulating an asynchronous system by a synchronous system (A → S). . . 19 1.7.2 Emulating a synchronous system by an asynchronous system (S → A). . . 19 1.7.3 Emulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.8 Design Issues and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.8.1 Distributed Systems Challenges from a System Perspective . . . . . . . . . 21 1.8.2 Algorithmic Challenges in Distributed Computing . . . . . . . . . . . . . 22 1.8.3 Applications of Distributed Computing and Newer Challenges . . . . . . . 28 1.9 Selection and Coverage of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.12 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 A Model of Distributed Computations 37 2.1 A Distributed Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 A Model of Distributed Executions . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3 Models of Communication Network . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4 Global State of a Distributed System . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.1 Global State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5 Cuts of a Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 42 v
  • 8. 2.6 Past and Future Cones of an Event . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Models of Process Communications . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Logical Time 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 A Framework for a System of Logical Clocks . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Implementing Logical Clocks . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 Scalar Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Vector Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.3 On the Size of Vector Clocks . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 Efficient Implementations of Vector Clocks . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Singhal-Kshemkalyani’s Differential Technique . . . . . . . . . . . . . . . 56 3.5.2 Fowler-Zwaenepoel’s Direct-Dependency Technique . . . . . . . . . . . . 58 3.6 Jard-Jourdan’s Adaptive Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Matrix Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8 Virtual Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8.1 Virtual Time Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8.2 Comparison with Lamport’s Logical Clocks . . . . . . . . . . . . . . . . . 66 3.8.3 Time Warp Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.4 The Local Control Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.5 Global Control Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8.6 An Example: Distributed Discrete Event Simulations . . . . . . . . . . . . 71 3.9 Physical Clock Synchronization: NTP . . . . . . . . . . . . . . . . . . . . . . . . 72 3.9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.9.2 Definitions and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.9.3 Clock Inaccuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.12 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4 Global State and Snapshot Recording Algorithms 82 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 System Model and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.2 A Consistent Global State . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.3 Interpretation in Terms of Cuts . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.4 Issues in Recording a Global State . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Snapshot Algorithms for FIFO Channels . . . . . . . . . . . . . . . . . . . . . . . 87 vi
  • 9. 4.3.1 Chandy-Lamport Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.2 Properties of the Recorded Global State . . . . . . . . . . . . . . . . . . . 89 4.4 Variations of the Chandy-Lamport Algorithm . . . . . . . . . . . . . . . . . . . . 91 4.4.1 Spezialetti-Kearns Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.2 Venkatesan’s Incremental Snapshot Algorithm . . . . . . . . . . . . . . . 92 4.4.3 Helary’s Wave Synchronization Method . . . . . . . . . . . . . . . . . . . 93 4.5 Snapshot Algorithms for Non-FIFO Channels . . . . . . . . . . . . . . . . . . . . 94 4.5.1 Lai-Yang Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.2 Li et al.’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5.3 Mattern’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 Snapshots in a Causal Delivery System . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6.1 Process State Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.2 Channel State Recording in Acharya-Badrinath Algorithm . . . . . . . . . 99 4.6.3 Channel State Recording in Alagar-Venkatesan Algorithm . . . . . . . . . 100 4.7 Monitoring Global State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.8 Necessary and Sufficient Conditions for Consistent Global Snapshots . . . . . . . 102 4.8.1 Zigzag Paths and Consistent Global Snapshots . . . . . . . . . . . . . . . 103 4.9 Finding Consistent Global Snapshots in a Distributed Computation . . . . . . . . . 106 4.9.1 Finding Consistent Global Snapshots . . . . . . . . . . . . . . . . . . . . 106 4.9.2 Manivannan-Netzer-Singhal Algorithm for Enumerating Consistent Snap- shots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.9.3 Finding Z-paths in a Distributed Computation . . . . . . . . . . . . . . . . 110 4.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.12 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5 Terminology and Basic Algorithms 118 5.1 Topology Abstraction and Overlays . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2 Classifications and Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.1 Application Executions and Control Algorithm Executions . . . . . . . . . 120 5.2.2 Centralized and Distributed Algorithms . . . . . . . . . . . . . . . . . . . 120 5.2.3 Symmetric and Asymmetric Algorithms . . . . . . . . . . . . . . . . . . . 121 5.2.4 Anonymous Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2.5 Uniform Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.2.6 Adaptive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.2.7 Deterministic Versus Nondeterministic Executions . . . . . . . . . . . . . 122 5.2.8 Execution Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2.9 Synchronous and Asynchronous Systems . . . . . . . . . . . . . . . . . . 124 5.2.10 Online versus Offline Algorithms . . . . . . . . . . . . . . . . . . . . . . 124 5.2.11 Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.2.12 Wait-free algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.2.13 Communication Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3 Complexity Measures and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.5 Elementary Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 vii
  • 10. 5.5.1 Synchronous Single-Initiator Spanning Tree Algorithm Using Flooding: Algorithm 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.5.2 Asynchronous Single-Initiator Spanning Tree Algorithm Using Flooding: Algorithm I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.5.3 Asynchronous Concurrent-Initiator Spanning Tree Algorithm Using Flood- ing: Algorithm II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5.4 Asynchronous Concurrent-Initiator Depth First Search Spanning Tree Al- gorithm: Algorithm III . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.5.5 Broadcast and Convergecast on a Tree . . . . . . . . . . . . . . . . . . . . 137 5.5.6 Single Source Shortest Path Algorithm: Synchronous Bellman-Ford . . . . 140 5.5.7 Distance Vector Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.5.8 Single Source Shortest Path Algorithm: Asynchronous Bellman-Ford . . . 141 5.5.9 All Sources Shortest Paths: Asynchronous Distributed Floyd-Warshall . . . 142 5.5.10 Asynchronous and Synchronous Constrained Flooding (w/o a Spanning Tree) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.11 Minimum Weight Spanning Tree (MST) Algorithm in a Synchronous System146 5.5.12 Minimum Weight Spanning Tree (MST) in an Asynchronous System . . . 152 5.6 Synchronizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.7 Maximal Independent Set (MIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.8 Connected Dominating Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.9 Compact Routing Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.10 Leader Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.11 Challenges in Designing Distributed Graph Algorithms . . . . . . . . . . . . . . . 164 5.12 Object Replication Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.12.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.12.2 Algorithm Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.12.3 Reads and Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.12.4 Converging to an Replication Scheme . . . . . . . . . . . . . . . . . . . . 167 5.13 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.14 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.15 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6 Message Ordering and Group Communication 179 6.1 Message Ordering Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.1.1 Asynchronous Executions . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.1.2 FIFO Executions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.1.3 Causally Ordered (CO) Executions . . . . . . . . . . . . . . . . . . . . . 181 6.1.4 Synchronous Execution (SYNC) . . . . . . . . . . . . . . . . . . . . . . . 184 6.2 Asynchronous Execution with Synchronous Communication . . . . . . . . . . . . 184 6.2.1 Executions Realizable with Synchronous Communication (RSC) . . . . . . 185 6.2.2 Hierarchy of Ordering Paradigms . . . . . . . . . . . . . . . . . . . . . . 188 6.2.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.3 Synchronous Program Order on an Asynchronous System . . . . . . . . . . . . . . 190 6.3.1 Rendezvous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.3.2 Algorithm for Binary Rendezvous . . . . . . . . . . . . . . . . . . . . . . 192 viii
  • 11. 6.4 Group Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.5 Causal Order (CO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.5.1 The Raynal-Schiper-Toueg Algorithm . . . . . . . . . . . . . . . . . . . . 197 6.5.2 The Kshemkalyani-Singhal Optimal Algorithm . . . . . . . . . . . . . . . 198 6.6 Total Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.6.1 Centralized Algorithm for Total Order . . . . . . . . . . . . . . . . . . . . 205 6.6.2 Three-Phase Distributed Algorithm . . . . . . . . . . . . . . . . . . . . . 206 6.7 A Nomenclature For Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.8 Propagation Trees For Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.9 Classification of Application-Level Multicast Algorithms . . . . . . . . . . . . . . 214 6.10 Semantics of Fault-Tolerant Group Communication . . . . . . . . . . . . . . . . . 216 6.11 Distributed Multicast Algorithms At The Network Layer . . . . . . . . . . . . . . 218 6.11.1 Reverse Path Forwarding (RPF) For Constrained Flooding . . . . . . . . . 219 6.11.2 Steiner Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 6.11.3 Multicast Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.11.4 Delay-Bounded Steiner Trees . . . . . . . . . . . . . . . . . . . . . . . . 221 6.11.5 Core-Based Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.12 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.13 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.14 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 7 Termination Detection 230 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.2 System Model of a Distributed Computation . . . . . . . . . . . . . . . . . . . . . 231 7.3 Termination Detection Using Distributed Snapshots . . . . . . . . . . . . . . . . . 232 7.3.1 Informal Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.3.2 Formal Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.4 Termination Detection by Weight Throwing . . . . . . . . . . . . . . . . . . . . . 234 7.4.1 Formal Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 7.4.2 Correctness of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.5 A Spanning-Tree-Based Termination Detection Algorithm . . . . . . . . . . . . . 236 7.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.5.2 A Simple Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.5.3 The Correct Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 7.5.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.5.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.6 Message-Optimal Termination Detection . . . . . . . . . . . . . . . . . . . . . . . 242 7.6.1 The Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.6.2 Formal Description of the Algorithm . . . . . . . . . . . . . . . . . . . . 243 7.6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 7.7 Termination Detection in a Very General Distributed Computing Model . . . . . . 245 7.7.1 Model Definition and Assumptions . . . . . . . . . . . . . . . . . . . . . 246 7.7.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.7.3 Termination Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 ix
  • 12. 7.7.4 A Static Termination Detection Algorithm . . . . . . . . . . . . . . . . . . 247 7.7.5 A Dynamic Termination Detection Algorithm . . . . . . . . . . . . . . . . 249 7.8 Termination Detection in the Atomic Computation Model . . . . . . . . . . . . . . 251 7.8.1 The Atomic Model of Execution . . . . . . . . . . . . . . . . . . . . . . . 252 7.8.2 A Naive Counting Method . . . . . . . . . . . . . . . . . . . . . . . . . . 252 7.8.3 The Four Counter Method . . . . . . . . . . . . . . . . . . . . . . . . . . 253 7.8.4 The Sceptic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 7.8.5 The Time Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 7.8.6 Vector Counters Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 7.8.7 A Channel Counting Method . . . . . . . . . . . . . . . . . . . . . . . . . 258 7.9 Termination Detection in a Faulty Distributed System . . . . . . . . . . . . . . . . 260 7.9.1 Flow Detecting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 7.9.2 Taking Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 7.9.3 Description of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 263 7.9.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 7.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.11 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8 Reasoning with Knowledge 271 8.1 The Muddy Children Puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 8.2 Logic of Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 8.2.1 Knowledge Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 8.2.2 The Muddy Children Puzzle Again . . . . . . . . . . . . . . . . . . . . . 273 8.2.3 Kripke Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 8.2.4 Muddy Children Puzzle using Kripke Structures . . . . . . . . . . . . . . 275 8.2.5 Properties of Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 8.3 Knowledge in Synchronous Systems . . . . . . . . . . . . . . . . . . . . . . . . . 277 8.4 Knowledge in Asynchronous Systems . . . . . . . . . . . . . . . . . . . . . . . . 278 8.4.1 Logic and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 8.4.2 Agreement in Asynchronous Systems . . . . . . . . . . . . . . . . . . . . 279 8.4.3 Variants of Common Knowledge . . . . . . . . . . . . . . . . . . . . . . . 280 8.4.4 Concurrent Common Knowledge . . . . . . . . . . . . . . . . . . . . . . 281 8.5 Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 8.6 Knowledge and Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 8.9 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 9 Distributed Mutual Exclusion Algorithms 293 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 9.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 9.2.2 Requirements of Mutual Exclusion Algorithms . . . . . . . . . . . . . . . 294 9.2.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 9.3 Lamport’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 x
  • 13. 9.4 Ricart-Agrawala Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 9.4.1 Description of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 300 9.5 Singhal’s Dynamic Information-Structure Algorithm . . . . . . . . . . . . . . . . 303 9.5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 9.5.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 9.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 9.5.4 Adaptivity in Heterogeneous Traffic Patterns . . . . . . . . . . . . . . . . 309 9.6 Lodha and Kshemkalyani’s Fair Mutual Exclusion Algorithm . . . . . . . . . . . . 309 9.6.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 9.6.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 9.6.3 Safety, Fairness and Liveness . . . . . . . . . . . . . . . . . . . . . . . . 313 9.6.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 9.6.5 Message Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 9.7 Quorum-Based Mutual Exclusion Algorithms . . . . . . . . . . . . . . . . . . . . 316 9.8 Maekawa’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 9.8.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 9.8.2 Problem of Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 9.9 Agarwal-El Abbadi Quorum-Based Algorithm . . . . . . . . . . . . . . . . . . . . 320 9.9.1 Constructing a tree-structured quorum . . . . . . . . . . . . . . . . . . . . 320 9.9.2 Analysis of the algorithm for constructing tree-structured quorums . . . . . 321 9.9.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 9.9.4 Examples of Tree-Structured Quorums . . . . . . . . . . . . . . . . . . . 322 9.9.5 The Algorithm for Distributed Mutual Exclusion . . . . . . . . . . . . . . 323 9.9.6 Correctness proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 9.9.7 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 9.10 Token-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 9.11 Suzuki-Kasami’s Broadcast Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 325 9.12 Raymond’s Tree-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 327 9.12.1 The HOLDER Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 9.12.2 The Operation of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 329 9.12.3 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 9.12.4 Cost and Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 335 9.12.5 Algorithm Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 9.12.6 Node Failures and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 336 9.13 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 9.14 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 10 Deadlock Detection in Distributed Systems 342 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 10.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 10.2.1 Wait-For-Graph (WFG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 10.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 10.3.1 Deadlock Handling Strategies . . . . . . . . . . . . . . . . . . . . . . . . 343 10.3.2 Issues in Deadlock Detection . . . . . . . . . . . . . . . . . . . . . . . . . 344 10.4 Models of Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 xi
  • 14. 10.4.1 The Single Resource Model . . . . . . . . . . . . . . . . . . . . . . . . . 345 10.4.2 The AND Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 10.4.3 The OR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 10.4.4 The AND-OR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 10.4.5 The p q Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 10.4.6 Unrestricted Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 10.5 Knapp’s Classification of Distributed Deadlock Detection Algorithms . . . . . . . 348 10.5.1 Path-Pushing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 10.5.2 Edge-Chasing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 348 10.5.3 Diffusing Computations Based Algorithms . . . . . . . . . . . . . . . . . 348 10.5.4 Global State Detection Based Algorithms . . . . . . . . . . . . . . . . . . 349 10.6 Mitchell and Merritt’s Algorithm for the Single-Resource Model . . . . . . . . . . 349 10.7 Chandy-Misra-Haas Algorithm for the AND Model . . . . . . . . . . . . . . . . . 352 10.8 Chandy-Misra-Haas Algorithm for the OR Model . . . . . . . . . . . . . . . . . . 353 10.9 Kshemkalyani-Singhal Algorithm for P-out-of-Q Model . . . . . . . . . . . . . . 355 10.9.1 Informal Description of the Algorithm . . . . . . . . . . . . . . . . . . . . 357 10.9.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 10.9.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 10.10Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 10.11Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 10.12Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 11 Global Predicate Detection 370 11.1 Stable and Unstable Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 11.1.1 Stable Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 11.1.2 Unstable Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 11.2 Modalities on Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.2.1 Complexity of Predicate Detection . . . . . . . . . . . . . . . . . . . . . . 374 11.3 Centralized Algorithm for Relational Predicates . . . . . . . . . . . . . . . . . . . 374 11.4 Conjunctive Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 11.4.1 Interval-based Centralized Algorithm for Conjunctive Predicates . . . . . . 379 11.4.2 Global State based Centralized Algorithm for Possibly(φ), where φ is conjunctive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 11.5 Distributed Algorithms for Conjunctive Predicates . . . . . . . . . . . . . . . . . . 384 11.5.1 Distributed State-based Token Algorithm for Possibly(φ), where φ is Con- junctive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 11.5.2 Distributed Interval-based Token Algorithm for Definitely(φ), where φ is Conjunctive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 11.5.3 Distributed Interval-based Piggybacking Algorithm for Possibly(φ), where φ is Conjunctive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 11.6 Further Classification of Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . 393 11.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 11.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 11.9 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 xii
  • 15. 12 Distributed Shared Memory 399 12.1 Abstraction and Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 12.2 Memory Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 12.2.1 Strict consistency/Atomic consistency/Linearizability . . . . . . . . . . . . 403 12.2.2 Sequential Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 12.2.3 Causal Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 12.2.4 PRAM (Pipelined RAM) or Processor Consistency . . . . . . . . . . . . . 411 12.2.5 Slow Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 12.2.6 Hierarchy of Consistency Models . . . . . . . . . . . . . . . . . . . . . . 413 12.2.7 Other Models based on Synchronization Instructions . . . . . . . . . . . . 414 12.3 Shared Memory Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 416 12.3.1 Lamport’s Bakery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 416 12.3.2 Lamport’s WRWR Mechanism and Fast Mutual Exclusion . . . . . . . . . 418 12.3.3 Hardware support for mutual exclusion . . . . . . . . . . . . . . . . . . . 421 12.4 Wait-freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 12.5 Register Hierarchy and Wait-free Simulations . . . . . . . . . . . . . . . . . . . . 423 12.5.1 Construction 1: SRSW Safe to MRSW Safe . . . . . . . . . . . . . . . . . 426 12.5.2 Construction 2: SRSW Regular to MRSW Regular . . . . . . . . . . . . . 427 12.5.3 Construction 3: Boolean MRSW Safe to integer-valued MRSW Safe . . . . 427 12.5.4 Construction 4: Boolean MRSW Safe to boolean MRSW Regular . . . . . 427 12.5.5 Construction 5: Boolean MRSW Regular to integer-valued MRSW Regular 428 12.5.6 Construction 6: Boolean MRSW Regular to integer-valued MRSW Atomic 429 12.5.7 Construction 7: Integer MRSW Atomic to integer MRMW Atomic . . . . 432 12.5.8 Construction 8: Integer SRSW Atomic to integer MRSW Atomic . . . . . 433 12.6 Wait-free Atomic Snapshots of Shared Objects . . . . . . . . . . . . . . . . . . . 435 12.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 12.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 12.9 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 13 Checkpointing and Rollback Recovery 445 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 13.2 Background and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 13.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 13.2.2 A Local Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 13.2.3 Consistent System States . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 13.2.4 Interactions with the Outside World . . . . . . . . . . . . . . . . . . . . . 448 13.2.5 Different Types of Messages . . . . . . . . . . . . . . . . . . . . . . . . . 449 13.3 Issues in Failure Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 13.4 Checkpoint Based Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 13.4.1 Uncoordinated Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . 452 13.4.2 Coordinated Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . 454 13.4.3 Impossibility of Min Process Non-blocking Checkpointing . . . . . . . . . 456 13.4.4 Communication-Induced Checkpointing . . . . . . . . . . . . . . . . . . . 456 13.5 Log-based Rollback Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 13.5.1 Deterministic and Nondeterministic Events . . . . . . . . . . . . . . . . . 458 xiii
  • 16. 13.5.2 Pessimistic Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 13.5.3 Optimistic Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 13.5.4 Causal Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 13.6 Koo-Toueg Coordinated Checkpointing Algorithm . . . . . . . . . . . . . . . . . 463 13.6.1 The Checkpointing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 463 13.6.2 The Rollback Recovery Algorithm . . . . . . . . . . . . . . . . . . . . . . 465 13.7 Juang and Venkatesan Algorithm for Asynchronous Checkpointing and Recovery . 466 13.7.1 System Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . 466 13.7.2 Asynchronous Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . 467 13.7.3 The Recovery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 13.7.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 13.8 Manivannan-Singhal Quasi-Synchronous Checkpointing Algorithm . . . . . . . . 470 13.8.1 Checkpointing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 471 13.8.2 Recovery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 13.8.3 Comprehensive Message Handling . . . . . . . . . . . . . . . . . . . . . . 476 13.9 Peterson-Kearns Algorithm Based on Vector Time . . . . . . . . . . . . . . . . . . 479 13.9.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 13.9.2 Informal Description of the Algorithm . . . . . . . . . . . . . . . . . . . . 480 13.9.3 Formal Description of the RollBack Protocol . . . . . . . . . . . . . . . . 482 13.9.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 13.9.5 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 13.10Helary-Mostefaoui-Netzer-Raynal Communication-induced Protocol . . . . . . . . 486 13.10.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 13.10.2 The Checkpointing Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 491 13.11Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 13.12Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 14 Consensus and Agreement Algorithms 500 14.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 14.1.1 The Byzantine Agreement and Other Problems . . . . . . . . . . . . . . . 502 14.1.2 Equivalence of the Problems and Notations . . . . . . . . . . . . . . . . . 503 14.2 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 14.3 Agreement in a Failure-Free System (Synchronous or Asynchronous) . . . . . . . 505 14.4 Agreement in (Message-Passing) Synchronous Systems with Failures . . . . . . . 506 14.4.1 Consensus Algorithm for Crash Failures (Synchronous System) . . . . . . 506 14.4.2 Consensus Algorithms for Byzantine Failures (Synchronous System) . . . 507 14.4.3 Upper Bound on Byzantine Processes . . . . . . . . . . . . . . . . . . . . 507 14.4.4 Byzantine Agreement Tree Algorithm: Exponential (Synchronous System) 509 14.5 Agreement in Asynchronous Message-Passing Systems with Failures . . . . . . . . 518 14.5.1 Impossibility Result for the Consensus Problem . . . . . . . . . . . . . . . 518 14.5.2 Terminating Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . 520 14.5.3 Distributed Transaction Commit . . . . . . . . . . . . . . . . . . . . . . . 521 14.5.4 k-set consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 14.5.5 Approximate Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 14.5.6 Renaming Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 xiv
  • 17. 14.5.7 Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 14.6 Wait-free Shared Memory Consensus in Asynchronous Systems . . . . . . . . . . 533 14.6.1 Impossibility Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 14.6.2 Consensus Numbers and Consensus Hierarchy . . . . . . . . . . . . . . . 536 14.6.3 Universality of Consensus Objects . . . . . . . . . . . . . . . . . . . . . . 540 14.6.4 Shared Memory k-set Consensus . . . . . . . . . . . . . . . . . . . . . . . 545 14.6.5 Shared Memory Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . 545 14.6.6 Shared Memory Renaming using Splitters . . . . . . . . . . . . . . . . . . 547 14.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 14.8 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 14.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 15 Failure Detectors 555 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 15.2 Unreliable Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 15.2.1 The System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 15.2.2 Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 15.2.3 Completeness and Accuracy Properties . . . . . . . . . . . . . . . . . . . 557 15.2.4 Types of Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . 560 15.2.5 Reducibility of Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . 560 15.2.6 Reducing Weak Failure Detector W to a Strong Failure Detector S . . . . . 561 15.2.7 Reducing an Eventually Weak Failure Detector ♦W to an Eventually Strong Failure Detector ♦S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 15.3 The Consensus Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 15.3.1 Solutions to the Consensus Problem . . . . . . . . . . . . . . . . . . . . . 566 15.3.2 A Solution Using Strong Failure Detector S . . . . . . . . . . . . . . . . . 566 15.3.3 A Solution Using Eventually Strong Failure Detector ♦S . . . . . . . . . . 568 15.4 Atomic Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 15.5 A Solution to Atomic Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 15.6 The Weakest Failure Detectors to Solve Fundamental Agreement Problems . . . . 573 15.6.1 Realistic Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . 574 15.6.2 The weakest failure detector for consensus . . . . . . . . . . . . . . . . . 575 15.6.3 The Weakest Failure Detector for Terminating Reliable Broadcast . . . . . 576 15.7 An Implementation of a Failure Detector . . . . . . . . . . . . . . . . . . . . . . . 576 15.8 An Adaptive Failure Detection Protocol . . . . . . . . . . . . . . . . . . . . . . . 578 15.8.1 Lazy Failure Detection Protocol (FDL) . . . . . . . . . . . . . . . . . . . 579 15.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 15.10Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 16 Authentication in Distributed System 586 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586 16.2 Background and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586 16.2.1 Basis of Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 16.2.2 Types of Principals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 16.2.3 A Simple Classification of Authentication Protocols . . . . . . . . . . . . 588 xv
  • 18. 16.2.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 16.2.5 Design Principles for Cryptographic Protocols . . . . . . . . . . . . . . . 589 16.3 Protocols Based on Symmetric Cryptosystems . . . . . . . . . . . . . . . . . . . . 590 16.3.1 Basic Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 16.3.2 Modified Protocol with Nonce . . . . . . . . . . . . . . . . . . . . . . . . 591 16.3.3 Wide-Mouth Frog Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 592 16.3.4 A Protocol Based On an Authentication Server . . . . . . . . . . . . . . . 593 16.3.5 One-Time Password Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 594 16.3.6 Otway-Rees Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 16.3.7 Kerberos Authentication Service . . . . . . . . . . . . . . . . . . . . . . . 597 16.4 Protocols Based on Asymmetric Cryptosystems . . . . . . . . . . . . . . . . . . . 602 16.4.1 The Basic Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 16.4.2 A Modified Protocol with a Certification Authority . . . . . . . . . . . . . 602 16.4.3 Needham and Schroeder Protocol . . . . . . . . . . . . . . . . . . . . . . 603 16.4.4 SSL Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 16.5 Password-based Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 16.5.1 Encrypted Key Exchange (EKE) Protocol . . . . . . . . . . . . . . . . . . 609 16.5.2 Secure Remote Password (SRP) Protocol . . . . . . . . . . . . . . . . . . 610 16.6 Authentication Protocol Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 16.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 16.8 Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 17 Self-Stabilization 619 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 17.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 17.3 Definition of Self-Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 17.4 Issues in the design of self-stabilization algorithms . . . . . . . . . . . . . . . . . 624 17.4.1 The Number of States in Each of the Individual Units . . . . . . . . . . . . 625 17.4.2 Uniform Vs. Non-uniform Networks . . . . . . . . . . . . . . . . . . . . 631 17.4.3 Central and Distributed Demons . . . . . . . . . . . . . . . . . . . . . . . 632 17.4.4 Reducing the number of states in a token ring . . . . . . . . . . . . . . . . 633 17.4.5 Shared memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 17.4.6 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 17.4.7 Costs of self-stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 17.5 Methodologies for designing self-stabilizing systems . . . . . . . . . . . . . . . . 635 17.6 Communication Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 17.7 Self-Stabilizing Distributed Spanning Trees . . . . . . . . . . . . . . . . . . . . . 638 17.8 Self-Stabilizing Algorithms for Spanning-tree Construction . . . . . . . . . . . . . 640 17.8.1 Dolev, Israeli, and Moran Algorithm . . . . . . . . . . . . . . . . . . . . . 640 17.8.2 Afek, Kutten, and Yung Algorithm for Spanning-tree Construction . . . . . 642 17.8.3 Arora and Gouda Algorithm for Spanning-tree Construction . . . . . . . . 643 17.8.4 Huang et al. Algorithms for Spanning-tree Construction . . . . . . . . . . 643 17.8.5 Afek and Bremler Algorithm for Spanning-tree Construction . . . . . . . . 644 17.9 An anonymous self-stabilizing algorithm for 1-maximal independent set in trees . . 645 17.10A Probabilistic Self-Stabilizing Leader Election Algorithm . . . . . . . . . . . . . 646 xvi
  • 19. 17.11The role of compilers in self-stabilization . . . . . . . . . . . . . . . . . . . . . . 649 17.11.1 Compilers for sequential programs . . . . . . . . . . . . . . . . . . . . . . 649 17.11.2 Compilers for asynchronous message passing systems . . . . . . . . . . . 650 17.11.3 Compilers for asynchronous shared memory systems . . . . . . . . . . . . 651 17.12Self stabilization as a Solution to Fault Tolerance . . . . . . . . . . . . . . . . . . 652 17.13Factors Preventing Self-Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . 654 17.14Limitations of Self-Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 17.15Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 17.16Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 17.17Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 18 Peer-to-Peer Computing and Overlay Graphs 666 18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 18.1.1 Napster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667 18.1.2 Application Layer Overlays . . . . . . . . . . . . . . . . . . . . . . . . . 667 18.2 Data Indexing and Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 18.2.1 Distributed Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669 18.3 Unstructured Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670 18.3.1 Unstructured Overlays: Properties . . . . . . . . . . . . . . . . . . . . . . 670 18.3.2 Gnutella . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670 18.3.3 Search in Gnutella and Unstructured Overlays . . . . . . . . . . . . . . . . 671 18.3.4 Replication Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 18.3.5 Implementing Replication Strategies. . . . . . . . . . . . . . . . . . . . . 675 18.4 Chord Distributed Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 18.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 18.4.2 Simple lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 18.4.3 Scalable Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 18.4.4 Managing Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 18.4.5 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 18.5 Content Addressible Networks: CAN . . . . . . . . . . . . . . . . . . . . . . . . 683 18.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 18.5.2 CAN Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 18.5.3 CAN Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 18.5.4 CAN Maintainence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686 18.5.5 CAN Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 18.5.6 CAN Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 18.6 Tapestry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 18.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 18.6.2 Overlay and Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 18.6.3 Object Publication and Object Search . . . . . . . . . . . . . . . . . . . . 692 18.6.4 Node Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 18.6.5 Node Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 18.7 Some Other Challenges in P2P System Design . . . . . . . . . . . . . . . . . . . . 695 18.7.1 Fairness: A Game Theory Application . . . . . . . . . . . . . . . . . . . . 695 18.7.2 Trust or Reputation Management . . . . . . . . . . . . . . . . . . . . . . 696 xvii
  • 20. 18.8 Tradeoffs between Table Storage and Route Lengths . . . . . . . . . . . . . . . . 696 18.8.1 Unifying DHT Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 696 18.8.2 Bounds on DHT Storage and Routing Distance . . . . . . . . . . . . . . . 697 18.9 Graph Structures of Complex Networks . . . . . . . . . . . . . . . . . . . . . . . 699 18.10Internet graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 18.10.1 Basic Laws and their Definitions . . . . . . . . . . . . . . . . . . . . . . . 701 18.10.2 Properties of the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . 702 18.10.3 Error and Attack Tolerance of Complex Networks . . . . . . . . . . . . . 704 18.11Random Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 18.11.1 Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 18.11.2 Graph Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 708 18.11.3 Graph Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708 18.11.4 Graph Clustering Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 708 18.11.5 Generalized Random Graph Networks . . . . . . . . . . . . . . . . . . . . 709 18.12Small-world Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 18.13Scale-free Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710 18.13.1 Master-equation approach . . . . . . . . . . . . . . . . . . . . . . . . . . 710 18.13.2 Rate-equation approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 711 18.14Evolving Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712 18.14.1 Extended Barabasi-Albert Model . . . . . . . . . . . . . . . . . . . . . . 713 18.15Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714 18.16Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716 18.17Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716 xviii
  • 21. Chapter 1 Introduction 1.1 Definition A distributed system is a collection of independent entities that cooperate to solve a problem that cannot be individually solved. Distributed systems have been in existence since the start of the universe. From a school of fish to a flock of birds and entire ecosystems of microorganisms, there is communication among mobile intelligent agents in nature. With the widespread proliferation of the internet and the emerging global village, the notion of distributed computing systems as a useful and widely deployed tool is becoming a reality. For computing systems, a distributed system has been characterized in one of several ways. • You know you are using one when the crash of a computer you have never heard of prevents you from doing work. (Lamport) • A collection of computers that do not share common memory or a common physical clock, and that communicate by message passing over a communication network; and each com- puter has its own memory and runs its own operating system. Typically the computers are semi-autonomous and are loosely coupled while they cooperate to address a problem collec- tively. (Singhal-Shivaratri [10]) • A collection of independent computers that appears to the users of the system as a single coherent computer. (Tanenbaum [11]) • A term that describes a wide range of computers, from weakly coupled systems such as wide-area networks to strongly coupled systems such as local area networks to very strongly- coupled systems such as multiprocessor systems. (Goscinski [6]) A distributed system can be characterized as a collection of mostly autonomous processors communicating over a communication network and having the following features. • No common physical clock. This is an important assumption because it introduces the ele- ment of “distribution” in the system and gives rise to the inherent asynchrony amongst the processors. 1
  • 22. P P P P P PP M M M MM M M Communication network (WAN/ LAN) P processor(s) M memory bank(s) Figure 1.1: A distributed system that connects processors by a communication network. • No shared memory. This is a key feature that requires message-passing required for commu- nication. This feature implies the absence of the common physical clock. It may be noted that a distributed system may still provide the abstraction of a common ad- dress space via the distributed shared memory abstraction. Several aspects of shared memory multiprocessor systems have also been studied in the distributed computing literature. • Geographical separation. The geographically wider apart that the processors are, the more representative is the system of a distributed system. However, it is not necessary for the pro- cessors to be on a wide-area network (WAN). Recently, the Network/Cluster of Workstations (NOW/COW) configuration connecting processors on a LAN is also being increasingly re- garded as a small distributed system. This NOW configuration is becoming popular because of the low-cost high-speed off-the-shelf processors now available. The Google search engine is based on the NOW architecture. • Autonomy and heterogeneity. The processors are “loosely coupled” in that they have different speeds and each can be running a different operating system. They are usually not part of a dedicated system, but cooperate with one another by offering services or solving a problem jointly. 1.2 Relation to Computer System Components A typical distributed system is shown in Figure 1.1. Each computer has a memory-processing unit and the computers are connected by a communication network. Figure 1.2 shows the relation- ships of the software components that run on each of the computers and use the local operating system and network protocol stack for its functioning. The distributed software is also termed as middleware. A distributed execution is the execution of processes across the distributed system to collaboratively achieve a common goal. An execution is also sometimes termed a computation or a run. The distributed system uses a layered architecture to break down the complexity of system design. The middleware is the distributed software that drives the distributed system, while pro- 2
  • 23. protocols distributed Extent ofDistributed application Network layer (middleware libraries) Application layer Data link layer Transport layer Networkprotocolstack Distributed software system Operating Figure 1.2: Interaction of the software components at each processor. viding transparency of heterogeneity at the platform level. Figure 1.2 schematically shows the interaction of this software with these system components at each processor. Here we assume that the middleware layer does not contain the traditional application layer functions of the network protocol stack, such as http, mail, ftp, and telnet. Various primitives and calls to functions defined in various libraries of the middleware layer are embedded in the user program code. There exist several libraries to choose from to invoke primitives for the more common functions – such as reli- able and ordered multicasting, of the middleware layer. There are several standards such as Object Management Group’s (OMG) Common Object Request Broker Architecture (CORBA), and the Remote Procedure Call (RPC) mechanism. The RPC mechanism conceptually works like a local procedure call, with the difference that the procedure code may reside on a remote machine, and the RPC software sends a message across the network to invoke the remote procedure. It then awaits a reply, after which the procedure call completes from the perspective of the program that invoked it. Currently deployed commercial versions of middleware often use CORBA, DCOM (Distributed Component Object Model), Java, and RMI (Remote Method Invocation) technolo- gies. The Message-Passing Interface (MPI) developed in the research community is an example of an interface for various communication functions. 1.3 Motivation The motivation for using a distributed system is some or all of the following requirements. 1. Inherently distributed computations. In many applications such as money transfer in bank- ing, or reaching consensus among parties that are geographically distant, the computation is inherently distributed. 2. Resource sharing. Resources such as peripherals, complete data sets in databases, special libraries, as well as data (variable/files) cannot be fully replicated at all the sites because it is often neither practical nor cost-effective. Further, they cannot be placed at a single site because access to that site might prove to be a bottleneck. Therefore, such resources 3
  • 24. are typically distributed across the system. For example, distributed databases such as DB2 partition the data sets across several servers, in addition to replicating them at a few sites for rapid access as well as reliability. 3. Access to geographically remote data and resources. In many scenarios, the data cannot be replicated at every site participating in the distributed execution because it may be too large or too sensitive to be replicated. For example, payroll data within a multinational corporation is both too large and too sensitive to be replicated at every branch office/site. It is therefore stored at a central server which can be queried by branch offices. Similarly, special resources such as supercomputers exist only in certain locations, and to access such supercomputers, users need to log in remotely. Advances in the design of resource-constrained mobile devices as well as in wireless tech- nology using which these devices communicate have given further impetus to the importance of distributed protocols and middleware. 4. Enhanced reliability. A distributed system has the inherent potential to provide increased reliability because of the possibility of replicating resources and executions, as well as the reality that geographically distributed resources are not likely to crash/malfunction at the same time under normal circumstances. Reliability entails several aspects. • Availability, i.e., the resource should be accessible at all times. • Integrity, i.e., the value/state of the resource should be correct, in the face of concurrent access from multiple processors, as per the semantics expected by the application. • Fault-tolerance, i.e., the ability to recover from system failures, where such failures may be defined to occur in one of many failure models, which we will study in Chap- ter 14. 5. Increased Performance/Cost ratio. By resource sharing and accessing geographically remote data and resources, the Performance/Cost ratio is increased. Although higher throughput has not necessarily been the main objective behind using a distributed system, nevertheless, a task can be partitioned across the various computers in the distributed system. Such a configuration provides a better Performance/Cost ratio than using special parallel machines. This is particularly true of the NOW configuration. In addition to meeting the above requirements, a distributed system also offers the following ad- vantages. 6. Scalability. As the processors are usually connected by a wide-area network, adding more processors does not pose a direct bottleneck for the communication network. 7. Modularity and incremental expandability. Heterogeneous processors may be easily added into the system without affecting the performance, as long as those processors are running the same middleware algorithms. Similarly, existing processors may be easily replaced by other processors. 4
  • 25. M MP P P PPP M M M memory P processor (b)(a) Interconnection networkInterconnection network PPP P MMMM M M Figure 1.3: Two standard architectures for parallel systems. (a) Uniform memory access (UMA) multiprocessor system. (b) Non-uniform memory access (NUMA) multiprocessor. In both archi- tectures, the processors may locally cache data from memory. 1.4 Relation to Parallel Multiprocessor/Multicomputer Systems The characteristics of a distributed system were identified above. A typical distributed system would look as shown in Figure 1.1. However, how does one classify a system that meets some but not all of the characteristics? Is the system still a distributed system, or does it become a parallel multiprocessor system? To better answer these questions, we first examine the architecture of par- allel systems, and then examine some well-known taxonomies for multiprocessor/multicomputer systems. 1.4.1 Characteristics of Parallel Systems A parallel system may be broadly classified as belonging to one of three types. 1. A multiprocessor system is a parallel system in which the multiple processors have direct access to shared memory which forms a common address space. The architecture is shown in Figure 1.3(a). Such processors usually do not have a common clock. A multiprocessor system usually corresponds to a uniform memory access (UMA) architec- ture in which the access latency, i.e., waiting time, to complete an access to any memory location from any processor is the same. The processors are in very close physical proximity and are usually very tightly coupled (homogenous hardware and software), are connected by an interconnection network. Interprocess communication across processors is tradition- ally through read and write operations on the shared memory, although the use of message- passing primitives such as those provided by the MPI, is also possible (using emulation on the shared memory). All the processors usually run the same operating system, and both the hardware and software are very tightly coupled. The processors are usually of the same type, and are housed within the same box/container with a shared memory among them. The interconnection network to access the memory may 5
  • 26. be a bus, although for greater efficiency, it is usually a multistage switch with a symmetric and regular design. Figure 1.4 shows two popular interconnection networks – the Omega network and the But- terfly network, each of which is a multi-stage network formed of 2x2 switching elements. Each 2x2 switch allows data on either of the two input wires to be switched to the upper or the lower output wire. However, in a single step, only one data unit can be sent on an output wire. So if the data from both the input wires is to be routed to an output wire in a single step, there is a collsion. Various techniques such as buffering or more elaborate interconnection designs can address collisions. Each 2x2 switch is represented as a rectangle in the figure. Furthermore, a n-input and n- output network uses log n stages and log n bits for addressing. Routing in the 2x2 switch at stage k uses only the kth bit, and hence can be done at clock speed in hardware. The multi- stage networks can be constructed recursively, and the interconnection pattern between any two stages can be expressed using an iterative or a recursive generating function. Besides the Omega and Butterfly (banyan) networks, other examples of multistage interconnection networks are the Benes and the shuffle-exchange networks. Each of these has very inter- esting mathematical properties that allow rich connectivity between the processor bank and memory bank. Omega interconnection function. The Omega network which connects n processors to n memory units has n 2 · log2 n switching elements of size 2x2 arranged in log2 n stages. Between each pair of adjacent stages of the Omega network, a link exists between output i of a stage and the input j to the next stage according to the following perfect shuffle pattern which is a left-rotation operation on the binary representation of i to get j. The iterative generation function is as follows. j = 2i for 0 ≤ i ≤ n/2 − 1 2i + 1 − n for n/2 ≤ i ≤ n − 1 (1.1) Consider any stage of switches. Informally, the upper (lower) input lines for each switch come in sequential order from the upper (lower) half of the switches in the earlier stage. With respect to the Omega network in Figure 1.4(a), n = 8. Hence, for any stage, for the outputs i, where 0 ≤ i ≤ 3, the output i is connected to input 2i of the next stage. For 4 ≤ i ≤ 7, the output i of any stage is connected to input 2i + 1 − n of the next stage. Omega routing function. The routing function from input line i to output line j considers only j and the stage number s, where s ∈ [0, log2n − 1]. In a stage s switch, if the s + 1th MSB of j is 0, the data is routed to the upper output wire, otherwise it is routed to the lower output wire. Butterfly interconnection function. Unlike the Omega network, the generation of the in- terconnection pattern between a pair of adjacent stages depends not only on n but also on the stage number s. The recursive expression is as follows. Let there be M = n/2 switches per stage, and let a switch be denoted by the tuple x, s , where x ∈ [0, M − 1] and stage s ∈ [0, log2n − 1]. 6
  • 27. The two outgoing edges from any switch x, s are as follows. There is an edge from switch x, s to switch y, s + 1 if (i) x = y or (ii) x XOR y has exactly one 1 bit, which is in the (s + 1)th MSB. For stage s, apply the rule above for M/2s switches. Whether the two incoming connections go to the upper or the lower input port is not impor- tant because of the routing function, given below. Example. Consider the Butterfly network in Figure 1.4(b), n = 8 and M = 4. There are three stages, s = 0, 1, 2, and the interconnection pattern is defined between s = 0 and s = 1 and between s = 1 and s = 2. The switch number x varies from 0 to 3 in each stage, i.e., x is a 2-bit string. (Note that unlike the Omega network formulation using input and output lines given above, this formulation uses switch numbers. Exercise 5 asks you to prove a formulation of the Omega interconnection pattern using switch numbers instead of input and output port numbers.) Consider the first stage interconnection (s = 0) of a butterfly of size M, and hence having log2 2M stages. For stage s = 0, as per rule (i), the first output line from switch 00 goes to input line of switch 00 of stage s = 1. As per rule (ii), the second output line of switch 00 goes to input line of switch 10 of stage s = 1. Similarly, x, s = (01) has one output line go to an input line of switch (11) in stage s = 1. The other connections in this stage can be determined similarly. For stage s = 1 connecting to stage s = 2, we apply the rules considering only M/21 = M/2 switches, i.e., builds 2 butterflies of size M/2 - the ”upper half" and the ”lower half" switches. The recursion terminates for M/2s = 1, when there is a single switch. Butterfly routing function. In a stage s switch, if the s + 1th MSB of j is 0, the data is routed to the upper output wire, otherwise it is routed to the lower output wire. Observe that for the Butterfly and the Omega networks, the paths from the different inputs to any one output form a spanning tree. This implies that collisions will occur when data is destined to the same output line. However, the advantage is that data can be combined at the switches if the application semantics (e.g., summation of numbers) are known. 2. A multicomputer parallel system is a parallel system in which the multiple processors do not have direct access to shared memory. The memory of the multiple processors may or may not form a common address space. Such computers usually do not have a common clock. The architecture is shown in Figure 1.3(b). The processors are in close physical proximity and are usually very tightly coupled (homoge- nous hardware and software), and connected by an interconnection network. The processors communicate either via a common address space or via message-passing. A multicomputer system that has a common address space usually corresponds to a non-uniform memory ac- cess (NUMA) architecture in which the latency to access various shared memory locations from the different processors varies. Examples of parallel multicomputers are: the NYU Ultracomputer and the Sequent shared memory machines, the CM* Connection machine and processors configured in regular and symmetrical topologies such as an array or mesh, ring, torus, cube, and hypercube (message- passing machines). The regular and symmetrical topologies have interesting mathematical 7
  • 28. P0 P1 P2 P3 P4 P6 P7 101P5 000 001 M0 M1 010 011 100 101 110 111 001 101 110 111 100 111 110 100 011 010 000 M2010 000 001 100 101 P0 P1 P2 P3 P4 P5 P6 P7 (a) 3−stage Omega network (n=8, M=4) (b) (n=8, M=4)3−stage Butterfly network 011 M3 M4 M5 M6 M7 000 001 010 011 M0 M1 M2 M3 M4 M5 M6 M7 110 111 Figure 1.4: Interconnection networks for shared memory multiprocessor systems. (a) Omega net- work for n = 8 processors P0 − P7 and memory banks M0 − M7. (b) Butterfly network for n = 8 processors P0 − P7 and memory banks M0 − M7. properties that enable very easy routing and provide many rich features such as alternate routing. Figure 1.5(a) shows a wrap-around 4x4 mesh. For a k × k mesh which will contain k2 processors, the maximum path length between any two processors is 2(k/2 − 1). Routing can be done along the Manhattan grid. Figure 1.5(b) shows a 4-dimensional hypercube. A k-dimensional hypercube has 2k processor-and-memory units. Each such unit is a node in the hypercube, and has a unique k-bit label. Each of the k dimensions is associated with a bit position in the label. The labels of any two adjacent nodes are identical except for the bit position corresponding to the dimension in which the two nodes differ. Thus, the processors are labelled such that the shortest path between any two processors is the Hamming distance (defined as the number of bit positions in which the two equal sized bit strings differ) between the processor labels. This is clearly bounded by k. Example. Nodes 0101 and 1100 have a Hamming distance of 2. THe shortest path between them has length 2. Routing in the hypercube is done hop-by-hop. At any hop, the message can be sent along any dimension corresponding to the bit position in which the current node’s address and the des- tination address differ. The 4-D hypercube shown in the figure is formed by connecting the corresponding edges of two 3-D hypercubes (corresponding to the left and the right “cubes” in the figure) along the fourth dimension; the labels of the 4-D hypercube are formed by prepending a ‘0’ to the labels of the left 3-D hypercube and prepending a ‘1’ to the labels of the right 3-D hypercube. This can be extended to construct hypercubes of higher dimen- sions. Observe that there are multiple routes between any pair of nodes, which provides fault-tolerance as well as a congestion control mechanism. The hypercube and its variant topologies have very interesting mathematical properties with implications for routing and fault-tolerance. 3. Array processors belong to a class of parallel computers that are physically co-located, are 8
  • 29. 0010 0111 0011 0101 (b)(a) processor + memory 1100 1000 1110 1010 1111 1011 1001 1101 0001 01100100 0000 Figure 1.5: Some popular topologies for multicomputer shared-memory machines. (a) Wrap- around 2D-mesh, also known as torus. (b) Hypercube of dimension 4. very tightly coupled, and have a common system clock (but may not share memory and communicate by passing data using messages). Array processors and systolic arrays that perform tightly synchronized processing and data exchange in lock-step for applications such as DSP and image processing belong to this category. These applications usually involve a large number of iterations on the data. This class of parallel systems has a very niche market. The distinction between UMA multiprocessors on the one hand, and NUMA and message- passing multicomputers on the other hand, is important because the algorithm design and data and task partitioning among the processors must account for the variable and unpredictable latencies in accessing memory/communication. As compared to UMA systems and array processors, NUMA and message-passing multicomputer systems are less suitable when the degree of granularity of accessing shared data and communication is very fine. The primary and most efficacious use of parallel systems is for obtaining a higher throughput by dividing the computational workload among the processors. The tasks that are most amenable to higher speedups on parallel systems are those that can be partitioned into subtasks very nicely, involving much number-crunching and relatively little communication for synchronization. Once the task has been decomposed, the processors perform large vector, array and matrix computations that are common in scientific applications. Searching through large state spaces can be performed with significant speedup on parallel machines. While such parallel machines were an object of much theoretical and systems research in the 1980s and early 1990s, they have not proved to be economically viable for two related reasons. First, the overall market for the applications that can potentially attain high speedups is relatively small. Second, due to economy of scale and the high processing power offered by relatively inexpensive off-the-shelf networked PCs, specialized parallel machines are not cost-effective to manufacture. They additionally require special compiler and other system support for maximum throughput. 9
  • 30. 1.4.2 Flynn’s Taxonomy Flynn identified four processing modes, based on whether the processors execute the same or different instruction streams at the same time, and whether or not the processors processed the same (identical) data at the same time. It is instructive to examine this classification to understand the range of options used for configuring systems. Single Instruction stream, Single Data stream (SISD). This mode corresponds to the conven- tional processing in the von Neumann paradigm with a single CPU, and a single memory unit connected by a system bus. Single Instruction stream, Multiple Data stream (SIMD). This mode corresponds to the pro- cessing by multiple homogenous processors which execute in lock-step on different data items. Applications that involve operations on large arrays and matrices, such as scientific applications, can best exploit systems that provide the SIMD mode of operation because the data sets can be partitioned easily. Several of the earliest parallel computers, such as Illiac-IV, MPP, CM2, and MasPar MP-1 were SIMD machines. Vector processors, array processors and systolic arrays also belong to the SIMD class of processing. Recent SIMD architectures include co-processing units such as the MMX units in Intel processors (e.g., Pentium with the Streaming SIMD Extensions (SSE) options) and DSP chips such as the Sharc. Multiple Instruction stream, Single Data stream (MISD). This mode corresponds to the exe- cution of different operations in parallel on the same data. This is a specialized mode of operation with limited but niche applications, e.g., visualization. Multiple Instruction stream, Multiple Data stream (MIMD). In this mode, the various proces- sors execute different code on different data. This is the mode of operation in distributed systems as well as in the vast majority of parallel systems. There is no common clock among the system processors. Sun Ultra servers, multicomputer PCs, and IBM SP machines are example machines that execute in MIMD mode. SIMD, MISD, MIMD architectures are illustrated in Figure 1.6. MIMD architectures are most general and allow much flexibility in partitioning code and data to be processed, among the proces- sors. MIMD architectures also include the classically understood mode of execution in distributed systems. 1.4.3 Coupling, Parallelism, Concurrency, and Granularity 1.4.3.1 Coupling. The degree of coupling among a set of modules, whether hardware or software, is measured in terms of the interdependency and binding and/or homogeneity among the modules. When the degree of coupling is high (low), the modules are said to be tightly (loosely) coupled. SIMD and MISD architectures generally tend to be tightly coupled because of the common clocking of the shared instruction stream or the shared data stream. Here we briefly examine various MIMD architectures in terms of coupling. 10
  • 31. PP CCCC I I II II III D P (c) MISD(b) MIMD(a) SIMD data stream Processing Unit Control Unit P D instruction streamI CC P DD PP DD Figure 1.6: Flynn’s taxonomy of SIMD, MIMD, and MISD architectures for multiproces- sor/multicomputer systems. 1. Tightly-coupled multiprocessors (with UMA shared memory). These may be either switch- based (e.g., NYU Ultracomputer, and RP3) or bus-based (e.g., Sequent, Encore). 2. Tightly-coupled multiprocessors (with NUMA shared memory or that communicate by mes- sage passing). Examples are the SGI Origin 2000 and the Sun Ultra HPC servers (that com- municate via NUMA shared memory), and the hypercube and the torus (that communicate by message passing). 3. Loosely-coupled multicomputers (without shared memory) physically co-located. These may be bus-based (e.g., NOW connected by a LAN or Myrinet card) or using a more gen- eral communication network, and the processors may be heterogenous. In such systems, processors neither share memory nor have a common clock, and hence may be classified as distributed systems – however, the processors are very close to one another, which is char- acteristic of a parallel system. As the communication latency may be significantly lower than in wide-area distributed systems, the solution approaches to various problems may be different for such systems than for wide-area distributed systems. 4. Loosely-coupled multicomputers (without shared memory and without common clock) that are physically remote. These correspond to the conventional notion of distributed systems. 1.4.3.2 Parallelism or speedup of a program on a specific system. This is a measure of the relative speedup of a specific program, on a given machine. The speedup depends on the number of processors and the mapping of the code to the processors. It is expressed as the ratio of the time T(n) with n processors, to the time T(1) with a single processor. 1.4.3.3 Parallelism within a parallel/distributed program. This is an aggregate measure of the percentage of time that all the processors are executing CPU instructions productively, as opposed to waiting for communication (either via shared memory or message-passing) operations to complete. The term is traditionally used to characterize parallel 11
  • 32. programs. If the aggregate measure is a function of only the code, then the parallelism is indepen- dent of the architecture. Otherwise, this definition degenerates to the definition of parallelism in Section 1.4.3.2. 1.4.3.4 Concurrency of a program. This is a broader term that means roughly the same as parallelism of a program, but is used in the context of distributed programs. The parallelism/concurrency in a parallel/distributed program can be measured by the ratio of the number of local (non-communication and non-shared memory ac- cess) operations to the total number of operations, including the communication or shared memory access operations. 1.4.3.5 Granularity of a program. The relative measure of the amount of computation to the amount of communication within the parallel/distributed program is termed as granularity. If the degree of parallelism is coarse-grained (fine-grained), there are relatively many more (fewer) productive CPU instruction executions, com- pared to the number of times the processors communicate (either via shared memory or message- passing) and wait to get synchronized with the other processors. Programs with fine-grained par- allelism are best suited for tightly coupled systems. These typically include SIMD and MISD architectures, tightly coupled MIMD multiprocessors (that have shared memory), and loosely- coupled multicomputers (without shared memory) that are physically colocated. If programs with fine-grained parallelism were run over loosely-coupled multiprocessors that are physically remote, the latency delays for the frequent communication over the WAN would significantly degrade the overall throughput. As a corollary, it follows that on such loosely-coupled multicomputers, pro- grams with a coarse-grained communication/message-passing granularity will incur substantially less overhead. Figure 1.2 showed the relationships between the local operating system, the middleware imple- menting the distributed software, and the network protocol stack. Before moving on, we identify various classes of multiprocessor/multicomputer operating systems. • The operating system running on loosely coupled processors (i.e., heterogenous and/or geo- graphically distant processors), which are themselves running loosely coupled software (i.e., software that is heterogenous) is classified as a Network Operating System. In this case, the application cannot run any significant distributed function that is not provided by the Application Layer of the network protocol stacks on the various processors. • The operating system running on loosely coupled processors, which are running tightly cou- pled software (i.e., the middleware software on the processors is homogenous) is classified as a Distributed Operating System. • The operating system running on tightly coupled processors, which are themselves running tightly coupled software is classified as a Multiprocessor Operating System. Such a parallel system can run sophisticated algorithms contained in the tightly coupled software. 12
  • 33. 1.5 Message Passing Systems versus Shared Memory Systems Shared memory systems are those in which there is a (common) shared address space throughout the system. Communication among processors takes place via shared data variables, and control variables for synchronization among the processors. Semaphores and monitors that were origi- nally designed for shared memory uniprocessors and multiprocessors are examples of how syn- chronization can be achieved in shared memory systems. All multicomputer (NUMA as well as message-passing) systems that do not have a shared address space provided by the underlying architecture and hardware necessarily communicate by message passing. Conceptually, program- mers find it easier to program using shared memory than by message passing. For this and several other reasons that we examine later, the abstraction called shared memory is sometimes provided to simulate a shared address space. For a distributed system, this abstraction is called distributed shared memory. Implementing this abstraction has a certain cost but it simplifies the task of the application programmer. There also exists a well-known folklore result that communication via message-passing can be simulated by communication via shared memory and vice-versa. There- fore, the two paradigms are equivalent. 1.5.1 Emulating message-passing on shared memory system (MP → SM). The shared address space can be partitioned into disjoint parts, one part being assigned to each processor. “Send” and “Receive” operations can be implemented by writing to and reading from the destination/sender processor’s address space, respectively. Specifically, a separate location can be reserved as the mailbox for each ordered pair of processes. A Pi–Pj message-passing can be emulated by a Write by Pi to the mailbox and then a Read by Pj from the mailbox. In the simplest case, these mailboxes can be assumed to have unbounded size. The write and read operations need to be controlled using synchronization primitives to inform the receiver/sender after the data has been sent/received. 1.5.2 Emulating shared memory on a message-passing system (SM → MP). This involves the use of “Send” and “Receive” operations for “Write” and “Read” operations. Each shared location can be modeled as a separate process; “Write” to a shared location is emulated by sending an update message to the corresponding owner process; a “Read” to a shared location is emulated by sending a query message to the owner process. As accessing another processor’s memory requires Send and Receive operations, this emulation is expensive. Although emulating shared memory might seem to be more attractive from a programmer’s perspective, it must be remembered that in a distributed system, it is only an abstraction. Thus, the latencies involved in read and write operations may be high even when using shared memory emulation because the read and write operations are implemented by using network-wide communication under the covers. An application can of course use a combination of shared memory and message-passing. In a MIMD message-passing multicomputer system, each “processor” may be a tightly-coupled mul- tiprocessor system with shared memory. Within the multiprocessor system, the processors com- municate via shared memory. Between two computers, the communication is by message passing. As message-passing systems are more common and more suited for wide-area distributed systems, 13
  • 34. we will consider message-passing systems more extensively than we consider shared memory sys- tems. 1.6 Primitives for Distributed Communication 1.6.1 Blocking/Nonblocking, Synchronous/Asynchronous Primitives Message send and message receive communication primitives are denoted Send() and Receive(), respectively. A Send primitive has at least two parameters - the destination, and the buffer in the user space, containing the data to be sent. Similarly, a Receive primitive has at least two parameters - the source from which the data is to be received (this could be a wildcard), and the user buffer into which the data is to be received. There are two ways of sending data when the Send primitive is invoked - the buffered option and the unbuffered option. The buffered option which is the standard option copies the data from the user buffer to the kernel buffer. The data later gets copied from the kernel buffer onto the network. In the unbuffered option, the data gets copied directly from the user buffer onto the network. For the Receive primitive, the buffered option is usually required because the data may already have arrived when the primitive is invoked, and needs a storage place in the kernel. The following are some definitions of blocking/nonblocking and synchronous/asynchronous primitives. Synchronous primitives. A Send or a Receive primitive is synchronous if both the Send() and Receive() handshake with each other. The processing for the Send primitive completes only after the invoking processor learns that the other corresponding Receive primitive has also been invoked and that the receive operation has been completed. The processing for the Receive primitive completes when the data to be received is copied into the receiver’s user buffer. Asynchronous primitives. A Send primitive is said to be asynchronous if control returns back to the invoking process after the data item to be sent has been copied out of the user-specified buffer. It does not make sense to define asynchronous Receive primitives. Blocking primitives. A primitive is blocking if control returns to the invoking process after the processing for the primitive (whether in synchronous or asynchronous mode) completes. Nonblocking primitives. A primitive is nonblocking if control returns back to the invoking pro- cess immediately after invocation, even though the operation has not completed. For a non- blocking Send, control returns to the process even before the data is copied out of the user buffer. For a nonblocking Receive, control returns to the process even before the data may have arrived from the sender. For nonblocking primitives, a return parameter on the primitive call returns a system-generated handle which can be later used to check the status of completion of the call. The process can check for the completion of the call in two ways. First, it can keep checking (in a loop or periodically) if the handle has been flagged or posted. Second, it can issue a Wait with a list 14
  • 35. Send(X, destination, handlek) //handlek is a return parameter ... ... Wait(handle1, handle2, . . ., handlek, . . . , handlem) //Wait always blocks Figure 1.7: A nonblocking send primitive. When the Wait call returns, at least one of its parameters is posted. of handles as parameters. The Wait call usually blocks until one of the parameter handles is posted. Presumably after issuing the primitive in nonblocking mode, the process has done whatever actions it could and now needs to know the status of completion of the call, there- fore using a blocking Wait() call is usual programming practice. The code for a nonblocking Send would look as shown in Figure 1.7. If at the time that Wait() is issued, the processing for the primitive (whether synchronous or asynchronous) has completed, the Wait returns immediately. The completion of the pro- cessing of the primitive is detectable by checking the value of handlek. If the processing of the primitive has not completed, the Wait blocks and waits for a signal to wake it up. When the processing for the primitive completes, the communication subsystem software sets the value of handlek and wakes up (signals) any process with a Wait call blocked on this handlek. This is called posting the completion of the operation. There are therefore four versions of the Send primitive – synchronous blocking, synchronous nonblocking, asynchronous blocking, and asynchronous nonblocking. For the Receive primitive, there are the blocking synchronous and nonblocking synchronous versions. These versions of the primitives are illustrated in Figure 1.8 using a timing diagram. Here, three time lines are shown for each process: (1) for the process execution, (2) for the user buffer from/to which data is sent/received, and (3) for the kernel/communication subsystem. Blocking synchronous Send (Figure 1.8(a)): The data gets copied from the user buffer to the kernel buffer and is then sent over the network. After the data is copied to the receiver’s system buffer, an acknowledgement back to the sender causes control to return to the process that invoked the Send operation and completes the Send. Nonblocking synchronous Send (Figure 1.8(b)): Control returns back to the invoking process as soon as the copy of data from the user buffer to the kernel buffer is initiated. A parameter in the nonblocking call also gets set with the handle of a location that the user process can later check for the completion of the synchronous send operation. The location gets posted after an acknowledgement returns from the receiver, as per the semantics described for (a). The user process can keep checking for the completion of the nonblocking synchronous Send by testing the returned handle, or it can invoke the blocking Wait operation on the returned handle. Blocking asynchronous Send (Figure 1.8(c)): The user process that invokes the Send is blocked until the data is copied from the user’s buffer to the kernel buffer. (For the unbuffered option, the user process that invokes the Send is blocked until the data is copied from the user’s buffer to the network.) 15
  • 36. S_C R_CR P S_C P, (a) blocking sync. Send, blocking Receive (b) nonblocking sync. Send, nonblocking Receive P, R_C S_CP, R_C W Send The completion of the previously initiated nonblocking operation duration in which the process issuing send or receive primitive is blocked primitive issued primitive issuedReceive Send (c) blocking async. Send (d) nonblocking async. Send S processing for completes S W WS_CS kernel_i buffer_i process i R R WW WW Receive Process may issue to check completion of nonblocking operationWait duration to copy data from or to user buffer processing for completes S S_Cprocess i buffer_i kernel_i process j buffer_j kernel_j S Figure 1.8: Blocking/nonblocking and synchronous/asynchronous primitives. Process Pi is send- ing and process Pj is receiving. (a) Blocking synchronous Send and blocking (synchronous) Receive. (b) Nonblocking synchronous Send and nonblocking (synchronous) Receive. (c) Block- ing asynchronous Send. (d) Nonblocking asynchronous Send. Nonblocking asynchronous Send (Figure 1.8(d)): The user process that invokes the Send is blocked until the transfer of the data from the user’s buffer to the kernel buffer is initiated. (For the unbuffered option, the user process that invokes the Send is blocked until the trans- fer of the data from the user’s buffer to the network is initiated.) Control returns to the user process as soon as this transfer is initiated, and a parameter in the nonblocking call also gets set with the handle of a location that the user process can check later using the Wait operation for the completion of the asynchronous Send operation. The asynchronous Send completes when the data has been copied out of the user’s buffer. The checking for the completion may be necessary if the user wants to reuse the buffer from which the data was sent. Blocking Receive (Figure 1.8(a)): The Receive call blocks until the data expected arrives and is written in the specified user buffer. Then control is returned to the user process. 16
  • 37. Nonblocking Receive (Figure 1.8(b)): The Receive call will cause the kernel to register the call and return the handle of a location that the user process can later check for the completion of the nonblocking Receive operation. This location gets posted by the kernel after the expected data arrives and is copied to the user-specified buffer. The user process can keep checking for the completion of the nonblocking Receive by invoking the Wait operation on the returned handle. (If the data has already arrived when the call is made, it would be pending in some kernel buffer, and still needs to be copied to the user buffer.) A synchronous Send is easier to use from a programmer’s perspective because the handshake between the Send and the Receive makes the communication appear instantaneous, thereby sim- plifying the program logic. The “instantaneity” is, of course, only an illusion, as can be seen from Figure 1.8(a) and (b). In fact, the Receive may not get issued until much after the data arrives at Pj, in which case the data arrived would have to be buffered in the system buffer at Pj and not in the user buffer. All this while, the sender would remain blocked. Thus, a synchronous Send lowers the efficiency within process Pi. The nonblocking asynchronous Send (see Figure 1.8(d)) is useful when a large data item is be- ing sent because it allows the process to perform other instructions in parallel with the completion of the Send. The nonblocking synchronous Send (see Figure 1.8(b)) also avoids the potentially large delays for handshaking, particularly when the receiver has not yet issued the Receive call. The nonblocking Receive (see Figure 1.8(b)) is useful when a large data item is being received and/or when the sender has not yet issued the Send call, because it allows the process to perform other instructions in parallel with the completion of the Receive. Note that if the data has already arrived, it is stored in the kernel buffer, and it may take a while to copy it to the user buffer speci- fied in the Receive call. For nonblocking calls, however, the burden on the programmer increases because he has to keep track of the completion of such operations in order to meaningfully reuse (write to or read from) the user buffers again. Thus, conceptually, blocking primitives are easier to use. 1.6.2 Processor Synchrony As opposed to the classification of synchronous and asynchronous communication primitives, there is also the classification of synchronous versus asynchronous processors. Processor synchrony in- dicates that all the processors execute in lock-step with their clocks synchronized. As this syn- chrony is not attainable in a distributed system, what is more generally indicated is that for a large granularity of code, usually termed as a step, the processors are synchronized. This abstraction is implemented using some form of barrier synchronization to ensure that no processor begins exe- cuting the next step of code until all the processors have completed executing the previous steps of code assigned to each of the processors. 1.6.3 Libraries and Standards The previous subsections identified the main principles underlying all communication primitives. In this subsection, we briefly mention some publicly available interfaces that embody some of the above concepts. 17
  • 38. There exists a wide range of primitives for message-passing. Many commercial software prod- ucts (banking, payroll, etc. applications) use proprietary primitive libraries supplied with the soft- ware marketed by the vendors (e.g., the IBM CICS software which has a very widely installed customer base worldwide uses its own primitives). The Message-Passing Interface (MPI) library and the PVM (Parallel Virtual Machine) library are used largely by the scientific community, but other alternative libraries exist. Commercial software is often written using the Remote Proce- dure Calls (RPC) mechanism in which procedures that potentially reside across the network are invoked transparently to the user, in the same manner that a local procedure is invoked. Under the covers, socket primitives or socket-like transport layer primitives are invoked to call the proce- dure remotely. There exist many implementations of RPC - for example, Sun RPC, and Distributed Computing Environment (DCE) RPC. “Messaging” and “streaming” are two other mechanisms for communication. With the growth of object based software, libraries for Remote Method Invocation (RMI) and Remote Object Invocation (ROI) with their own set of primitives are being proposed and standardized by different agencies. CORBA (Common Object Request Broker Architecture) and DCOM (Distributed Component Object Model) are two other standardized architectures with their own set of primitives. Additionally, several projects in the research stage are designing their own flavour of communication primitives. 1.7 Synchronous versus Asynchronous Executions internal event send event receive event P P P P 0 1 2 3 m1 m7 m4 m2 m6 m5m3 Figure 1.9: An example of an asynchronous execution in a message-passing system. A timing diagram is used to illustrate the execution. In addition to the two classifications of processor synchrony/ asynchrony and of synchronous/ asynchronous communication primitives, there is another classification, namely that of synchronous/ asynchronous executions. • An asynchronous execution is an execution in which (1) there is no processor synchrony and there is no bound on the drift rate of processor clocks, (2) message delays (transmission + propagation times) are finite but unbounded, and (3) there is no upper bound on the time taken by a process to execute a step. An example asynchronous execution with four processes P0 to P3 is shown in Figure 1.9. The arrows denote the messages; the head and tail of an arrow mark the send and receive event for that message, denoted by a circle and vertical 18
  • 39. line, respectively. Non-communication events, also termed as internal events, are shown by shaded circles. • A synchronous execution is an execution in which (1) processors are synchronized and the clock drift rates between any two processors is bounded, (2) message delivery (transmission + delivery) times are such that they occur in one logical step or round, and (3) there is a known upper bound on the time taken by a process to execute a step. An example of a synchronous execution with four processes P0 to P3 is shown in Figure 1.10. The arrows denote the messages. It is easier to design and verify algorithms assuming synchronous executions because of the coordinated nature of the executions at all the processes. However, there is a hurdle to having a truly synchronous execution. It is practically difficult to build a completely synchronous sys- tem, and have the messages delivered within a bounded time. Therefore, this synchrony has to be simulated under the covers, and will inevitably involve delaying or blocking some processes for some time durations. Thus, synchronous execution is an abstraction that needs to be provided to the programs. When implementing this abstraction, observe that the fewer the steps or “synchro- nizations” of the processors, the lower the delays and costs. If processors are allowed to have an asynchronous execution for a period of time and then they synchronize, then the granularity of the synchrony is coarse. This is really a virtually synchronous execution, and the abstraction is sometimes termed as virtual synchrony. Ideally, many programs want the processes to execute a series of instructions in rounds (also termed as steps or phases) asynchronously, with the require- ment that after each round/step/phase, all the processes should be synchronized and all messages sent should be delivered. This is the commonly understood notion of a synchronous execution. Within each round/phase/step, there may be a finite and bounded number of sequential sub-rounds (or sub-phases or sub-steps) that processes execute. Each sub-round is assumed to send at most one message per process; hence the message(s) sent will reach in a single message hop. The timing diagram of an example synchronous execution is shown in Figure 1.10. In this system, there are four nodes P0 to P3. In each round, process Pi sends a message to P(i+1) mod 4 and P(i−1) mod 4 and calculates some application-specific function on the received values. The messages are sent as per the code given in function Sync_Execution. There are k rounds in the execution. 1.7.1 Emulating an asynchronous system by a synchronous system (A → S). An asynchronous program (written for an asynchronous system) can be emulated on a synchronous system fairly trivially as the synchronous system is a special case of an asynchronous system – all communication finishes within the same round in which it is initiated. 1.7.2 Emulating a synchronous system by an asynchronous system (S → A). A synchronous program (written for a synchronous system) can be emulated on an asynchronous system using a tool called synchronizer, to be studied in Chapter 2. 19
  • 40. phase 1 phase 2 phase 3 P P P P 3 2 0 1 (1) Sync_Execution(int k, n) //There are k rounds. (2) for r = 1 to k do (3) process i sends a message to (i + 1) mod n and (i − 1) mod n; (4) each process i receives a message from (i + 1) mod n and (i − 1) mod n; (5) compute some application-specific function on the received values. Figure 1.10: An example of a synchronous execution in a message-passing system. All the mes- sages sent in a round are received within that same round. 1.7.3 Emulations. Section 1.5 showed how a shared memory system could be emulated by a message-passing system, and vice-versa. We now have 4 broad classes of programs, as shown in Figure 1.11. Using the emulations shown, any class can be emulated by any other. If system A can be emulated by system B, denoted A B , and if a problem is not solvable in B, then it is also not solvable in A. Likewise, if a problem is solvable in A, it is also solvable in B. Hence, in a sense, all four classes are equivalent in terms of “computability” – what can and cannot be computed – in failure-free systems. However, in fault-prone systems, this is not the case as we will see in Chapter 14; a synchronous system offers more computability than an asynchronous system. Synchronous shared memory (ASM) Asynchronous Synchronous shared memory (SSM) message−passing (SMP) SM−>MPMP−>SMSM−>MPMP−>SM A−>S S−>A A−>S S−>A (AMP) Asynchronous message−passing Figure 1.11: Emulations among the principal system classes in a failure-free system. 20
  • 41. 1.8 Design Issues and Challenges Distributed computing systems have been in widespread existence since the 1970s when the inter- net and ARPANET came into being. At the time, the primary issues in the design of the distributed systems included providing access to remote data in the face of failures, file system design and directory structure design. While these continue to be important issues, many newer issues have surfaced as the widespread proliferation of the high-speed high-bandwidth internet and distributed applications continues rapidly. Below we describe the important design issues and challenges after categorizing them as (i) having a greater component related to systems design and operating systems design, or (ii) having a greater component related to algorithm design, or (iii) emerging from recent technology advances and/or driven by new applications. There is some overlap between these categories. However, it is useful to identify these categories because of the chasm among the (i) the systems community, (ii) the theoretical algorithms community within distributed computing, and (iii) the forces driving the emerging applications and technology. For example, the current practice of distributed computing follows the client-server architecture to a large degree, whereas that receives scant attention in the theoretical distributed algorithms community. Two reasons for this chasm are as follows. First, an overwhelming number of applications outside the scientific computing community of users of distributed systems are business applications for which simple models are adequate. For example, the client server model has been firmly entrenched with the legacy applications first developed by the Blue Chip companies (e.g., HP, IBM, Wang, DEC which is now Compaq, Microsoft) since the 1970s and 1980s. This model is largely adequate for traditional business applications. Second, the state of the practice is largely controlled by industry standards, which do not necessarily choose the “technically best” solution. 1.8.1 Distributed Systems Challenges from a System Perspective The following functions must be addressed when designing and building a distributed system. • Communication. This task involves designing appropriate mechanisms for communication among the processes in the network. Some example mechanisms are: Remote Procedure Call (RPC), Remote Object Invocation (ROI), message-oriented communication versus stream- oriented communication. • Processes. Some of the issues involved are: management of processes and threads at clients/servers; code migration; and the design of software and mobile agents. • Naming. Devising easy to use and robust schemes for names, identifiers, and addresses is essential for locating resources and processes in a transparent and scalable manner. Naming in mobile systems provides additional challenges because naming cannot easily be tied to any static geographical topology. • Synchronization. Mechanisms for synchronization or coordination among the processes are essential. Mutual exclusion is the classical example of synchronization, but many other forms of synchronization, such as leader election are also needed. In addition, synchronizing physical clocks, and devising logical clocks that capture the essence of the passage of time, as well as global state recording algorithms all require different forms of synchronization. 21
  • 42. • Data storage and access. Schemes for data storage, and implicitly for accessing the data in a fast and scalable manner across the network are important for efficiency. Traditional issues such as file system design have to be reconsidered in the setting of a distributed system. • Consistency and replication. To avoid bottlenecks, to provide fast access to data, and to provide scalability, replication of data objects is highly desirable. This leads to issues of managing the replicas, and dealing with consistency among the replicas/caches in a dis- tributed setting. A simple exxample issue is deciding the level of granularity (i.e., size) of data access. • Fault tolerance. Fault tolerance requires maintaining correct and efficient operation in spite of any failures of links, nodes, and processes. Process resilience, reliable communication, distributed commit, checkpointing and recovery, agreement and consensus, failure detection, and self-stabilization are some of the mechanisms to provide fault-tolerance. • Security. Distributed systems security involves various aspects of cryptography, secure chan- nels, access control, key management – generation and distribution, authorization, and secure group management. • Applications Programming Interface (API) and transparency. The API for communication and other specialized services is important for the ease of use and wider adoption of the distributed systems services by non-technical users. Transparency deals with hiding the im- plementation policies from the user, and can be classified as follows. Access transparency hides differences in data representation on different systems and providing uniform opera- tions to access system resources. Location transparency makes the locations of resources transparent to the users. Migration transparency allows relocating resources without the users noticing it. The ability to relocate the resources as they are being accessed is relocation transparency. Replication transparency does not let the user become aware of any replica- tion. Concurrency transparency deals with masking the concurrent use of shared resources for the user. Failure transparency refers to the system being reliable and fault-tolerant. • Scalability and modularity. The algorithms, data (objects), and services must be as dis- tributed as possible. Various techniques such as replication, caching and cache management, and asynchronous processing help to achieve scalability. Some of the recent experiments in designing large-scale distributed systems include the Globe project at Vrije University [32], and the Globus project [33]. The Grid infrastructure for large- scale distributed computing is a very ambitious project that has gained significant attention to date [35, 34]. All these projects attempt to provide the above listed functions as efficiently as possible. 1.8.2 Algorithmic Challenges in Distributed Computing The previous section addresses the challenges in designing distributed systems from a system building perspective. In this section, we briefly summarize the key algorithmic challenges in dis- tributed computing. 22
  • 43. Designing useful execution models and frameworks. The interleaving model and partial order model are two widely adopted models of distributed system executions. They have proved to be particularly useful for operational reasoning and the design of distributed algorithms. The Input/Output automata model and the TLA (Temporal Logic of Actions) are two other exam- ples of models that provide different degrees of infrastructure for reasoning more formally with and proving the correctness of distributed programs. Dynamic distributed graph algorithms and distributed routing algorithms. The distributed sys- tem is modeled as a distributed graph, and the graph algorithms form the building blocks for a large number of higher level communication, data dissemination, object location, and object search functions. The algorithms need to deal with dynamically changing graph character- istics, such as to model varying link loads in a routing algorithm. The efficiency of these algorithms impacts not only the user-perceived latency but also the traffic and hence the load or congestion in the network. Hence, the design of efficient distributed graph algorithms is of paramount importance. Time and global state in a distributed system. The processes in the system are spread across three-dimensional physical space. Another dimension, time, has to be superimposed uni- formly across space. The challenges pertain to providing accurate physical time, and to providing a variant of time, called logical time. Logical time is relative time, and elimi- nates the overheads of providing physical time for applications where physical time is not required. More importantly, logical time can (i) capture the logic and inter-process depen- dencies within the distributed program, and also (ii) track the relative progress at each pro- cess. Observing the global state of the system (across space) also involves the time dimension for consistent observation. Due to the inherent distributed nature of the system, it is not possible for any one process to directly observe a meaningful global state across all the processes, without using extra state-gathering effort which needs to be done in a coordinated manner. Deriving appropriate measures of concurrency also involves the time dimension, as judging the independence of different threads of execution depends not only on the program logic but also on execution speeds within the logical threads, and communication speeds among threads. Synchronization/ coordination mechanisms. The processes must be allowed to execute concur- rently, except when they need to synchronize to exchange information, i.e., communicate about shared data. Synchronization is essential for the distributed processes to overcome the limited observation of the system state from the viewpoint of any one process. Over- coming this limited observation is necessary for taking any actions that would impact other processes. The synchronization mechanisms can also be viewed as resource management and concurrency management mechanisms to streamline the behavior of the processes that would otherwise act independently. Here are some examples of problems requiring synchro- nization. • Physical clock synchronization. Physical clocks ususally diverge in their values due to hardware limitations. Keeping them synchronized is a fundamental challenge to maintain common time. 23
  • 44. • Leader election. All the processes need to agree on which process will play the role of a distinguished process – called a leader process. A leader is necessary even for many distributed algorithms because there is often some asymmetry – as in initiating some action like a broadcast or collecting the state of the system, or in “regenerating” a token that gets “lost” in the system. • Mutual exclusion. This is clearly a synchronization problem because access to the critical resource(s) has to be coordinated. • Deadlock detection and resolution. Deadlock detection should be coordinated to avoid duplicate work, and deadlock resolution should be coordinated to avoid unnecessary aborts of processes. • Termination detection. This requires cooperation among the processes to detect the specific global state of quiescence. • Garbage collection. Garbage refers to objects that are no longer in use and that are not pointed to by any other process. Detecting garbage requires coordination among the processes. Group communication, multicast, and ordered message delivery. A group is a collection of pro- cesses that share a common context and collaborate on a common task within an application domain. Specific algorithms need to be designed to enable efficient group communication and group management wherein processes can join and leave groups dynamically, or even fail. When multiple processes send messages concurrently, different recipients may receive the messages in different orders, possibly violating the semantics of the distributed program. Hence, formal specifications of the semantics of ordered delivery need to be formulated, and then implemented. Monitoring distributed events and predicates. Predicates defined on program variables that are local to different processes are useful for specifying conditions on the global system state, and are useful for applications such as debugging, sensing the environment, and in industrial process control. On-line algorithms for monitoring such predicates are hence important. An important paradigm for monitoring distributed events is that of event streaming, wherein streams of relevant events reported from different processes are examined collectively to detect predicates. Typically, the specification of such predicates uses physical or logical time relationships. Distributed program design and verification tools. Methodically designed and verifiably cor- rect programs can greatly reduce the overhead of software design, debugging, and engineer- ing. Designing mechanisms to achieve these design and verification goals is a challenge. Debugging distributed programs. Debugging sequential programs is hard; debugging distributed programs is that much harder because of the concurrency in actions and the ensuing uncer- tainty due to the large number of possible executions defined by the interleaved concurrent actions. Adequate debugging mechanisms and tools need to be designed to meet this chal- lenge. Data replication, consistency models, and caching. Fast access to data and other resources re- quires them to be replicated in the distributed system. Managing such replicas in the face 24
  • 45. of updates introduces the problems of ensuring consistency among the replicas and cached copies. Additionally, placement of the replicas in the systems is also a challenge because resources usually cannot be freely replicated. World Wide Web design – caching, searching, scheduling. The Web is an example of a widespread distributed system with a direct interface to the end user, wherein the operations are predomi- nantly read-intensive on most objects. The issues of object replication and caching discussed above have to be tailored to the web. Further, prefetching of objects when access patterns and other characteristics of the objects are known, can also be performed. An example of where prefetching can be used is the case of subscribing to Content Distribution Servers. Minimizing response time to minimize user-perceived latencies is an important challenge. Object search and navigation on the web are important functions in the operation of the web, and are very resource-intensive. Designing mechanisms to do this efficiently and accurately are great challenges. Distributed shared memory abstraction. A shared memory abstraction simplifies the task of the programmer because he has to deal only with Read and Write operations, and no message communication primitives. However, under the covers in the middleware layer, the abstrac- tion of a shared address space has to be implemented by using message-passing. Hence, in terms of overheads, the shared memory abstraction is not less expensive. • Wait-free algorithms. Wait-freedom, which can be informally defined as the ability of a process to complete its execution irrespective of the actions of other processes, gained prominence in the design of algorithms to control acccess to shared resources in the shared memory abstraction. It corresponds to n − 1-fault resilience in a n process system and is an important principle in fault-tolerant system design. While wait-free algorithms are highly desirable, they are also expensive, and designing low overhead wait-free algorithms is a challenge. • Mutual exclusion. A first course in Operating Systems covers the basic algorithms (such as the Bakery algorithm and using semaphores) for mutual exclusion in a mul- tiprocessing (uniprocessor or multiprocessor) shared memory setting. More sophisti- cated algorithms – such as those based on hardware primitives, fast mutual exclusion, and wait-free algorithms – will be covered in this book. • Register constructions. In light of promising and emerging technologies of tomorrow – such as biocomputing and quantum computing – that can alter the present foundations of computer “hardware” design, we need to revisit the assumptions of memory access of current systems that are exclusively based on the semiconductor technology and the von Neumann architecture. Specifically, the assumption of single/multiport memory with serial access via the bus in tight synchronization with the system hardware clock may not be a valid assumption in the possibility of “unrestricted” and “overlapping” concurrent access to the same memory location. The study of register constructions deals with the design of registers from scratch, with very weak assumptions on the accesses allowed to a register. This field forms a foundation for future architectures that allow concurrent access even to primitive units of memory (independent of technology) without any restrictions on the concurrency permitted. 25
  • 46. • Consistency models. For multiple copies of a variable/object, varying degrees of con- sistency among the replicas can be allowed. These represent a trade-off of coherence versus cost of implementation. Clearly, a strict definition of consistency (such as in a uniprocessor system) would be expensive to implement in terms of high latency, high message overhead, and low concurrency. Hence, relaxed but still meaningful models of consistency are desirable. Reliable and fault-tolerant distributed systems. A reliable and fault-tolerant environment has multiple requirements and aspects, and these can be addressed using various strategies. • Consensus algorithms. All algorithms ultimately rely on message-passing, and the recipients take actions based on the contents of the received messages. Consensus algorithms allow correctly functioning processes to reach agreement among themselves in spite of the existence of some malicious (adversarial) processes whose identities are not known to the correctly functioning processes. The goal of the malicious processes is to prevent the correctly functioning processes from reaching agreement. The malicious processes operate by sending messages with misleading information, to confuse the correctly functioning processes. • Replication and replica management. Replication (as in having backup servers) is a classical method of providing fault-tolerance. The Triple Modular Redundancy (TMR) technique has long been used in software as well as hardware installations. More so- phisticated and efficient mechanisms for replication are the subject of study here. • Voting and quorum systems. Providing redundancy in the active (e.g., processes) or passive (e.g., hardware resources) components in the system and then performing vot- ing based on some quorum criterion is a classical way of dealing with fault-tolerance. Designing efficient algorithms for this purpose is the challenge. • Distributed databases and distributed commit. For distributed databases, the traditional properties of the transaction (A.C.I.D. – Atomicity, Consistency, Isolation, Durability) need to be preserved in the distributed setting. The field of traditional “transaction commit” protocols is a fairly matura area. Transactional properties can also be viewed as having a counterpart for guarantees on message delivery in group communication in the presence of failures. Results developed in one field can be adapted to the other. • Self-stabilizing systems. All system executions have associated good (or legal) states and bad (or illegal) states; during correct functioning, the system makes transitions among the good states. Faults, internal or external to the program and system, may cause a bad state to arise in the execution. A self-stabilizing algorithm is any algorithm that is guaranteed to eventually take the system to a good state even if a bad state were to arise due to some error. Self-stabilizing algorithms require some in-built redundancy to track additional variables of the state or by doing extra work. Designing efficient self-stabilizing algorithms is a challenge. • Checkpointing and recovery algorithms. Checkpointing involves periodically record- ing state on secondary storage so that in case of a failure, the entire computation is not lost but can be recovered from one of the recently taken checkpoints. Checkpointing 26
  • 47. in a distributed environment is difficult because if the checkpoints at the different pro- cesses are not coordinated, the local checkpoints may become useless because they are inconsistent with the checkpoints at other processes. • Failure detectors. A fundamental limitation of asynchronous distributed systems is that there is no theoretical bound on the message transmission times. Hence, it is impossible to distinguish a sent-but-not-yet-arrived message from a message that was never sent. This implies that it is impossible using message transmission to determine whether some other process across the network is alive or has failed. Failure detectors represent a class of algorithms that probabilistically suspect another process as having failed (such as after timing out after non-receipt of a message for some time), and then converge on a determination of the up/down status of the suspected process. Load balancing. The goal of load balancing is to gain higher throughput, and reduce the user- perceived latency. Load balancing may be necessary because of a variety of factors such as: high network traffic or high request rate causing the network connection to be a bottleneck, or high computational load. A common situation where load balancing is used is in server farms, where the objective is to service incoming client requests with the least turnaround time. Several results from traditional operating systems can be used here, although they need to be adapted to the specifics of the distributed environment. The following are some forms of load balancing. • Data migration. The ability to move data (which may be replicated) around in the system, based on the access pattern of the users. • Computation migration. The ability to relocate processes in order to perform a redis- tribution of the workload. • Distributed scheduling. This achieves a better turnaround time for the users by using idle processing power in the system more efficiently. Real-time scheduling. Real-time scheduling is important for mission-critical applications, to ac- complish the task execution on schedule. The problem becomes more challenging in a distributed system where a global view of the system state is absent. On-line or dynamic changes to the schedule are also harder to make without a global view of the state. Furthermore, message propagation delays which are network-dependent are hard to con- trol or predict, which makes meeting real-time guarantees that are inherently dependent on communication among the processes harder. Although networks offering Quality-of-Service guarantees can be used, they alleviate the uncertainty in propagation delays only to a limited extent. Further, such networks may not always be available. Performance. Although high throughput is not the primary goal of using a distributed system, achieving good performance is important. In large distributed systems, network latency (propagation and transmission times) and access to shared resources can lead to large delays which must be minimized. The user-perceived turn-around time is very important. The following are some example issues arise in determining the performance. 27
  • 48. • Metrics. Appropriate metrics must be defined or identified for measuring the perfor- mance of theoretical distributed algorithms, as well as for implementations of such algorithms. The former would involve various complexity measures on the metrics, whereas the latter would involve various system and statistical metrics. • Measurement methods/tools. As a real distributed system is a complex entity and has to deal with all the difficulties that arise in measuring performance over a WAN/the Internet, appropriate methodologies and tools must be developed for measuring the performance metrics. 1.8.3 Applications of Distributed Computing and Newer Challenges Mobile systems. Mobile systems typically use wireless communication which is based on electro- magnetic waves and utilizes a shared broadcast medium. Hence, the characteristics of com- munication are different; many issues such as range of transmission and power of transmis- sion come into play, besides various engineering issues such as battery power conservation, interfacing with the wired internet, signal processing and interference. From a computer science perspective, there is a rich set of problems such as routing, location management, channel allocation, localization and position estimation, and the overall management of mo- bility. There are two popular architectures for a mobile network. The first is the base-station ap- proach, also known as the cellular approach, wherein a cell which is the geographical region within range of a static but powerful base transmission station is associated with that base station. All mobile processes in that cell communicate with the rest of the system via the base station. The second approach is the ad-hoc network approach where there is no base station (which essentially acted as a centralized node for its cell). All responsibility for communication is distributed among the mobile nodes, wherein mobile nodes have to partic- ipate in routing by forwarding packets of other pairs of communicating nodes. Clearly, this is a complex model. It poses many graph-theoretical challenges from a computer science perspective, in addition to various engineering challenges. Sensor networks. A sensor is a processor with an electro-mechanical interface that is capable of sensing physical parameters, such as temperature, velocity, pressure, humidity, and chemi- cals. Recent developments in cost-effective hardware technology have made it possible to deploy very large (of the order of 106 or higher) low-cost sensors. An important paradigm for monitoring distributed events is that of event streaming, defined earlier. The streaming data reported from a sensor network differs from the streaming data reported by “computer processes” in that the events reported by a sensor network are in the environment, external to the computer network and processes. This limits the nature of information about the reported event in a sensor network. Sensor networks have a wide range of applications. Sensors may be mobile or static; sensors may communicate wirelessly, although they may also communicate across a wire when they are statically installed. Sensors may have to self-configure to form an ad-hoc network, which introduces a whole new set of challenges, such as position estimation and time estimation. 28
  • 49. Ubiquitous or pervasive computing. Ubiquitous systems represent a class of computing where the processors embedded in and seamlessly pervading through the environment perform ap- plication functions in the background, much like in the sci-fi movies. The intelligent home, and the smart workplace are some example of ubiquitous environments currently under in- tense research and development. Ubiquitous systems are essentially distributed systems; recent advances in technology allow them to leverage wireless communication and sensor and actuator mechanisms. They can be self-organizing and network-centric, while also being resource constrained. Such systems are typically characterized as having many small proces- sors operating collectively in a dynamic ambient network. The processors may be connected to more powerful networks and processing resources in the background for processing and collating data. Peer-to-peer computing. Peer-to-peer (P2P) computing represents computing over an application layer network wherein all interactions among the processors are at a “peer” level, without any hierarchy among the processors. Thus, all processors are equal and play a symmetric role in the computation. P2P computing arose as a paradigm shift from client-server com- puting where the roles among the processors are essentially asymmetrical. P2P networks are typically self-organizing, and may or may not have a regular structure to the network. No central directories (such as those used in Domain Name Servers) for name resolution and object lookup are allowed. Some of the key challenges include: object storage mechanisms, efficient object lookup and retreival in a scalable manner; dynamic reconfiguration with nodes as well as objects joining and leaving the network randomly; replication strategies to expedite object search; tradeoffs between object size latency and table sizes; anonymity, privacy, and security. Publish-subscribe, content distribution, and multimedia. With the explosion in the amount of information, there is a greater need to receive and access only information of interest. Such information can be specified using filters. In a dynamic environment where the information constantly fluctuates (varying stock prices is a typical example), there needs to be: (i) an efficient mechanism for distributing this information (publish), (ii) an efficient mechanism to allow end users to indicate interest in receiving specific kinds of information (subscribe), and (iii) an efficient mechanism for aggregating large volumes of published information and filtering it as per the user’s subscription filter. Content distribution refers to a class of mechanisms, primarily in the web and P2P comput- ing context, whereby specific information which can be broadly characterized by a set of parameters is to be distributed to interested processes. Clearly, there is overlap between con- tent distribution mechanisms and publish-subscribe mechanisms. When the content involves multimedia data, special requirement such as the following arise: multimedia data is usu- ally very large and information-intensive, requires compression, and often requires special synchronization during storage and playback. Distributed agents. Agents are software processes or robots that can move around the system to do specific tasks for which they are specially programmed. The name “agent” derives from the fact that the agents do work on behalf of some broader objective. Agents collect and process information, and can exchange such information with other agents. Often, the agents 29
  • 50. cooperate as in an ant colony, but they can also have friendly competition, as in a free market economy. Challenges in distributed agent systems include coordination mechanisms among the agents, controlling the mobility of the agents, and their software design and interfaces. Research in agents is inter-disciplinary: spanning artificial intelligence, mobile computing, economic market models, software engineering, and distributed computing. Distributed data mining. Data mining algorithms examine large amounts of data to detect pat- terns and trends in the data, to mine or extract useful information. A traditional example is: examining the purchasing patterns of customers in order to profile the customers and enhance the efficacy of directed marketing schemes. The mining can be done by applying database and artificial intelligence techniques to a data repository. In many situations, the data is nec- essarily distributed and cannot be collected in a single repository, as in banking applications where the data is private and sensitive, or in atmospheric weather prediction where the data sets are far too massive to collect and process at a single repository in real-time. In such cases, efficient distributed data mining algorithms are required. Grid computing. Analogous to the electrical power distribution grid, it is envisaged that the in- formation and computing grid will become a reality some day. Very simply stated, idle CPU cycles of machines connected to the network will be available to others. Many challenges in making grid computing a reality include: scheduling jobs in such a distributed environment, a framework for implementing Quality of Service and real-time guarantees, and of course, security of individual machines as well as of jobs being executed in this setting. Security in distributed systems. The traditional challenges of security in a distributed setting in- clude: confidentiality (ensuring that only authorized processes can access certain informa- tion), authentication (ensuring the source of received information and the identity of the sending process), and availability (maintaining allowed access to services despite malicious actions). The goal is to meet these challenges with efficient and scalable solutions. These basic challenges have been addressed in traditional distributed settings. For the newer dis- tributed architectures, such as wireless, peer-to-peer, grid, and pervasive computing dis- cussed in this subsection), these challenges become more interesting due to factors such as a resource-constrained environment, a broadcast medium, the lack of structure, and the lack of trust in the network. 1.9 Selection and Coverage of Topics This is a long list of topics and difficult to cover in a single textbook. This book covers a broad selection of topics from the above list, in order to present the fundamental principles underlying the various topics. The goal has been to select topics that will give a good understanding of the field, and of the techniques used to design solutions. Some topics that have been omitted are interdisciplinary, across fields within computer science. An example is load balancing, which is traditionally covered in detail in a course on Parallel Processing. As the focus of distributed systems has shifted away from gaining higher efficiency to providing better services and fault-tolerance, the importance of load balancing in distributed computing has diminished. Another example is mobile systems. A mobile system is a distributed 30
  • 51. system having certain unique characteristics, and there are courses devoted specifically to mobile systems. 1.10 Chapter Summary This chapter first characterized distributed systems by looking at various informal definitions based on functional aspects. It then looked at various architectures of multiple processor systems, and the requirements that have traditionally driven distributed systems. The relationship of a distributed system to “middleware”, the operating system, and the network protocol stack provided a different perspective on a distributed system. The relationship between parallel systems and distributed systems, covering aspects such de- grees of software and hardware coupling, and the relative placement of the processors, memory units, and interconnection networks, was examined in detail. There is some overlap between the fields of parallel computing and distributed computing, and hence it is important to understand their relationhip clearly. For example, various interconnection networks such as the Omega net- work, the Butterfly network, and the hypercube network, were designed for parallel computing but they are recently finding surprising applications in the design of application-level overlay networks for distributed computing. The traditional taxonomy of multiple processor systems by Flynn was also studied. Important concepts such as the degree of parallelism and of concurrency, and the degree of coupling were also introduced informally. The chapter then introduced three fundamental concepts in distributed computing. The first concept is the paradigm of shared memory communication versus message-passing communica- tion. The second concept is the paradigm of synchronous executions and asynchronous executions. For both these concepts, emulation of one paradigm by another was studied for error-free systems. The third concept was that of synchronous and asynchronous send communication primitives, of synchronous receive communicaiton primitives, and of blocking and nonblocking send and receive communication primitives. The chapter then presented design issues and challenges in the field of distributed computing. The challenges were classified as (i) being important from a systems design perspective, or (ii) being important from an algorithmic perspective, or (iii) those that are driven by new applications and emerging technologies. This classification is not orthogonal and is somewhat subjective. The various topics that will be covered in the rest of the book are portrayed on a miniature canvas in the section on the design issues and challenges. 1.11 Bibliographic Notes The selection of topics and material for this book has been shaped by the authors’ perception of the importance of various subjects, as well as the coverage by the existing textbooks. There are many books on distributed computing and distributed systems. Attiya and Welch [1] and Lynch [7] provide a formal theoretical treatment of the field. The books by Barbosa [2] and Tel [12] focus on algorithms. The books by Chow and Johnson [3], Coulouris, Dollimore, and Kindberg [4], Garg [5], Goscinski [6], Mullender [8], Raynal [22], Singhal and Shivaratri [10], and Tanenbaum and van Steen [11] provide a blend of theoretical and systems issues. 31
  • 52. Much of the material in this introductory chapter is based on well understood concepts and paradigms in the distributed systems community, and is difficult to attribute to any particular source. A recent overview of the challenges in middleware design from systems’ perspective is given in the special issue by Lea, Vinoski, and Vogels [30]. An overview of the Common Object Request Broker Model (CORBA) of the Object Management Group (OMG) is given by Vinoski [22]. The Distributed Component Object Model (DCOM) from Microsoft, Sun’s Java Remote Method Invo- cation (RMI), and CORBA are analyzed in perspective by Campbell, Coulson, and Kounavis [26]. A detailed treatment of CORBA, RMI, and RPC is given by Coulouris, Dollimore, and Kindberg [4]. The Open Foundations’s Distributed Computing Environment (DCE) is described in [31]; DCE is not likely to be enjoy a continuing support base. Descriptions of the Message Passing In- terface can be found in Snir et al. [23] and Gropp et al. [24]. The Parallel Virtual Machine (PVM) framework for parallel distributed programming is described by Sunderam [29]. The discussion of parallel processing, and of the UMA and NUMA parallel architectures, is based on Kumar, Grama, Gupta, and Karypis [18]. The properties of the hypercube architecture are surveyed by Feng [27] and Harary, Hayes, and Wu [28]. The multi-stage interconnection archi- tectures – the Omega (Benes) [14], the Butterfly [16], and Clos [15] were proposed in the papers indicated. A good overview of multistage interconnection networks is given by Wu and Feng [25]. Flynn’s taxomomy of multiprocessors is based on [21]. The discussion on blocking/nonblocking primitives as well as synchronous and asynchropnous primitives is extended from Cypher and Leu [20]. The section on design issues and challenges is based on the vast research literature in the area. The Globe architecture is described by van Steen, Homburg, and Tanenbaum [32]. The Globus architecture is described by Foster and Kesselman [33]. The grid infrastructure and the distributed computng vision for the 21st century is described by Foster and Kesselman [35] and by Foster [34]. The World-Wide Web is an excellent example of a distributed system that has largely evolved of its own; Tim Berners-Lee is credited with seeding the WWW project; its early description is given by Berners-Lee, Cailliau, Luotonen, Nielsen, and Secret [36]. 1.12 Exercise Problems 1. What are the main differences between a parallel system and a distributed system? 2. Identify some distributed applications in the scientific and commercial application areas. For each application, determine which of the motivating factors listed in Section 1.3 are important for building the application over a distributed system. 3. Draw the Omega and Butterfly networks for n = 16 inputs and outputs. 4. For the Omega and Butterfly networks shown in Figure 1.4, trace the paths from P5 to M2, and from P6 to M1. 5. Formulate the interconnection function for the Omega network having n inputs and outputs, only in terms of the M = n/2 switch numbers in each stage. (Hint: Follow an approach similar to the Butterfly network formulation.) 32
  • 53. 6. In Figure 1.4, observe that the paths from input 000 to output 111 and from input 101 to output 110 have a common edge. Therefore, simultaneous transmission over these paths is not possible; one path blocks another. Hence, the Omega and Butterfly networks are classified as blocking interconnection networks. Let Π(n) be any permutation on {0 . . .n−1}, mapping the input domain to the output range. A nonblocking interconnection network allows simultaneous transmission from the inputs to the outputs for any permutation. Consider the network built as follows. Take the image of a butterfly in a vertical mirror, and append this mirror image to the output of a butterfly. Hence, for n inputs and outputs, there will be 2log2n stages. Prove that this network is nonblocking. 7. The Baseline Clos network has a interconnection generation function as follows. Let there be M = n/2 switches per stage, and let a switch be denoted by the tuple x, s , where x ∈ [0, M − 1] and stage s ∈ [0, log2n − 1]. There is an edge from switch x, s to switch y, s + 1 if (i) y is the cyclic right-shift of the (log2n − s) least significant bits of x, (ii) y is the cyclic right-shift of the (log2n − s) least significant bits of x′ , where x′ is obtained by complementing the LSB of x. Draw the interconnection diagram for the Clos network having n = 16 inputs and outputs, i.e., having 8 switches in each of the 4 stages. 8. Two interconnection networks are isomorphic if there is a 1:1 mapping f between the switches such that for any switches x and y that are connected to each other in adjacent stages in one network, f(x) and f(y) are also connected in the other network. Show that the Omega, Butterfly, and Clos (Baseline) networks are isomorphic to each other. 9. Explain why a Receive call cannot be asynchronous. 10. What are the three aspects of reliability? Is it possible to order them in different ways in terms of importance, based on different applications’ requirements? Justify your answer by giving examples of different applications. 11. Figure 1.11 shows the emulations among the principal system classes in a failure-free sys- tem. (a) Which of these emulations are possible in a failure-prone system? Explain. (b) Which of these emulations are not possible in a failure-prone system? Explain. 12. Examine the impact of unreliable links and node failures on each of the challenges listed in Section 1.8.2. 33
  • 54. Bibliography [1] H. Attiya, J. Welch, Distributed Computing Fundamentals, Simulations, and Advanced Topics, Wiley Inter-Science, 2nd edition, 2004. [2] V. Barbosa, An Introduction to Distributed Algorithms, MIT Press, 1996. [3] R. Chow and D. Johnson, Distributed Operating Systems and Algorithms, Addison-Wesley, 1997. [4] G. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems Concepts and Design, Addison-Wesley, 3rd edition, 2001. [5] V. Garg, Elements of Distributed Computing, John Wiley, 2003. [6] A. Goscinski, Distributed Operating Systems: The Logical Design, Addison-Wesley, 1991. [7] N. Lynch, Distributed Algorithms, Morgan Kaufmann, 1996. [8] S. Mullender, Distributed Systems, Addison-Wesley, 2nd edition, 1993. [9] M. Raynal, Distributed Algorithms and Protocols, John Wiley, 1988. [10] M. Singhal and N. Shivaratri, Advanced Concepts in Operating Systems, McGraw Hill, 1994. [11] A. Tanenbaum, M. Van Steen, Distributed Systems: Principles and Paradigms, Prentice-Hall, 2003. [12] G. Tel, Introduction to Distributed Algorithms, Cambridge University Press, 1994. [13] A. Ananda, B. Tay, E, Koh, A survey of asynchronous remore procedure calls, ACM SIGOPS Operating Systems Review, 26(2): 92-109, 1992. [14] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic, Academic Press, New York, 1965. [15] C. Clos, A study of non-blocking switching networks, Bell Systems Technical Journal, 32: 406-424, 1953. [16] J. M. Cooley, J. W. Tukey, An algorithm for the machine calculation of comple Fourier series, Mathematical Comp., 19, 297-301, 1965. 34
  • 55. [17] A. Birrell, B. Nelson, Implementing remote procedure calls, ACM Transactions on Computer Systems, 2(1): 39-59, 1984. [18] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing, Benjamin- Cummins, 2nd edition, 2003. [19] A. Tanenbaum, Computer Networks, 3rd edition, Prentice-Hall PTR, NJ, 1996. 1994 [20] R. Cypher, E. Leu, The semantics of blocking and nonblocking send and receive primitives, 8th International Symposium on Parallel Processing, 729-735, 1994. [21] M. Flynn, Some computer organizations and their effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 94, 1972. [22] S. Vinoski, CORBA: Integrating diverse applications within heterogeneous distributed envi- ronments, IEEE Communications Magazine, 35(2): 1997. [23] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference, MIT Press, 1996. [24] W. Gropp, E. Lusk, A. Skjellum, Using MPI: portable parallel programming with the message-passing interface, MIT Press, 1994. [25] C.L. Wu, T.-Y. Feng, On a class of multistage interconnection networks, IEEE Transactions on Computers. Vol. C-29, pp. 694-702. Aug. 1980 [26] A. Campbell, G. Coulson, M. Counavis, Managing complexity: Middleware explained, IT Professional Magazine, October 1999. [27] T. Y. Feng, A survey of interconnection networks, IEEE Comput., 14, pp. 12-27, Dec. 1981. [28] F. Harary, J.P. Hayes, H. Wu, A survey of the theory of hypercube graphs, Computational Mathematical Applications 15(4): 277-289, 1988. [29] V. Sunderam, PVM: A framework for parallel distributed computing, Concurrency - Practice and Experience, 2(4): 315-339, 1990. [30] D. Lea, S. Vinoski, W. Vogels, Guest editors’ introduction: Asynchronous middleware and services, IEEE Internet Computing, 10(1): 14-17, 2006. [31] J. Shirley, W. Hu, D. Magid, Guide to Writing DCE Applications, O’Reilly and Associates, Inc., ISBN 1-56592-045-7, USA. [32] M. van Steen, P. Homburg, A. Tanenbaum, Globe: A wide-area distributed system, IEEE Concurrency, 70-78, 1999. [33] I. Foster, C. Kesselman, Globus: A metacomputing infrastructure toolkit, International Jour- nal of Supercomputer Applications, 11(2): 115-128, 1997. [34] I. Foster, The Grid: A new infrastructure for 21st century science, Physics Today, 55(2):42- 47, 2002. 35
  • 56. [35] I. Foster, C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1998. [36] T. Berners-Lee, R. Cailliau, A. Luotonen, H. Nielsen, A. Secret, The World-Wide Web, Com- munications of the ACM, 37(8): 76-82, 1994. 36
  • 57. Chapter 2 A Model of Distributed Computations A distributed system consists of a set of processors that are connected by a communication net- work. The communication network provides the facility of information exchange among proces- sors. The communication delay is finite but unpredictable. The processors do not share a common global memory and communicate solely by passing messages over the communication network. There is no physical global clock in the system to which processes have instantaneous access. The communication medium may deliver messages out of order, messages may be lost, garbled, or duplicated due to timeout and retransmission, processors may fail, and communication links may go down. The system can be modeled as a directed graph in which vertices represent the processes and edges represent unidirectional communication channels. A distributed application runs as a collection of processes on a distributed system. This chapter presents a model of a distributed computation and introduces several terms, concepts, and notations that will be used in the subsequent chapters. 2.1 A Distributed Program A distributed program is composed of a set of n asynchronous processes p1, p2, ..., pi, ..., pn that communicate by message passing over the communication network. Without loss of generality, we assume that each process is running on a different processor. The processes do not share a global memory and communicate solely by passing messages. Let Cij denote the channel from process pi to process pj and let mij denote a message sent by pi to pj. The communication delay is finite and unpredictable. Also, these processes do not share a global clock that is instantaneously accessible to these processes. Process execution and message transfer are asynchronous – a process may execute an action spontaneously and a process sending a message does not wait for the delivery of the message to be complete. The global state of a distributed computation is composed of the states of the processes and the communication channels [2]. The state of a process is characterized by the state of its local memory and depends upon the context. The state of a channel is characterized by the set of messages in transit in the channel. 37
  • 58. 2.2 A Model of Distributed Executions The execution of a process consists of a sequential execution of its actions. The actions are atomic and the actions of a process are modeled as three types of events, namely, internal events, message send events, and message receive events. Let ex i denote the xth event at process pi. Subscripts and/or superscripts will be dropped when they are irrelevant or are clear from the context. For a message m, let send(m) and rec(m) denote its send and receive events, respectively. The occurrence of events changes the states of respective processes and channels, thus causing transitions in the global system state. An internal event changes the state of the process at which it occurs. A send event (or a receive event) changes the state of the process that sends (or receives) the message and the state of the channel on which the message is sent (or received). An internal event only affects the process at which it occurs. The events at a process are linearly ordered by their order of occurrence. The execution of process pi produces a sequence of events e1 i , e2 i , ..., ex i , ex+1 i , ... and is denoted by Hi where Hi = (hi, →i) hi is the set of events produced by pi and binary relation →i defines a linear order on these events. Relation →i expresses causal dependencies among the events of pi. The send and the receive events signify the flow of information between processes and establish causal dependency from the sender process to the receiver process. A relation →msg that captures the causal dependency due to message exchange, is defined as follows. For every message m that is exchanged between two processes, we have send(m) →msg rec(m). Relation →msg defines causal dependencies between the pairs of corresponding send and receive events. p p p 1 2 3 e e e3 2 1 e1 e1 e1 e1 e2 e2 e2 e2 e2 e3 e3 e3 1 2 3 4 1 2 3 4 5 2 3 4 5 6 1 time Figure 2.1: The space-time diagram of a distributed execution. The evolution of a distributed execution is depicted by a space-time diagram. Figure 2.1 shows the time-space diagram of a distributed execution involving three processes. A horizontal line represents the progress of the process; a dot indicates an event; a slant arrow indicates a message transfer. Generally, the execution of an event takes a finite amount of time; however, since we 38
  • 59. assume that an event execution is atomic (hence, indivisible and instantaneous), it is justified to denote it as a dot on a process line. In this figure, for process p1, the second event is a message send event, the third event is an internal event, and the fourth event is a message receive event. Causal Precedence Relation The execution of a distributed application results in a set of distributed events produced by the processes. Let H=∪ihi denote the set of events executed in a distributed computation. Next, we define a binary relation on the set H, denoted as →, that expresses causal dependencies between events in the distributed execution. ∀ex i , ∀ey j ∈ H, ex i → ey j ⇔    ex i →i ey j i.e., (i = j) ∧ (x ≤ y) or ex i →msg ey j or ∃ez k ∈ H : ex i → ez k ∧ ez k → ey j The causal precedence relation induces an irreflexive partial order on the events of a distributed computation [6] that is denoted as H=(H, →). Note that the relation → is Lamport’s “happens before" relation1 [4]. For any two events ei and ej, if ei → ej, then event ej is directly or transitively dependent on event ei; graphically, it means that there exists a path consisting of message arrows and process-line segments (along increasing time) in the space-time diagram that starts at ei and ends at ej. For example, in Figure 2.1, e1 1 → e3 3 and e3 3 → e6 2. Note that relation → denotes flow of information in a distributed computation and ei → ej dictates that all the information available at ei is potentially accessible at ej. For example, in Figure 2.1, event e6 2 has the knowledge of all other events shown in the figure. For any two events ei and ej, ei → ej denotes the fact that event ej does not directly or transitively dependent on event ei. That is, event ei does not causally affect event ej. Event ej is not aware of the execution of ei or any event executed after ei on the same process. For example, in Figure 2.1, e3 1 → e3 3 and e4 2 → e1 3. Note the following two rules: • For any two events ei and ej, ei → ej : ej → ei. • For any two events ei and ej, ei → ej : ej → ei. For any two events ei and ej, if ei → ej and ej → ei, then events ei and ej are said to be concurrent and the relation is denoted as ei ej. In the execution of Figure 2.1, e3 1 e3 3 and e4 2 e1 3. Note that relation is not transitive; that is, (ei ej) ∧ (ej ek) : ei ek. For example, in Figure 2.1, e3 3 e4 2 and e4 2 e5 1, however, e3 3 e5 1. Note that for any two events ei and ej in a distributed execution, ei → ej or ej → ei, or ei ej. 1 In Lamport’s “happens before” relation, an event e1 happens before an event e2, denoted by ei → ej, if (a) e1 occurs before e2 on the same process, or (b) e1 is the send event of a message and e2 is the receive event of that message, or (c) ∃e′ | e1 happens before e′ and e′ happens before e2. 39
  • 60. Logical vs. Physical Concurrency In a distributed computation, two events are logically concurrent if and only if they do not causally affect each other. Physical concurrency, on the other hand, has a connotation that the events occur at the same instant in physical time. Note that two or more events may be logically concurrent even though they do not occur at the same instant in physical time. For example, in Figure 2.1, events in the set {e3 1, e4 2, e3 3} are logically concurrent, but they occurred at different instants in physical time. However, note that if processor speed and message delays would have been different, the execu- tion of these events could have very well coincided in physical time. Whether a set of logically concurrent events coincide in the physical time or in what order in the physical time they occur does not change the outcome of the computation. Therefore, even though a set of logically concurrent events may not have occurred at the same instant in physical time, for all practical and theoretical purposes, we can assume that these events occured at the same instant in physical time. 2.3 Models of Communication Network There are several models of the service provided by communication networks, namely, FIFO, Non-FIFO, and causal ordering. In the FIFO model, each channel acts as a first-in first-out message queue and thus, message ordering is preserved by a channel. In the non-FIFO model, a channel acts like a set in which the sender process adds messages and the receiver process removes messages from it in a random order. The “causal ordering” model [1] is based on Lamport’s “happens before” relation. A system that supports the causal ordering model satisfies the following property: CO: For any two messages mij and mkj, if send(mij) −→ send(mkj), then rec(mij) −→ rec(mkj). That is, this property ensures that causally related messages destined to the same destination are delivered in an order that is consistent with their causality relation. Causally ordered delivery of messages implies FIFO message delivery. Furthermore, note that CO ⊂ FIFO ⊂ Non-FIFO. Causal ordering model is useful in developing distributed algorithms. Generally, it consider- ably simplifies the design of distributed algorithms because it provides a built-in synchronization. For example, in replicated database systems, it is important that every process responsible for up- dating a replica receives the updates in the same order to maintain database consistency. Without causal ordering, each update must be checked to ensure that database consistency is not being violated. Causal ordering eliminates the need for such checks. 2.4 Global State of a Distributed System The global state of a distributed system is a collection of the local states of its components, namely, the processes and the communication channels [2], [3]. The state of a process at any time is defined by the contents of processor registers, stacks, local memory, etc. and depends on the local context of the distributed application. The state of channel is given by the set of messages in transit in the channel. The occurrence of events changes the states of respective processes and channels, thus causing transitions in global system state. For example, an internal event changes the state of the process 40
  • 61. at which it occurs. A send event (or a receive event) changes the state of the process that sends (or receives) the message and the state of the channel on which the message is sent (or received). Let LSx i denote the state of process pi after the occurrence of event ex i and before the event ex+1 i . LS0 i denotes the initial state of process pi. LSx i is a result of the execution of all the events executed by process pi till ex i . Let send(m)≤LSx i denote the fact that ∃y:1≤y≤x :: ey i =send(m). Likewise, let rec(m)≤LSx i denote the fact that ∀y:1≤y≤x :: ey i =rec(m). The state of a channel is difficult to state formally because a channel is a distributed entity and its state depends upon the states of the processes it connects. Let SCx,y ij denote the state of a channel Cij defined as follows: SCx,y ij ={mij| send(mij) ≤ ex i rec(mij) ≤ ey j } Thus, channel state SCx,y ij denotes all messages that pi sent upto event ex i and which process pj had not received until event ey j . 2.4.1 Global State The global state of a distributed system is a collection of the local states of the processes and the channels. Notationally, global state GS is defined as, GS = { iLSxi i , j,kSC yj,zk jk } For a global snapshot to be meaningful, the states of all the components of the distributed system must be recorded at the same instant. This will be possible if the local clocks at processes were perfectly synchronized or if there were a global system clock that can be instantaneously read by the processes. However, both are impossible. However, it turns out that even if the state of all the components in a distributed system has not been recorded at the same instant, such a state will be meaningful provided every message that is recorded as received is also recorded as sent. Basic idea is that an effect should not be present without its cause. A message cannot be received if it was not sent; that is, the state should not violate causality. Such states are called consistent global states and are meaningful global states. Inconsistent global states are not meaningful in the sense that a distributed system can never be in an inconsistent state. A global state GS = { iLSxi i , j,kSC yj,zk jk } is a consistent global state iff it satisfies the following condition: ∀LSxi i ∀SCyi,zk ik :: yi = xi That is, channel state SCyi,zk ik and process state LSzk k must not include any message that pro- cess pi sent after executing event exi i and must include all messages that process pi sent upto the execution of event exi i . In the distributed execution of Figure 2.2, a global state GS1 consisting of local states {LS1 1, LS3 2 , LS3 3 , LS2 4 } is inconsistent because the state of p2 has recorded the receipt of message m12, however, the state of p1 has not recorded its send. On the contrary, a global state GS2 consisting of local states {LS2 1, LS4 2, LS4 3, LS2 4} is consistent; all the channels are empty except C21 that contains message m21. 41
  • 62. 3 4 1 2 time e e e e e e e e e e e 12 e e e p p p p 1 1 1 1 2 2 2 2 3 3 3 4 4 1 2 3 4 4 2 3 e1 3 1 3 2 3 4 5 1 2 m 21m Figure 2.2: The space-time diagram of a distributed execution. A global state GS = is transitless iff ∀i, ∀j : 1 ≤ i, j ≤ n :: SC yi,zj ij = φ. Thus, all channels are recorded as empty in a transitless global state. Note that a transitless global state is always a consistent global state. A global state is strongly consistent iff it is transit- less as well as consistent. Note that in Figure 2.2, the global state consisting of local states {LS2 1, LS3 2 , LS4 3 , LS2 4} is strongly consistent. Recording the global state of a distributed system is an important paradigm when one is inter- ested in analyzing, monitoring, testing, or verifying properties of distributed applications, systems, and algorithms. Design of efficient methods for recording the global state of a distributed system is an important problem. 2.5 Cuts of a Distributed Computation In the space-time diagram of a distributed computation, a zigzag line joining one arbitrary point on each process line is termed a cut in the computation. Such a line slices the space-time diagram, and thus the set of events in the distributed computation, into a PAST and a FUTURE. The PAST contains all the events to the left of the cut and the FUTURE contains all the events to the right of the cut. For a cut C, let PAST(C) and FUTURE(C) denote the set of events in the PAST and FUTURE of C, respectively. Every cut corresponds to a global state and every global state can be graphically represented as a cut in the computation’s space-time diagram [6]. Definition: If e Max_P ASTi(C) i denotes the latest event at process pi that is in the PAST of a cut C, then the global state represented by the cut is { iLS Max_P ASTi(C) i , j,kSC yj,zk jk } where SC yj,zk jk ={m | send(m)∈PAST(C) ∧ rec(m)∈FUTURE(C)}. A consistent global state corresponds to a cut in which every message received in the PAST of the cut was sent in the PAST of that cut. Such a cut is known as a consistent cut. All messages that cross the cut from the PAST to the FUTURE are in transit in the corresponding consistent 42
  • 63. 3 4 1 2 time e e e e e e e e e e 1 e e e e C C p p p p 1 1 1 1 2 2 2 2 3 3 3 4 4 1 2 3 4 4 2 3 e1 3 1 3 2 3 4 5 1 2 2 Figure 2.3: Illustration of cuts in a distributed execution. global state. A cut is inconsistent if a message crosses the cut from the FUTURE to the PAST. For example, the space-time diagram of Figure 2.3 shows two cuts, C1 and C2. C1 is an inconsistent cut, whereas C2 is a consistent cut. Note that these two cuts respectively correspond to the two global states GS1 and GS2, identified in the previous subsection. Cuts in a space-time diagram provide a powerful graphical aid in representing and reasoning about global states of a computation. 2.6 Past and Future Cones of an Event In a distributed computation, an event ej could have been affected only by all events ei such that ei → ej and all the information available at ei could be made accessible at ej. All such events ei belong to the past of ej [6]. Let Past(ej) denote all events in the past of ej in a computation (H, →). Then, Past(ej) = {ei|∀ei ∈ H, ei → ej }. Figure 2.4 shows the past of an event ej. Let Pasti(ej) be the set of all those events of Past(ej) that are on process pi. Clearly, Pasti(ej) is a totally ordered set, ordered by the relation →i, whose maximal element is denoted by max(Pasti(ej)). Obviously, max(Pasti(ej)) is the latest event at process pi that affected event ej (see Figure ??). Note that max(Pasti(ej)) is always a message send event. Let Max_Past(ej) = (∀i){max(Pasti(ej))}. Max_Past(ej) consists of the latest event at every process that affected event ej and is referred to as the surface of the past cone of ej [6]. Note that Max_Past(ej) is a consistent cut [7]. Past(ej) represents all events on the past light cone that affect ej. Similar to the past is defined the future of an event. The future of an event ej, denoted by Future(ej), contains all events ei that are causally affected by ej (see Figure 2.4). In a computation (H, →), Future(ej) is defined as: Future(ej) = {ei|∀ei ∈ H, ej → ei}. 43
  • 64. ej ) j FUTURE( ej e )PAST( max(Past ( i e j )) min(Future i (ej )) Figure 2.4: Illustration of past and future cones in a distributed computation. Likewise, we can define Futurei(ej) as the set of those events of Future(ej) that are on process pi and min(Futurei(ej)) as the first event on process pi that is affected by ej. Note that min(Futurei(ej)) is always a message receive event. Likewise, Min_Past(ej), defined as (∀i){min(Futurei(ej))}, consists of the first event at every process that is causally affected by event ej and is referred to as the surface of the future cone of ej [6]. It denotes a consistent cut in the computation [7]. Future(ej) represents all events on the future light cone that are affected by ej. It is obvious that all events at a process pi that occurred after max(Pasti(ej)) but before min(Futurei(ej)) are concurrent with ej. Therefore, all and only those events of computation H that belong to the set “H − Past(ej) − Future(ej)” are concurrent with event ej. 2.7 Models of Process Communications There are two basic models of process communications [8] – synchronous and asynchronous. The synchronous communication model is a blocking type where on a message send, the sender process blocks until the message has been received by the receiver process. The sender process resumes execution only after it learns that the receiver process has accepted the message. Thus, the sender and the receiver processes must synchronize to exchange a message. On the other hand, asynchronous communication model is a non-blocking type where the sender and the receiver do not synchronize to exchange a message. After having sent a message, the sender process does not wait for the message to be delivered to the receiver process. The message is bufferred by the system and is delivered to the receiver process when it is ready to accept the message. A buffer overflow may occur if a process sends a large number of messages in a burst to another process. Neither of the communication models is superior to the other. Asynchronous communication provides higher parallelism because the sender process can execute while the message is in transit 44
  • 65. to the receiver. However, an implementation of asynchronous communication requires more com- plex buffer management. In addition, due to higher degree of parallelism and non-determinism, it is much more difficult to design, verify, and implement distributed algorithms for asynchronous communications. The state space of such algorithms are likely to be much larger. Synchronous communication is simpler to handle and implement. However, due to frequent blocking, it is likely to have poor performance and is likely to be more prone to deadlocks. 45
  • 66. Bibliography [1] K. Birman, T. Joseph, Reliable Communication in Presence of Failures, ACM Transactions on Computer Systems, 47-76, 3, 1987. [2] K.M. Chandy, L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, 63-75, 3(1), 1985. [3] A. Kshemkalyani, M. Raynal, and M. Singhal, “Global Snapshots of a Distributed System”, Distributed Systems Engineering Journal, Vol 2, No 4, December 1995, pp. 224-233. [4] L. Lamport, Time, clocks and the ordering of events in a distributed system. Comm. ACM, vol.21, (July 1978), pp. 558-564. [5] A. Lynch, Distributed Processing Solves Main-frame Problems. Data Communications, pp. 17-22, December 1976. [6] F. Mattern, Virtual time and global states of distributed systems. Proc. "Parallel and dis- tributed algorithms" Conf., (Cosnard, Quinton, Raynal, Robert Eds), North-Holland, (1988), pp. 215-226. [7] P. Panengaden and K. Taylor, Concurrent Common Knowledge: A New Definition of Agree- ment for Asynchronous Events. Proc. of the 5th Symp. on Principles of Distributed Comput- ing, pp. 197-209, 1988. [8] Sol M. Shatz, Communication Mechanisms for Programming Distributed Systems, IEEE Computer, June 1984, pp. 21-28. 46
  • 67. Chapter 3 Logical Time 3.1 Introduction The concept of causality between events is fundamental to the design and analysis of parallel and distributed computing and operating systems. Usually causality is tracked using physical time. However, in distributed systems, it is not possible to have global physical time; it is possible to realize only an approximation of it. As asynchronous distributed computations make progress in spurts, it turns out that the logical time, which advances in jumps, is sufficient to capture the fundamental monotonicity property associated with causality in distributed systems. This chapter discusses three ways to implement logical time (e.g., scalar time, vector time, and matrix time) that have been proposed to capture causality between events of a distributed computation. Causality (or the causal precedence relation) among events in a distributed system is a powerful concept in reasoning, analyzing, and drawing inferences about a computation. The knowledge of the causal precedence relation among the events of processes helps solve a variety of problems in distributed systems. Examples of some of these problems is as follows: • Distributed algorithms design: The knowledge of the causal precedence relation among events helps ensure liveness and fairness in mutual exclusion algorithms, helps maintain consistency in replicated databases, and helps design correct deadlock detection algorithms to avoid phantom and undetected deadlocks. • Tracking of dependent events: In distributed debugging, the knowledge of the causal depen- dency among events helps construct a consistent state for resuming reexecution; in failure recovery, it helps build a checkpoint; in replicated databases, it aids in the detection of file inconsistencies in case of a network partitioning. • Knowledge about the progress: The knowledge of the causal dependency among events helps measure the progress of processes in the distributed computation. This is useful in discarding obsolete information, garbage collection, and termination detection. • Concurrency measure: The knowledge of how many events are causally dependent is useful in measuring the amount of concurrency in a computation. All events that are not causally related can be executed concurrently. Thus, an analysis of the causality in a computation gives an idea of the concurrency in the program. 47
  • 68. The concept of causality is widely used by human beings, often unconsciously, in planning, scheduling, and execution of a chore or an enterprise, in determining infeasibility of a plan or the innocence of an accused. In day-to-day life, the global time to deduce causality relation is ob- tained from loosely synchronized clocks (i.e., wrist watches, wall clocks). However, in distributed computing systems, the rate of occurrence of events is several magnitudes higher and the event ex- ecution time is several magnitudes smaller. Consequently, if the physical clocks are not precisely synchronized, the causality relation between events may not be accurately captured. Network Time Protocols [18], which can maintain time accurate to a few tens of milliseconds on the Internet, are not adequate to capture the causality relation in distributed systems. However, in a distributed computation, generally the progress is made in spurts and the interaction between processes oc- curs in spurts. Consequently, it turns out that in a distributed computation, the causality relation between events produced by a program execution and its fundamental monotonicity property can be accurately captured by logical clocks. In a system of logical clocks, every process has a logical clock that is advanced using a set of rules. Every event is assigned a timestamp and the causality relation between events can be generally inferred from their timestamps. The timestamps assigned to events obey the fundamental monotonicity property; that is, if an event a causally affects an event b, then the timestamp of a is smaller than the timestamp of b. This chapter first presents a general framework of a system of logical clocks in distributed systems and then discusses three ways to implement logical time in a distributed system. In the first method, Lamport’s scalar clocks, the time is represented by non-negative integers; in the second method, the time is represented by a vector of non-negative integers; in the third method, the time is represented as a matrix of non-negative integers. We also discuss methods for efficient implementation of the systems of vector clocks. The chapter ends with a discussion of virtual time, its implementation using the time-warp mechanism and a brief discussion of physical clock synchronization and the Network Time Proto- col. 3.2 A Framework for a System of Logical Clocks 3.2.1 Definition A system of logical clocks consists of a time domain T and a logical clock C [23]. Elements of T form a partially ordered set over a relation <. This relation is usually called the happened before or causal precedence. Intuitively, this relation is analogous to the earlier than relation provided by the physical time. The logical clock C is a function that maps an event e in a distributed system to an element in the time domain T, denoted as C(e) and called the timestamp of e, and is defined as follows: C : H → T such that the following property is satisfied: for two events ei and ej, ei → ej =: C(ei) < C(ej). This monotonicity property is called the clock consistency condition. When T and C satisfy the following condition, 48
  • 69. for two events ei and ej, ei → ej ⇔ C(ei) < C(ej) the system of clocks is said to be strongly consistent. 3.2.2 Implementing Logical Clocks Implementation of logical clocks requires addressing two issues [23]: data structures local to every process to represent logical time and a protocol (set of rules) to update the data structures to ensure the consistency condition. Each process pi maintains data structures that allow it the following two capabilities: • A local logical clock, denoted by lci, that helps process pi measure its own progress. • A logical global clock, denoted by gci, that is a representation of process pi’s local view of the logical global time. It allows this process to assign consistent timestamps to its local events. Typically, lci is a part of gci. The protocol ensures that a process’s logical clock, and thus its view of the global time, is managed consistently. The protocol consists of the following two rules: • R1: This rule governs how the local logical clock is updated by a process when it executes an event (send, receive, or internal). • R2: This rule governs how a process updates its global logical clock to update its view of the global time and global progress. It dictates what information about the logical time is piggybacked in a message and how this information is used by the receiving process to update its view of the global time. Systems of logical clocks differ in their representation of logical time and also in the protocol to update the logical clocks. However, all logical clock systems implement rules R1 and R2 and consequently ensure the fundamental monotonicity property associated with causality. Moreover, each particular logical clock system provides its users with some additional properties. 3.3 Scalar Time 3.3.1 Definition The scalar time representation was proposed by Lamport in 1978 [11] as an attempt to totally order events in a distributed system. Time domain in this representation is the set of non-negative integers. The logical local clock of a process pi and its local view of the global time are squashed into one integer variable Ci. Rules R1 and R2 to update the clocks are as follows: • R1: Before executing an event (send, receive, or internal), process pi executes the following: Ci := Ci + d (d > 0) 49
  • 70. In general, every time R1 is executed, d can have a different value, and this value may be application-dependent. However, typically d is kept at 1 because this is able to identify the time of each event uniquely at a process, while keeping the rate of increase of d to its lowest level. • R2: Each message piggybacks the clock value of its sender at sending time. When a process pi receives a message with timestamp Cmsg, it executes the following actions: – Ci := max(Ci, Cmsg) – Execute R1. – Deliver the message. Figure 3.1 shows the evolution of scalar time with d=1. p 1 p 2 p 3 1 2 3 3 10 11 5 6 7 2 7 9 4 b 1 8 9 4 5 1 Figure 3.1: Evolution of scalar time. 3.3.2 Basic Properties Consistency Property Clearly, scalar clocks satisfy the monotonicity and hence the consistency property: for two events ei and ej, ei → ej =: C(ei) < C(ej). Total Ordering Scalar clocks can be used to totally order events in a distributed system [11]. The main problem in totally ordering events is that two or more events at different processes may have identical timestamp. (Note that for two events e1 and e2, C(e1) = C(e2) = : e1 e2.) For example in Figure 3.1, the third event of process P1 and the second event of process P2 have identical scalar timestamp. Thus, a tie-breaking mechanism is needed to order such events. Typically, a tie is broken as follows: process identifiers are linearly ordered and a tie among events with identical scalar timestamp is broken on the basis of their process identifiers. The lower the process identifier in the ranking, the higher the priority. The timestamp of an event is denoted by a tuple (t, i) where 50
  • 71. t is its time of occurrence and i is the identity of the process where it occurred. The total order relation ≺ on two events x and y with timestamps (h,i) and (k,j), respectively, is defined as follows: x ≺ y ⇔ (h < k or (h = k and i < j)) Since events that occur at the same logical scalar time are independent (i.e., they are not causally related), they can be ordered using any arbitrary criterion without violating the causal- ity relation →. Therefore, a total order is consistent with the causality relation “→". Note that x ≺ y = :x → y ∨ x y. A total order is generally used to ensure liveness properties in dis- tributed algorithms. Requests are timestamped and served according to the total order based on these timestamps [11]. Event Counting If the increment value d is always 1, the scalar time has the following interesting property: if event e has a timestamp h, then h-1 represents the minimum logical duration, counted in units of events, required before producing the event e [6]; we call it the height of the event e. In other words, h-1 events have been produced sequentially before the event e regardless of the processes that produced these events. For example, in Figure 3.1, five events precede event b on the longest causal path ending at b. No Strong Consistency The system of scalar clocks is not strongly consistent; that is, for two events ei and ej, C(ei) < C(ej) = : ei → ej. For example, in Figure 3.1, the third event of process P1 has smaller scalar timestamp than the third event of process P2. However, the former did not happen before the latter. The reason that scalar clocks are not strongly consistent is that the logical local clock and logical global clock of a process are squashed into one, resulting in the loss causal dependency informa- tion among events at different processes. For example, in Figure 3.1, when process P2 receives the first message from process P1, it updates its clock to 3, forgetting that the timestamp of the latest event at P1 on which it depends is 2. 3.4 Vector Time 3.4.1 Definition The system of vector clocks was developed independently by Fidge [6], Mattern [14] and Schmuck [27]. In the system of vector clocks, the time domain is represented by a set of n-dimensional non- negative integer vectors. Each process pi maintains a vector vti[1..n], where vti[i] is the local logical clock of pi and describes the logical time progress at process pi. vti[j] represents process pi’s latest knowledge of process pj local time. If vti[j]=x, then process pi knows that local time at process pj has progressed till x. The entire vector vti constitutes pi’s view of the global logical time and is used to timestamp events. Process pi uses the following two rules R1 and R2 to update its clock: 51
  • 72. • R1: Before executing an event, process pi updates its local logical time as follows: vti[i] := vti[i] + d (d > 0) • R2: Each message m is piggybacked with the vector clock vt of the sender process at sending time. On the receipt of such a message (m,vt), process pi executes the following sequence of actions: – Update its global logical time as follows: 1 ≤ k ≤ n : vti[k] := max(vti[k], vt[k]) – Execute R1. – Deliver the message m. The timestamp associated with an event is the value of the vector clock of its process when the event is executed. Figure 3.2 shows an example of vector clocks progress with the increment value d=1. Initially, a vector clock is [0, 0, 0, ...., 0]. The following relations are defined to compare two vector timestamps, vh and vk: vh = vk ⇔ ∀x : vh[x] = vk[x] vh ≤ vk ⇔ ∀x : vh[x] ≤ vk[x] vh < vk ⇔ vh ≤ vk and ∃x : vh[x] < vk[x] vh vk ⇔ ¬(vh < vk) ∧ ¬(vk < vh) 3 p p 1 2 0 0 3 0 0 4 3 4 0 1 0 2 0 0 2 3 0 2 4 0 2 3 4 5 3 4 5 6 4 0 0 1 2 3 3 2 3 4 2 p 2 3 0 2 2 0 2 3 2 1 0 0 5 3 4 5 5 4 Figure 3.2: Evolution of vector time. 52
  • 73. 3.4.2 Basic Properties Isomorphism Recall that relation “→" induces a partial order on the set of events that are produced by a dis- tributed execution. If events in a distributed system are timestamped using a system of vector clocks, we have the following property. If two events x and y have timestamps vh and vk, respectively, then x → y ⇔ vh < vk x y ⇔ vh vk. Thus, there is an isomorphism between the set of partially ordered events produced by a dis- tributed computation and their vector timestamps. This is a very powerful, useful, and interesting property of vector clocks. If the process at which an event occurred is known, the test to compare two timestamps can be simplified as follows: If events x and y respectively occurred at processes pi and pj and are assigned timestamps vh and vk, respectively, then x → y ⇔ vh[i] ≤ vk[i] x y ⇔ vh[i] > vk[i] ∧ vh[j] < vk[j] Strong Consistency The system of vector clocks is strongly consistent; thus, by examining the vector timestamp of two events, we can determine if the events are causally related. However, Charron-Bost showed that the dimension of vector clocks cannot be less than n, the total number of processes in the distributed computation, for this property to hold [4]. Event Counting If d is always 1 in the rule R1, then the ith component of vector clock at process pi, vti[i], denotes the number of events that have occurred at pi until that instant. So, if an event e has timestamp vh, vh[j] denotes the number of events executed by process pj that causally precede e. Clearly, vh[j] − 1 represents the total number of events that causally precede e in the distributed com- putation. Applications Since vector time tracks causal dependencies exactly, it finds a wide variety of applications. For example, they are used in distributed debugging, implementations of causal ordering communica- tion and causal distributed shared memory, establishment of global breakpoints, and in determining the consistency of checkpoints in optimistic recovery. 53
  • 74. A Brief Historical Perspective of Vector Clocks Although the theory associated with vector clocks was first developed in 1988 independently by Fidge and Mattern, vector clocks were informally introduced and used by several researchers ear- lier. Parker et al. [20] used a rudimentary vector clocks system to detect inconsistencies of repli- cated files due to network partitioning. Liskov and Ladin [13] proposed a vector clock system to define highly available distributed services. Similar system of clocks was used by Strom and Yemini [30] to keep track of the causal dependencies between events in their optimistic recovery algorithm and by Raynal to prevent drift between logical clocks [21]. Singhal [28] used vector clocks coupled with a boolean vector to determine the currency of a critical section execution request by detecting the cusality relation between a critical section request and its execution. 3.4.3 On the Size of Vector Clocks An important question to ask is whether vector clocks of size n are necessary in a computation consisting of n processes. To answer this, we examine the usage of vector clocks. • A vector clock provides the latest known local time at each other process. If this information in the clock is to be used to explicitly track the progress at every other process, then a vector clock of size n is necessary. • A popular use of vector clocks is to determine the causality between a pair of events. Given any events e and f, the test for e ≺ f if and only if T(e) < T(f), which requires a compari- son of the vector clocks of e and f. Although it appears that the clock of size n is necessary, that is not quite accurate. It can be shown that a size equal to the dimension of the partial order (E, ≺) is necessary, where the upper bound on this dimension is n. This is explained below. To understand this result on the size of clocks for determining causality between a pair of events, we first introduce some definitions. A linear extension of a partial order (E, ≺) is a linear ordering of E that is consistent with the partial order, i.e., if two events are ordered in the partial order, they are also ordered in the linear order. A linear extension can be viewed as projecting all the events from the different processes on a single time axis. However, the linear order will necessarily introduce ordering between each pair of events, and some of these orderings are not in the partial order. Also observe that different linear extensions are possible in general. Let P denote the set of tuples in the partial order defined by the causality relation; so there is a tuple (e, f) in P for each pair of events e and f such that e ≺ f. Let L1, L2 . . . denote the sets of tuples in different linear extensions of this partial order. The set P is contained in the set obtained by taking the intersection of any such collection of linear extensions L1, L2 . . .. This is because each Li must contain all the tuples, i.e., causality dependencies, that are in P. The dimension of a partial order is the minimum number of linear extensions whose intersection gives exactly the partial order. Consider a client-server interaction between a pair of processes. Queries to the server and responses to the client occur in strict alternating sequences. Although n = 2, all the events are strictly ordered, and there is only one linear order of all the events that is consistent with the “partial” order. Hence the dimension of this "partial order" is 1. A scalar clock such as one implemented by Lamport’s scalar clock rules is adequate to determine e ≺ f for any events e and f in this execution. 54
  • 75. Now consider an execution on processes P1 and P2 such that each sends a message to the other before receiving the other’s message. The two send events are concurrent, as are the two receive events. To determine the causality between the send events or between the receive events, it is not sufficient to use a single integer; a vector clock of size n = 2 is necessary. This execution exhibits the graphical property called a crown, wherein there are some messages m0, . . . mn−1 such that Send(mi) ≺ Receive(mi+1 mod (n−1)) for all i from 0 to n − 1. A crown of n messages has dimension n. We introduced the notion of crown and studied its properties in Chapter 3. f a b d h i j j e f c a h i g (i) e d (ii) two linear extensions < c, e, f, a, b, d, g, h, i , j > < a,b,c,d,g,h,i,e,j,f> range of events "c", "e", "f" b c g Figure 3.3: Example illustrating dimension of a execution (E, ≺). For n = 4 processes, the dimension is 2. For a complex execution, it is not straightforward to determine the dimension of the partial or- der. Figure 3.3 shows an execution involving four processes. However, the dimension of this partial order is two. To see this informally, consider the longest chain a, b, d, g, h, i, j . There are events outside this chain that can yield multiple linear extensions. Hence, the dimension is more than 1. The right side of Figure 3.3 shows the earliest possible and the latest possible occurrences of the events not in this chain, with respect to the events in this chain. Let L1 be c, e, f, a, b, d, g, h, i, j , which contains the following tuples that are not in P: (c, a), (c, b), (c, d), (c, g), (c, h), (c, i), (c, j), (e, a), (e, b), (e, d), (e, g), (e, h), (e, i), (e, j), (f, a), (f, b), (f, d), (f, g), (f, h), (f, i), (f, j). Let L2 be a, b, c, d, g, h, i, e, j, f , which contains the following tuples not in P: (a, c), (b, c), (c, d), (c, g), (c, h), (c, i), (c, j), (a, e), (b, e), (d, e), (g, e), (h, e), (i, e), (e, j), (a, f