PhD Thesis
Upcoming SlideShare
Loading in...5




Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

PhD Thesis PhD Thesis Document Transcript

  • Rolando da Silva MartinsOn the Integration of Real-Time and Fault-Tolerance in P2P Middleware Departamento de Ciˆncia de Computadores e Faculdade de Ciˆncias da Universidade do Porto e 2012
  • Rolando da Silva MartinsOn the Integration of Real-Time and Fault-Tolerance in P2P Middleware Tese submetida ` Faculdade de Ciˆncias da a e Universidade do Porto para obten¸˜o do grau de Doutor ca em Ciˆncia de Computadores e Advisors: Prof. Fernando Silva and Prof. Lu´ Lopes ıs Departamento de Ciˆncia de Computadores e Faculdade de Ciˆncias da Universidade do Porto e Maio de 2012
  • To my wife Liliana, for her endless love, support, and encouragement. 3
  • –Imagination is everything. It is the preview of Acknowledgmentslife’s coming attractions. Albert EinsteinTo my soul-mate Liliana, for her endless support on the best and worst of times. Herunconditional love and support helped me to overcome the most daunting adversitiesand challenges.I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and PauloPaix˜o, for their vision and support that allowed me to pursuit this Ph.D. aI would like to thank the financial support from EFACEC, Sistemas de Engenharia,S.A. and FCT - Funda¸˜o para a Ciˆncia e Tecnologia, with Ph.D. grant SFRH/B- ca eDE/15644/2006.I would especially like to thank my advisors, Professors Lu´ Lopes and Fernando Silva, ısfor their endless effort and teaching over the past four years. Lu´ thank you for steering ıs,me when my mind entered a code frenzy, and for teaching me how to put my thoughtsto words. Fernando, your keen eye is always able to understand the “big picture”, thiswas vital to detected and prevent the pitfalls of building large and complex middlewaresystems. To both, I thank you for opening the door of CRACS to me. I had an incredibletime working with you.A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor.She opened the door of CMU to me and helped to shape my work at crucial stages.Priya, I had a fantastic time mind-storming with you, each time I managed to learnsomething new and exciting. Thank you for sharing with me your insights on MEAD’sarchitecture, and your knowledge on fault-tolerance and real-time.Lu´ Fernando and Priya, I hope someday to be able to repay your generosity and ıs,friendship. It is inspirational to see your passion for your work, and your continuouseffort on helping others.I would like to thank Jiaqi Tan for taking the time to explain me the architecture andfunctionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on mywork and for his insights on byzantine failures and consensus protocols.I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor-reia, V´ Costa, and Inˆs Dutra, for listening and sharing their thoughts on my work. ıtor eA big thank you to Hugo Ribeiro, for his crucial help with the experimental setup. 5
  • –All is worthwhile if the soul is not small. Abstract Fernando PessoaThe development and management of large-scale information systems, such as high-speed transportation networks, are pushing the limits of the current state-of-the-artin middleware frameworks. These systems are not only subject to hardware failures,but also impose stringent constraints on the software used for management and there-fore on the underlying middleware framework. In particular, fulfilling the Quality-of-Service (QoS) demands of services in such systems requires simultaneous run-timesupport for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage thatremains a challenge for current middleware frameworks. Fault-tolerance support isusually introduced in the form of expensive high-level services arranged in a client-serverarchitecture. This approach is inadequate if one wishes to support real-time tasks dueto the expensive cross-layer communication and resource consumption involved.In this thesis we design and implement Stheno, a general purpose P2 P middlewarearchitecture. Stheno innovates by integrating both FT and soft-RT in the architecture,by: (a) implementing FT support at a much lower level in the middleware on top of asuitable network abstraction; (b) using the peer-to-peer mesh services to support FT,and; (c) supporting real-time services through a QoS daemon that manages the under-lying kernel-level resource reservation infrastructure (CPU time), while simultaneously(d) providing support for multi-core computing and traffic demultiplexing. Stheno isable to minimize resource consumption and latencies from FT mechanisms and allowsRT services to perform withing QoS limits.Stheno has a service oriented architecture that does not limit the type of service that canbe deployed in the middleware. Whereas current middleware systems do not providea flexible service framework, as their architecture is normally designed to support aspecific application domain, for example, the Remote Procedure Call (RPC) service.Stheno is able to transparently deploy a new service within the infrastructure withoutthe user assistance. Using the P2 P infrastructure, Stheno searches and selects a suitablenode to deploy the service with the specified level of QoS limits.We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, suchas membership, discovery and service deployment, the impact of FT over RT, withand without resource reservation, and compare with other closely related middlewareframeworks. Results showed that Stheno is able to sustain RT performance whilesimultaneously providing FT support. The performance of the resource reservationinfrastructure enabled Stheno to maintain this behavior even under heavy load. 7
  • AcronymsAPI Application Programming InterfaceBFT Byzantine Fault-ToleranceCCM CORBA Component ModelCID Cell IdentifierCORBA Common Object Request Broker ArchitectureCOTS Common Of The ShelfDBMS Database Management SystemsDDS Data Distribution ServiceDHT Distributed Hash TableDOC Distributed Object ComputingDRE Distributed Real-Time and EmbeddedDSMS Data Stream Management SystemsEDF Earliest Deadline FirstEM/EC Execution Model/Execution ContextFT Fault-ToleranceIDL Interface Description LanguageIID Instance IdentifierIPC Inter-Process CommunicationIaaS Infrastructure as a ServiceJ2SE Java 2 Standard EditionJMS Java Messaging ServiceJRTS Java Real-Time SystemJVM Java Virtual Machine 9
  • JeOS Just Enough Operating SystemKVM Kernel Virtual-MachineLFU Least Frequently UsedLRU Least Recently UsedLwCCM Lightweight CORBA Component ModelMOM Message-Oriented MiddlewareNSIS Next Steps in SignalingOID Object IdentifierOMA Object Management ArchitectureOS Operating SystemsPID Peer IdentifierPOSIX Portable Operating System InterfacePoL Place of LaunchQoS Quality-of-ServiceRGID Replication Group IdentifierRMI Remote Method InvocationRPC Remote Procedure CallRSVP Resource Reservation ProtocolRTSJ Real-Time Specification for JavaRT Real-TimeSAP Service Access PointSID Service IdentifierSLA Service Level of AgreementSSD Solid State Disk10
  • TDMA Time Division Multiple AccessTSS Thread-Specific StorageUUID Universal Unique IdentifierVM Virtual MachineVoD Video on Demand 11
  • ContentsAcknowledgments 5Abstract 7Acronyms 9List of Tables 17List of Figures 19List of Algorithms 23List of Listings 251 Introduction 27 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.4 Assumptions and Non-Goals . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Overview of Related Work 35 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2.1 Special Purpose RT+FT Systems . . . . . . . . . . . . . . . . . . 37 2.2.2 CORBA-based Real-Time Fault-Tolerant Systems . . . . . . . . . 39 2.3 P2 P+RT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.1 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 13
  • 2.3.2 QoS-Aware P2 P . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4 P2 P+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.1 Publish-subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Resource Computing . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5 P2 P+RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . 49 2.6 A Closer Look at TAO, MEAD and ICE . . . . . . . . . . . . . . . . . . 49 2.6.1 TAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6.3 ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Architecture 59 3.1 Stheno’s System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1.1 Application and Services . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.2 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.3 P2 P Overlay and FT Configuration . . . . . . . . . . . . . . . . . 66 3.1.4 Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.1.5 Operating System Interface . . . . . . . . . . . . . . . . . . . . . 76 3.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.1 Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.2 Overlay Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.2.3 Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3 Fundamental Runtime Operations . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1 Runtime Creation and Bootstrapping . . . . . . . . . . . . . . . . 81 3.3.2 Service Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.3 Client Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894 Implementation 91 4.1 Overlay Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.1 Overlay Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.2 Mesh Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.1.3 Discovery Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 10814
  • 4.1.4 Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111 4.2 Implementation of Services . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.1 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2.2 Actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.2.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3 Support for Multi-Core Computing . . . . . . . . . . . . . . . . . . . . . 142 4.3.1 Object-Based Interactions . . . . . . . . . . . . . . . . . . . . . . 142 4.3.2 CPU Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.3.3 Threading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.3.4 An Execution Model for Multi-Core Computing . . . . . . . . . . 148 4.4 Runtime Bootstrap Parameters . . . . . . . . . . . . . . . . . . . . . . . 155 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565 Evaluation 157 5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.1 Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.2 Overlay Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.1 Overlay Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.2 Services Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2.3 Load Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3 Overlay Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3.1 Membership Performance . . . . . . . . . . . . . . . . . . . . . . 163 5.3.2 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.3.3 Service Deployment Performance . . . . . . . . . . . . . . . . . . 165 5.4 Services Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4.1 Impact of Faul-Tolerance Mechanisms in Service Latency . . . . . 167 5.4.2 Real-Time and Resource Reservation Evaluation . . . . . . . . . . 169 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1766 Conclusions and Future Work 177 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3 Personal Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 15
  • References 18216
  • List of Tables4.1 Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155 17
  • List of Figures1.1 Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 282.1 Middleware system classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 TAO’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . . 512.3 FLARe’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 532.4 MEAD’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 543.1 Stheno overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Application Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3 Stheno’s organization overview. . . . . . . . . . . . . . . . . . . . . . . . 633.4 Core Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.5 QoS Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.6 Overlay Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.7 Examples of mesh topologies. . . . . . . . . . . . . . . . . . . . . . . . . 683.8 Querying in different topologies. . . . . . . . . . . . . . . . . . . . . . . . 693.9 Support framework layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.10 QoS daemon resource distribution layout. . . . . . . . . . . . . . . . . . . 733.11 End-to-end network reservation. . . . . . . . . . . . . . . . . . . . . . . . 753.12 Operating system interface. . . . . . . . . . . . . . . . . . . . . . . . . . 773.13 Interactions between layers. . . . . . . . . . . . . . . . . . . . . . . . . . 783.14 Multiple processes runtime usage. . . . . . . . . . . . . . . . . . . . . . . 793.15 Creating and bootstrapping of a runtime. . . . . . . . . . . . . . . . . . . 813.16 Local service creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.17 Finding a suitable deployment site. . . . . . . . . . . . . . . . . . . . . . 843.18 Remote service creation without fault-tolerance. . . . . . . . . . . . . . . 853.19 Remote service creation with fault-tolerance: primary-node side. . . . . . 863.20 Remote service creation with fault-tolerance: replica creation. . . . . . . 873.21 Client creation and bootstrap sequence. . . . . . . . . . . . . . . . . . . . 884.1 The peer-to-peer overlay architecture. . . . . . . . . . . . . . . . . . . . . 914.2 The overlay bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 19
  • 4.3 The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 The initial binding process for a new peer. . . . . . . . . . . . . . . . . . 95 4.5 The final join process for a new peer. . . . . . . . . . . . . . . . . . . . . 96 4.6 Overview of the cell group communications. . . . . . . . . . . . . . . . . 99 4.7 Cell discovery and management entities. . . . . . . . . . . . . . . . . . . 103 4.8 Failure handling for non-coordinator (left) and coordinator (right) peers. 105 4.9 Cell failure (left) and subsequent mesh tree rebinding (right). . . . . . . . 106 4.10 Discovery service implementation. . . . . . . . . . . . . . . . . . . . . . . 109 4.11 Fault-Tolerance service overview. . . . . . . . . . . . . . . . . . . . . . . 112 4.12 Creation of a replication group. . . . . . . . . . . . . . . . . . . . . . . . 113 4.13 Replication group binding overview. . . . . . . . . . . . . . . . . . . . . . 114 4.14 The addition of a new replica to the replication group. . . . . . . . . . . 115 4.15 The control and data communication groups. . . . . . . . . . . . . . . . . 118 4.16 Semi-active replication protocol layout. . . . . . . . . . . . . . . . . . . . 120 4.17 Recovery process within a replication group. . . . . . . . . . . . . . . . . 122 4.18 RPC service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.19 RPC invocation types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.20 RPC service architecture without (left) and with (right) semi-active FT. 130 4.21 RPC service with passive replication. . . . . . . . . . . . . . . . . . . . . 132 4.22 Actuator service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.23 Actuator service overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.24 Actuator fault-tolerance support. . . . . . . . . . . . . . . . . . . . . . . 137 4.25 Streaming service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.26 Streaming service architecture. . . . . . . . . . . . . . . . . . . . . . . . . 139 4.27 Streaming service with fault-tolerance support. . . . . . . . . . . . . . . . 141 4.28 Object-to-Object interactions. . . . . . . . . . . . . . . . . . . . . . . . . 143 4.29 Examples of CPU Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 144 4.30 Object-to-Object interactions with different partitions. . . . . . . . . . . 145 4.31 Threading strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.32 End-to-End QoS propagation. . . . . . . . . . . . . . . . . . . . . . . . . 148 4.33 RPC service using CPU partitioning on a quad-core processor. . . . . . . 148 4.34 Invocation across two distinct partitions. . . . . . . . . . . . . . . . . . . 149 4.35 Execution Model Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.36 RPC implementation using the EM/EC pattern. . . . . . . . . . . . . . . 153 5.1 Overlay evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Physical evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.3 Overview of the overlay benchmarks. . . . . . . . . . . . . . . . . . . . . 16020
  • 5.4 Network organization for the service benchmarks. . . . . . . . . . . . . . 1615.5 Overlay bind (left) and rebind (right) performance. . . . . . . . . . . . . 1645.6 Overlay query performance. . . . . . . . . . . . . . . . . . . . . . . . . . 1655.7 Overlay service deployment performance. . . . . . . . . . . . . . . . . . . 1665.8 Service rebind time (left) and latency (right). . . . . . . . . . . . . . . . 1685.9 Rebind time and latency results with resource reservation. . . . . . . . . 1705.10 Missed deadlines without (left) and with (right) resource reservation. . . 1725.11 Invocation latency without (left) and with (right) resource reservation. . 1745.12 RPC invocation latency comparing with reference middlewares (without fault-tolerance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 21
  • List of Algorithms4.1 Overlay bootstrap algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 934.2 Mesh startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Cell initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4 Cell group communications: receiving-end . . . . . . . . . . . . . . . . . 1004.5 Cell group communications: sending-end . . . . . . . . . . . . . . . . . . 1024.6 Cell Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.7 Cell fault handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.8 Cell fault handling (continuation). . . . . . . . . . . . . . . . . . . . . . . 1084.9 Discovery service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.10 Creation and joining within a replication group . . . . . . . . . . . . . . 1164.11 Primary bootstrap within a replication group . . . . . . . . . . . . . . . 1174.12 Fault-Tolerance resource discovery mechanism. . . . . . . . . . . . . . . . 1184.13 Replica startup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.14 Replica request handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.15 Support for semi-active replication. . . . . . . . . . . . . . . . . . . . . . 1214.16 Fault detection and recovery . . . . . . . . . . . . . . . . . . . . . . . . . 1234.17 A RPC object implementation. . . . . . . . . . . . . . . . . . . . . . . . 1264.18 RPC service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.19 RPC service implementation. . . . . . . . . . . . . . . . . . . . . . . . . 1284.20 RPC client implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 1294.21 Semi-active replication implementation. . . . . . . . . . . . . . . . . . . . 1304.22 Service’s replication callback. . . . . . . . . . . . . . . . . . . . . . . . . 1314.23 Passive Fault-Tolerance implementation. . . . . . . . . . . . . . . . . . . 1334.24 Actuator service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . 1354.25 Actuator service implementation. . . . . . . . . . . . . . . . . . . . . . . 1364.26 Actuator client implementation. . . . . . . . . . . . . . . . . . . . . . . . 1364.27 Stream service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.28 Stream service implementation. . . . . . . . . . . . . . . . . . . . . . . . 1404.29 Stream client implementation. . . . . . . . . . . . . . . . . . . . . . . . . 1414.30 Joining an Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . 1514.31 Execution Context stack management. . . . . . . . . . . . . . . . . . . . 1524.32 Implementation of the EM/EC pattern in the RPC service. . . . . . . . . 154 23
  • List of Listings3.1 Overlay plugin and runtime bootstrap. . . . . . . . . . . . . . . . . . . . 823.2 Transparent service creation. . . . . . . . . . . . . . . . . . . . . . . . . . 833.3 Service creation with explicit and transparent deployments. . . . . . . . . 853.4 Service creation with Fault-Tolerance support. . . . . . . . . . . . . . . . 873.5 Service client creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.1 A RPC IDL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 25
  • –Most of the important things in the world have 1been accomplished by people who have kepttrying when there seemed to be no hope at all. Dale Carnegie Introduction1.1 MotivationThe development and management of large-scale information systems is pushing thelimits of the current state-of-the-art in middleware frameworks. At EFACEC1 , we haveto handle a multitude of application domains, including: information systems usedto manage public, high-speed transportation networks; automated power managementsystems to handle smart grids, and; power supply systems to monitor power supply unitsthrough embedded sensors. Such systems typically transfer large amounts of streamingdata; have erratic periods of extreme network activity; are subject to relatively commonhardware failures and for comparatively long periods, and; require low jitter and fastresponse time for safety reasons, for example, vehicle coordination.Target SystemsThe main motivation for this PhD thesis was the need to address the requirements of thepublic transportation solutions at EFACEC, more specifically, the light-train systems.The deployment of one of such systems is installed in Oporto’s light-train network andis composed of 5 lines, 70 stations and approximately 200 sensors (partially illustratedin Figure 1.1). Each station is managed by a computational node, that we designateas peer, that is responsible for managing all the local audio, video, display panels, andlow-level sensors such as track sensors for detecting inbound and outbound trains. 1 EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systemsengineering namely in public transportation and energy systems, employs around 3000 people and hasa turnover of almost 1000 million euro; it is established in more than 50 countries and exports almosthalf of its production (c.f. 27
  • CHAPTER 1. INTRODUCTIONThe system supports three types of traffic: normal - for regular operations over thesystem, such as playing an audio message in a station through an audio codec; critical- medium priority traffic comprised of urgent events, such as an equipment malfunctionnotification; alarms - high priority traffic that notifies critical events, such as low-levelsensor events. Independently of the traffic type (e.g., event, RPC operation), the systemrequires that any operation must be completed within 2 seconds.From the point of view of distributed architectures, the current deployments would bebest matched with P2 P infra-structures that are resilient and allow resources (e.g., asensor connected through a serial link to a peer) to be seamlessly mapped to the logicaltopology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT)services. Support for both RT and FT is fundamental to meet system requirements.Moreover, the next generation light train solutions require deployments across citiesand regions that can be overwhelmingly large. This introduces the need for a scalablehierarchical abstraction, the cell, that is composed of several peers that cooperate tomaintain a portion of the mesh. Figure 1.1: Oporto’s light-train network.1.2 Challenges and OpportunitiesThe requirements from our target systems pose a significant number of challenges. Thepresence of FT mechanisms, specially using space redundancy [1], it introduces the needfor the presence of multiple copies of the same resource (replicas), and these, in turn,ultimately lead to a greater resource consumption.FT also introduce overheads in the form of latency and this is another constraintthat is important when dealing with RT systems. When an operation is performed,irrespectively, of whether it is real-time or not, any state change that it causes by it28
  • 1.2. CHALLENGES AND OPPORTUNITIESmust be propagated among the replicas through a replication algorithm that introducesan additional source of latency. Furthermore, the recovery time, that consists in thetime that the system needs to recover from a fault, is an additional source of latencyto real-time operations. There are well known replication styles that offer differenttrade-offs between state consistency and latency.Our target systems have different traffic types with distinct deadlines requirements thatmust be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernetnetworking) and software (e.g., Linux). This requires that the RT mechanisms leveragethe available resources, through resource reservation, while providing different threadingstrategies that allow different trade-offs between latency and throughput.To overcome the overhead introduced by the FT mechanisms, it must be possibleto employ a replication algorithm that do not compromises the RT requirements.Replication algorithms that offer a higher degree of consistency introduce a higherlevel of latency [1, 2] that may be prohibitive for certain traffic types. On the otherhand, certain replication algorithms exhibit a lower resource consumption and latencyat the expense of a longer recovery time, that may also be prohibitive.Considering current state-of-the-art research we see many opportunities to addressthe previous challenges. One is the use of COTS operating system that allow for afaster implementation time, thus smaller development cost, while offering the necessaryinfrastructure to build a new middleware system.P2 P networks can be used to provide a resilient infra-structure that mirrors the physicaldeployments of our target systems, furthermore, different P2 P topologies offer differenttrade-offs between self-healing, resource consumption and latency in end-to-end oper-ations. Moreover, by directly implementing FT on the P2 P infra-structure we hopeto lower resource usage and latency to allow the integration of RT. By using provenreplication algorithms [1, 2] that offer well-known trade-offs regarding consistency,resource consumption and latency, we can focus on the actual problem of integratingreal-time, fault-tolerance within a P2 P infrastructure.On the other hand, RT support can be achieve through the implementation of differentthreading strategies, resource reservation (through the Linux’s Control Groups) and byavoiding traffic multiplexing through the use of different access points to handle differenttraffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling wouldprovide greater RT guarantees, this goal will not be pursued due the lack of maturityof the current EDF implementations in Linux (our reference COTS operating system).Because we are limited to use priority based scheduling and resource reservation, we can 29
  • CHAPTER 1. INTRODUCTIONonly partially support our goal of providing end-to-end guarantees, more specifically,we enhance our RT guarantees through the use of RT scheduling policies with over-provisioning to ensure that deadlines are met.1.3 Problem DefinitionThe work presented in this thesis focuses on the integration of Real-Time (RT) andFault-Tolerance (FT) in a scalable general purpose middleware system. This goalcan only be achieved if the following premises are valid: (a) FT infrastructure cannotinterfere in RT behavior, independently of the replication policy; (b) the network modelmust be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and awareof the underlying infrastructure, i.e. network model, operating system and physicalenvironment.Our problem definition is a direct consequence of the requirements from our targetsystems, and it can be summarize with the following question: ”Can we opportunisticallyleverage and integrate these proven strategies to simultaneously support soft-RT and FTto meet the needs of our target systems even under faulty conditions?”In this thesis we argue that a lightweight implementation of fault-tolerance mechanismsin a middleware is fundamental for its successful integration with soft real-time support.Our approach is novel in that it explores peer-to-peer networking as a means to imple-ment generic, transparent, lightweight fault-tolerance support. We do this by directlyembedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage oftheir scalable, decentralized and resilient nature. For example, peer-to-peer networksreadily provide the functionality required to maintain and locate redundant copies ofresources. Given their dynamic and adaptive nature, they are promising infra-structuresfor developing lightweight fault-tolerant and soft real-time middleware.Despite these a priori advantages, mainstream generic peer-to-peer middleware systemsfor QoS computing are, to our knowledge, unavailable. Motivated by this state ofaffairs, by the limitations of the current infra-structure for the information system weare managing at EFACEC (based on CORBA technology) and, last but not least, by thecomparative advantages of flexible peer-to-peer network architectures, we have designedand implemented a prototype service-oriented peer-to-peer middleware framework.The networking layer relies on a modular infra-structure that can handle multiple peer-to-peer overlays. The support for fault-tolerance and soft real-time features is provided30
  • 1.4. ASSUMPTIONS AND NON-GOALSat this level through the implementation of efficient and resilient services for, e.g.resource discovery, messaging and routing. The kernel of the middleware system (theruntime) is implemented on top of these overlays and uses the above mentioned peer-to-peer functionalities to provide developers with APIs for customization of QoS policiesfor services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy,number of replicas). This approach was inspired in that of TAO [3], that allows fordistinct strategies for the execution of tasks by threads to be defined.1.4 Assumptions and Non-GoalsThe distributed model used in this thesis is based on a partial asynchronous modelcomputing model, as defined in [2], extended with fault-detectors.The services and P2 P plugin implemented in this thesis only support crash failures. Weconsider a crash failure [1] to be characterized as a complete shutdown of a computinginstance in the event of a failure, ceasing to interact any further with the remainingentities of the distributed system.The timing faults are handled differently by services and the P2 P plugin. In our serviceimplementations a timing fault is logged (for analysis) with no other action beingperformed, whereas, in the P2 P layer we consider a timing fault as a crash failure, i.e.,if the remote creation of a service exceeds its deadline, the peer is considered crashed.This method is also called as process controlled crash, or crash control, as defined in[4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect ofbeing crashed, it does not get killed or commits suicide, instead it gets shunned, thatis, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it mustrebind using the membership service in the P2 P layer.The fault model used was motivated by the author’s experience on several field deploy-ments of ligth-train transportation systems, such as the Oporto, Dublin and TenerifeLight Rail solutions [5]. Due to the use of highly redundant hardware solutions, such asredundant power supplies and redundant 10-Gbit network ring links, network failurestend to be short. The most common cause for downtime is related with software bugs,that mostly results in a crashing computing node. While simultaneous failures canhappen, they are considered rare events.We also assume that the resource-reservation mechanisms are always available.In this thesis we do not address value faults and byzantine faults, as they are not a 31
  • CHAPTER 1. INTRODUCTIONrequirement for our target systems. Furthermore, we do not provide a formal specifica-tion and verification of the system. While this would be beneficial to assess systemcorrectness, we had to limit the scope of this thesis. Nevertheless, we provide anempirical evaluation of the system.We also do not address hard real-time because the lack of a mature support for EDFscheduling in the Linux kernel. Furthermore, we do not provide a fully optimizedimplementation, but only a proof-of-concept to validate our approach. Testing thesystem in a production environment is left for future work.1.5 ContributionsBefore undertaking the task of building an entire new middleware system from scratch,we explored current solutions, presented in Chapter 2, to see if any of them couldsupport the requirements from our target system. As we did not find any suitablesolution, we then assessed if it was possible to extend an available solution to meetthose requirements. In our previous work, DAEM [6], we explored the use of JGroups [7]within an hierarchical P2 P mesh, and concluded that the simultaneous support for real-time, fault-tolerance and P2 P requires fine grain control of resources that is not possiblewith the use of ”black-box” solutions, for example, it is impossible to have out-of-the-box support for resource reservation in JGroups.Given these assessments, we have designed and implemented Stheno, that to the bestof our knowledge is the first middleware system to seamlessly integrate fault-toleranceand real-time in a peer-to-peer infrastructure. Our approach was motivated by thelack of support of current solutions for the timing, reliability and physical deploymentcharacteristics of our target systems.For that, a complete architectural design is proposed that addresses the levels of thesoftware stack, including kernel space, network, runtime and services, to achieve aseamless integration. The list of contributions include: (a) a full specification of a userApplication Programming Interface (API); (b) pluggable P2 P network infrastructureaiming to better adjust to the target application; (c) support for configurable FT onthe P2 P layer with the goal of providing lightweight FT mechanisms, that fully enableRT behavior, and; (d) integration of resource reservation at all the levels of runtime,enabling (partial) end-to-end Quality-of-Service (QoS) guarantees.Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioningfor real-time systems. In this thesis we present, Euryale, a QoS network oriented32
  • 1.6. THESIS OUTLINEframework that features resource reservation with support for a broader range of sub-systems, including CPU, memory, I/O and network bandwidth for a general purposeoperating system as Linux. At the heart of this infrastructure resides Medusa, a QoSdaemon that handles admission and management of QoS requests.Current well-known threading strategies, such as Leader-Followers [11], Thread-per-Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la-tency and resource usage [3, 14]. However, they do not support resource reservation,namely, CPU partitioning. In order to suppress this limitation, this thesis provides anadditional contribution with the introduction of a novel design pattern (Chapter 4) thatis able to integrate multi-core computing with resource reservation within a configurableframework that supports these well-known threading strategies. For example, when aclient connects to a service it can specify, through the QoS real-time parameters, for aparticular threading strategy that best meets its requirements.We present a full implementation that covers all the previously architectural features,including a complete overlay implementation, inspired in the P3 [15] topology, thatseamlessly integrates RT and FT.To evaluate our implementation and justify our claims, we present a complete evalua-tion for both mechanisms. The impact of the resource reservation mechanism is alsoevaluated, as well as a comparative evaluation of RT performance against state-of-the-art middleware systems. The experimental results show that Stheno meets and exceedstarget system requirements for end-to-end latency and fail-over latency.1.6 Thesis OutlineThe focus of this thesis is on the design, implementation and evaluation of a scalablegeneral purpose middleware that provides the seamless integration of RT and FT. Theremaining of this thesis is organized as follows.Chapter 2: Overview of Related Work.This chapter presents an overview on related middleware systems that exhibit supportfor RT, FT and P2 P, the mandatory requirements from our target system. We startedby searching for an available off-the-shelf solution that could support all of theserequirements, or in its absence, identifying a current solution that could be extended inorder to avoid creating a new middleware solution from scratch.Chapter 3: Architecture. 33
  • CHAPTER 1. INTRODUCTIONChapter 3 describes the runtime architecture on the proposed middleware. We startby providing a detailed insight on the architecture, covering all layers present in theruntime. Special attention is given to the presentation of the QoS and resource reser-vation infrastructure. This is followed by an overview of the programming model thatdescribes the most important interfaces present in the runtime, as well the interactionsthat occur between them. The chapter ends with the description of the fundamentalruntime operations, namely: the creation of services with and without FT support,deployment strategy, and client creation.Chapter 4: Implementation.Chapter 4 describes the implementation of a prototype based on the aforementionedarchitecture, and is divided in four parts. In the first part, we present a completeimplementation of P2 P overlay that is inspired on the P3 [15] topology, while providingsome insight on the limitations of the current prototype. The second part of this chapterfocuses on the implementation of three types of user services, namely, Remote ProcedureCall (RPC), Actuator, and Streaming. These services are thoroughly evaluated inChapter 5. In the third part, we describe our support for multi-core computing, throughthe presentation of a novel design pattern, the Execution Model/Context. This designpattern is able to integrate resource reservation, especially CPU partitioning, withdifferent well-known (and configurable) threading strategies. The fourth and final partof this chapter describes the most relevant parameters used in the bootstrap of theruntime.Chapter 5: Evaluation.The experimental results are presented in this chapter. It starts by providing detailsof physical setup used throughout the evaluation. Then it describes the parametersused in the testbed suite, that is composed by the three services previously described inChapter 4. We then focus on presenting the results for the benchmarks, including theassessment of the impact of FT on RT, and the impact of the resource reservation infra-structure in the overall performance. The chapter ends with a comparative evaluationagainst well-known middleware systems.Chapter 6: Conclusion and Future Work.This last chapter presents the concluding remarks. It highlights the contributions ofthe proposed and implemented middleware, and provides34
  • –By failing to prepare, you are preparing to fail. 2 Benjamin Franklin Overview of Related Work2.1 OverviewThis chapter presents an overview of the state-of-the-art on related middleware systems.As illustrated in Figure 2.1, we are mostly interested in systems that exhibit supportfor real-time (RT), fault-tolerance (FT) and peer-to-peer (P2 P), the mandatory require-ments from our target system. We started by searching for an available off-the-shelfsolution that could support all of these requirements, or in its absence, identify a currentsolution that could be extended, and thus avoid the creation of a new middlewaresolution from the ground up. For that reason, we have focused on the intersectingdomains, namely, RT+FT, RT+P2 P and FT+P2 P, since the systems contained in thesedomains are closer to meet the requirements of our target system.From an historic perspective, the origins of modern middleware systems can be tracedback to the 1980s, with the introduction of the concept of ubiquitous computing, inwhich computational resources are accessible and seen as ordinary commodities suchas electricity or tapwater [2]. Furthermore, the interaction between these resourcesand the users was governed by the client-server model [16] and a supporting protocolcalled RPC [17]. The client-server model is still the most prevalent paradigm in currentdistributed systems.An important architecture for client-server systems was introduced with the CommonObject Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did notaddress real-time or fault-tolerance. Only recently both real-time and fault-tolerancespecifications were finalized but remained mutually exclusive. This means that asystem supporting the real-time specification will not be able to support the fault- 35
  • CHAPTER 2. OVERVIEW OF RELATED WORK DDS Video Streaming RT RT+P2P CORBA RT FT RT+FT RT+FT+P2P P2P FT+P2P FT Pastry Distributed storage CORBA FT Stheno Figure 2.1: Middleware system classes.tolerance specification, and vice-versa. Nevertheless, seminal work has already ad-dressed these limitations and offered systems supporting both features, namely, TAO [3]and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared asa Java alternative capable of providing a more flexible and easy-to-use environment.In recent years, CORBA entered in a steady decline [20] in favor of web-orientedplatforms, such as J2EE [21], .NET [22] and SOAP [23], and P2 P systems. Theweb-oriented platforms, such as the JBoss [24] application server, aim to integrateavailability with scalability, but they remain unable to support real-time. Moreover,while partitioning offers a clean approach to improve scalability, it fails to supportlarge scale distributed systems [2]. Alternatively, P2 P systems focused on providinglogical organizations, i.e., meshes, that abstract the underlying physical deploymentwhile providing a decentralized architecture for increased resiliency. These systemsfocused initially on resilient distributed storage solutions, such as Dynamo [25], butprogressively evolved to support soft real-time systems, such as video streaming [26].More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributedmessage passing infrastructure based on an asynchronous interaction model, that isable to suppress the scaling issues present in RPC. A considerable amount of im-plementations exist, including Tibco [28], Websphere MQ [29] and Java MessagingService (JMS) [30]. MOM sometimes are integrated as subsystems in the applicationserver infrastructures, such as JMS in J2EE and Websphere MQ in the WebsphereApplication Server.A substantial body of research has focused on the integration of real-time within36
  • 2.2. RT+FT MIDDLEWARE SYSTEMSCORBA-based middleware, such as TAO [3] (that later addressed the integration offault-tolerance). More recently, QoS-enabled publish-subscribe middleware systemsbased on the JAIN SLEE specification [31], such as Mobicents [32], and in the DataDistribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34]and OpenSplice [35], appeared as a way to overcome the current lack of support forreal-time applications in SOA-based middleware systems.The introduction of fault-tolerance in middleware systems also remains an active topicof research. CORBA-based middleware systems were a fertile ground to test fault-tolerance techniques in a general purpose platform, resulting in the creation of theCORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA-based platforms, such as J2EE. One of the most popular deployments, JBoss, supportsscalability and availability through partitioning. Each partition is supported by a groupcommunication framework based on the virtual synchrony model, more specifically, theJGroups [7] group communication framework.2.2 RT+FT Middleware SystemsThis section overviews systems that provide simultaneous support for real-time andfault-tolerance. These systems are divided into special purposed solutions, designed forspecific application domains, and CORBA-based solutions, aimed for general purposedcomputing.2.2.1 Special Purpose RT+FT SystemsSpecial purpose real-time fault-tolerant systems introduced concepts and implementa-tion strategies that are still relevant on current state-of-the-art middleware systems.ArmadaArmada [37] focused on providing middleware services and a communication infrastruc-ture to support FT and RT semantics for distributed real-time systems. This waspursued in two ways, which we now describe.The first contribution was the introduction of a communication infrastructure that isable to provide end-to-end QoS guarantees, in both unicast and multicast primitives.This was supported by a control signaling and a QoS-sensitive data transfer (as in thenewer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)). 37
  • CHAPTER 2. OVERVIEW OF RELATED WORKThe network infrastructure used a reservation mechanism based on EDF schedulingpolicy that was built on top of the Mach OS priority based scheduling. The initialimplementation was done in the user-level but subsequently migrated to the kernellevel with the goal of reducing latency.Much of the architectural decisions regarding RT support were based on the availableoperating system at the time, mainly Mach OS. Despite the advantages of a micro-kernel approach, its application remains restricted by the underlying cost associatedwith message passing and context switching. Instead, a large body of research has beenmade on monolithic kernels, specially in Linux OS, that are able to offer the advantagesof the micro-kernel approach, through the introduction of kernel modules, and the speedof monolithic kernels.The second contribution came in the form of a group communication infrastructurebased on a ring topology that ensured the delivery of messages in a reliable and totalorder fashion within a bounded time. It also had support for membership managementthat offered consistent views of the group through the detection of process and commu-nication failures. These group communication mechanisms enabled the support for FTthrough the use of a passive replication scheme, that allowed for some inconsistenciesbetween the primary and the replicas, where the states of the replicas could lag behindthe state of the primary, up to a bounded time window.MarsMars [38] provided support for the analysis and deployment of synchronous hard real-time systems through a static off-line scheduler for CPU and Time Division MultipleAccess (TDMA) bus. Mars is able to offer FT support through the use of activeredundancy on the TDMA bus, i.e. sending multiple copies of the same message, andself-checking mechanisms. Deterministic communications are achieved though the useof a time-triggered protocol.The project focused on the RT process control, where all the intervening entities areknown in advance. So it does not offer any type of support for dynamical admission ofnew components, neither it supports on-the-fly fault-recovery.ROAFTSROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis-tributed RT applications, consisting in a network of Time-triggered Message-triggeredObjects [41] (TMO’s), whose execution is managed by a TMO support manager. TheFT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic38
  • 2.2. RT+FT MIDDLEWARE SYSTEMSfault server ; (b) and a network surveillance [42] manager. Fault-detection is assuredby the network surveillance TMO, and used by the generic fault-server to change theFT policy with the goal of preserving RT semantics. The system assumes that RTcan live with lesser reliability assurances from the middleware, under highly dynamicenvironments.MarutiMaruti [43] aimed to provide a development framework and an infrastructure for thedeployment of hard real-time applications within a reactive environment, focusing onreal-time requirements on a single-processor system. The reactive model is able tooffer runtime decisions on the admission of new processing requests without producingadverse effects on the scheduling of existing requests. Fault-tolerance is achieved byredundant computation. A configuration language allows the deployment of replicatingmodules and services.Delta-4Delta-4 [44] provided an in-depth characterization of fault assumptions, for both thehost and the network. It also demonstrated various techniques for handling them,namely, passive and active replication for fail-silent hosts and byzantine agreement forfail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi-tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 frameworkthrough the introduction of the Leader/Follower replication model (better known assemi-active replication) for fail-silent hosts. This work also lead to the extension to thecommunication system to support additional communication primitives (the originalwork on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, andAtLeastTo.2.2.2 CORBA-based RT+FT SystemsThe support for RT and FT in general purpose distributed platforms remains mostlyrestricted to CORBA. While some support was carried out by Sun to introduce RT sup-port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46,47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen-tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-TimeVM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attemptedto provide support for RT in a J2EE environment. Nevertheless, this support seems tobe confined to the introduction of a deterministic garbage collector, through the use of 39
  • CHAPTER 2. OVERVIEW OF RELATED WORKthe RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbagecollection [51].Previous work on integration of RT and FT in CORBA context systems can be catego-rized into three distinct approaches: (a) integration, where the base ORB is modified;(b) services, systems that rely on high-level services to provide FT (and indirectly, RT),and; (c) interception, systems that perform interception on client request to providetransparent FT and RT.Integration ApproachPast work on the integration of fault-tolerance in CORBA-like systems was done inElectra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors ofthe CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage-ment Architecture (OMA) to support transparent and non-transparent fault-tolerancecapabilities. Instead of using message queues or transaction monitors [56], it relied onobject-communication groups [57, 58]. Maestro [53] is a distribute layer built on top ofthe Ensemble [59] group communication, that was used by Electra [52] in the Qualityof Service for CORBA Objects (QuO) project [60]. Its main focus was to provide anefficient, extensible and non disruptive integration of the object layers with the low-level QoS system properties. The AQuA [54] system uses both QuO and Maestro ontop of the Ensemble communication groups, to provide a flexible and modular approachthat is able to adapt to faults and changes in the application requirements. Within itsframework a QuO runtime accepts availability requests by the application and relaysthem to a dependability manager, that is responsible to leverage the requests frommultiple QuO runtimes.TAO+QuOThe work done in [61] focused on the integration of QoS mechanisms, for both CPU andnetwork resources while supporting both priority- and reservation-based QoS semantics,with standard COTS Distributed Real-Time and Embedded (DRE) middleware, moreprecisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. Thepriority-based approach was built on top of the RT-CORBA specification, and it defineda set of standard features in order to provide end-to-end predictability for operationswithin a fixed priority context [62]. The CPU priority-based resource management isleft to the scheduling of the underlying Operating Systems (OS), whereas the networkpriority-based management is achieved through the use of the DiffServ architecture [63],by setting the DSCP codepoint on the IP header of the GIOP requests. Based onvarious factors, the QuO runtime can dynamically change this priority to adjust to40
  • 2.2. RT+FT MIDDLEWARE SYSTEMSenvironment changes. Alternatively, the network reservation-based approach relies onthe RSVP [64] signaling protocol to guarantee the desired network bandwidth betweenhosts. The QuO runtime monitors the RSVP connections and makes adjustments toovercome abnormal conditions. For example, in a video service it can drop frames tomaintain stability. The cpu-reservation is made using reservation mechanisms presentin the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservationspolicies. This was done to preserve the end-to-end QoS semantics that is only availableat a higher level of the middleware.CIAO+QuOCIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built ontop of TAO [3] that aims to alleviate the complexity of integrating real-time features onDRE using Distributed Object Computing (DOC) middleware. These DOC systems,of which TAO is an example, offer configurable policies and mechanisms for QoS,namely real-time, but lack a programming model that is capable of separating systemicaspects from applicational logic. Furthermore, QoS provisioning must be done in anend-to-end fashion, thus having to be applied to several interacting components. Itis difficult, or nearly impossible, to properly configure a component without takinginto account the QoS semantics for interacting entities. Developers using standardDOC middleware systems are susceptible to produce misconfigurations that cause anoverall system misbehavior. CIAO overcomes these limitations by applying a widerange of aspect-oriented development techniques that support the composition of real-time semantics without intertwining configurations concerns. The support for CIAO’sCCM architecture was done in CORFU [66] and is described below.Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67].The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi-sioning to a total provisioning middleware that is also able to accommodate dynamicaland adaptive QoS provisioning. For example, the setup of a RSVP [64] connectionwould require the explicit configuration from the developer, defeating the purpose ofCIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], itdoes not provide a solution for component cross-cutting.DynamicTAODynamicTAO [69] focused on providing a reflective model middleware that extendsTAO to support on-the-fly dynamic reconfiguration of its component behavior andresource management through meta-interfaces. It allows the application to inspectthe internal state/configuration and, if necessary, to reconfigure it in order to adapt 41
  • CHAPTER 2. OVERVIEW OF RELATED WORKto environment changes. Subsequently, it is possible to select networking protocols,encoding and security policies to improve the overall system performance in the presenceof unexpected events.Service-based ApproachAn alternative, high-level service approach for CORBA fault-tolerance was taken byDistributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service(OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providingreplica management, fault-detection and fault-recovery as a CORBA high-level service.It did group communication and it mainly focused on passive replication, but allowedthe developer to select the desired level of reliability (number of replicas), replicationpolicy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recoverystrategy. OGS improved over prior approaches by using a group communication protocolthat imposes consensus semantics. Instead of adopting an integrated approach, groupcommunication services are transparent to the ORB, by providing a request levelbridging. Newtop followed a similar approach to OGS but augmented the supportfor network partition, allowing the newly formed sub-groups to continue to operate.TAOTAO [3] is a CORBA middleware with support for RT and FT middleware, that iscompliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. Thesupport for RT includes priority propagation, explicit binding, and RT thread pools.The FT is supported through the of a high level service, the Replication Manager, thatsits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure,acting as a rendezvous for all the remaining components, more precisely, monitors thatwatch the status of the replicas, replica factories that allow the creation of new replicas,and fault notifiers that inform the manager of failed replicas. TAO’s architecture isfurther detailed in Section 2.6 of Chapter 3.FLARe and CORFUFLARe [74] focus on proactively adapting the replication group to underlying changeson resource availability. To minimize resource usage, it only supports passive replica-tion [75]. Its implementation is based on TAO [3]. It adds three new components tothe existing architecture: (a) Replication Manager high level service that decides on thestrategy to be employed to address the changes on resource availability and faults; (b)a client interceptor that redirects invocations to the active primary; (c) a redirectionagent that receives updates from the Replication Manager and is used by the interceptor,42
  • 2.2. RT+FT MIDDLEWARE SYSTEMSand; (d) a resource monitor that watches the load on nodes and periodically notifies theReplication Manager. In the presence of faulty conditions, such as overload of a node,the Replication Manager adapts the replication group to the changing conditions, byactivating replicas on nodes that have a lower resource usage, and additionally, changethe location of the primary node to a better suitable placement.CORFU [66] extends FLARe to support real-time and fault-tolerance for the LightweightCORBA Component Model (LwCCM) [76] standard for DRE systems. It providesfail-stop behavior, that is, when one component on a failover unit fails, then all theremaining components are stopped, allowing for a clean switch to a new unit. This isachieved through a fault mapping facility that allows the correspondence of the objectfailure into the respective plan(s), with the subsequent component shutdown.DeCoRAMThe DeCoRAM system [77] aims to provide RT and FT properties through a resource-aware configuration, executed using a deployment infrastructure. The class of supportedsystems is confined to closed DRE, where the number of tasks and their respectiveexecution and resource requirements are known a priori and remain invariant thoughtthe system’s life-cycle. As the tasks and resources are static, it is possible to optimize theallocation of the replicas on available nodes. The allocation algorithm is configurableallowing for a user to choose the best approach to a particular application domain.DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real-Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti-mization problem, while satisfying both RT and FT system constraints. Because of thelimited resources normally available on DRE systems, DeCoRAM only supports passivereplication [75], thus avoiding the high overhead associated with active replication [78].The allocation algorithm calculates the components inter-dependencies and deploys theexecution plan using the underlying middleware infrastructure, which is provided byFLARe [74].Interception-based ApproachThe work done in Eternal [79, 80] focused on providing transparent fault-tolerance forCORBA ensuring strong replica consistency through the use of reliable totally-orderedmulticast protocol. This approach alleviated the developer from having to deal with low-level mechanisms for supporting fault-tolerance. In order to maintain compatibility withthe CORBA-FT standard, Eternal exposes the replication manager, fault detector, andfault notifier to developers. However, the main infrastructure components are locatedbelow the ORB for both efficiency and transparency purposes. These components 43
  • CHAPTER 2. OVERVIEW OF RELATED WORKinclude logging-recovery mechanisms, replication mechanisms, and interceptors. Thereplication mechanisms provide support for warm and cold passive replication and activereplication. The interceptor captures the CORBA IIOP requests and replies (based onTCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recoverymechanisms are responsible for managing the logging, checkpointing, and performingthe recovery protocols.MEADMEAD focuses on providing fault-tolerance support in a non intrusive way by en-hancing distributed RT systems with (a) a transparent, although tunable FT, thatis (b) proactively dependable through (c) resource awareness, that has (d) scalable andfast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO,as proof-of-concept. The paper makes an important contribution by leveraging fault-tolerance resource consumption for providing RT behavior. MEAD is detailed furtherin Section 2.6 of Chapter 3.2.3 P2P+RT Middleware SystemsWhile most of the focus on P2 P systems has been on the support of FT, there is agrowing interested in using these systems for RT applications, namely, in streamingand QoS support. This section provides an overview on P2 P systems that support RT.2.3.1 StreamingStreaming and specially Video on Demand (VoD), were a natural evolution of the firstfile sharing P2 P systems [81, 82]. With the steady increase of network bandwidth on theInternet, it is now possible to have high-quality multimedia streaming solutions to theend-user. These focus on providing near soft real-time performance resorting to streamssplit through the use of distributed P2 P storage and redundant network channels.PPTVThe work done in [26] provides the background for the analysis, design and behaviorof VoD systems, focusing on the PPTV system [83]. An overview of the differentreplication strategies and their respective trade-offs is presented, namely, Least RecentlyUsed (LRU) and Least Frequently Used (LFU). The later uses a weighted estimationbased on the local cache completion and by the availability to demand ratio (ATD).44
  • 2.3. P2 P+RT MIDDLEWARE SYSTEMSEach stream is divided into chunks. The size of these chunks have a direct influence onthe efficiency of the streaming, with smaller size pieces facilitating replication and thusoverall system load-balancing, whereas bigger pieces decrease the resource overheadassociated with piece management and bandwidth consumption due to less protocolcontrol. To allow for a more efficient piece selection three algorithms are proposed:sequential, rarest first and anchor-based. To ensure real-time behavior the system isable to offer different levels of aggressiveness, including: simultaneous requests of thesame type to neighboring peers; simultaneous sending different content requests tomultiple peers, and; requesting to a single peer (making a more conservative use ofresources).ThicketEfficient data dissemination over unstructured P2 P was addressed by Thicket [84].The work used multiple trees to ensure efficient usage of resources while providingredundancy in the presence of node failure. In order to improve load-balancing acrossthe nodes, the protocol tries to minimize the existence of nodes that act as interiornodes on several of trees, thus reducing the load produced from forwarding messages.The protocol also defines a reconfiguration algorithm for leveraging load-balance acrossneighbor nodes and a tree repair procedure to handle tree partitions. Results showthat the protocol is able to quickly recover from a large number of simultaneous nodefailures and leverage the load across existing nodes.2.3.2 QoS-Aware P2 PUntil recently, P2 P systems have been focused on providing resiliency and throughput,and thus, not addressing the increasing need for QoS on latency-sensitive applications,such as VoD.QRONQRON [85] aimed to provide a general unified framework in contrast to application-specific overlays. The overlays brokers (OBs), present at each autonomous system inthe Internet, support QoS routing for overlay applications through resource negotiationand allocation, and topology discovery. The main goal of QRON is to find a path thatsatisfies the QoS requirements, while balancing the overlay traffic across the OBs andoverlay links. For this it proposes two distinct algorithms, a “modified shortest distancepath” (MSDP) and “proportional bandwidth shortest path (PBSP). 45
  • CHAPTER 2. OVERVIEW OF RELATED WORKGlueQoSGlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoSfeatures from two communicating processes. It provides a declarative language thatallows the specification of the feature QoS set (and possible conflicts) and a runtimenegotiation mechanism that finds a set of valid QoS features that is valid in the bothends of the interacting components. Contrary to aspect-oriented programming [65], thatonly enforces QoS semantics at deployment time, GlueQoS offers a runtime solution thatremains valid throughout the duration of the session between a client and a server.2.4 P2P+FT Middleware SystemsThe research on P2 P systems has been largely dominated by the pursuit for fault-tolerance, such as in distributed storage, mainly due to the resilient and decentralizednature of P2 P infrastructures.2.4.1 Publish-subscribeP2 P publish-subscribe systems are a set of P2 P systems that implement a messagepattern where the publishers (senders) do not have a predefined set of subscribers(receivers) to their messages. Instead, the subscribers must first register their interestswith the target publisher, before starting to receive published messages. This decou-pling between publishers and subscribers allows for a better scalability, and ultimately,performance.ScribeScribe [87] aimed to provided a large scale event notification infrastructure, built ontop of Pastry [88], for topic-based publish-subscribe applications. Pastry is used tosupport topics and subscriptions and build multicast trees. Fault-Tolerance is providedby the self-organizing capabilities of Pastry, through the adaptation to network failuresand subsequent multicast tree repair. The event dissemination performed is best-effort oriented and without any delivery order guarantees. Nevertheless, it is possibleto enhance Scribe to support consistent ordering thought the implementation of asequential time stamping at the root of the topic. To ensure strong consistency andtolerate topic root node failures, an implementation of a consensus algorithm such asPaxos [89] is needed across the set of replicas (of the topic root).46
  • 2.4. P2 P+FT MIDDLEWARE SYSTEMSHermesHermes [90] focused on providing a distributed event-based middleware with an underly-ing P2 P overlay for scalability and reliability. Inspired by work done in Distributed HashTable (DHT) overlay routing [88, 91], it also has some notions of rendezvous similarto [81]. It bridges the gap between programming language type semantics and low-levelevent primitives, by introducing the concepts of event-type and event-attributes thathave some common ground with Interface Description Language (IDL) within the RPCcontext. In order to improve performance, it is possible in the subscription process toattach a filter expression to the event attributes. Several algorithms are proposed forimproving availability, but they all provide weak consistency properties.2.4.2 Resource ComputingThere is a growing interest on harvesting and managing the spare computing powerfrom the increasing number of networked devices, both public and private, as reportedin [92, 93, 94, 95]. Some relevant examples are:BOINCBOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili-tate the harvesting of public resource computing by the scientific research community.BOINC implements a redundant computing mechanism to prevent malicious or erro-neous computational results. Each project specifies the number of results that should becreated for each “workunit”, i.e. the basic unit of computation to be performed. Whensome number of the results are available, an application specific function is called toevaluate the results and possibly choosing a canonical result. If no consensus is achieved,or if simply the results fail, a new set o results are computed. This process repeats untila successful consensus is achieved or an application defined timeout occurs.P2 P-MapReduceDeveloped at Google, MapReduce [97] is a programming model that is able parallelizethe processing of large data sets in a distributed environment. It follows a master-slavemodel, where a master distributes the data set across a set of slaves, returning at endthe computational results (from the map or reduce tasks). MapReduce provides fault-tolerance for slave nodes by reassigning the failed job to an alternative active slave,but lacks support for master failures. P2 P-MapReduce [98] provides fault-tolerance byresorting to two distinct P2 P overlays, one containing the current available masters in 47
  • CHAPTER 2. OVERVIEW OF RELATED WORKthe system, and the other with the active slaves. When an user submits a MapReducejob, it queries the master overlay for a list of the available masters (ordered by theirworkload). It then selects a master node and the number of replicas. After this, themaster node notifies its replicas that they will participate on the current job. A masternode is responsible for periodically synchronizing the state of the job over its replica set.In case of failure, a distributed procedure is executed to elect the new master acrossthe active replicas. Finally, the master selects the set of slaves using a performancemetric based on workload and CPU performance from the slave overlay and starts thecomputation.2.4.3 StorageStorage systems were one of the most prevalent applications on first generation P2 P sys-tems. Evolving from early file-sharing systems, and with the help of DHT middlewares,they have now become the choice for large-scale storage systems in both industry andacademia.openDHTWork done in [99] aimed to provide a lightweight framework for P2 P storage usingDHTs (such in [88, 91]) in a public environment. The key challenge was to handlemutually untrusting clients, while guarantying fairness in the access and allocation ofstorage. The work was able to provide a fair access to the underlying storage capacity,while taking the assumption that storage capacity is free. Because of its intrinsic fairapproach, the system is unable to provide any type of Service Level of Agreement (SLA)to the clients, so reducing the domain of applications that can use it.DynamoRecent research on data storage [25] and distribution at Amazon, focus on key-valueapproaches using P2 P overlays, more precisely DHT, to overcome the well exploredlimitation of simultaneous providing high availability and strong consistency (throughsynchronous replication) [100, 101]. The approach taken was to use an optimisticreplication scheme that relied on asynchronous replica synchronization (also knownas passive replication). The consistency conflicts between different replicas, that arecaused by network and server failures, are resolved in ’read time’, as opposed to themore traditional ’write time’ strategy, with this being done to maximize the writeavailability in the system. Such conflicts are resolved by the services, allowing for amore efficient resolution (although the system offers a default ’last value holds’ strategy48
  • 2.5. P2 P+RT+FT MIDDLEWARE SYSTEMSto the services). Dynamo offers efficient key-value storage, while maximizing writeoperations availability. Nevertheless, the ring based overlay hampers the scalability ofthe system, and depending on the partitioning strategy used, the membership processdoes not seem efficient.2.5 P2P+RT+FT Middleware SystemsThese types of systems offer a natural evolution over previous FT-RT middlewaresystems. They aim to provide scalability and resilience through a P2 P network infra-structure that is able to provide lightweight FT mechanisms, allowing them to supportsoft RT semantics. We first proposed an architecture [102, 103] for a general purposemiddleware that aimed to integrate FT into the P2 P network layer, while being ableto provide RT support. The first implementation, in Java, of the architecture was donein DAEM [6, 104]. This work used an hierarchical tree P2 P based on P3 [15]. The FTsupport was performed in all levels of the tree, resulting in a high availability rate butthe use of JGroups [7] for maintaining strong consistency, both for mesh and servicedata, resulted in high overhead. Due to its highly coupled tree architecture, faults had amajor impact on availability when they occurred near the root node, as they produceda cascade failure. Initial support for RT was provided, but the high overhead of thereplication infrastructure limited its applicability.2.6 A Closer Look at TAO, MEAD and ICEThis section provides a closer look at middleware systems that have provided us withseveral strategies and insights that we used to design and implement Stheno, ourmiddleware solution that is able to support RT, FT and P2 P.All the referred systems share a service oriented architecture with a client-server networkmodel, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEADsupport the RT-CORBA standard, while ICE only supports best-effort invocations. Asfor FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid,that combines both low and high-level services. 49
  • CHAPTER 2. OVERVIEW OF RELATED WORK2.6.1 TAOTAO is a classical RPC middleware and therefore only supports the client-server networkmodel. Name resolution is provided by a high-level service, representing a clear point-of-failure and a bottleneck.RT Support. TAO supports the RT CORBA specification 1.0., with the most impor-tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT threadpools.The priority propagation ensures that a request maintains its priority across a chain ofinvocations. A client issues a request to an Object A, that in turn, issues an invocationto other Object B. The request priority at Object A is then used to make the invocationat Object B. There are two types of propagation: a server declared priorities, andclient propagated priorities. In the first type, a server dictates the priority that willbe used when processing an incoming invocation. In the other type, the priority ofthe invocation is encoded within the request, so the server processes the request at thepriority specified by the client.A source of unbound priority inversion is caused by the use of multiplexed communica-tion channels. To overcome this, the RT CORBA specification defines that the networkchannels should be pre-established, avoiding the latency caused by their creation. Thismodel allows two possible policies: (a) private connection between the client and theserver, or; (b) priority banded connection that can be shared but limits the priority ofthe requests that can be made on it.In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], withthe support of a reactor (an object that handles network event de-multiplexing), and isnormally associated with an acceptor (an entity that handles the incoming connections),a connection cache, and a memory pool. In classic CORBA a high priority thread canbe delayed by a low priority one, leading to priority inversion. So in an effort to avoidthis unwanted side-effect, the RT-CORBA specification defines the concept of threadpool lanes.All the threads belonging to a thread pool lane have the same priority, and so, onlyprocess invocation that have the same priority (or a band that contains that priority).Because each lane has it own acceptor, memory pool and reactor, the risk of priorityinversion is greatly minimized at the expense of greater resource usage overhead.FT Support. In a effort to combine RT and FT semantics, the replication style50
  • 2.6. A CLOSER LOOK AT TAO, MEAD AND ICEproposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids thelatency associated with both warm and cold passive replication [105] and the highoverhead and non-determinism of active replication, but represents an extension to theFT specification. Figure 2.2: TAO’s architectural layout (adapted from [3]).Figure 2.2 shows the architectural overview of TAO. The support for FT is achievedthrough the use of a set of high-level services built on top of TAO. These services includea Fault Notifier, a Fault Detector and a Replication Manager.The Replication Manager is the central component of the FT infrastructure. It actsas central rendezvous to the remaining FT components, and it has the responsibilitiesof managing the replication groups life-cycle (creation/destruction) and perform groupmaintenance, that is the election of a new primary, removal of faulty replicas, andupdating group information.It is composed by three sub-components: (a) a Group Manager, that manages the groupmembership operations (adds and removes elements), allows the change of the primaryof a given group (for passive replication only), and allows manipulation and retrievalof group member localization; (b) a Property Manager, that allows the manipulation of 51
  • CHAPTER 2. OVERVIEW OF RELATED WORKreplication properties, like replication style; and (c) a Generic Factory, the entry pointfor creating and destroying objects.The Fault Detector is the most basic component of the FT infrastructure. Its role is tomonitor components, processes and processing nodes and report eventual failures to theFault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwardsthem to the Replication Manager.The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next;(b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that(d) finds the Replication Manager and registers itself with it. As a response, e) theReplication Manager connects as a consumer to the Fault Notifier. (f) For each nodethat is going to participate, starts a Fault Detector Factory and a Replica Factory, thatin turn register themselves in the Replication Manager. (g) A group creation request ismade to the Replication Manager (by an foreign entity, that is referred as Object GroupCreator ), followed by the request of a list to the available Fault Detector Factories anda Replica Factories; (h) this is followed by a request to create an object group in theGeneric Factory. (i) The Object Group Creator then bootstraps the desired numberof replicas using the Replica Factory at each target node, and in turn, each ReplicaFactory creates the actual replica, and at the same time, it starts a Fault Detectorat each site using the Fault Detector Factory. Each one of these detectors, finds theReplication Manager and retrieves the reference to the Fault Notifier and connects toit as a supplier. (j) Each replica is added to the object group by the Object GroupCreator by using the Group Manager at the Replication Manager. (k) At this point, aclient is started and retrieves the object reference from the naming service, and makesan invocation to that group. This is then carried out by the primary of the replicationgroup.Proactive FT Support. An alternative approach has been proposed by FLARe [74],that focus on proactively adapting the replication group to the load present in thesystem. The replication style is limited to semi-active replication using state-transfer,that is commonly referred solely as passive replication .Figure 2.3 shows the architectural overview of FLARe. This new architecture presentsthree new components to TAO’s FT infrastructure: (a) a client interceptor, that redi-rects the invocations to the proper server, as the initial reference could have beenchanged by the proactive strategy, in response to a load change; (b) a redirection agentthat receives the updates with these changes from the Replication Manager; and (c)a resource monitor that monitors the load on a processing node and sends periodical52
  • 2.6. A CLOSER LOOK AT TAO, MEAD AND ICE Figure 2.3: FLARe’s architectural layout (adapted from [74]).updates to the Replication Manager.In the presence of abnormal load fluctuations the Replication Manager changes thereplication group to adapt to these new conditions, by creating replicas on lower usagenodes and, if required, by changing the primary to a better suitable replica.TAO’s fault tolerance support relies on a centralized infrastructure, with its maincomponent, the Replication Manager, representing a major obstacle in the system’sscalability and resiliency. No mechanisms are provided to replicate this entity.2.6.2 MEADMEAD focused on providing fault-tolerance support in a non intrusive way for enhancingdistributed RT systems by providing a transparent, although tunable FT, that isproactively dependable through resource awareness, that has scalable and fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-concept.Transparent Proactive FT Support. MEAD’s architecture contains three majorcomponents, namely, the Proactive FT Manager, the Mead Recovery Manager and the 53
  • CHAPTER 2. OVERVIEW OF RELATED WORKMead Interceptor. The underlying communication is provided by Spread, an groupcommunication framework that offers reliable total ordered multicast, for guaranteeingconsistency for both component and node membership.The Mead Interceptor provides the usual interception of system calls between theapplication and underlying operating system. This approach allows a transparent andnon-intrusive way to enhance the middleware with fault-tolerance. Figure 2.4: MEAD’s architectural layout (adapted from [14]).Figure 2.4 shows the architectural overview of MEAD. The main component of theMEAD system is the Proactive FT Manager, and is embedded within the interceptorsin both server and client. It has the responsibility of monitoring the resource usage ateach server, initialization a proactive recovery schema based on a two-step threshold.When the resource usage gets higher then the first threshold, the proactive managersends a request to the MEAD Recover Manager to launch a new replica. If the usagegets higher than the second threshold then the proactive manager starts migrating thereplica’s clients to the next non-faulty replica server.The Mead Recovery Manager has some similarities with the Replication Manager ofCORBA-FT, as it also must launch new replicas in the presence of failures (node orserver). In MEAD, the recovery manager does not follow a centralized architecture, asin TAO or FLARe, where all the components of the FT infrastructure are connected tothe replication manager, instead, they are connected by a reliable total ordered groupcommunication framework that establishes an implicit agreement at each communica-tion round. These frameworks also provide a notion of view, i.e. an instantaneous54
  • 2.6. A CLOSER LOOK AT TAO, MEAD AND ICEsnapshot of the group membership, and notifications of any membership change. Thisallows the MEAD Recover Manager to detect a failed server and respawn a new replica,maintaining the desired number of replicas.The decision by developers of the proper FT properties, e.g. replication style, withoutan evaluation of object state size and resource usage, can severely affect the overallperformance and reliability. The only possible way to achieve balance between thesetwo orthogonal domains of reliability and (real-time) performance, must leverage theobject’s resource usage, system resource availability, and the target level of reliabilityand recovery-time.To overcome this issue, MEAD introduced a FT Advisor. This advisor profiles theobject for a certain period of time to assess its resource usage, e.g. cpu, networkbandwidth, etc., and invocation ratio. Using this, the advisor can provide advice onthe proper settings of FT properties. For example, if an object uses little computationtime and has a large state, then active replication is the most suitable replication style.The replication style is not the only choice considered. For passive replication thereare two options that are of relevance: checkpoint and fault-detection. The periodicityof checkpointing affects the delay window of the consistency between the primary andthe replicas. A high period results in a smaller window, i.e. the inconsistency statehas a lesser duration, but brings a larger resource overhead, as more cpu and networkbandwidth are needed. The fault-detection directly impacts the recovery-time, as alarger period between fault-detection inspections results in a larger recovery time.The fault advisor continuously and periodically provides feedback to the runtime withmore accurate suggestions, adjusting to changes in resource usage and availability.Normally, active replication support is restricted to deterministic single threaded appli-cations. MEAD’s last contribution comes in the form of support for non-deterministicaplications under active replication. To achieve this, MEAD uses source-code analysis todetect points in the source code that introduce non-determinism, e.g. system calls likegettimeofday. These non-deterministic points are stored into a data structure and areembedded within invocations and replies, so they must be stored locally in both clientsand servers. The reason behind this necessity resides in the way the active replicationworks. A client makes an invocation, that is multicasted to the replicas. Each replicaprocesses the request, storing the non-deterministic data locally and piggybacking it tothe reply that is sent back to the client. The client picks the first reply, and stores thenon-deterministic data locally. The client piggybacks this information in next invocationit makes. When the replicas receive this invocation, they retrieve this non-deterministic 55
  • CHAPTER 2. OVERVIEW OF RELATED WORKinformation and update their internal state, except the replica whose reply was chosenby the client.The recovery manager does not have replication, turning it into a single point of failure.The use of a reliable total ordered group communication framework partial improves thedecentralization of the infrastructure, but the recovery manager still acts as a centralizedunit resulting in a negative impact on the overall system scalability. In systems thatare pruned to a large churn rate, group communication could result in partitions, as weassessed in DAEM [6]. These partitions could result in a major outage, compromisingthe reliability and real-time performance.2.6.3 ICEICE [106] provides a lightweight RPC based middleware that aims to overcome theinefficiency, such as redundancy, present in the CORBA specification. For that purpose,ICE provides an efficient communication protocol and data encoding. It does notsupport any kind of RT semantics.The support for FT present in ICE is minimal, and is restricted to naming for replicationgroups, that is when a client tries to resolve the replication group name, it receives alist with all the server instances that belong to the group (i.e. the endpoints). On theother hand, it does not support any type of replication style or even synchronizationprimitives, leaving this to the applications.ICE does not provide an infrastructure to support very large scale systems. Its registry,that acts as the CORBA Naming Service, constitutes a bottleneck and possible singlepoint of failure. The reliability of the registry can be improved by the addition ofstandby instances, in a master-slave relation.2.7 SummaryThe goal of this chapter was to search for a suitable solution that could addressall the requirements from our target system, that is, a middleware system capableof simultaneously supporting RT+FT+P2 P. As no solution was found, we focusedon systems that belong to the intersecting domains, namely, RT+FT, P2 P+FT andP2 P+RT, to see if we could extend one of them and avoid designing and implementinga new middleware from scratch.56
  • 2.7. SUMMARYIn our previous work, DAEM [102, 103], we used some off-the-self components, e.g.,JGroups [7] to manage replication groups, but realized that in order to integrate real-time and fault-tolerance within a P2 P infrastructure, we would have to completelycontrol the underlying infrastructure with fine grain management over all the resourcesavailable in the system. Thus, the use of COTS software components creates a ”black-box” effect that introduces sources of unpredictable behavior and non-determinism thatundermines any attempt to support real-time. For that reason, it was unavoidable tocreate a solution from scratch.Using the insights learned from several inspirational middleware systems, namely TAO,MEAD, and ICE, we have designed, in Chapter 3, and implemented, in Chapter 4,Stheno, that to the best of our knowledge is the first middleware system that simulta-neously supports RT and FT within a P2 P infrastructure. 57
  • –If you can’t explain it simply, you don’t understand it well 3enough. Albert Einstein ArchitectureThe implementation of increasingly complex systems at EFACEC is currently lim-ited by the capabilities of the supporting middleware infrastructure. These systemsinclude public information systems for public transportation, automated power gridmanagement and automated substation management for railways. The use of serviceoriented architectures is an effective approach to reduce the complexity of such systems.However, the increasing demand for guarantees on the fulfillment of SLAs can only beachieved with a middleware platform that is able to provide QoS computing whileenforcing a resilient behavior.Some middleware systems [14, 3] already addressed this problem by offering soft real-time computing and fault-tolerance support. Nevertheless, their support for real-timecomputing is limited, as they do not provide any type of isolation. For example a servicecan hog the CPU and effectively starve the remaining services. The support for fault-tolerance is restricted to crash-failures and the implementation of the fault-tolerancemechanisms is normally accomplished through the use of high-level services. However,these high-level services cause a significant amount of overhead, due to cross-layering,limiting the real-time capabilities of these middleware systems.These systems also used a centralized networking model that is susceptible to singlepoint-of-failure and offers limited scalability. For example, the CORBA naming servicereflects these limitations, where a crash failure can effectively stop an entire systembecause of the absence of the name resolution mechanism.This chapter describes the architectural overview of a new general purpose P2 P middle-ware that addresses the aforementioned problems. The resilient nature of P2 P overlaysenables us to overcome the limitations of current approaches by offering a decentral-ized and reconfigurable fault resistant architecture that avoids bottlenecks, and thus 59
  • CHAPTER 3. ARCHITECTUREenhances overall performance.Stheno, our middleware platform, is able to provide QoS computing with support forresource reservation through the implementation of a QoS daemon. This daemon isresponsible for the admission and distribution of the available resources among thecomponents of the middleware. Furthermore, it also interacts with the low-level resourcereservation mechanisms of the operating system to perform the actual reservations.With this support, we provide proper isolation that is able to accommodate soft real-time tasks and thus provide guarantees on SLAs. While we currently only support CPUreservation, the architecture was designed to be extensible and subsequently supportadditional sub-systems, such as memory or networking resource reservations.Notwithstanding, the real-time capabilities are limited by the amount of resources thatare need to provide fault-tolerance. To overcome the current limitations of provid-ing fault-tolerance through the use of expensive high-level services, we propose theintegration of the fault-tolerance mechanisms directly in the the overlay layer. Thisprovides two advantages over the previous approaches: 1) it allows the implementationsof lightweight fault-tolerance mechanism by reducing cross-layering, and; 2) the replicaplacement can be optimized using the knowledge of the overlay’s topology. Previoussystems relied on manual bootstrap of replicas, such as TAO [3], or required the presenceof additional high-level services to perform load balancing across the replica set, as inFLARe [74].While the work presented in this thesis only implements semi-active replication [44],we designed a modular and flexible fault-tolerance infrastructure that is able to accom-modate other types of replication policies, such as passive replication [75] and activereplication [78] .Our architectural design also considered future support for virtualization. However,instead of providing virtualization as a service, as is done in cloud computing plat-form [107], our goal is to support lightweight virtualized services to offer out-of-the-box fault-tolerance support for legacy services through the use of the live-migrationmechanisms present in current hypervisors, such as KVM [108] and Xen [109]. Thiscan be achieved through the use of Just Enough Operating System (JeOS) [110] whichenables the creation of small footprint virtual machines that are a critical requirementto perform the migration of virtual machines.Finally, in order to minimize the effort required to port the runtime to a new operatingsystem, we used the ACE framework [111] that abstracts the underlying operatingsystem infrastructure.60
  • 3.1. STHENO’S SYSTEM ARCHITECTURE3.1 Stheno’s System ArchitectureIn order to contextualize our approach, we will present our solution applied to oneof our target systems, the Oporto’s light-train public information system. As shownin Figure 3.1, the networks uses an hierarchical tree-based topology, that is based onthe P3 overlay [15], where each cell represents a portion of the mesh space that ismaintained (replicated) by a group of peers. These peers provide the computationalresources needed to maintain the light-train stations and host services within the system.Additionally, there are also sensors that connect to the system through peers. Theyoffer an abstraction to several low-level activities, such as traffic track sensors andvideo camera streams. A detailed discussion about the implementation of the overlayis provided in Chapter 4. Figure 3.1: Stheno overview.The middleware’s runtime provides the necessary infrastructure that allows users tolaunch and manipulate services, while hiding the interaction with low level peer-to-peeroverlay and operating system mechanisms. It is based on a five layer model, as shownin Figure 3.1.The bottom layer, Operating System Interface, encapsulates the Linux operating system 61
  • CHAPTER 3. ARCHITECTUREand the ACE [111] network framework. The Support Framework is built on top ofthe bottom layer, and offers a set of high-level abstractions for efficient, modularcomponent design. The P2 P Layer and FT Configuration contains all the peer-to-peer overlay infrastructure components and provides a communication abstraction andFT configuration to the upper layers. The runtime can be loaded with a specific overlayimplementation at bootstrap. The middleware is parametric in the choice of overlay, andthese are provided as plugins and can be loaded dynamically. The Core layer representsthe kernel of the runtime, and is responsible for managing all the resources allocatedto the middleware and the peer-to-peer overlay. Finally, the Application and Serviceslayer is composed of the applications and services that run on top of the middleware.Next, we describe the organization for each layer, as well as their inter-dependencies. Inan effort to improve the overall comprehension of the runtime, the layers are presentedusing a top-down approach, starting at the application level, continuing throughout thecore and overlay layers, and ending at the operating system interface.3.1.1 Application and ServicesOne of the most fundamental problem when developing a general purpose middlewaresystem is its ability to expose functionalities and configuration options to the user. Thislayer achieves that goal through the introduction of high-levels APIs that allows theusers to query and configure the different layers of the runtime. For example, in ourtarget system, a system operator may create a video streaming service from a light-trainstation and set the frame rate and replication style.The service represents the main abstraction of the middleware, and is shown in Fig-ure 3.4. A developer that wishes to deploy an application, has to use this abstraction.The node hosting a service guarantees that its QoS requirements (CPU, network,memory and I/O) are assured throughout the service’s entire life-cycle. The CPUsubsystem offers an exception to this definition. It allows the creation of best-effortcomputing tasks that, as the name implies, do not have any QoS guarantees. These arenormally associated with helper mechanisms, such as logging.A service can be statically or dynamically loaded into the middleware. Dynamic servicesare encapsulated into a meta archive called Stheno Service Archive, that has the .ssafile extension, and uses the ZIP archive format. Such an archive contains a serviceimplementation (plugin) that may be loaded by the runtime. This solution allows theruntime to dynamically retrieve a missing service implementation and load it on-the-fly.62
  • 3.1. STHENO’S SYSTEM ARCHITECTURE Figure 3.2: Application Layer. Figure 3.3: Stheno’s organization overview.Each service is identified in the middleware system by a Service Identifier (SID) thatuniquely identifies the service implementation, and an Instance Identifier (IID) thatidentifies a particular instance of a service in the system, as any given service imple-mentation can have multiple instances running simultaneously (Figure 3.3).An IID is unique across the peer-to-peer overlay, therefore at any given time, it isrunning in only one peer, identified uniquely with Peer Identifier (PID), but during itslifespan it can migrate to other peers. This occurs when a service instance migratesfrom one peer to another. A PID can only be allocated in one Cell Identifier (CID).However, this membership can dynamically change during the peer’s lifespan.A cell can be seen as a set of peers that are organized to maintain a partition ofthe overlay space. These cell can be loosely decoupled, for example, Gnutella peerspartitions the overlay space in an ad-hoc fashion, or follows a structured topology.Other overlays [15] have an hierarchical tree of cells and in each cell the peers cooperatewith the purpose of maintaining a portion of an overlay tree. In turn, these cooperate 63
  • CHAPTER 3. ARCHITECTUREamong themselves to maintain the global tree topology.Some services can be deployed strictly as daemons. This class of services does not offerany type of external interaction. Nevertheless, a service usually provides some sort ofinteraction that is abstracted in the form of a client.Using the RPC service as example, a client is a broker between the user and the server,marshaling the request, and unmarshaling the reply. Another example, is a videostreaming client that connects to a streaming service with the purpose of receiving avideo stream, acting as a stream sink.The interaction with a service, through a client, is only possible if the service providesone or more Service Access Points (SAPs). These SAPs provide the entry-points thatsupport such interactions, with each one providing a specific QoS. For example, aRPC service can provide two SAPs, one for low-priority invocations and the other forhigh-priority invocations.When an user (through a client) wants to contact a service instance, it first has toknown which SAPs are available in that particular instance. In order to accomplishthat goal, the user must use the discovery service and query about the active accesspoints for a particular instance of a service.To summarize, the responsibilities of a service are the following: define the amount ofresources that it will need throughout its life-cycle; manage multiple SAPs, and; providea client implementation.3.1.2 CoreOne important issue is how to deal with the different real-time and fault-tolerancerequirements from different services, that in turn, are requested different users. In orderto address this issue, the core is responsible for the overall management of all assignedresources, including overlays and services, is shown Figure 3.4. The resource reservationmechanisms are not controlled directly by the runtime, but by a resource reservationdaemon, shown as QoS Daemon, that is responsible for managing the available low-levelresources.This approach enables multiple runtimes to coexist within the same physical host andfurther allows foreign applications to use the resource reservation infrastructure. Theruntime core merely acts as a broker for any resource reservation request initiated byany of its applications or overlay services.64
  • 3.1. STHENO’S SYSTEM ARCHITECTURE Figure 3.4: Core Layer.The most important roles performed by the core are the following: a) maintain theinformation of all local active service instances; b) act as a regulator, deciding on theacceptance of new local service instances, and; c) provide a resource reservation broker.The management of the active service instances is done through the use of the ServiceManager.Service ManagerThe service manager is responsible for managing all the local services of a runtime.A service can be loaded into an active runtime in one of two ways: it can be locallybootstrapped at start-up, such as static services that are loaded when the runtimebootstraps, or; it can be dynamically loaded in response to a local or remote request.The request for the creation of a new service instance could be initiated locally bythe user or a local service, or when a remote peer requests it through the overlayinfrastructure.This remote service creation is delegated to the overlay mesh service, that in turn usesthe overlay’s inner infrastructure to accomplish this task. This implementation of thesemechanisms is detailed in Chapter 4.The service manager is composed of two entities, a service factory and a service book-keeper. The service factory is a repository of known service implementations that canbe manipulated dynamically, allowing the insertion and removal of service implementa-tions. The service bookkeeper manages the information, such as SAPs, about the activeservice instances that are running locally. 65
  • CHAPTER 3. ARCHITECTUREQoS ControllerThe QoS Controller, shown in Figure 3.5, acts as a proxy between the components ofthe runtime and the QoS daemon. Each component has access to a resources that areassigned at creation time. A component uses its resources through a QoS Client, thatwas previously assigned to it by the QoS Controller. A resource reservation requestis created by a QoS Client and then gets re-routed by the QoS Controller to the QoSdaemon. In the current implementation, the allocation assigned to each component isstatic. A dynamical reassignment of the resources allocated to a component is left forfuture work. Figure 3.5: QoS Infrastructure.Section 3.1.4 provides the details on the QoS and resource reservation infrastructure,in particular detailing the internals of the QoS daemon.3.1.3 P2 P Overlay and FT ConfigurationOur target systems require that the middleware must be able to adapt its P2 P net-working layer to mimic the physical deployment, while at the same time, provide thefault-tolerance configuration options to meet application needs.The overlay layer is based on a plugin infrastructure that enables a flexible deployment ofthe middleware for different application domains. For example, in our flagship solution,the Oporto’s light-train network, we used a P3 -based plugin implementation that mirrorsthe regional hierarchy of the system. Additionally, the FT configurations options passedby the user, for example, the requirement to maintain a service replicated among 3replicas while using semi-active replication is delegated to the FT service within theP2 P overlay.66
  • 3.1. STHENO’S SYSTEM ARCHITECTUREBecause of this flexibility, the runtime does not bootstrap with a specific overlayimplementation by default, it is left to the user to choose the most suitable P2 P imple-mentation to match its particular target case. Figure 3.6 shows the components thatform the overlay abstraction layer. Figure 3.6: Overlay Layer.Every overlay implementation must provide the following services: (a) Mesh, responsiblefor membership and overlay management; (b) Discovery, used to discover resources anddata across the overlay, and; (c) FT (Fault-Tolerance), used to manage and negotiatethe fault-tolerance policies across the overlay.Mesh ServiceThe mesh service is responsible for managing the overlay topology and providing supportfor the remote creation and removal of services. The management of the overlaytopology is supported through the membership and recovery mechanisms. The mem-bership mechanism must allow the entrance and departure of peers while maintainingconsistency of the mesh topology. At the same time, the recovery mechanism has toperform the necessary rebind and reconfiguration to ensure that the mesh topologyremains valid even in the presence of severe faults.An overlay plugin is free to implement the membership and recovery mechanisms thatmost fits its needs. This was motivated by the goal of minimizing the restrictionsmade on the overlay topology, increasing in this way the range of systems supported byStheno.Figure 3.7 shows four possible implementation approaches. A portal can be used to actas a gatekeeper [112] (shown in Figure 3.7a), resembling the approach taken by mostweb services. This can be suitable for systems that do not have a high churn rate. Onthe other hand, systems that need highly available and decentralized architectures may 67
  • CHAPTER 3. ARCHITECTURE (a) (b) (c) (d) Figure 3.7: Examples of mesh topologies.use multicast mechanisms to detect other nodes present in the system [15] (shown inFigure 3.7b). Nevertheless, some systems require bounded operations times, such asqueries. This can be accomplished with the introduction of cells (also known as federa-tions), such in Gnutella [81] (shown in Figure 3.7c), or alternatively, by imposing somekind of well-defined inter-peer relationship, such as Chord [113] (shown in Figure 3.7d).Discovery ServiceThe discovery service offers an abstraction that allows the execution of queries on theunderlying overlay. As with the mesh service, each overlay plugin is free to implementthe discovery service as it best suits the needs of the target system. Figure 3.8 showsthe execution of a query under some possible topologies.The main goals defined in the discovery service are the following: performing syn-68
  • 3.1. STHENO’S SYSTEM ARCHITECTURE (a) Hierarchical overlay topology. (b) Ad-hoc overlay topology. (c) DHT overlay topology. Figure 3.8: Querying in different topologies.chronous and asynchronous querying with QoS awareness, and; handling query requestsfrom neighboring peers while respecting the QoS associated with each requestFault-Tolerance ServiceThe FT infrastructure is based on replication groups. These groups can be definedas a set of cooperating peers that have the common goal of providing reliability to ahigh-level service. In current middleware systems, FT support is implemented througha set of high-level services that use the underlying primitives, for example, TAO [3].Our approach makes a fundamental shift to this principle, by embedding FT supportin the overlay layer.The integration of FT in the overlay reduces the overhead of cross-layering that is 69
  • CHAPTER 3. ARCHITECTUREassociated with the use of high-level services. Furthermore, this approach also enablesthe runtime to make decisions on the placement of replicas that are aware of the overlaytopology. This awareness allows for a better leverage between the target reliability andresource usage.The FT service is responsible for the creation and removal of replication groups. How-ever, the management of the replication group is self contained, that is, the FT servicedelegates all the logistics to the replication group. This allows further extensibility ofthe replication infrastructure, and also allows the co-existence of simultaneous types ofreplication strategies inside the FT service. This allows each service to use the mostsuitable replication policy to meet its requirements.The assumptions made in the design of each service limit the type of fault-tolerancepolicies that can be used. For example, if a service needs to maintain a high-level ofavailability then it should use active replication [78] in order to minimize recovery time.For these reasons, we designed an architecture that provides a flexible framework, wheredifferent fault-tolerance policies can be implemented. In Chapter 4 we provide anexample of a FT implementation.3.1.4 Support FrameworkOur target system has different RT requirements for different tasks, for example, acritical event is the highest priority traffic present in the system and is highly sensitiveto latency. To ensure that the 2 second deadline is meet, is necessary to reserve thenecessary CPU to process the events, and at the same time employ a suitable threadingstrategy that minimizes latency (at the expense of throughput), such as Thread-per-Connection [12].The support framework provides the necessary infrastructure to address these issuesby offering a set of packages that provide high level abstractions for different threadingstrategies, network communication and QoS management, in particular the mechanismsfor resource reservation. Figure 3.9 shows the components of the support framework.It introduces three key aspects: (a) provides a novel and extensible infrastructure forresource reservation and QoS; (b) introduces a novel design pattern for multi-corecomputing; and (c) provides an extensible monitoring facility. Before delving into thedetails of these components, we first present its components. In an effort to improve itsmaintainability, the framework uses a package-like schema, with the following layout:70
  • 3.1. STHENO’S SYSTEM ARCHITECTURE Figure 3.9: Support framework layer. • common - this package includes support for integer conversion, backtrace (for debugging), state management, synchronization primitives, and exception han- dling; • network - this package has support for networking, namely stream and datagram sockets, packet oriented sockets, low level network utilities, and request support; • event - this package implements the event interface, a fundamental component for network oriented programming. • qos - this package implements the resource reservation infrastructure, namely the QoS daemon and client, as well the QoS primitives which are used by the threading package, such as scheduling information. • serialization - this package includes a serialization interface support and provides a default serialization implementation; • threading - this package offers several scheduling strategies, including: Leader- Followers [11], Thread-Pool [114], Thread-per-Connection [12] and Thread-per- Request [13]. All of these strategies are implemented using the Execution Model - Execution Context design pattern; • tools - the tools package includes the loader and the monitoring sub-packages, which contain a load injector and a resource monitoring daemon, respectively.The most prominent package in the framework is the resource reservation and QoSinfrastructure. It provides the low-level support that is necessary for the integrationof RT and FT into the middleware’s runtime. Next, we present an overview of the 71
  • CHAPTER 3. ARCHITECTUREinner-works of each of the components and reason about their implications in severalaspects of a real-time fault-tolerant middleware.The QoS and Resource Reservation InfrastructureOne of the key aspects of real-time systems is the ability to fulfill a SLA even in thepresence of an adverse environment. Adversities can be caused by system overload,bugs, or malicious attacks, and can occur in the form of rogue services, device drivers,or kernel modules.The only viable solution to provide deterministic behavior is to provide isolation tothe various components present in the system. This type of containment can beachieved by using a virtual machine, but obviously this would only work for user-space applications/services, or; by using the low-level infrastructure provided by theunderlying operating system, such as Control Groups [115] or Zones [116]. ControlGroups is a modular and extensible resource management facility provided by theLinux kernel, while Zones is a similar but less powerful implementation for the Solarisoperating system.These types of mechanisms are normally associated with static provisioning and left tosystem administrators to manage. This, clearly, is not a suitable approach to complexand dynamic environments that are the focus of this work. To overcome this limitationwe designed and implemented a novel QoS daemon that manages the available resourcesin the Linux operating system.The goal of the QoS daemon is to provide an admission control and managementfacility that governs the underlying Control Groups infrastructure. There are four mainQoS subsystems: CPU, I/O, memory and network. At this time, we have only fullyimplemented the CPU subsystem. The remaining subsystems have just a preliminarysupport.All the subsystems supported by Control Groups follow a hierarchical tree approachto the distribution of their resources (Figure 3.10). Each node of the tree representsa group that contains a set of threads that share the available resources of the group,for example if a CPU group has 50% of the CPU resources, then all the threads of thegroup share those resources. As usual, the distribution of the CPU time among thethreads is performed by the underlying CPU scheduler.CPU subsystemWe define three types of threads: (a) best-effort, that do not have real-time requirements,and are expected to run as soon-as-possible but without any deadline constraints; (b)72
  • 3.1. STHENO’S SYSTEM ARCHITECTUREsoft real-time, that have a defined deadline, but that, in case of a deadline miss,does not produce system failures, and; (c) isolated soft real-time, these threads arepositioned in isolated core(s) in order to prevent entanglement with other activitiesof the operating system (interrupt handling from other cores, network multi-queueprocessing, etc), resulting in less latency and jitter and thus providing a better assuranceon the fulfillment of deadlines.However, there is another type of threads that is not currently supported by themiddleware, hard real-time threads. A failure to fulfill the deadline of one of thesethreads could result in a catastrophic failure, and is normally associated with criticalsystems such as railway signalling or avionics. Ongoing work on EDF schedulingseems to offer a solid way to provide hard real-time support in Linux [117, 118]. Arecent validation seems to confirm our beliefs [119]. We plan to extend our support toaccommodate threads that are governed by deadlines instead of priorities.To simplify the explanation of the CPU subsystem, we describe it as one entity although,in reality, it is composed by two separate groups that are closely related, the CPUPartitioning (also known as cpusets) and the Real-Time Group Scheduling. The firstgroup is responsible to provide isolation, commonly known as shielding, of subsets of theavailable cores, while the second group provides resource reservation guarantees to RTthreads, that is, it is responsible for controlling the amount of CPU for each reservation. Figure 3.10: QoS daemon resource distribution layout. 73
  • CHAPTER 3. ARCHITECTUREFigure 3.10 illustrates a possible resource reservation schema. The nodes with RA andRB represent the two runtimes present in the same physical host, while S1 and S2represent services running under these two runtimes. The other node shown as OS,represents the resources allocated to the operating system. The P2 P node representsthe resources allocated to the overlay. For the sake of clarity, we do not present thedistribution of the overlay’s resources among its services. Each of the runtimes hasto request a provision of resources for later distribution among its services. Later, inChapter 5, we present the results that assess the potential of this approach.I/O subsystemAlthough not implemented, we have left support for the I/O subsystem that is respon-sible for managing the I/O bandwidth of each device individually. The I/O reservationcan be accomplished either by specifying weights, or by specifying read and writebandwidth limits and operations per second (IOPS).When using weights to perform I/O reservation, groups with greater weights have moreI/O time quantum from the I/O scheduler. This approach is used for best effortscenarios, which do not suit our purposes. In order to provide real-time behavior,it is necessary to enforce I/O usage limits on both bandwidth and IOPS. Services thatmanage large streams of information, such as video streaming, do not issue a highnumber of I/O operations, but instead need a high amount of bandwidth. However,low-latency data centric services like Database Management Systems (DBMS) [120] orData Stream Management Systems (DSMS) [121, 122] exhibit the opposite behavior.Not needing a high amount of bandwidth, but instead, requiring a high number ofIOPS.I/O contention can be caused by a high consumption service that starves other servicesin the system, by either depleting I/O bandwidth, and/or by saturating the device withan overwhelming number of I/O requests that exceeds its operational capabilities, suchas the length of the request queue.The progressive introduction of Solid State Disk (SSD) technology into traditionalstorage devices like hard-drives, is reshaping the approach taken to this type of re-source [123]. These new devices are capable of unprecedented levels of performance,specially in terms of latency, where they are able to offer a hundred fold reductionin access times. The elimination of the mechanical components allows SSDs to offerlow-latency read/write operations and deterministic behavior. An evaluation of thesefeatures is left for future work.74
  • 3.1. STHENO’S SYSTEM ARCHITECTUREMemory subsystemA substantial number of system faults are caused by memory depletion, normallyassociated with bugs, ill-defined applications, and system overuse. When an operatingsystem reaches a critical level of free memory, it tries to free all non-essential memory,such as programs caches. If this is not sufficient, then a fail-over mechanism is started.In the Linux operating system, this mechanism consists of randomly killing processes inorder to release allocated memory, in an effort to prevent the inevitable system crash.The runtime ensures that has access to the memory it needs throughout its life-cycle byrequesting a statically provisioning to the memory subsystem. In the memory subsys-tem, each group reserves a portion of the total system memory, following a hierarchicaldistribution model, allowing the runtime to further distributed the provisioned memoryamong the different components, such as the P2 P overlay layer and the user services.Network subsystemEach group in the network subsystem tags the packets generated by its threads with anID, allowing the tc (Linux traffic controller) to identify packets of a particular group.With this mapping it is possible to associate different priorities and scheduling policiesto different groups.This approach deals with the local aspects of network reservation, that is the sendingand receiving on the local network interfaces, but this is not sufficient to guarantee end-to-end network QoS. In order to provide this, all the hops, such as routers, betweenthe two peers, must accept and enforce the target QoS reservation. An example of anend-to-end QoS reservation is depicted in figure 3.11. Figure 3.11: End-to-end network reservation.In the future, we intent to provide an end-to-end QoS signaling protocol capable ofproviding QoS tunnels across a segment of a network, using a protocol such as RSVP [64]and NSIS [124].Monitoring InfrastructureThe monitoring infrastructure audits the resource usage of the underlying OS, such asCPU, memory, storage, etc. The monitoring data is gathered using the informationexpose by the /proc pseudo filesystem. For this to work, the Linux kernel must beconfigured to exposed this information. 75
  • CHAPTER 3. ARCHITECTUREThe main goal of the infrastructure is to provide a resource usage histogram (currently itsupports CPU and memory) that can be used for both off-line (log audit) and real-timeanalysis. The log analysis is helpful in detecting abnormal behaviors that are normallycaused by bugs (such as memory leaks).Currently, we use a reactive fault-detection model that only acts after a fault hasoccurred. With a real-time monitoring infrastructure it is possible to evolve to amore efficient proactive fault-detection model. Using a proactive approach, the runtimecould predict imminent faults and take actions to eliminate, or at least minimize, theconsequences of such events. For example, if a runtime detects that its storing unit,such as hard drive, is exhibiting an increasing number of bad blocks, it could decide tomigrate its services to other nodes in the overlay.3.1.5 Operating System InterfaceOur target systems can be supported by a collection of heterogeneous machines withdifferent operating systems, so it was to crucial to develop a portable runtime im-plementation. Additionally, fine-grain control over all the resources available in thesystem is paramount to achieve real-time support. For example, in order to maintain ahighly critical surveillance feed, the middleware must be able to reserve (provision) thenecessary CPU time to process the video frames within a predefined deadline.To met this goal, we choose to control and monitor the underlying resources fromuserspace (shown in Figure 3.12), avoiding the use of specialized kernels modules.To complement this approach, we use ACE [111], a portable network framework thatoffers a common API that abstracts the low-level system-calls offered by the differentoperating systems, namely, thread handling (including priorities), networking and I/O.Furthermore, ACE also provides several high-level design patterns, such as the reac-tor/connector design pattern, that enable the development of modular systems capableof offering high-levels of performance.The resource reservation mechanisms, including CPU partitioning, are not covered byany of the Portable Operating System Interface (POSIX) standards, so there is nocommon API to access them. The Linux operating system, in which our current imple-mentation is based on, provides access to the low-level resource reservation mechanism,via the Control Groups infrastructure, through the manipulation of a set of exposedfiles in the /proc pseudo-filesystem.Nevertheless, low-level RT support in Linux is not provided out-of-the-box. A careful76
  • 3.1. STHENO’S SYSTEM ARCHITECTURE Figure 3.12: Operating system interface.selection of the kernel version and proper configuration must be used. An initialevaluation was performed for kernel 2.6.33 with the rt-preempt patch (usually referredas kernel 2.6.33-rt), but its support for Control Groups revealed several issues, resultingin unstable systems.A second kernel version was evaluated, the kernel 2.6.39-git12, which already supportsalmost every feature present in the rt-preempt patch and provides flawless support forControl Groups.The Linux kernel supports a wide range of parameters that can be adjusted. However,only a small subset had a significant impact in the overall system performance andstability under RT, most notably: • DynTicks - the dynamic ticks enhances the former static checking of timer events (usually 100, 250, 500 or 1000 Hz), allowing for a significant power reduction, but more importantly, the reduction of kernel latencies; • Memory allocator - the two most relevant are the SLAB [125] and SLUB [126] memory allocators. They both manage caches of objects, thus allowing for efficient allocations. SLUB is an evolution of SLAB, offering a more efficient and scalable implementation that reduces queuing and general overhead; • RCU - the Read-Copy Update [127] is a synchronization mechanism that allows reads to be performed concurrently with updates. Kernel 2.6.39-git12 offers a novel RCU feature, the “RCU preemption priority boosting” [128]. This feature enables a task that wants to synchronize the RCU to boost all the sleeping readers priority to match the caller’s priority. 77
  • CHAPTER 3. ARCHITECTURE3.2 Programming ModelThe access to the runtime capabilities is safeguarded by a set of interfaces. The mainpurpose of these interfaces is to provide a disciplined access to resources while providinginteroperability between the runtime and services that are not collocated within thesame memory address space. Furthermore, it also allows a better modularization of thecomponents of the runtime. Figure 3.13 shows the interactions between the componentsof the architecture through these interfaces. Figure 3.13: Interactions between layers.User applications and services access the runtime through the Runtime Interface. Thedirect control of the overlay is restricted to the core of the runtime. The access to theP2 P overlay, for both services and users, is only allowed through the Overlay Interface(described in Section 3.2.2) that is accessible from the Runtime Interface. An overlayis also restricted on the its access to the core of the runtime. The access of an overlayto the core of the runtime is also restricted through the Core Interface (described inSection 3.2.3), avoiding malicious use of runtime resources by overlay plugins.3.2.1 Runtime InterfaceThe Runtime Interface is the main interface that is available to the user and services andit provides a proxy type support, allowing them to interact with runtimes that are notin the same address space through an Inter-Process Communication (IPC) mechanism.While multiple runtimes can exist in a single host, this results in a redundant resource78
  • 3.2. PROGRAMMING MODELconsumption. Our approach allows the reduction of coexisting runtimes, resulting in alesser resource consumption.Figure 3.14 shows the access to the runtime from different processes. The runtime isinitially bootstrapped in process 1. A virtualized service that uses the Kernel Virtual-Machine (KVM) hypervisor is contained in process 2. In process 3 is shown an additionaluser and service using the runtime of process 1. The support for additional languageswas also considered in the design of the architecture. Processes 4 and 5 show servicesrunning inside a Java Virtual Machine (JVM) and a .NET Virtual Machine, respectively.While we only show one runtime in this example, the QoS daemon, allocated in process6, is able to support multiple runtimes. Figure 3.14: Multiple processes runtime usage.The operations supported by the Runtime Interface are the following: (a) bootstrapnew runtimes; (b) access previously bootstrapped runtimes; (c) start and stop services,both local and remote; (d) attach new overlay plugins on-the-fly; (e) allow access to theoverlay, through the Overlay Interface, and; (f) create clients to interact with serviceinstances.3.2.2 Overlay InterfaceThe main goal of Overlay Interface is to provide a disciplined access to a subset of theunderlying overlay infrastructure, leveraging performance goals, for example avoiding 79
  • CHAPTER 3. ARCHITECTURElengthy code paths that can lead to the creation of hot paths, while enforcing properisolation and thus preventing misuse of shared resources by rogue or misbehaved servicesor applications.The Overlay Interface provides an access to the overlay Mesh and Discovery services.These services and the overall overlay architecture were described in Section 3.1.3. Weplan to extend the architecture to provide access to the underlying FT service to allowthe dynamical manipulation of the replication policy used in a replication group. But,as different replication policies have different resource requirements, it is necessary toprovide additional support for dynamical changes in resource reservations assignments,that is the increase or decrease of the amount of resources associated with a resourcereservation. To overcome this, we plan to enhance our QoS daemon in order to providethe necessary support.3.2.3 Core InterfaceEvery overlay implementation has to interact with the core. This interaction is mediatedthrough the Core Interface that is only accessible to the overlay plugin.The operations supported by the Core Interface are the following: (a) start and stoplocal services; (b) create replicas for the fault-tolerance service, and; (c) retrieve infor-mation about service instances and resource availability.The creation and destruction of local services are issued by the Mesh service upon thereception of requests from remote peers. These requests are then redirected to the coreof the runtime by the Core Interface. In case of a creation of a new service, the corerequests the Service Manager to create a new service instance and makes the necessaryQoS resource reservations, by the QoS daemon, through the QoS client. On the otherhand, when destroying a service instance, the core just has to request the removal ofthe instance by the Service Manager.The creation and removal of replicas are issued by the FT service upon the receptionof requests from a replication group, and is normally requested by the coordinator ofthe replication group, but it is implementation dependent. As with the previous case,the request for a creation or removal of a replica is handled by the core, after beingredirected by the Core Interface. In the case of a removal of a replica, the core forwardsthe request to the proper replication group through the fault-tolerance service. Onthe other hand, in the case of the creation of a new replica, the core makes the QoSresources reservations that are needed to maintain both the service instance ( that will80
  • 3.3. FUNDAMENTAL RUNTIME OPERATIONSact as a replica of the primary service instance) and the replication group, that is,the infrastructure necessary to enforce the replication mechanisms. The retrieval ofinformation about service instances and resource availability is used by the Discoveryservice in response to queries.3.3 Fundamental Runtime OperationsThe runtime manages resources, services and clients. Its main operations are: the initialruntime creation and corresponding bootstrap; creation of local and remote services withand without fault-tolerance, and; creation of clients for user services.3.3.1 Runtime Creation and BootstrappingThe creation and the initialization, normally designated as bootstrap, of the runtimeinvolves a three phase process as shown in Figure 3.15. (a) (b) (c) Figure 3.15: Creating and bootstrapping of a runtime.The creation of the middleware, shown in 3.15a, is accomplished by the user throughthe Runtime Interface. At this point, the runtime does not have an active overlayinfrastructure. The user is responsible for choosing a suitable overlay implementation(plugin) and for attaching it to the runtime (shown in Figure 3.15b). In the final 81
  • CHAPTER 3. ARCHITECTUREphase, the user bootstraps the newly created runtime (depicted in Figure 3.15c). Thisbootstrapping process is governed by the core. If the runtime is configured to useQoS reservation then the core connects to the QoS daemon and reserves the necessaryresources. Otherwise, step 2 is omitted, and no interaction is made with the QoSdaemon.Listing 3.1: Overlay plugin and runtime bootstrap.1 RuntimeInterface∗ runtime = 0;2 try {3 runtime = RuntimeInterface::createRuntime();4 Overlay∗ overlay = createOverlay();5 runtime−>attachOverlay(overlay);6 runtime−>start(args);7 } catch (RuntimeException& ex) {8 Log(’Runtime creation failed’); // handle error9 }The code snipplet necessary to create and bootstrap a runtime is shown in Listing 3.1.Line 3, shows the creation of runtime, as previously illustrated in Figure 3.15a. At thistime, only the basic infrastructure is created and the runtime is still not bootstrapped.This is followed by the creation of the chosen overlay implementation, that is going tobe attached to the runtime, in lines 4 and 5 (that corresponds to the illustration ofFigure 3.15b). For last, the whole process is completed, in line 6, with the bootstrap ofthe runtime, that implicitly bootstraps the overlay (as shown in Figure 3.15b).3.3.2 Service InfrastructureThe life cycle of a service starts with its creation and terminates with its destruction.The service infrastructure provides the user with such mechanisms. This section startswith an in-depth view of the local creation of services, as first introduced on Sec-tion 3.1.1. Then follows a detailed view of the mechanisms that regulate the creation ofremote services with and without FT support. It concludes with the complete outlineof the service deployment mechanisms.Local Service CreationThe steps involved in instantiating a new local service are depicted in Figure 3.16. Theuser, through the Runtime Interface, requests the creation (and bootstrap) of a newlocal service instance (step 1). The core of the runtime redirects the request to theservice manager for further handling (step 2). The first step to be taken by the service82
  • 3.3. FUNDAMENTAL RUNTIME OPERATIONSmanager is to determine if the service implementation is known. If the service is notknown, then the core tries to find the respective implementation using the discoveryservice in the overlay (step omitted). If the implementation is found then it is transferredback requesting peer and the service creation can continue. Otherwise the creation ofthe service is aborted. Figure 3.16: Local service creation.If the runtime was bootstrap with resource reservation enabled then, once the serviceimplementation is retrieved, it is possible to retrieve its QoS requirements. Knowingthese requirements, the runtime tries to allocate them through a QoS client (shown as adashed lines in steps 3 and 4). If the requested resources are available then the serviceis instantiate, otherwise the service creation is aborted. If the resources are availablebut the service does not successfully start, all the associated resource reservations arereleased.If, on the other hand, the runtime does not have the resource reservation infrastructureenabled, then once the service implementation is known and retrieved, the core canimmediately instantiated a local service instance.Listing 3.2 has the code snipplet necessary to bootstrap a new local service instance.Line 1 shows the initialization of the service parameters that are wrapped by a smartpointer variable, allowing for a safe manipulation by the runtime. The actual servicecreation is done in line 4 and is performed by the startService() method thattakes the following parameters: the SID of the service to be created; the serviceparameters, and; the peer where the service is to be launched, which in this case is theUniversal Unique Identifier (UUID) of the local runtime. Upon the successful creationof the service instance, the parameter iid of the call startService() will contain itsinstance identifier.Listing 3.2: Transparent service creation. 83
  • CHAPTER 3. ARCHITECTURE1 ServiceParamsPtr paramsPtr(new ServiceParams(sid));2 try {3 UUIDPtr iid;4 runtime−>startService(sid, paramsPtr, runtime−>getUUID(), iid);5 } catch (ServiceException& ex) {6 Log(’Service creation failed’); // handle error7 }Remote Service CreationThere are two distinct approaches to create remote services. An user can either explicitlyspecify the peer to host the service, or alternatively, it can leave the decision of findinga suitable place for hosting the service to the middleware. This last approach is thedefault way to bootstrap services. Figure 3.17: Finding a suitable deployment site.Figure 3.17 shows the mechanism associated with the search for a suitable place todeploy a new service instance, within a hierarchical mesh overlay, where each level oftree is maintained by a cell. Cells are logical constructions that maintain portions ofthe overlay space and provide mesh resilience.The requesting peer uses the discovery service of the overlay to perform a Place ofLaunch (PoL) query. This query retrieves the information about a suitable hostingpeer. However, as previously stated, the resolution of the query is totally dependent ofthe overlay implementation. In the example provided by Figure 3.17, the query issuedby peer A is relayed until it reaches peer C. This peer is able to satisfy the query andreplies back to peer B that in turn replies back to peer A. After receiving the query reply,indicating peer D as the deployment site, peer A requests a remote service creation atpeer D. We describe an implementation for this mechanism in Chapter 4.84
  • 3.3. FUNDAMENTAL RUNTIME OPERATIONS Figure 3.18: Remote service creation without fault-tolerance.In order to create a remote service (Figure 3.18), the user using peer A makes therequest through the Runtime Interface (steps 1 and 2). The core of the runtime coreuses its mesh service to request the remote peer the creation of the wanted service(steps 3 and 4). The mesh service of the remote peer after receiving the request forthe creation of a new service instance, uses the Core Interface (step 5) to redirect therequest to the core of its runtime (step 6). At this point, the remote peer uses thepreviously described procedure for local service creation (Figure 3.16). The dashedlines represent the optionality of using resource reservation.The code snipplet shown in Listing 3.3 creates two remote service instances, one usesexplicit deployment and the other uses transparent deployment. Line 1 shows the initial-ization of the service parameters that are used in the creation of both service instances.Line 5, shows the creation of a remote service instance using explicit deployment. Theremote peer that will host the instance is given by the remotePeerUUID variable. Line7 shows the creation of a remote service instance when using transparent deployment.Upon the successful creation of the service instance, the last parameter used in thecall to startService() contains the instance identifier for the newly created serviceinstance.Listing 3.3: Service creation with explicit and transparent deployments.1 ServiceParamsPtr paramsPtr(new ServiceParams(sid));2 try {3 UUIDPtr explicitIID, transparentIID;4 // explicit deployment5 runtime−>startService(sid, paramsPtr, remotePeerUUID, explicitIID);6 // or, transparent deployment 85
  • CHAPTER 3. ARCHITECTURE 7 runtime−>startService(sid, paramsPtr, transparentIID); 8 } catch (ServiceException& ex) { 9 Log(’Service creation failed’); // handle error10 }Remote Service Creation With Fault-ToleranceWhen creating a remote service with fault-tolerance (Figure 3.19), in response to arequest from another peer (steps 1 to 4), the remote peer acts as the main instance,also known as the primary node, for that service (steps 5 to 8). Before being able toinstantiate the service, the primary node has first to find the placement for the numberof requested replicas (step omitted). This process is delegated and governed by the FTservice (step 9). As before, the dashed lines (step 8) represent optional paths, if usingresource reservation. Figure 3.19: Remote service creation with fault-tolerance: primary-node side.The fault-tolerance service using its underlying mechanisms, which are dependent onthe implementation of the overlay, tries to find the optimal placements on the mesh toinstantiate the needed replicas. In a typical implementation this is normally accom-plished through the use of the discovery service. Depending on the overlay topology,finding the optimal placement can be intractable, as in ad-hoc topologies, so systemsoften implement more structure topologies or heuristics.Given the modularity of the architecture, it is possible to configure for each service thetype of fault-tolerance strategy to be used, such as semi-active or passive replication,allowing a better fit to the service’s needs to be obtained.86
  • 3.3. FUNDAMENTAL RUNTIME OPERATIONSThe primary node using the FT service creates the replication group that will supportreplication for the service. To create the replication group, the FT service uses theplacement information to create replicas for the group. Figure 3.20: Remote service creation with fault-tolerance: replica creation.The process of creating a new replica is shown in Figure 3.20. After receiving therequest to join the replication group through the FT service (steps 1 to 2), the replicaproceeds as previously described for the local service creation (steps 3 to 6). We describethe algorithms that materialize the behavior for different types for different types ofreplication policies in Chapter 4.Listing 3.4: Service creation with Fault-Tolerance support.1 FTServiceParams∗ ftParams = createFTParams( nbrOfReplicas, FT::SEMI ACTIVE REPLICATION));2 ServiceParamsPtr paramsPtr(new ServiceParams(sid, ftParamsPtr));3 try {4 UUIDPtr iid;5 runtime−>startService(sid, paramsPtr, iid);6 } catch (ServiceException& ex) {7 Log(’Service creation failed’); // handle error8 }Listing 3.4 shows the code snipplet that is necessary to bootstrap a remote servicewith FT support. Line 1 shows the initialization of the FT parameters with a total ofnbrOfReplicas replicas and using semi-active replication. The actual service creationis done in line 5. Upon the successful creation of the service instance, the parameter iidof the call startService() will contain its instance identifier, and sid the system-wide identifier for the service. 87
  • CHAPTER 3. ARCHITECTURE3.3.3 Client MechanismsThe interactions between an user and a service instance are supported by a client.A client is a proxy between the user and a service instance that is responsible forhandling all the underlying communication and resource reservation mechanisms. Theruntime provides a flexible infrastructure that does not impose any type of architecturalrestrictions on either the design of a client or the type of interaction that can take place.Figure 3.21 shows the creation and bootstrap sequence of a client. (a) (b) Figure 3.21: Client creation and bootstrap sequence.The creation of a client, shown in Figure 3.21a, starts with the user requesting a newclient through the Runtime Interface. Upon receiving the client creation request, thecore of the runtime uses the service factory to check if the service implementation isknown. If it is known then the core of the runtime returns a new client to the user fromthe service implementation, otherwise the creation of the client is aborted.After retrieving the client, the user must find a suitable service instance to connectto (shown in Figure 3.21b). After retrieving the Core Interface though the RuntimeInterface, the user uses the discovery service to search for a suitable instance (the callingpath is identified as 1). This is followed by the reply from the discovery service that isreturned to the user (calling path identified as 2). In this case, indicating peer B hasowner of a service instance.If the user wishes to use resource reservation, then it must use the underlying resource88
  • 3.4. SUMMARYreservation infrastructure. This optional step is shown as a dashed line (calling pathidentified as 3). To finish the bootstrap sequence, the user must use the informationabout the service instance that was returned by the discovery service and connect tothe service (step 4).Listing 3.5 shows the code snipplet necessary to create a client, using the RPC serviceas an example.Listing 3.5: Service client creation.1 try {2 ClientParamsPtr paramsPtr( new ClientParams(QoS::RT, CPUQoS::MAX RT PRIO));3 ServiceClient∗ client = runtime−>getClient (sid, iid, paramsPtr);4 RPCServiceClient∗ rpcClient = static cast<RPCServiceClient∗> (client);5 RPCTestObjectClient∗ rpcTestObjectClient = new RPCTestObjectClient (rpcClient);6 rpcTestObjectClient−>ping();7 } catch (ServiceException& ex) {8 Log(’Client creation failed’); // handle error9 }Prior to the actual creation of the client, the user must initialize the ClientParamsPrtparameter with the desired QoS properties. In line 2 of Listing 3.5, this parameter isinitialized to use the maximum RT priority. The actual creation of the client is donein line 3. The runtime returns a generic ServiceClient pointer that must be down-casted to the proper client implementation. In line 4, the generic pointer is converted toa generic RPC client, that manages the low-level infrastructure that handles invocationsand replies. Line 5 shows the creation of the RPC “stub”, that is responsible formarshaling requests and unmarshaling replies. Within the creation of the stub, thegeneral RPC client is attached to it. Line 6 shows an actual one-way RPC invocationof a ping operation.3.4 SummaryThis chapter started by presenting the architecture of the runtime of a P2 P middleware,providing an overview of all the layers that compose the runtime: applications layer,contains all the services and users that run on top of the middleware; core layer,is responsible for the overall management of the runtime; overlay abstraction layer,provides the abstractions to the low-level P2 P services; support framework, provides aset of high level abstractions for network communications and QoS management, and; 89
  • CHAPTER 3. ARCHITECTUREthe Linux/ACE layer, provides an abstraction to the underlying Linux operating systemthrough the ACE framework.We then provided a detailed insight on programming model, exposing the interfaces thatmust be used to access the runtime capabilities. Furthermore, we describe the advan-tages of these programming interfaces, specifically their ability to provide modularity,interoperability and controlled access to runtime resources.The chapter ended with an overview of the fundamental operations present in themiddleware, namely: runtime creation and bootstrap; local service creation; remoteservice creation with and without FT, and; client creation.90
  • –With great power comes great responsibility. 4 Voltaire ImplementationThis chapter presents the implementation details of the runtime, focusing on the under-lying mechanisms that are present in the P2 P services of our overlay implementation.Additionally, we present three service implementations that showcase the runtime capa-bilities, more precisely, a RPC-like service, an actuator service and a streaming service.4.1 Overlay Implementation Figure 4.1: The peer-to-peer overlay architecture.As a proof-of-concept for this prototype, we have chosen the P3 [15] topology, thatfollows a hierarchical tree P2 P mesh. A representation of such topology is shown inFigure 4.1. There are three different types of peers present in our implementation, 91
  • CHAPTER 4. IMPLEMENTATIONpeers, coordinators peers and leafs peers. The peers are responsible for maintaining theorganization of the overlay and for providing access points to the overlay for leaf peers.Each node in a P3 network corresponds to a cell, a set of peers that collaborate tomaintain a portion of the overlay. Cells are logical constructions that provide overlayresilience and are central in our implementation of fault-tolerance mechanisms. Eachcell is coordinated by one peer, denominated as coordinator peer. Every other peerin the cell is connected to the coordinator, allowing for efficient group communication.If the coordinator fails, one of the peers in the cell takes its place and becomes thenew coordinator. The communication between distinct cells is accomplished throughpoint-to-point connections (TCP/IP sockets) between the coordinators of the cells.The last type of peer present in the overlay is known as leaf peer. These peers donot have any type of responsibilities in maintaining the mesh. Typically, they use theoverlay capabilities, for instance, to advertise the presence of a sensor or simply to actas a client. This type of peer does not host any user service, but instead relies on theoverlay to host them.The original P3 topology [15] follows a hierarchical organization that had a significantproblem. When a coordinator of a cell crashes it causes a cascade failure, with itschildren coordinators propagating the failure to the remaining sub-trees. We exploredthis problem in previous work [6], and concluded that it was directly linked to the rigidnaming scheme of the P3 architecture. In case of a cell failure, the cell and its sub-treeswould have to perform to a complete rebind to the mesh, and thus had to contact theroot node of the tree to find a new suitable position. This caused two obvious problems,the overhead (and time) of rebinding all the cells and the bottleneck in the root node.To avoid these limitations, we modified the original P3 topology. The problems as-sociated with the rigid naming scheme of P3 were avoided through the design andimplementation of a new faulty architecture. This type of architecture focuses onreducing the impact of faults, as it assumes that they happen frequently, taking spe-cial care to eliminate, or at least minimize, the occurrence of cascade failures. Toachieve this, the middleware introduces a new flexible naming scheme, that removes allinter-dependencies between cells, and therefore allows the migration of entire sub-treesbetween different portions of the tree.The developer, however is free to implement any type of topology and behavior withinan overlay implementation for the middleware. It only has to implement the OverlayInterface. This interface is composed of three basic P2 P services. The mesh service,described in sub-section 4.1.2, handles all the management for the overlay. In a sense92
  • 4.1. OVERLAY IMPLEMENTATIONit is the most fundamental service since it provides the infra-structure for all the otherservices. The discovery service, detailed in sub-section 4.1.3, supports the infrastructurefor generic querying. Last, the FT service, provides the infrastructure for the fault-tolerance mechanisms present in the overlay and is described in sub-section Overlay BootstrapThe bootstrap of an overlay is requested by the core of the runtime on behalf of theuser. Figure 4.2 illustrates this bootstrap process. The overlay bootstraps sequentiallythe mesh (step 1), discovery (step 2) and fault-tolerance (step 3) services. Figure 4.2: The overlay bootstrap.The bootstrap process is implemented by the Overlay:start() procedure and it isshown in Algorithm 4.1. This procedure starts the mesh, discovery and FT services.The order by which the services are opened is conditioned by the dependencies betweenthe services, as both the discovery and fault-tolerance services need the informationabout SAPs of homologous services in neighbor peers, and this information is providedby the mesh service.Algorithm 4.1: Overlay bootstrap algorithm 1 procedure Overlay:start() 2 for service in [Mesh, Discovery, Fault-Tolerance] do 3 service.start() 4 end for 5 end procedure 93
  • CHAPTER 4. IMPLEMENTATION4.1.2 Mesh ServiceThe mesh service is the central component in our overlay implementation. It acts asan overlay manager and it is also responsible for the creation and removal of high-levelservices from the overlay, as previously described in Chapter 3.A mesh service must extend the Mesh Interface, but is free to implement any typeof organizational logic. Nevertheless, in a typical implementation, the mesh servicenormally has a mesh discovery sub-service, responsible for providing a dynamic discov-ery mechanism for peers in the overlay. Whereas the discovery service, described inSection 4.1.3, provides a generic infrastructure capable of handling high-level queries.It is not possible to use the generic discovery service to search for peers in the overlaybecause the of the dependencies between the mesh and discovery services, as explainedpreviously.A possible implementation for this type of mesh discovery mechanism could be accom-plished though the use of a well-known portal. This has the advantage of being simpleto implement but inherently represents both a bottleneck and a single-point-of-failure.To overcome these limitations, our overlay has a discovery mechanism, one in eachcell, that uses low-level multicast sockets to provide a distributed and efficient meshdiscovery implementation. Figure 4.3 provides an overview of the major components ina cell. Each peer participating in a cell has a cell object that contains a cell discoveryobject and a cell group object: the cell object provides a global view of the cell to thelocal peer; the cell discovery object provides the support, through the use of multicastsockets, for the cell discovery mechanism, and; the cell group object provides the groupcommunications within the cell. Figure 4.3: The cell overview.Building and membershipThe membership mechanism allows a peer to join the peer-to-peer overlay (Figure 4.4).The process starts with a request for a binding cell (step 1). This request has to be94
  • 4.1. OVERLAY IMPLEMENTATIONmade to the root cell, that in turn replies with a tuple comprising a suitable cell, itscorresponding coordinator, and the parent cell and coordinator (if available). The nextstep is the active binding (step 2), which is further sub-divided in two possibilities (steps2-a and 2-b).The multicast address for the root cell discovery address is a static and well-knownvalue. While this can be seen as a single point-of-failure, in the presence of a cell crash,that is when all the peers in a cell have crashed, the root cell is replaced by one of itschildren cells that belong to first level of the tree. The process behind failure handlingand recovery is described further below. Figure 4.4: The initial binding process for a new peer.Upon receiving the reply, and if the returned cell exists (step 2-a), the joining peerconnects to the coordinator (step 3-a). Otherwise, if the cell is new, it becomes thecoordinator for the cell (step 2-b). If the target cell is not the root cell, and if the peeris the coordinator of the cell, then it connects to the coordinator peer of its parent cell(step 3-b).To finalize the binding process, the peer has to formalize its membership by sendinga join message that is illustrated in Figure 4.5. At this point, the peer sends a joinmessage to its parent (step 1), if it is the coordinator of the cell, or sends the message 95
  • CHAPTER 4. IMPLEMENTATIONto the coordinator of the cell (step 1-a) that forwards it to its parent (step 1-b). Thismessage is propagated through the overlay until it reaches the root cell. It is theresponsibility of the root cell to validate the join request and to reply accordingly. Thereply is propagated through the overlay downwards to the joining peer (step 3). Afterthis, the peer is part of the overlay. Figure 4.5: The final join process for a new peer.The mesh construction algorithm is depicted in Algorithm 4.2. To enter the mesh, a newpeer calls the Mesh:start() procedure, which then creates a cell discovery object foraccessing the root cell discovery infrastructure (line 2), which has a well-known multicastaddress. This is then used to request a cell to which the joining node will connect itselfby making a call to the cellRootDiscoveryObj.requestCell() procedure (shownin Algorithm 4.6, lines 1-8). This procedure multicasts a discovery message that tries tofind the peer-to-peer overlay. If it fails then no peer is present in the root cell, then thecall to the Cell:requestCell() procedure returns the information associated withthe root cell, more specifically, the well-known multicast address used for the root celldiscovery. Otherwise, the appropriate bind information is returned. Using this bindinginformation, a new cell object is created and initialized (lines 4-5).96
  • 4.1. OVERLAY IMPLEMENTATIONAlgorithm 4.2: Mesh startup 1 procedure Mesh:start() 2 cellRootDiscoveryObj ← Cell:createRootCellDiscovery() 3 bindInfo ← cellRootDiscoveryObj.requestCell() 4 cellObj ← Cell:createCellObject() 5 cellObj.start(bindInfo) 6 end procedureCell BootstrapThe binding information returned by the cell discovery mechanism has all the informa-tion needed for the cell initialization (as shown in Figure 4.4). In Algorithm 4.3, weshow the algorithms that rule the behavior of a cell.Algorithm 4.3: Cell initialization var: this // The current cell object 1 procedure Cell:start(bindInfo) 2 bindingCellInfo ← bindInfo.getBindingCellInfo() 3 if not bindingCellInfo.isCoordinator() then 4 cellGroupObj ← Cell:bindToCoordinatorPeer(bindingCellInfo.getCoordInfo()) 5 else 6 parentPeerInfo ← ∅ 7 if not bindingCellInfo.isRoot() then 8 parentPeerInfo ← bindInfo.getParentCellCoordInfo() 9 end if10 cellGroupObj ← Cell:createCellGroup(parentPeerInfo)11 end if12 cellGroupObj.requestJoin()13 cellDiscoveryAddr ← bindingCellInfo.getCellDiscoveryAddress()14 cellDiscoveryObj ← Cell:createCellDiscovery(cellDiscoveryAddr)15 this.attach(cellDiscoveryObj)16 end procedureThe bootstrap of the cell object is performed using the Cell:start() procedure thattakes the bindInfo as its argument. This bootstrap process is dependent of on stateof the target cell. The call to the bindingCellInfo.isCoordinator() methodindicates if we are the coordinator of this cell. If the peer is not the coordinator peer(Figure 4.4, step 2-a) for the cell then it has to join the cell group by binding to thecell group’s coordinator peer (line 4). On the other hand, if the peer is the coordinator(Figure 4.4, step 2-b), then it checks if the cell is the root. If the peer is on the root cell,then the bootstrap is finished, otherwise it must connect to its parent cell coordinator,and link the newly created cell to its parent cell (lines 5-10). 97
  • CHAPTER 4. IMPLEMENTATIONRegardless of whether the newly arrived peer is a non-coordinator on a cell group, or ifit is a coordinator on a non root cell, it must propagate its membership by using a joinmessage. In line 12, the call to cellGroupObj.requestJoin() initializes the process.The join process is depicted in Figure 4.5, while the cell group communication is shownin Figure 4.6.Lines 13-15 show the creation of the cell discovery object that will be associated withthis cell, with the multicast address being provided by the bindingCellInfo. Afterthe creation, it is attached to the cell in line 15, enabling the cell to handle cell discoveryrequests.Cell State and CommunicationsWhen a peer is running inside a cell, it is either a coordinator or a non coordinatorpeer providing redundancy to the coordinator. Any external peer that connects to thecell, must connect through the coordinator peer. It is the coordinator’s responsibilityto validate any incoming request. If the request is valid and accepted, the coordinatorsends the request to its parent (if applicable). After receiving the reply from its parent,the coordinator updates the state of the cell by synchronizing with all the active peers.This synchronization is done using our group communication infrastructure that isshown in Figure 4.6.The synchronization process inside a cell can be divided in two cases, whether thesynchronization is initiated by the coordinator or by a non-coordinator peer. When thesynchronization is initialized by the coordinator peer, shown in Figure 4.6a, it startsby sending the request its parent peer (step 1), which is recursively sent onwards theroot cell (step 2). After the root cell is synchronized, that is, after the request is sentto all active peers and their replies have been received, an acknowledgment message issent downwards the originating cell (step 3). Upon receiving the acknowledgment fromits parent, each coordinator peer repeats the same process, that is, they synchronizedtheir cell (steps 4 and 5) and send an acknowledgment downwards (step 6). When theacknowledgment reaches the originating cell, the request is synchronized (steps 7 and8).The synchronization process can be performed either in parallel or sequentially. Al-though we do not provide benchmarks, we have done a preliminary empirical assessmenton the optimal transmission strategy. Early testing shows that for a small number ofpeers, the best transmission strategy is to send the requests sequentially. However, for alarger number of requests, the best transmission strategy is to send them in parallel, byusing a pool of threads for performing the transmission simultaneously. This behavior98
  • 4.1. OVERLAY IMPLEMENTATION (a) Synchronization initiated by the coordinator. (b) Synchronization initiated by a follower. Figure 4.6: Overview of the cell group communications.can be explained by the overhead associated with the enqueue of the sending requestin multiple threads. However, as the number of peers increases, the cost of sending therequests sequentially surpasses the overhead of the parallel transmission. 99
  • CHAPTER 4. IMPLEMENTATIONFigure 4.6b shows the communications steps required when the synchronization isinitialized by a non coordinator peer. Here, the peer must send the request to thecoordinator peer (step 1). Upon receiving the request, the coordinator peer performs thesame process that as used in Figure 4.6a. It starts by propagating the request onwardsthe root cell (steps 2 and 3), with the respective acknowledgment being sent after theroot cell synchronizes (step 4). All the coordinator peers that belong to the cells betweenthe root cell and the originating cell, synchronize the request within their cell afterreceiving the acknowledgment from their parent. When the acknowledgment reachesthe originating cell, the coordinator peer spreads the request through the remainingactive peers and waits for their replies (steps 8 and 9). Last, the coordinator peer sendsan acknowledgment back to the originating peer (step 10).The cell communication algorithms are shown in Algorithms 4.4 and 4.5, and they ex-pose the previously described roles that are present in the architecture: the coordinatorand non coordinator roles.Algorithm 4.4: Cell group communications: receiving-end var: this // the current cell communication group object var: cellObj // the cell object associated with the communication group var: coordinatorPeer // the cell coordinator peer 1 procedure CellGroup:coordinatorHandleMsg(peer,msg) 2 if not msg.isAckMessage() then 3 ackMessage ← cellObj.processMsg(msg) 4 if not isRoot() then 5 request ← this.getParentPeer().sendMessage(msg) 6 request.waitForCompletion() 7 if request.failed() then 8 this.handleParentFailure() 9 end if10 end if11 this.sendMessage(msg)12 peer.sendMessage(ackMessage)13 else14 this.updatePendingRequests(msg)15 end if16 end procedure17 procedure CellGroup:nonCoordinatorHandleMsg(msg)18 if not msg.isAckMessage() then19 ackMessage ← cellObj.processMsg(msg)20 coordinatorPeer.sendMessage(ackMessage)21 else22 this.updatePendingRequests(msg)23 end if24 end procedure100
  • 4.1. OVERLAY IMPLEMENTATIONIf a peer is the coordinator of the cell, then all the incoming messages (from the cellor from children cells) are processed by the CellGroup:coordinatorHandleMsg()procedure, otherwise, the CellGroup:nonCoordinatorHandleMsg() procedure isused to process the incoming messages.In CellGroup:coordinatorHandleMsg() procedure, the coordinator receives a newmessage and process it if is not an acknowledgment in line 3. After the message is pro-cessed and validated by the coordinator (line 3), and if the coordinator does not belongto the root cell, then it must forward the message to its parent cell coordinator and waitfor the acknowledgment (lines 5-6), with the process recursively updating the cells untilthe root node is reached. If the synchronization with the parent fails, then the coor-dinator enters in a recovery stage by executing the Cell::handleParentFailure()procedure (lines 7-9) that is detailed below. After synchronizing with its parent, thecoordinator uses the CellGroup:sendMessage() procedure to send the message acrossthe peers, and thus synchronizing the state among all the active peers present in the cell(line 11). The last step remaining is to send back the reply message to the requestingpeer (line 12). On the other hand, if the coordinator received an acknowledgment, thenit updates any pending request (lines 13-15).If the peer is not the coordinator of the cell, then all the incoming messages are processedby the CellGroup:nonCoordinatorHandleMsg() procedure. If the message is not anacknowledgment then the cell object processes it and updates its internal state (line 19),reflecting the changes performed globally the cell. After this update, an acknowledgmentis sent back to the coordinator peer (line 20). Otherwise, the message received was anacknowledgment and is used to update any pending request (lines 21-23).The CellGroup:sendMessage() procedure, in Algorithm 4.5, illustrates the processof sending a message within a cell. If the a message is being sent by the coordinator(lines 2-15), but if it was originated in another peer then the coordinator removes thatpeer from the sending set (lines 3 and 4); illustrated in step 1 in Figure 4.6a and step2 in Figure 4.6b). The message is sent to all the peers present in the set with eachpending request being stored in an auxiliary list (lines 5-9). The coordinator then waitsfor the completion of all the pending requests (line 10). For each request that failed,the coordinator removes the peer associated with that request from the list containingall the active peers (lines 11-15).On the other hand, if the message is being sent by a non-coordinator peer, then it isforwarded to the coordinator of the cell (line 17). After sending the message, the peerwaits for the acknowledgment from the coordinator (line 18). The synchronization is 101
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.5: Cell group communications: sending-end var: this // the current cell group communications object var: peers // the active, non-coordinator, peer client list var: coordinatorPeer // the coordinator peer client 1 procedure CellGroup:sendMessage(msg) 2 if this.isLocalPeerGroupCoordinator() then 3 sendList ← peers 4 sendList.remove(msg.getSourcePeer()) 5 cellRequestList ← ∅ 6 for peer in sendList do 7 cellRequest ← peer.sendMessage(msg) 8 cellRequestList.add(cellRequest) 9 end for10 cellRequestList.waitForCompletion()11 for cellRequest in cellRequestList do12 if cellRequest.failed() then13 peers.remove(cellRequest.getPeer())14 end if15 end for16 else17 cellRequest ← coordinatorPeer.sendMessage(msg)18 cellRequest.waitForCompletion()19 if cellRequest.failed() then20 this.handleCoordinatorFailure()21 end if22 end if23 end procedurethen handled by the coordinator through the CellGroup:coordinatorHandleMsg()procedure (previously shown in Algorithm 4.4). If the request fails, it is assumed thatthe coordinator has crashed. In order to recover the cell from this faulty state, theCellGroup:handleCoordinatorFailure() procedure is triggered.Cell Discovery MechanismThe goal of the cell discovery mechanism is to allow the discovery of peers in a cell.The cell discovery object implements this sub-service, and uses low-level multicastsockets to achieve an efficient implementation. The cell membership management isaccomplished through the use of the join, leave and rebind operations. These operationsare implemented through the cell group object. Both these mechanisms are presentedin Figure 4.7.The algorithms that implement the cell discovery mechanisms are presented in Algo-rithm 4.6. When a peer wants to join the mesh, it first has to find a suitable cell to bind102
  • 4.1. OVERLAY IMPLEMENTATION Figure 4.7: Cell discovery and management entities.Algorithm 4.6: Cell Discovery var: cellObj // the cell object var: discoveryMC // the discovery low-level multicast socket 1 procedure CellDiscovery:requestCell(peerType) 2 request ← discoveryMC.sendRequestCell(peerType) 3 if request.failed() then 4 return Cell:createRootInfo() 5 else 6 return request.getCellInfo() 7 end if 8 end procedure 9 procedure CellDiscovery:RequestParent(peerType)10 request ← discoveryMC.requestParent(peerType)11 request.waitForCompletion()12 return request.getParent()13 end procedure14 procedure CellDiscovery:handleDiscoveryMsg(peer,msg)15 switch(msg.getType())16 case(RequestCell)17 if not cellObj.isRoot() then18 return19 end if20 replyRequestCellMsg ← cellObj.getCell(msg.getPeerInfo())21 peer.sendMessage(replyRequestCellMsg)22 end case23 case(RequestParent)24 replyRequestParentMsg ← cellObj.getParent(msg.getPeerInfo())25 peer.sendMessage(replyRequestParentMsg)26 end case27 end switch28 end procedureto. This is achieved through the call to the CellDiscovery:RequestCell procedure(lines 1-8) on the root cell, which in turn sends a cell request message. The call will 103
  • CHAPTER 4. IMPLEMENTATIONbe serviced by any of the peers in the cell. If there are no peers in the root cell theprocedure returns the root cell identifier (line 4). Otherwise, it returns an appropriateplace in the mesh tree to position the requesting peer (line 6). The parameter peerTypedenotes the type of node that is joining the cell, and it can be either a peer or a leafpeer.The optimal position for a new peer, depends on the strategy used and the type of peer.For a new peer, and given a tree like topology, we first try to occupy the top of the treeaiming to improve the resiliency of the overlay.The procedure CellDiscovery:handleDiscoveryMsg() (lines 14-28) is the call-backthat is executed on the cell’s active peers to process the discovery requests. The celldiscovery mechanism supports two types of messages, the request for a cell (lines 16-22)and the request for a new parent (lines 23-26).The request for a cell is only valid in the root cell, otherwise the request is simplydiscarded (lines 17-19). The restriction of this operation to the root cell allows usto provide a better balance of the mesh tree, because the root cell is the only partof the tree that has full knowledge of the overlay. A suitable cell is found, using thecellObj.getCell() procedure. The reply message containing the binding informationis sent to the requesting peer (lines 20-21).However, if the incoming request is for a new parent, then a suitable parent is foundthrough the call to the cellObj.getParent() procedure, with the result being sent tothe originating peer (lines 24 and 25). The request for a new parent is issued when theparent peer of a cell fails. The coordinator of the cell must be able to find the parent peerwithin the parent cell, if available, by using the CellDiscovery:RequestParent()procedure.Faults and RecoveryFaults arise for various reasons, ranging from hardware failures, that include peerhardware failures and network outages, to software bugs. We considered three types offaults: peer crash; coordinator peer crash, and; cell crash.Figure 4.8 illustrates the fault handling processes in the presence of a fault in a cell.When a non-coordinator peer crashes in a cell, shown in Figure 4.8a), the coordinatorpeer issues a leavePeer request to the upper part of the tree (step 2), notifying thedeparture of the crashed peer. After the acknowledgment from the parent peer hasbeen received (step 3), the coordinator peer notifies the active peers in the cell of thecrashed peer (steps 4 and 5).104
  • 4.1. OVERLAY IMPLEMENTATION (a) (b)Figure 4.8: Failure handling for non-coordinator (left) and coordinator (right) peers.On the other hand, when a failure happens in the cell’s coordinator peer, shown inFigure 4.8b), one of the other peers in the cell takes its place as the new coordinator.After detecting the failure of the coordinator (step 1), the peer that is next-in-line,according to the order that the peers entered the cell, succeeds it and becomes the newcoordinator. The coordinator of the parent cell also detects the crashed coordinatorpeer, and sends a notification onwards the root cell (steps omitted). The newly electedcoordinator peer sends a rebind request to the parent coordinator and waits or theacknowledgment (steps 2 and 3), informing that it is the new coordinator of the cell.Furthermore, each active peer in the cell rebinds to the new coordinator, as will also anycoordinator belonging to a children cell. These rebind requests are also sent onwardsthe root and fully acknowledged (steps 4 to 7).As said, the coordinator peers from the children cells try to rebind to the parent’s cell.If there are no more peers in parent’s cell then the cell has crashed and the coordinatorsof the children cells have to contact the root node of the tree to request a new suitableplacement, that is a new cell. At this point, it is possible for the children cells, and 105
  • CHAPTER 4. IMPLEMENTATIONtheir sub-trees, to migrate to their new location, effectively avoiding the costly rebindingprocess that would arise from forcing every peer to individually rebind to the mesh. (a) (b) Figure 4.9: Cell failure (left) and subsequent mesh tree rebinding (right).The Figure 4.9a) shows the instance when the coordinator peer crashes. Because itwas the only active peer in the cell, this resulted in a cell crash, as no more peers wereavailable in the cell. The reconfigured P2 P network is shown in Figure 4.9b).Algorithms 4.7 and 4.8 show the algorithms that govern the fault-handling mechanism.When a TCP/IP connection closes without proper shutdown, the peer is assumedto have crashed. Within a cell, the coordinator peer monitors all active peers, andin turn, they monitor the coordinator peer. The Cell:onPeerFailureHandler()procedure is called by the coordinator when any of the active peers has failed, orit is called by all the active peers when the coordinator has failed. Furthermore,when a parent coordinator detects that a child coordinator has failed, it also callsthe Cell:onPeerFailureHandler() procedure. On the other hand, every childrencoordinator peer calls the Cell:onParentFailureHandler() procedure when theydetect that their parent coordinator has crashed.When a peer crashes, there are two possible scenarios. The first one being the crashof a non-coordinator peer of the cell, shown in Figure 4.9a), and the second scenariois related to the crash of a coordinator peer, shown in Figure 4.9b). When a non-coordinator peer crashes, the coordinator peer of that cell calls the Cell:leavePeer()procedure at line 10 of the Cell:onPeerFailureHandler() procedure. It starts byremoving the information about the peer (line 39) and then sending the notification tothe parent coordinator peer and waiting for the acknowledgment (lines 40 to 42). Afterthe acknowledgment has been received, the coordinator synchronizes cell by issuing a106
  • 4.1. OVERLAY IMPLEMENTATIONAlgorithm 4.7: Cell fault handling. var: this // the current cell object var: cellGroupObj // the cell communication group object 1 procedure Cell:onPeerFailureHandler(peerInfo) 2 if peerInfo.isCoordinator() then 3 this.removePeerInfo(peerInfo) 4 if this.isNewCoordinator() then 5 this.rebindParentPeer(this.getParentInfo()) 6 else 7 this.rebindCoordinatorPeer() 8 end if 9 else10 this.leavePeer(peerInfo)11 end if12 end procedure13 procedure Cell:onParentFailureHandler(peerInfo)14 cellDiscoveryObj ← Cell:createCellDiscovery(peerInfo.getCellInfo())15 newParentInfo ← cellDiscoveryObj.requestParent()16 if newParentInfo = ∅ then17 this.rebindParentPeer(newParentInfo)18 else19 cellRootDiscoveryObj ← Cell:createRootCellDiscovery()20 newParentInfo ← cellRootDiscoveryObj.requestParent()21 this.rebindParentPeer(newParentInfo)22 end if23 end procedure24 procedure Cell:onChildFailureHandler(peerInfo)25 leavePeer(peerInfo)26 end proceduredeparture notification through the cell group communication infrastructure (line 43).No additional recovery is necessary at this point.When the crashed peer was coordinating the cell, then each active peer remaining in thecell calls the Cell:onPeerFailureHandler() procedure (lines 2 to 9). They start byremoving the information about the crashed peer (line 2). The peer that is next-in-lineto succeed to the coordinator peer calls the Cell:rebindParentPeer() procedure(line 5). In turn, all the remaining active peers in that cell call the Cell:rebind-CoordinatorPeer() procedure (line 7) in order to connect to the new coordinatorpeer. The Cell:rebindParentPeer() procedure starts by connecting to the parentcoordinator peer (line 28), and then issuing a rebind notification to it and waiting forthe acknowledgment (lines 29 to 31).On the other hand, the Cell:rebindCoordinatorPeer() procedure starts by con- 107
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.8: Cell fault handling (continuation).27 procedure Cell:rebindParentPeer(parentInfo)28 this.connectToParentPeer(parentInfo)29 rebindMsg ← Cell:createRebindMsg(this.getOurPeerInfo())30 request ← this.getParentPeer().sendMessage(rebindMsg)31 request.waitForCompletion()32 end procedure33 procedure Cell:rebindCoordinatorPeer()34 this.connectToCoordinator(this.getCoordinatorInfo())35 rebindMsg ← Cell:createRebindMsg(this.getOurPeerInfo())36 cellGroupObj.sendMessage(rebindMsg)37 end procedure38 procedure Cell:leavePeer()(peerInfo)39 this.removePeerInfo(peerInfo)40 leaveMsg ← Cell:createLeaveMsg(peerInfo)41 request ← this.getParentPeer().sendMessage(leaveMsg)42 request.waitForCompletion()43 cellGroupObj.sendMessage(leaveMsg)44 end procedurenecting to the new coordinator peer (line 34), and then issuing a rebind notification tothe coordinator through the cell group communication infrastructure (lines 35 and 36).At the same time, the parent coordinator peer and all the children coordinator peers alsodetect that the coordinator peer has crashed. In the first case, the parent coordinatorpeer, through the Cell:onChildFailureHandler() procedure, issues a notificationto the topmost portion of the tree informing of the departure of the crashed peer(followed by the synchronization within its own cell). This is accomplished throughthe Cell:leavePeer() procedure (line 25). The children coordinators upon detectionof the failure of their parent coordinator call the Cell:onParentFailureHandler()procedure. The procedure starts by trying to discover a new parent in the same cell ofthe crashed coordinator (lines 14 and 15). If there is an active coordinator in thatcell then the child coordinator rebinds by calling the Cell:rebindParentPeer()procedure. If there is no such coordinator available, the child coordinator contactsthe root cell to ask for a new parent, and thus a new placement in the mesh, andrebinds to it using also the Cell:rebindParentPeer() procedure (lines 19 to 21).4.1.3 Discovery ServiceThe Discovery service provides a generic infrastructure for locating resources in theoverlay, such as the location of service instances, whereas the previously described cell108
  • 4.1. OVERLAY IMPLEMENTATIONdiscovery infrastructure only provides the mechanisms to locate peers within a cell. Figure 4.10: Discovery service implementation.The overlay Discovery service is shown in Figure 4.10. A user in peer A issues a querythough the Runtime Interface and Overlay Interface. The runtime of peer A tries firstto resolve it locally. If it is unable to locally resolve the query, then it must forwardthe query to its parent coordinator, peer B. If peer B is unable to resolve the query,then the request is forwarded to its parent coordinator, in this case peer C. If peer Cis unable to resolve the query, then a failure reply is sent downwards to the originatingpeer.Furthermore, the querying process can be generalized in the following manner. Upon thereception of a discovery request, the runtime tries first to resolve it locally, in the peer,and only when this is not possible, it propagates the request to the cell’s coordinator. Ifthe coordinator parent’s is also unable to reply to the request, the request is propagatedonce more to its parent cell coordinator and the process is repeated recursively untila coordinator peer is able to reply. If this process reaches a point where there is noparent coordinator available (root node for the sub-tree), the process fails and a failurereply is sent downwards to the originating peer.Algorithm 4.9 illustrates the algorithms that implement the behavior of the discoveryservice. The discovery service allows the execution of synchronous and asynchronousqueries. The procedure Discovery:executeQuery() performs synchronous queries.The current implementation redirects the query to the root cell. This was done for thesake of simplicity, but is going to be revised in the future. 109
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.9: Discovery service. var: this // the current discovery service object var: mesh // the mesh service 1 procedure Discovery:executeQuery(query,qos) 2 queryResult ← this.executeLocalQuery() 3 if queryResult = ∅ then 4 return(queryResult) 5 end if 6 coordinatorUUID ← ∅ 7 if not mesh.getCell().isCoordinator() then 8 coordinatorUUID ← mesh.getCell().getCoordinatorUUID() 9 else10 coordinatorUUID ← mesh.getCell().getParentUUID()11 end if12 if coordinatorUUID = ∅ then13 return(∅)14 end if15 coordDiscoverySAP ← mesh.getDiscoveryInfo(coordinatorUUID)16 coordDiscoveryClient ← this.createCoordinatorClient(coordDiscoverySAP,qos)17 return(coordDiscoveryClient.executeQuery(query,qos))18 end procedure19 procedure Discovery:executeAsyncQuery(query, qos)20 queryResult ← this.executeLocalQuery()21 if queryResult = ∅ then22 future ← this.createFutureWithResult(queryResult)23 return(future)24 end if25 coordinatorUUID ← ∅26 if not mesh.getCell().isCoordinator() then27 coordinatorUUID ← mesh.getCell().getCoordinatorUUID()28 else29 coordinatorUUID ← mesh.getCell().getParentUUID()30 end if31 if coordinatorUUID = ∅ then32 future ← this.createFutureWithResult(∅)33 return(future)34 end if35 coordDiscoverySAP ← mesh.getDiscoveryInfo(coordinatorUUID)36 coordDiscoveryClient ← this.createCoordinatorClient(coordDiscoverySAP,qos)37 return(coordDiscoveryClient.executeAsyncQuery(query,qos))38 end procedure39 procedure Discovery:handleQuery(peer, query,qos)40 queryResult ← this.executeQuery(query,qos)41 queryReplyMessage ← Discovery:createQueryReplyMessage(queryResult)42 peer.sendMessage(queryReplyMessage)43 end procedure110
  • 4.1. OVERLAY IMPLEMENTATIONThe procedure starts by trying to resolve the query locally, and if successful, returningthe result (lines 2 to 5). Otherwise, the query must be propagated throughout theoverlay. If the peer is not the coordinator of the cell, then the coordinator of the cellwill be used as gateway for the propagation of the query. On the other hand, if thepeer is the coordinator of the cell, then the coordinator of the parent cell is used (lines6 to 11). If either coordinators are not available then the query fails (lines 12 to 14).Otherwise, the SAP information of the coordinator, that can be either the coordinatorof the current cell or the coordinator of the parent cell, is retrieved using the meshservice, in line 15, which is followed by the creation of a client to the Discovery serviceof that coordinator (line 16). At line 17, we use the client to redirect the request to theparent and return the result.The procedure Discovery:executeAsyncQuery() provides the asynchronous versionof the querying primitive. It follows the same approach as with the synchronous version,with some slight differences. Instead of returning the result of the query, it returns afuture, that acts as a placeholder for the query result, notifying the owner when thatdata is available. If the query can be resolved locally, then a future is created withquery result and returned (lines 21 to 24). As with the synchronous querying, thisis followed with the retrieval of the UUID of either the coordinator of the cell, if thepeer is not the coordinator of the cell, or the coordinator of the parent cell. If nocoordinator is available, then the procedure fails and a token reflecting this failure iscreated and returned (lines 26 to 34). Otherwise, a client is created to the coordinatorafter the retrieval of the necessary information about the SAP of that coordinator.Last, the procedure returns the future created by the asynchronous querying on thecoordinator’s client (lines 35 to 37).The procedure Discovery:handleQuery() is the call-back that is executed to handlethe query requests of the followers peers of the cell, or from children peers, that belongto children cells. The Discovery:executeQuery() procedure, that was previouslydescribed in Listing 4.9, is used to process an incoming query. If the query fails, afailure message is created. If not, the query result is attached to a reply message. Thereply message is finally sent to the requesting peer.4.1.4 Fault-Tolerance ServiceOur FT infrastructure is based on replication groups. These groups can be definedas a set of cooperating peers that have the common goal of providing reliability toa high-level service. Previous work [3, 14], implemented FT support through a set of 111
  • CHAPTER 4. IMPLEMENTATIONhigh-level services that used the underlying primitives of the middleware. Our approach(c.f. Chapter 3), makes a fundamental shift to this principle, by embedding lightweightFT support at the overlay layer.The management of the replication group is self contained, in the sense that the FTservice delegates all the logistics to the replication group. This allows further extensi-bility of the replication infrastructure, and also allows the co-existence of simultaneoustypes of replication strategies inside the FT service.The integration of FT in the overlay reduces the overhead of cross-layering that isassociated with the use of high-level services. Furthermore, this approach also enablesthe runtime to make decisions on the placement of replicas that are aware of the overlaytopology. This awareness can allow a better leverage between the target reliability andresource usage. For example, placing replicas in different geographic locations leads to abetter reliability, but can be limited by the availability of bandwidth over WANs links. Figure 4.11: Fault-Tolerance service overview.Figure 4.11 shows an overview of the FT service, more specifically, of the bootstrapprocess of a replicated service. It starts with a peer, in this case referred to as client,requesting the creation of a replicated service to peer B. This request is delegated tothe mesh service. At this point, peer B receives the request and verifies if it is ableto host the service. If enough resources are available for hosting the service, that willact as the primary service instance, then the core requests the FT service to create areplication group that will support the replication infrastructure for the service.The FT service creates a new replication group object, that will oversee the managementof the replication group acting as its primary. Using the fault-tolerance parameters,that where passed by the core, the primary of the replication group finds the necessarynumber of replicas across the overlay using the discovery service (this interaction is112
  • 4.1. OVERLAY IMPLEMENTATIONomitted). After finding the suitable deployment peers, the primary sends requests tothe remote FT services to join the replication group, as replicas. Each remote peerverifies if it has the necessary resources to host the replica, and if so, the core creates areplication group object that will act as a replica in the replication group. This processends with the replica binding to the primary of the replication group.Replication Group ManagementThe management of a replication group includes the creation and removal of replicas.Furthermore, a replication group is also responsible for providing the fail-over mecha-nisms that allow the recovery from faults that occur in participating peers. Figure 4.12: Creation of a replication group.Figure 4.12 illustrates the creation of a replication group with one replica. The processstarts with an user requesting the creation of a service with FT support (step 1). Thecore of the runtime processes the request and creates a service instance that will actas Primary service instance (step 2). If configured, the core will make the necessaryreservations by interacting with the QoS client. The core proceeds to create a replicationgroup that will provide fault-tolerance support to the service (step 3).After creating the replication group object, and finding a suitable deployment site(omitted), the core requests the addition of a replica to the newly created replicationgroup, through the fault-tolerance service (step 4). The handleFTMsg procedure is thecall-back that is responsible for handling these types of requests.After receiving and accepting the request for the creation of a replica, the peer denom-inated as Replica, creates a service instance that will act as a replica to the primary 113
  • CHAPTER 4. IMPLEMENTATIONservice instance (step 5). This is followed by the creation of a replication group objectthat will act as a replica in the existing group. In order to complete the join to thereplication group, the replica issues a join request to the primary of the replicationgroup, that is maintained by the primary peer (step 6-7).Because the example given in Figure 4.12 only has one replica, there is no need toadvertise the arrival of a new replica. However, in the presence of a larger group, eachnew added replica has to be advertised in the replication group. Figure 4.13: Replication group binding overview.Figure 4.13 depicts the existing bindings within a replication group with multiplereplicas. The primary of the replication group, the peer that is managing the groupand is responsible for hosting the primary service, has active binds to all the replicas,that are the peers that host a replica service.The replicas are shown from left to right, denoting their order of entrance in thereplication group. If the primary fails, the leftmost replica is elected as the new primary.Furthermore, each replica pre-binds to all the replicas that are placed on its right.These pre-binds allow the monitoring of the neighboring peers for failures and reducethe latency of the binding process.Figure 4.14 shows the details of the process involved in the creation of a new replica.Following a request for the creation of a new replica, by the primary (show in Figure 4.12,step 4), the new replica joins the replication group (step 1).When the primery adds a new replica to the group, it first starts by binding to it(step 2). If this initialization is successful, the primary sends a message notifying theremaining replicas that a new replica was added (step 3). Upon the arrival of thismessage, each replica pre-binds to the new replica, and if this is done successfully, eachreplica replies back to the primary with an acceptance message (steps 4-5). Otherwise,114
  • 4.1. OVERLAY IMPLEMENTATIONa rejection message is sent back to the primary and the addition of the new replica isaborted (omitted). Figure 4.14: The addition of a new replica to the replication group.Fault-Tolerance AlgorithmsThe fault-tolerance service handles three types of requests: the creation of a newreplication group, which is performed by the primary; the addition of a new replica to anexisting replication group, requested by the primary to a new replica, and; the removalof an existing replication group. The procedures FT:createReplicationGroup(),FT:joinReplicationGroup() and FT:removeReplicationGroup() handle theserequests, respectively, and are shown in Algorithm 4.10.When a service creation request is made locally or remotely, through the mesh service,the core verifies if the necessary resources are available, and if so, creates a serviceinstance to be used by the replication group. Following this, the core creates thereplication group through the procedure FT:createReplicationGroup() (shown inFigure 4.12, step 3). Acting on behalf of the core, the FT service creates the replicationgroup primary that will construct and manage the replication group.This procedure takes as input the following parameters: svc, the service instance thatwill act as the primary; params, the service parameters used in the creation of theprimary and replicas; and qos, a QoS broker to be used by the replication group. Afterthe replication group has been created (line 2), the output variable rgid is initializedwith the Replication Group Identifier (RGID) and the group is added to the groupmanager (line 3) and bootstrapped (line 4).The fault-tolerance requests are handled by the FT:handleFTMsg() procedure. Uponthe reception of a request to host a new replica (lines 18-22), the FT service redirects therequest to the core of the runtime, by calling the joinReplicationGroup() procedureof the Core Interface (line 20). The core of the runtime first verifies the availability of 115
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.10: Creation and joining within a replication group var: this // the current FT service var: ftGroupObj // the replication communication group var: groupManager // the FT replication group manager 1 procedure FT:createReplicationGroup(svc,params,rgid,qos) 2 ftGroupObj ← this.createPrimaryFTGroupObj(svc,params,rgid,qos) 3 groupManager.addGroup(ftGroupObj) 4 ftGroupObj.start() 5 end procedure 6 procedure FT:joinReplicationGroup(svc,params,rgid,primary,replicas,qos) 7 ftGroupObj ← this.createReplicaFTGroupObj(svc,params,rgid,primary,replicas,qos) 8 groupManager.addGroup(ftGroupObj) 9 ftGroupObj.start()10 end procedure11 procedure FT:removeReplicationGroup(rgid)12 ftGroupObj ← groupManager.getGroup(rgid)13 ftGroupObj.stop()14 groupManager.removeGroup(ftGroupObj)15 end procedure16 procedure FT:handleFTMsg(peer,msg)17 switch(msg.getType())18 case(JoinFTGroup)19 (rgid,sid,params) ← msg.getReplicaInfo()20 getCoreInterface().joinReplicationGroup(primary,replicas,rgid,sid,params)21 peer.sendMessage(FT:createAckMessage(msg))22 end case23 case(RemoveFTGroup)24 ftGroupObj ← groupManager.getGroup(rgid)25 ftGroupObj.stop()26 groupManager.removeGroup(ftGroupObj)27 end case28 end switch29 end procedureresources to run the replica, and if they are available, it requests the FT service tojoin the replication group. This is implemented by the FT:joinReplicationGroup()(shown in Figure 4.12, step 6) procedure and takes as input the following parameters:svc, the service instance that will act as a replica; params, the service parametersused in the creation of the primary and replicas; qos, a QoS broker to be used bythe replication group object; rgid, the RGID of the replication group; the primaryparameter, that holds the primary info; and the replicas parameter that holds thecurrent replicas info.116
  • 4.1. OVERLAY IMPLEMENTATIONReplication Group AlgorithmsThe replication group is the core of the replication infrastructure. It enforces thebehavior that was requested in the creation of the replicated service, such as the numberof replicas or replication policy.Algorithm 4.11: Primary bootstrap within a replication group var: this // the local instance of the replication group var: ft // the fault-tolerance service var: rgControlGroup // the replication control group 1 procedure FTGroup:startPrimary() 2 this.openSAPs() 3 (sid,params)← this.getServiceInfo() 4 nbrOfReplicas ← params.getFTParams().getReplicaCount() 5 deployPeers ← ft.findResources(sid,params,nbrOfReplicas) 6 for peer in deployPeers do 7 replica ← this.createReplicaObject(peer) 8 rgControlGroup.addReplica(replica.getInfo()) 9 this.addToReplicaList(replica)10 end for11 this.getService().setReplicationGroup(this);12 end procedureAlgorithm 4.11 details the initialization procedure of a primary within a replicationgroup. The FTGroup:startPrimary() procedure shows the bootstrap sequence of aprimary. It starts by initializing two distinct access points, one for data and the otherfor control (line 2). This separation was made to prevent multiplexing of control anddata requests, that could lead to priority inversion or increased latency in the processingof requests. More specifically, the control SAP is used to manage the organization ofthe replication group, such as addition and removal of replicas and election of a newprimary, while the data SAP is used to implement the “actual” FT protocol.Figure 4.15 illustrates the control and data communication groups. The dashed linesrepresent pre-binds that are made to minimize recovery time. When the primary of areplication group fails, the necessary TCP/IP connections are already in place, so whenthe replica that is next-in-line becomes the new primary, it can immediately recover thereplication group.After this initial setup, the primary calls the FT:findResources() (shown in Al-gorithm 4.12) to search for suitable deployment sites to create the replicas. The totalnumber of replicas is enclosed within the fault-tolerance parameters, that in turn belongto the service parameters (lines 3-4). 117
  • CHAPTER 4. IMPLEMENTATION Figure 4.15: The control and data communication groups.After retrieving the list of suitable deployment sites at line 5, the primary createsand binds each replica (line 7). Each newly added replica is synchronized with theexisting replicas in the replication group, using the control group infrastructure (line 8).Subsequently, the new replica is added to the replica list (line 9). Last, the replicationgroup is attached to the service instance, allowing the service to access the underlyingFT infrastructure (line 11). If any of the previously mentioned operations fails, thewhole bootstrap process fails.Algorithm 4.12: Fault-Tolerance resource discovery mechanism. var: this // the current FT service object var: discovery // the discovery service var: mesh // the mesh service 1 procedure FT:findResources(sid,params,nbrOfReplicas) 2 peerList ← ∅ 3 for i ← 1, i < nbrOfReplicas do 4 filterList ← peerList 5 query ← this.createPoLQuery(mesh.getUUID(),sid,filterList) 6 queryReply ← discovery.executeQuery(query) 7 peerList.add(queryReply.getPeerInfo()) 8 end for 9 return(peerList)10 end procedureIn order to bootstrap a replica, a suitable place most be found. Algorithm 4.12shows the details of mechanism that is responsible for finding suitable peers to hostnew replicas. The process is exposed by the FT:findResources() procedure. Thisprocedure returns a list containing the peers, found across the overlay, that are ableto host a replica. To prevent duplication of replicas on the same runtime, a filter listis added to each query. The initialization of this list is performed at line 5, and is118
  • 4.1. OVERLAY IMPLEMENTATIONupdated every time a query is performed avoiding duplication of peers. The actualquery is created in line 6, through the use of the FT:createPoLQuery() procedure.The short name PoL stands for Place of Deployment, and refers to the runtime wherea service, or in this case the replica, will be launched. At this point, the FT uses thediscovery service to perform the query (line 7), adding the reply to the peer list (line8), in case of success. If this querying fails, the FT:findResources() procedure fails.Algorithm 4.13: Replica startup. var: this // the local instance of the replication group object 1 procedure FTGroup:startReplica() 2 FTGroup:openSAPs() 3 FTGroup:getService().setReplicationGroup(FTGroup:this); 4 end procedureThe startup of a replica is detailed in the FTGroup:startReplica() procedure inAlgorithm 4.13. The replica starts by opening the control and data access points. Thisenables the primary of the group to bind to the replica (shown in Algorithm 4.11). Last,the replication group is attached to the replica service (line 3).Algorithm 4.14: Replica request handling var: this // the local instance of the replication group object 1 procedure FTGroup:replicaHandleControlMsg(primaryPeer,msg) 2 switch(msg.getType()) 3 case(AddReplica) 4 replicaInfo ← msg.getReplicaInfo() 5 replica ← this.prebindControlAndDataToReplica(replicaInfo) 6 this.addToReplicaList(replica) 7 ackMessage ← FTGroup:createAckMessage(msg) 8 primaryPeer.sendMessage(ackMessage) 9 end case10 case(RemoveReplica)11 replicaInfo ← msg.getReplicaInfo()12 this.removeFromReplicaList(replicaInfo)13 ackMessage ← FTGroup:createAckMessage(msg)14 primaryPeer.sendMessage(ackMessage)15 end case16 end switch17 end procedureAlgorithm 4.14 shows the FTGroup:replicaHandleControlMsg() call-back that isresponsible for handling the control requests, in a peer that is acting as a replica within 119
  • CHAPTER 4. IMPLEMENTATIONa replication group. The notification messages sent by the primary that inform of thearrival of new replicas to the replication group are handled in lines 3-8. Upon receivingthe request, each replica pre-binds to the new replica (line 4) and adds it to the replicalist (line 5). This ends with a reply message being sent to the primary peer.The removal of a replica from the replication group is handled in lines 8-12. Whenremoving the replica from the list (line 9), all associated pre-binds (control and data)are closed. The process ends with an acknowledgment being sent to the primary peer.Support for the Replication ProtocolOur current implementation only supports semi-active replication [44]. In this type ofreplication, the primary instance of the service after receiving and processing a requestfrom a client, replicates the new state across all the active replicas. As soon as thereplication ends, an acknowledgment is sent back to the client. Figure 4.16: Semi-active replication protocol layout.Figure 4.16 illustrates the implementation of the semi-active replication policy. Whenthe primary service instance wants to replicate its state, it uses the replicate()procedure within the replication group (step 1). The replication group then uses thedata group to synchronize the new state among the replicas (step 2). Each replicahandles the replication request through the replicaHandleDataMsg() procedure.This takes the replication data and calls the onReplication() procedure (step 3).The service, after synchronizing into the new state, issues an acknowledgment throughthe replication group (step 4).The actual replication protocol support is detailed in Algorithm 4.15. When a primaryservice needs to synchronize some data, that can be individual actions, such as RPCinvocations, or state transfers (partial or complete), it uses the FTGroup:replicate()120
  • 4.1. OVERLAY IMPLEMENTATIONprocedure. The underlying replication group, depending on its policy, synchronizes thereplication data with all the replicas. For example, if the replication group is configuredto use semi-active replication, then when the FTGroup:replicate() procedure iscalled (by the primary), the group immediately spreads the data. Alternatively, ifpassive replication was in place, the replication group would buffer the data until thenext synchronization period expires. When the period expires, the replication groupsynchronizes the data.Each replica executes the FTGroup:handleReplicationPacket() call-back to handlethe arrival of replication data. Upon arrival, the replication data is send to replicaservice instance to be processed.Algorithm 4.15: Support for semi-active replication. var: this // the local instance of the replication group object var: rgDataGroup // the replication data group 1 procedure FTGroup:replicate(buffer) 2 rgDataGroup.replicate(buffer); 3 end procedure 4 procedure FTGroup:replicaHandleDataMsg(primaryPeer,msg) 5 switch(msg.getType()) 6 case(Replication) 7 buffer ← msg.getBuffer() 8 replicationAckMsg ← this.getService().onReplication(buffer) 9 primaryPeer.sendMessage(replicationAckMsg)10 end case11 end switch12 end procedureFault Detection and Recovery in Replication GroupsThe fault detection and recovery mechanisms within a replication group are imple-mentation dependent. Figure 4.17 illustrates the recovery process within our currentimplementation. After detecting the failure of the primary (step 1), the replica thatis next-in-line to become the new primary, assumes the leadership of the replicationgroup by sending a notification to all active replicas, informing that it assumed thecoordination (step 2). Next, the new primary notifies its service instance, that wasacting as a replica instance, that it became the primary service instance (step 3). Atthis point, the primary node updates the information about the service, allowing anyexisting client to retrieve this information and rebind to the new primary. This isaccomplished through the use of the changeIIDOfService() procedure of the CoreInterface. For the sake of simplicity, we omit the additional steps require to perform 121
  • CHAPTER 4. IMPLEMENTATIONthis update in the mesh. Figure 4.17: Recovery process within a replication group.Algorithm 4.16 details the detection and recovery call-backs that are used by the partic-ipants of the replication group. The procedure FTGroup:onPeerFailureHandler()is called when a bind or a pre-bind is closed, that is when a peer has crashed. If thefailing peer was the current primary of the group (line 2), then the next leftmost replica(line 3), the older replica in the group, is elected leader. If the executing peer is the newprimary (line 4), then it must notify the service instance that it became the primary(line 5). The new primary sends a notification to all the active replicas informing thatis ready to continue with the replication policy (line 6). This is followed by an updatecontaining the information about the new primary (lines 7 to 8). However, if the faultypeer was not the primary then it is just a matter of removing the binding informationassociated with the crashed peer (line 11).4.2 Implementation of ServicesEFACEC operates on several domains, including information systems used to managepublic high-speed transportation networks, robotics and smart (energy) grids. Despitetheir differences, these systems have many common requirements and problems, suchas: the need to transfer large sets of data; intermittent network activity, that can lead todata bursts; are exposure to common hardware failures, that can vary in time, rangingfrom short (for example, network reconfiguration raised from a link failure) to extendedoutages, such as fires, and; require low jitter and low latency for safety reasons, suchas vehicle coordination. The pursuit of these characteristics puts a tremendous stress122
  • 4.2. IMPLEMENTATION OF SERVICESAlgorithm 4.16: Fault detection and recovery var: this // the local instance of the replication group object var: mesh // the mesh service var: rgControlGroup // the replication control group var: service // the replicated service var: replicas // the replica list var: rgid // the replication group UUID 1 procedure FTGroup:onPeerFailureHandler(peerID) 2 if this.isPeerPrimary(peerID) then 3 primaryPeer ← replicas.pop() 4 if primaryPeer.getUUID() = mesh.getUUID() then 5 this.fireOnChangeToPrimary() 6 rgControlGroup.sendNewPrimaryInfo() 7 iid ← service.getIID() 8 this.getCoreInterface().changeIIDOfService(sid,iid,rgid); 9 end if10 else11 replicas.remove(peerID)12 end if13 end procedure14 procedure FTGroup:fireOnChangeToPrimary15 serviceChangeStatus ← service.changeToPrimaryRole();16 return(serviceChangeStatus);17 end procedureon both software and hardware infrastructures, and particularly, to the managementmiddleware platform.Our middleware architecture is able to support different types of services. To showcasesome possible implementations, we present three distinct services: 1) RPC, the classicalremote procedure call service; 2) Actuator, that allows the execution of commands ona set of sensors, and; 3) Streaming, that allows data streaming from a sensor to aclient. The RPC service is a standard in every middleware platform, whereas both theActuator and Streaming services were designed to resemble current systems for publicinformation management that were deployed in the Dublin and Tenerife metropolitaninfra-structures. These services will form the basis for the evaluation of the middlewareto be presented in Chapter Remote Procedure CallThe RPC service, depicted in figure 4.18, allows the execution of a procedure in aforeign address space, alleviating the programmer from the burden of coding the remote 123
  • CHAPTER 4. IMPLEMENTATIONinteractions. The service uses fault-tolerance in the common way, with the primarybeing the main service site, updating all the replicas that belong to the replication groupaccording to the group’s replication policy. The current implementation only supportssemi-active [44] replication, where the primary updates all replicas upon the receptionof a new invocation, and only replies to the client when all the replicas acknowledge theupdate. On the other hand, if the RPC service is bootstrapped without fault-tolerance,then the service executes a client invocation and replies immediately, as no replicationis involved. Figure 4.18 shows the RPC service deployed with two replicas across theoverlay. Figure 4.18: RPC service layout.The RPC service is divided in two layers. The topmost level contains the user definedobjects, referred as servers. The servers are the building block of the RPC, providingan object-oriented semantics, that is similar to CORBA. For now, they are staticallylinked, at compile time, to the RPC service. We have plans to expand this in thefuture. On the other hand, the the bottommost level contains the server manager, alsoknown as service adapter, that is responsible for managing these user objects. The mainfunctions of the server adapter include the registration and removal of objects, and theretrieval of the proper object to handle an incoming invocation.In order to fully support object semantics, RPC has two distinct invocation types,one-way and two-way invocations. One-way invocations do not return a value to theclient. Two-way invocations return a value back to the client that is dependent on theparticular operation.Figures 4.19a and 4.19b show the interaction between a client while performing one-way and two-way invocations, respectively. After receiving an invocation from a Service124
  • 4.2. IMPLEMENTATION OF SERVICES (a) RPC one-way invocation. (b) RPC two-way invocation. Figure 4.19: RPC invocation types.Access Point (SAP), through the handleRPCServiceMsg call-back, the server adapterredirects the request to the target object (server) that performs the call to the requestedmethod. If it is a one-way invocation then the server only has to call the target methodusing the input arguments (handled by the handleOneWayInvocation() method).Otherwise, the server invokes the method, also using the input arguments, and sendsback the output values to the invoker (the handleTwoWayInvocation() procedurehandles this case).Listing 4.1: A RPC IDL example.1 interface Counter {2 void increment();3 int sum(int num);4 };Listing 4.1 shows the IDL definition for a simple server that provides two basic op-erations over a counter variable. The one-way Counter:increment() procedure in-crements the counter by one, whereas the two-way Counter:sum() procedure adds agiven number to the counter variable and returns the new total.The Algorithm 4.17 exposes an implementation of the Counter server, that is normallydenominated as RPC skeleton. The Counter:handleOneWayInvocation() proce-dure handles one-way invocations. It starts by performing a look-up that checks if therequested procedure exists in the object (error handling was omitted), that is followedwith the call to the target procedure (line 3 to 5). The only available one-way procedureis the Counter:increment that performs the increment over sumTotal, the countervariable (lines 18 to 20). 125
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.17: A RPC object implementation. var: this // the current RPC server object constant: PROC INCREMENT ID // one-way I N C R E M E N T procedure id. constant: PROC SUM ID // two-way S U M procedure identification constant: COUNTER OID // the object identification var: sumTotal // the accumulator variable 1 procedure Counter:handleOneWayInvocation(pid,args) 2 switch(pid) 3 case(PROC INCREMENT PID) 4 this.increment() 5 end case 6 end switch 7 end procedure 8 procedure Counter:handleTwoWayInvocation(pid,args) 9 switch(pid)10 case(PROC SUM PID)11 num ← RPCSerialization:unmarshall(INT TYPE,args)12 result ← this.sum(num)13 output ← RPCSerialization:marshall(INT TYPE,result)14 return output15 end case16 end switch17 end procedure18 procedure Counter:increment()19 sumTotal ← sumTotal + 120 end procedure21 procedure Counter:sum(num)22 sumTotal ← sumTotal + num23 return sumTotal24 end procedure25 procedure Counter:getOID()26 return COUNTER OID27 end procedure28 procedure Counter:getState()29 state ← RPCSerialization:marshall(INT TYPE,sumTotal)30 return state31 end procedure32 procedure Counter:setState(state)33 sumTotal ← RPCSerialization:unmarshall(INT TYPE,state)34 end procedureOn the other hand, the Counter:handleTwoWayInvocation() procedure handles thetwo-way invocations. It also checks if the requested procedure exists and then performsthe two-way invocation (lines 10 to 15). As the Counter:sum() procedure has oneinput variable that has to be unmarshalled from the arguments (args) serialization126
  • 4.2. IMPLEMENTATION OF SERVICESbuffer (line 11). This is followed by a call to the Counter:sum() procedure using theunmarshalled argument num (line 12). The result from the call to the procedure is thenmarshalled into the serialization buffer output (line 13) and returned (line 14).The Counter:getOID() function returns the Object Identifier (OID) of the object,in this example this procedure returns the COUNTER OID constant. The state of theCounter object is returned by the Counter:getState() procedure. In this implemen-tation it returns the total state, and for this it only has to marshall the sumTotalinto a serialization buffer and return it. The counterpart for this procedure, theCounter:setState() procedure, performs the opposite action. It takes a serializationbuffer containing the state, unmarsalls the it and updates the local object.Algorithm 4.18: RPC service bootstrap. 1 procedure RPCService:open() 2 hrt ← createQoSEndpoint (HRT, MAX RT PRIO) 3 srt ← createQoSEndpoint (SRT, MED RT PRIO) 4 be ← createQoSEndpoint (BE, BE PRIO) 5 sapQosList ← {hrt,srt,be} 6 serviceSAPs ← createRPCSAPs(sapQoSList) 7 8 end procedureThe RPC service is responsible for the performing the invocations and the managementof objects. We start by presenting its bootstrap sequence. Algorithm 4.18 shows theopening sequence for the RPC service, exposed by the RPCService:open() procedure.The lines 1 to 5 show the creation of the list containing the QoS endpoint properties.This is followed by the creation of the SAPs and their respective bootstrap (lines 6 to7). The information characterizing the SAPs is associated with the IID of the RPCservice, by the runtime, so when a client resolves a service identifier it also retrieves theassociated SAP information.Algorithm 4.19 details the most relevant aspects of the RPC implementation. Theprocedure RPCService:handleRPCServiceMsg() is the call-back that handles all in-coming invocations (issued by the lower-level SAP infrastructure). The procedure takesas input two arguments: channel, the TCP/IP channel used to support the invocation,and; invocation, that contains all the relevant information to the invocation.The invocation argument is decomposed into five separate variables (line 2): iid, isthe invocation identification that is used in the reply to the client; type, indicates thetype of invocation (one-way or two-way); oid is the object/server identification; pid,identifies the procedure to be invoked, and; args, are the arguments to be used in the 127
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.19: RPC service implementation. 1 procedure RPCService:handleRPCServiceMsg(channel,invocation) 2 (iid,type,oid,pid,args)← invocation 3 output ← handleInvocation(type,oid,pid,args) 4 if RPCService:isFTEnabled() then 5 RPCService:getReplicationGroup().replicate(getState()) 6 end if 7 if type = TwoWay then 8 channel.replyInvocation(iid,output) 9 end if10 end procedure11 procedure RPCService:handleInvocation(type,oid,pid,args)12 rpcObject ← getRPCObject(oid)13 switch(type)14 case(OneWay)15 rpcObject.handleOneWayInvocation(pid,arg)16 return ∅17 end case18 case(TwoWay)19 return rpcObject.handleTwoWayInvocation(pid,arg)20 end case21 end switch22 end procedureinvocation.The actual invocation is delegated to the RPCService:handleInvocation() proce-dure (lines 11-22). After retrieving the object associated with the invocation (line 12),the procedure checks the type of the invocation and performs it corresponding action.If it is an one-way invocation then it simply delegates it to the object to perform theinvocation (lines 14-17). If it is a two-way invocation then results of the operation arereturned back to the RPCService:handleRPCServiceMsg() procedure (lines 18-20).After the invocation and if the RPC service was bootstrapped with fault-tolerance (lines4-6) then state of the RPC is synchronized across the replica set by the replication groupinfrastructure (line 5). If the invocation returns an output value (two-way invocations),it is then sent back to the client (line 8).The creation of an RPC client was already described in Chapter 3, more specificallyin Listing 3.5. The bootstrap and invocation procedures of the RPC client are shownin Algorithm 4.20. The bootstrap sequence of the RPC client is implemented withinthe RPCServiceClient:open() procedure, that takes as input parameters: the sidof the RPC service; the iid of the instance that the client will bind to, and; theclient parameters. The initial step is to retrieve the information associated with the128
  • 4.2. IMPLEMENTATION OF SERVICESAlgorithm 4.20: RPC client implementation. var: this // the current RPC client object var: channel // the low level connection object 1 procedure RPCServiceClient:open(sid, iid, clientParams) 2 queryInstanceInfoQuery ← this.createFindInstanceQuery(sid,iid) 3 discovery ← this.getRuntime().getOverlayInterface().getDiscovery() 4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery) 5 channel ← this.createRPCChannel(queryInstanceInfo.getSAPs(),clientParams) 6 end procedure 7 procedure RPCServiceClient:twoWayInvocation(oid,pid,args) 8 return (channel.twoWayInvocation(oid,pid,args)) 9 end procedure10 procedure RPCServiceClient:oneWayInvocation(oid,pid,args)11 channel.oneWayInvocation(oid,pid,args)12 end procedureservice instance (lines 1-4). It first starts by creating the query message, through theRPCServiceClient:createFindInstanceQuery() procedure, using the sid andiid arguments (line 2). This is followed by the retrieval of a reference to the discoveryservice (line 3), that is necessary to execute the query (line 4). This process ends withthe creation of the network channel using the query reply, with the information aboutthe available access points, and the selected level of QoS that is enclosed within theclient parameters (line 5).The RPCServiceClient:twoWayInvocation() procedure (lines 7-9) is used to per-form two-way invocations, while the RPCServiceClient:oneWayInvocation() pro-cedure (lines 10-12) handles one-way invocations. They both use the RPC networkchannel to perform the low-level remote invocation, that is, creating the packet channeland sending it through the network channel. Contrary to its one-way counterpart, thetwo-way operation must wait for the reply packet before returning to the caller.Semi-Active Fault-Tolerance SupportThe middleware offers an extensible fault-tolerance infrastructure that is able to ac-commodate different types of replication policies.Figure 4.20 depicts the current implemented fault-tolerance policy in the overlay. Fig-ure 4.20a) shows the RPC service without FT support. In this case, upon the receptionof an invocation, the RPC service executes the invocation and replies immediately tothe client, as no replication is to be performed.Figure 4.20b) shows the RPC with semi-active fault-tolerance enabled. The primary 129
  • CHAPTER 4. IMPLEMENTATION (a) (b)Figure 4.20: RPC service architecture without (left) and with (right) semi-active FT.node, upon reception of an invocation (step 1), uses the replication group to update allthe replicas (steps 2 and 3). After the replication is completed, that is, when all theacknowledgments have been received by the primary node (steps 4 and 5), it sends theresult of the invocation back to the RPC client (step 6).Algorithm 4.21: Semi-active replication implementation. 1 procedure SemiActiveReplicationGroup:replicate(replicationObject) 2 if IsPrimary() then 3 replicationRequestList ← ∅ 4 for replica in replicaGroup do 5 replicationRequest ← replica.sendMessage(replicationObject) 6 replicationRequestList.add(replicationRequest) 7 end for 8 replicationRequestList.waitForCompletion() 9 end if10 end procedureAlgorithm 4.21 shows the algorithm used for implementation semi-active replication.This procedure is only called in the primary peer of the replication group. Afterreceiving a replication object the replicate procedure sends a replication messageto all the replicas that are present in the replication group (the acknowledgments wereomitted for clarity).Algorithm 4.22 shows the RPCService:onReplication call-back that is used by the130
  • 4.2. IMPLEMENTATION OF SERVICESAlgorithm 4.22: Service’s replication callback. var: this // the current RPC service object 1 procedure RPCService:onReplication(replicationObject) 2 switch(replicationObject.getType()) 3 case(State) 4 this.setState(replicationObject) 5 return ∅ 6 end case 7 case(Invocation) 8 (iid,type,oid,pid,args)← replicationObject 9 return this.handleInvocation(iid,type,oid,pid,args)10 end case11 end switch12 end procedure13 procedure RPCService:setState(replicationObject)14 (oid,state) ← replicationObject15 rpcObject ← this.getRPCObject(oid)16 rpcObject.setState(state)17 end procedurereplication group to perform the state update. In the current implementation, weperform replication by synchronizing the state of the RPC service among the membersof the replication group (lines 3-6). The RPCService:setState() procedure retrievesthe object identification and state serialization buffer from the replicationObjectvariable (line 14). This is followed with the look-up for the target object (line 15), thatis then used to update the state of the object (line 16). Our RPC implementation canbe further extended to support replication based on the execution of the invocations.We present a possible implementation in lines 7 to 10.However, this implementation is only valid for single threaded object implementationswithout non-deterministic source code, such as using the gettimeofday system call. Thepresence of multiple threads in a replica can alter the sequence of state updates, asthe thread scheduling is controlled by the underlying operating system, and can leadto inconsistent states. The presence of non-deterministic source code in the serversimplementation can lead to inconsistent states if the replication is based on the re-execution of the invocations by each replica. For example, if a server implementationuses the gettimeofday system call then the execution of this system call will have adifferent value on each replica, leading to an inconsistent state. Several techniques havebeen proposed to address these problems [14, 129, 130]. 131
  • CHAPTER 4. IMPLEMENTATIONFault-Tolerance Infrastructure ExtensibilityTo illustrate the extensibility of our fault-tolerance infrastructure, we provided thealgorithms necessary to implement passive replication. An overview on the architectureof both policies is shown in Figures 4.21 and ??, respectively.Passive Replication Figure 4.21: RPC service with passive replication.Passive replication [75] is interesting from the point of view of RT integration becauseit is associated with lower latency and lower resource requirements, such as CPU, asshown in our previous work [6]. However, this is only feasible through the relaxationof the state consistency among the replication group members. This is accomplishedby avoiding immediate replication, as performed in semi-active replication. Instead,after receiving an invocation (step 1), the replication data is buffered and periodicallysent to the replicas (step 2). Because the primary node does not need to wait for theacknowledgments, it can immediately reply the result of the invocation to the RPCclient (step 3). Each replica periodically receives the updates (step 4), processes andacknowledges them back to the primary of the replication group (step 5).Algorithm 4.23 shows the algorithms needed to provides passive replication. ThePassiveReplicationGroup:replicate() procedure instead of immediately repli-cating the data, as done in semi-active replication, queues the data for later replica-tion. The replication is periodically performed, using a user-defined period, by thePassiveReplicationGroup:timer() procedure (lines 6-14). To achieve a betterthroughput, it sends a batch message containing all the replication data that was132
  • 4.2. IMPLEMENTATION OF SERVICESAlgorithm 4.23: Passive Fault-Tolerance implementation. var: this // the current passive replication group object 1 procedure PassiveReplicationGroup:replicate(replicationObject) 2 if this.IsPrimary() then 3 this.enqueue(replicationObject) 4 end if 5 end procedure 6 procedure PassiveReplicationGroup:timer(replicationObject) 7 replicationBatch ← this.dequeAll() 8 replicationRequestList ← ∅ 9 for replica in replicaGroup do10 replicationRequest ← replica.sendMessage(replicationBatch)11 replicationRequestList.add(replicationRequest)12 end for13 replicationRequestList.waitForCompletion()14 end procedure15 procedure RPCService:onReplication(replicationObject)16 switch(replicationObject.getType())17 ... (continuation of Algorithm 4.22)18 case(BatchMessage)19 replyList ← ∅20 for item in replicationObject do21 switch(item.getType())22 case(State)23 this.setState(replicationObject)24 end case25 case(Invocation)26 (iid,type,oid,pid,args) ← replicationObject27 replyList.add(handleInvocation(iid,type,oid,pid,args))28 end case29 end switch30 end for31 return replyList32 end case33 end switch34 end procedurepreviously enqueued. In order to use passive replication, the support for a batch messageis introduced in RPCService:onReplication procedure (lines 18-32). For each itemthat is contained in the batch message, it checks if it is a state transfer or an invocation.In case of a state transfer, it updates the service using the setState() procedure (line23). Otherwise, it is handling an invocation request and it has to perform the invocationand store the result in the replyList variable (lines 25 to 28), which is used to returnthe output values for all the batched invocations to the replication group infrastructure(and is sent back to the primary). 133
  • CHAPTER 4. IMPLEMENTATION4.2.2 ActuatorOne of the most important services in public information systems, for both railroads andlight trains, is the display of information at train stations about inbound and outboundcompositions, such as their track number and estimated time of arrival. The actuatorservice allows a client to execute a command in a set of sensor nodes, such as displayinga string in a set of information panels. These panels are implemented by leaf peers. Figure 4.22: Actuator service layout.Figure 4.22 shows the deployment of an actuator service instance while using 2 replicas.The primary instance binds to each panel in the set, while the replicas make pre-bindconnections (shown as dashed lines).Figure 4.23 shows an overview of the actuator service. The client starts by choosingand binding to the appropriate SAP of the actuator service. To display a message onthe set of panels, the client sends a command (step 1) to the actuator service. Afterreceiving the command, the service sends it to the sensors (step 2), waits for theiracknowledgments (step 3), and then acknowledges the client itself (step 4).Algorithm 4.24 shows the initial setup of the actuator service. As with the RPC service,the initial steps focus on the construction and initialization of the service access points(lines 2-7). If needed, a service must extend the generic class and augment it with theservice specific arguments. Unlike the RPC service, the actuator service makes usesof this capability, by introducing an additional panel list parameter. Before processing134
  • 4.2. IMPLEMENTATION OF SERVICES Figure 4.23: Actuator service overview.Algorithm 4.24: Actuator service bootstrap. var: this // the current actuator service object var: panelGroup // the panel communication group object 1 procedure ActuatorService:open(serviceArgs) 2 hrt ← createQoSEndpoint (HRT, MAX RT PRIO) 3 srt ← createQoSEndpoint (SRT, MED RT PRIO) 4 be ← createQoSEndpoint (BE, BE PRIO) 5 sapQosList ← {hrt,srt,be} 6 serviceSAPs ← createActuatorSAPs(sapQoSList) 7 8 actuatorServiceArgs ← downcast(serviceArgs) 9 for panel in actuatorServiceArgs.getPanelList() do10 panelChannel ← createPanelChannel(sensor)11 panelGroup.add(panelChannel)12 end for13 end procedurethis information, the actuator service must downcast the serviceArgs to its concreteimplementation (line 8). Then, using the panel list, the actuator creates a networkchannel for each of the panels and stores them in a list (lines 9-12).Algorithm 4.25 shows the main algorithm present in the actuator service. The proce-dure ActuatorService:handleAction() is the call-back that is executed upon thereception of a new action by the actuator service. The actuator spreads the action acrossall the panels (shown in Figure 4.23 as steps 2 and 3), using the channels previously 135
  • CHAPTER 4. IMPLEMENTATIONAlgorithm 4.25: Actuator service implementation. var: this // the current actuator service object var: panelGroup // the panel communication group object 1 procedure ActuatorService:handleAction(action,channel) 2 actionRequestList ← ∅ 3 for panel in panelGroup do 4 actionRequest ← panel.sendMessage(action) 5 actionRequestList.add(actionRequest) 6 end for 7 actionRequestList.waitForCompletion() 8 failedPanels ← ∅ 9 for actionRequest in actionRequestList do10 if actionRequest.failed() then11 panelGroup.remove(actionRequest.getPanel())12 failedPanels.add(actionRequest.getPanel())13 end if14 end for15 ackMessage ← ActuatorService:createAckMessage(failedPanels)16 channel.replyMessage(ackMessage)17 end procedurecreated in the bootstrap of the service (lines 2-6). Each failed panel is removed fromthe service panel list (line 11) and stored in an auxiliary list (line 12). The procedureends with the creation of an acknowledgment message containing the list of failed panelsthat is sent back to the client (lines 15 and 16).Algorithm 4.26: Actuator client implementation. var: this // the current actuator client object var: channel // the low level connection object 1 procedure ActuatorServiceClient:open(sid, iid, clientParams) 2 queryInstanceInfoQuery ← ActuatorSvcClient:createFindInstanceQuery(sid,iid) 3 discovery ← this.getRuntime().getOverlayInterface().getDiscovery() 4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery) 5 channel ← this.createActuatorChannel(queryInstanceInfo.getSAPs(),clientParams) 6 end procedure 7 procedure ActuatorServiceClient:action(action) 8 actionRequest ← channel.sendMessage(action) 9 actionRequest.waitForCompletion()10 end procedureAlgorithm 4.26 shows the initialization of the actuator client and the implementation ofthe action operation. The ActuatorServiceClient:open() procedure exposes thebootstrap of the client, following the same implementation as the RPC service. A query136
  • 4.2. IMPLEMENTATION OF SERVICESto find the information about the service instance is created and sent over the discoveryservice. Using the localization information retrieved in the query reply, a channel iscreated to that instance (lines 3-6). The low-level socket operations are handled by theActuatorServiceClient:action() procedure (shown in Figure 4.23 in step 1). Itsends action through the channel and waiting for corresponding acknowledgment.Actuator Fault-ToleranceThe service does not use the fault-tolerance support for data synchronization (as in theRPC service), but instead uses the replicas to pre-bind to the panels to minimize therecovery time. Figure 4.24 shows the architectural details of the actuator service withFT support. Figure 4.24: Actuator fault-tolerance support.In the event of a failure of the primary peer, the newly elected primary already haspre-binds to all the panels in the set, thus minimizing recovery latency. After rebindingto the new primary, the client reissues the failed action. While we could do the sameusing multiple service instances, the actuator client would have to know about thesemultiple instances, and switch among them in the presence of failures. Thus, using thefault-tolerance infrastructure avoids this issue, and allows the client to transparentlyswitchover to new running primary. 137
  • CHAPTER 4. IMPLEMENTATION4.2.3 StreamingThe streaming of both video and audio in public information systems is an importantcomponent in the management of train stations, specially in CCTV systems. Thestreaming service allows the streaming of a data flow, such as video and audio, fromstreamers to clients. While there is a considerable amount of work addressing streamingover P2 P networks [131, 132], we have chosen to implement it at a higher level to allowus to provide an alternative example of an efficient streaming implementation, withfault-tolerance support, on a general purposed middleware system. Figure 4.25: Streaming service layout.Figure 4.25 shows the deployment of a streaming service instance while using tworeplicas. A leaf peer, denominated as streamer, connects to all the members of thereplication group.Figure 4.26 shows the architecture details of the streaming service. At bootstrap, thestreaming service connects to the streamer (step 1) and starts receiving the stream (step2). Afterwords, a client connects to the streaming service and requests a stream (step3). The server allocates a stream session and the client starts receiving the stream fromthe service (step 4).Each client is handled by a stream session, that was designed to support transcoding.The term transcoding refers to the capability of converting a stream from one encoding,such as raw data, to a different encoding, such as the H.264 standard [133]. The useof transcoding allows the streaming service to soften the compression ratio of streams138
  • 4.2. IMPLEMENTATION OF SERVICES Figure 4.26: Streaming service support lower performance computing devices. At the same time, it also enables areduction of bandwidth usage, through a higher compression ratio, for high performancecomputing devices. However, in our current implementation, we do not implement anyencoding in this example, the same is to say that we apply the identity filter.Algorithm 4.27: Stream service bootstrap. var: this // the current streaming service object var: streamServiceArgs // the streaming service arguments var: streamChannel // the streamer channel object 1 procedure StreamService:open(serviceParams) 2 hrt ← createQoSEndpoint(HRT,MAX RT PRIO) 3 srt ← createQoSEndpoint(SRT,MED RT PRIO) 4 be ← createQoSEndpoint(BE,BE PRIO) 5 sapQosList ← {hrt,srt,be} 6 serviceSAPs ← this.createStreamSAPs(sapQoSList) 7 8 streamServiceArgs ← downcast(serviceArgs) 9 streamerInfo ← streamServiceArgs.getStreamerInfo()10 streamerChannel ← this.createStreamerChannel(streamerInfo)11 end procedureAlgorithm 4.27 exposes the initialization process of the stream service. The bootstrapprocess of the stream service is detailed in procedure StreamService:open(). Theinitial setup creates and bootstraps the service access points (lines 2-7). The streamservice uses one additional parameter, the streamer endpoint. This parameter is used 139
  • CHAPTER 4. IMPLEMENTATIONto create a stream channel to the streamer (lines 8-10).Algorithm 4.28: Stream service implementation. var: this // the current streaming service object var: streamSessions // the streaming session list var: streamStore // the stream circular buffer var: streamChannel // the streamer channel object 1 procedure StreamService:handleNewStreamServiceClient(client,sessionQoS) 2 streamSessions.add(createSession(client,sessionQoS)) 3 end procedure 4 procedure StreamService:handleStreamerFrame(streamFrame) 5 for session in streamSessions do 6 session.processFrame(streamFrame) 7 end for 8 end procedure 9 procedure StreamSession:processFrame(streamFrame)10 streamStore.add(streamFrame)11 streamChannel.sendFrame(streamFrame)12 end procedureAlgorithm 4.28 starts by exposing the procedure that handles a new incoming streamclient, in StreamService:handleNewStreamServiceClient() procedure. Uponthe arrival of a new client, the stream service creates a new session and stores it. TheStreamService:handleStreamerFrame() procedure handles incoming frames fromthe streamer. When the service receives a new frame, it updates every active session(lines 5 to 7) through the StreamSession:processFrame() procedure. Currently, asession only stores the received frames in a circular buffer (whose size is pre-defined),that will eventually substitute older frames with newer ones. The purpose of this bufferis to suppress frame loss in the presence of a primary crash, allowing for the client torequest older frames to fix the damaged stream.Algorithm 4.29 starts by describing the initialization process of the stream client. Thisinitialization follows the same sequence as with previously described clients. It retrievesthe information about the service instance, and then uses it to create a channel tothe service instance. Upon the reception of a new frame, by the stream client, theStreamServiceClient:handleStreamFrame() procedure is executed.Streaming Fault-ToleranceFigure 4.27 shows the fault-tolerance support within the streaming service. The primaryserver and the replicas all connect to the streamer, and receive the stream in parallel140
  • 4.2. IMPLEMENTATION OF SERVICESAlgorithm 4.29: Stream client implementation. var: this // the current streaming client object var: channel // the low level connection object 1 procedure StreamServiceClient:open(sid,iid,clientParams) 2 queryInstanceInfoQuery ← StreamSvcClient:createFindInstanceQuery(sid,iid) 3 discovery ← getRuntime().getOverlayInterface().getDiscovery() 4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery) 5 channel ← this.createStreamChannel(queryInstanceInfo.getSAPs(),clientParams) 6 end procedure 7 procedure StreamServiceClient:handleStreamFrame(streamFrame) 8 // application specific... 9 end procedure Figure 4.27: Streaming service with fault-tolerance support.(step 1). Each of the replicas stores the stream flow up to a maximum configurabletime, for example 5 minutes (step 2). When a stream client connects to the streamservice, it binds to the primary instance and starts receiving the data stream (step 3).When a fault occurs in the primary, the client rebinds to the newly elected primary ofthe replication group. As the client rebinds, it must inform the new primary what wasthe last frame received. The new primary, thought a new stream session, calculates themissing data and sends it back to the client, thereafter resuming the normal streamflow. 141
  • CHAPTER 4. IMPLEMENTATION4.3 Support for Multi-Core ComputingThe evolution of microprocessors has focused on the support for multi-core architecturesas a way to scale through the current physical limits in manufacturing. This brings newchallenges to systems programmers as they must be able to deal with an ever increasingpotential for parallelism.While coarse-grain parallelism can already be handled with current development frame-works, such as MPI and OpenMP, they are aimed for best-effort tasks that do not havea notion of deadline, and therefore are unable to support real-time. Furthermore, theirprogramming model is based on a set of low-level primitives that do not offer any typeof object-oriented programming support.On the other hand, the use of object-oriented programming languages provides verylimited supported for specifying object-to-object interactions and almost no parallelismsupport. For example, in C/C++ the parallelism is achieved through the use of threadsor processes that are implemented in low-level C primitives that do not have any typeof object awareness.For these reasons, fine-grained parallelism is hard to implement in a flexible and modularfashion. While a considerable amount of research work has been done in threadingstrategies with object awareness, such as the leader-followers pattern [11], they do notoffer support for resource reservation or regulated access between objects.4.3.1 Object-Based InteractionsThe object-oriented paradigm is based on the principle of using objects, which are datastructures containing data fields and methods, to develop computer programs. Themethods of an object allow manipulation of its internal state, which is composed byits data fields. However, object-to-object interaction is not addressed by the object-oriented paradigm. Recent work on component middleware systems [65, 66] addressedthis issue through the use of component-oriented models. However, component-basedprogramming offers a high-level approach that in our view is not able to addressimportant low-level object-to-object interactions, such as CPU partitioning, and fine-grained parallelism.The implementation of fine-grained parallelism frameworks has to support object-to-object interactions that include direct and deferred calls. With direct calls (shown142
  • 4.3. SUPPORT FOR MULTI-CORE COMPUTINGin Figure 4.28a), the caller object enters the object-space of the callee, that might beguarded through a mutex, and performs the target action. On the other hand, when thetarget object enforces deferred calling shown in Figure 4.28b), the caller object is unableto perform the operation directly and must queue it. These requests are then handledby a thread of the target object. The caller does not enter the callee object-space. Thispattern is commonly known as Active Object [114, 13]. (a) Direct calling. (b) Deferred calling. Figure 4.28: Object-to-Object interactions.4.3.2 CPU PartitioningCPU partitioning is an approach based on the isolation of individual cores or processorsto perform specific tasks, and is normally used to isolate real-time threads from potentialinterferences from other non real-time threads. Despite the large body of research onreal-time middleware systems that use general-purpose operating over Common-Of-The-Shelf (COTS) hardware [3, 65, 66], to our knowledge, no real-time middlewaresystem, specially when combined with FT support, ever employed a CPU partitioningscheme (shielding) to further enhance real-time performance.Figure 4.29 exemplifies a possible examples of CPU partitioning for 4 (Figure 4.29a), 6(Figure 4.29b) and 8 cores (Figure 4.29c) microprocessors. A more detailed explanationof the resource reservation mechanisms is provided in Section 3.1.4. Now it sufficesto say that the partitions designated with OS contain the threads that belong to theunderlying operating system (in this case Linux). The partitions BE & RT contain thethreads for best-effort and soft real-time, and finally, the Isolated RT indicates that 143
  • CHAPTER 4. IMPLEMENTATION (a) Quad-core partition- (b) Six-core partitioning. ing. (c) Eight-core partitioning. Figure 4.29: Examples of CPU Partitioning.the partitions have dedicated cores that only host soft real-time threads, reducing thescheduling latency caused by the switching between best-effort and real-time threads.Our runtime can be seen as a set of low-level services that offers a set of high levelabstractions to the implementation of high-level services. It was necessary to create amechanism that regulated access between the services in order to allow the preservationof the QoS parameters for each individual service, that is the interactions betweenobjects running on different partitions.Figure 4.30 revisits the object-to-object interactions with the introduction of CPUpartitioning. Figure 4.30a shows object A making a direct call to operation op b1()in object B. This normally implies that operation op b1() has a mutex to guardany critical data structures. Even with priority boosting schemes, such as priorityinheritance, the use of mutexes can cause unbound latencies. Subsequently, this wouldbreak the isolation of partition Isolated RT and would defeat the purpose of usingCPU partitioning. In order to improve throughput, real-time threads can be co-locatedwith non real-time threads to maximize the use of the cores allocated to a particularpartition. The disadvantage of this approach is that the real-time threads are no longerin an isolated environment, and so, the scheduling of non real-time threads can causeinterference in real-time threads. As in Figure 4.30b, a direct call involving objectswithin the same partition is a valid option.The use of deferred calling (shown in Figure 4.30c) avoids the problems of direct callingwhen objects are allocated in different partitions. The call from object A is serializedand queued, and a future is associated with the pending request. This call is later144
  • 4.3. SUPPORT FOR MULTI-CORE COMPUTING (a) Direct calling with different par- (b) Direct calling within the same titions. partition. (c) Deferred calling with different partitions.. Figure 4.30: Object-to-Object interactions with different partitions.handled by a thread belonging to object B that dequeues it and executes the request,updating the future with the respective result. The thread of object A, that was waitingthe future, is waken and returns to op a1(). This execution model is commonly referredas worker-master [13].4.3.3 Threading StrategiesA threading strategy defines how several threads interact in order to fulfill a goal, witheach strategy offering a trade-off between latency and throughput. Figure 4.31 presentsseveral well-known strategies that are implemented in our Support Framework, namely:Leader-Followers [11]; b) Thread-Pool [114]; c) Thread-per-Connection [12], and; d)Thread-per-Request [13].Leader-Followers (LF)The leader-followers pattern (c.f. Figure 4.31a) [11] was designed to reduce context 145
  • CHAPTER 4. IMPLEMENTATION (a) (b) (c) (d) Figure 4.31: Threading strategies.switching overhead when multiple threads access a shared resource, such as a set of filedescriptors. This is a special kind of thread-pool where threads take turn as leaders, inorder to access the shared resource. If the shared resource is a descriptor set, such assockets, then, when a new event happens on a descriptor, the leader thread is notifiedby the select system call. At this point, the leader removes the descriptor from theset, elects a new leader, and then resumes the processing of the request associated withthe event.In this case, our default implementation allows that foreign threads join the leader-followers execution model. After joining, the foreign thread is inserted in the followersthread set, waiting its turn to become a leader and process pending work. As soon asthe event reaches a final state (in case of success, error or timeout), the foreign threadis removed from the followers set.Thread-Pool (TP)The thread-pool pattern (c.f. Figure 4.31b) [114] consists of a set of pre-spawned threads,that normally are synchronized by a barrier primitive, such as select and read. Thispattern avoids the overhead and latency of dynamically creating threads to handle clientrequests, but results in a loss of flexibility. In general, however, it is possible to adjustthe size of the pool in order to cope with environment changes.Thread-per-Connection (TPC)The thread-per-connection pattern (c.f. Figure 4.31c) [12] aims to provide minimum146
  • 4.3. SUPPORT FOR MULTI-CORE COMPUTINGlatency time by avoiding request multiplexing, at the cost of having a dedicated threadper connection.Every SAP has a listening socket that is responsible for accepting new TCP/IP con-nections that is usually managed by an Acceptor design pattern [13]. After acceptinga new connection, the Acceptor creates a new thread that will handle the connectionthroughout its life-cycle. Given the one-to-one match between thread and connection,it is not possible to allow foreign threads into the execution model without breakingcorrectness of the connection object, as it is not configured to allow multiple accessesto low-level primitives such as the read system call. Because of this, any foreign threadthat invokes a synchronous operation on the connection object, has its request queued.This is later processed by the thread that owns the connection.Thread-per-Request (TPR)The thread-per-request pattern (c.f. Figure 4.31d) [13] focuses on minimizing threadusage, while trying to maximize the overall throughput on a set of network sockets.This design pattern results from a combination of a low-level thread-per-connectionstrategy with a high-level thread-pool strategy. This strategy is also referred as HalfAsync - Half Sync [13].The role of the thread-per-connection strategy is to read and parse incoming packets,and enqueuing them into an input queue to be processed by the workers of the thread-pool. When a worker thread wants to send a packet, it also has to enqueue the packetinto an output queue.Minimization of Network Induced Priority InversionProviding end-to-end QoS in a distributed environment needs a vertical approach,starting at the network level (inside the OS layer). Previous research [134], focusedon the minimization of network induced priority inversion, through the enhancement ofSolaris’s network stack to support QoS. Additional work [3] extended this approach tothe runtime level by providing separate access points for requests of different priority.Building on these principles, our runtime was built to preserve end-to-end QoS seman-tics. For that end, each service publishes a set of access points, with associated QoS,that will serve as entry points for client requests, thus avoiding request multiplexing.This approach was based on TAO’s work on the minimization of priority inversion [3]caused by the use of network multiplexing. The service access points are served by athreading strategy that is statically configured during the bootstrap of the runtime. 147
  • CHAPTER 4. IMPLEMENTATION Figure 4.32: End-to-End QoS propagation.However, as TAO was designed to accommodate only one type of service, that is theRPC service, it did not address the following aspects: service inter-dependencies andresource reservation, more precisely, CPU shielding. In our middleware, each SAP isserved by an execution model, offering a flexible behavior.4.3.4 An Execution Model for Multi-Core ComputingThe lack of a design pattern capable of providing a flexible behavior that leverages theuse of multi-core processors through CPU reservation and partitioning, while providingsupport for a configurable threading strategy, motivated the creation of the ExecutionModel/Context design pattern. Figure 4.33: RPC service using CPU partitioning on a quad-core processor.Figure 4.33 shows an overview of the RPC service while using CPU partitioning. TheIsolated RT partition, containing core 1, supports the handling of high priority RTinvocations. Whereas, the BE & RT partition, containing cores 2 and 3, supports the148
  • 4.3. SUPPORT FOR MULTI-CORE COMPUTINGhandling of medium priority RT invocations and best-effort invocations. Each SAPfeatures a thread-per-connection (TPC) threading strategy, but they can use any of theprevious described strategies. Figure 4.34: Invocation across two distinct partitions.Figure 4.34 shows the interaction of a medium priority RT invocation, which is handledby a thread that belongs to a med RT SAP that resides in the BE & RT partition,with a high priority server that resides in the Isolated RT partition. While any threadbelonging to the high RT SAP could directly interact with a high priority server, asthey reside in the same partition, this should not happen when the interaction wasoriginated by a thread belonging to a different partition. This last interaction couldcause a priority inversion on the threading strategy that is supporting high priorityserver.The first part of the execution model/context pattern, the execution model sub-pattern,allows an entity to regulate the acceptance of foreign threads, that is the threads thatbelong to other execution models, within its computing model. The rationale behindthis principle resides in the fact that an application might reside in a dedicated core andthe interaction with a foreign thread could cause cache line trashing, or simply breakthe isolation for some real-time threads.The second sub-pattern is the execution context. Its role is to efficiently manage thecall stack through the use of Thread-Specific Storage (TSS). This allows the executionmodel to retrieve the necessary information about a thread, for example the partitionthat is assigned to the thread, and use it to regulate the behavior of the thread that is 149
  • CHAPTER 4. IMPLEMENTATIONinteracting with it. For example, it prevents an isolated real-time thread that belongs toan isolated execution model hosted on an isolated real-time partition, from participatingin a foreign execution model, that would break the isolation principle and result innon-deterministic behavior (for example, by propagating interrupts from non-isolatedreal-time threads into the isolated core).The internals of the Execution Model/Execution Context (EM/EC) design pattern aredepicted in Figure 4.35 showing the interaction between three distinct execution models.When a thread that belongs to EM0 calls an operation on EM1, it effectively enters anew computational domain. An operation can either be synchronous or asynchronous.If it is asynchronous, then the requesting EM0 will not participate in the computingeffort of EM1. Figure 4.35: Execution Model Pattern.On the other hand, if the operation is synchronous, then it must check whether the lastEM, the top of an execution context calling stack, allows that its thread to participatein the threading strategy of EM1. If the thread is allowed to join the threading strategy,then it participates in the computing effort until it reaches a final state (that is operationsuccessful, error, or timeout). When it reaches the final state, it backtracks to therequesting EM, in this case EM0 by popping the context from the stack. The operationbeing performed on EM1 could continue the call chain by executing an operation onEM2, and if so, this process would repeat itself.If the requesting EM0 does not allow its threads to join EM1, then the operation mustbe enqueued for future processing by a thread within the threading strategy of EM1. IfEM1 embodies a passive entity, i.e. an object that does not have active threads running150
  • 4.3. SUPPORT FOR MULTI-CORE COMPUTINGinside its scope, then the EM is considered a NO-OP EM. In this scenario, it is notpossible to enqueue the request because there are no threads to process it, so an error isreturned to EM0 (this should only happen in a configuration error). Otherwise, if EM1is an active object, then if the queue buffer is not full, the request is enqueued and areply future is created. As the operation has a synchronous semantics, the thread (thatbelongs to EM0) must wait for the token to reach its final state before returning to itsoriginating EM.Algorithm 4.30: Joining an Execution Model. var: this // the current Execution Model object 1 procedure ExecutionModel:join(event,timeout) 2 ec ← TSS:getExecutionContext() 3 topEM ← ec.peekExecutionModel() 4 joinable ← topEM.allowsMigration(this) 5 if not joinable then 6 throw(ExecutionModelException) 7 end if 8 try 9 ec.pushEnvironment(this,event,timeout)10 ts ← this.getThreadingStrategy()11 ts.join(event,timeout)12 catch(ThreadingStrategyException)13 ec.popEnvironment()14 throw(ExecutionModelException)15 catch(ExecutionContextException)16 throw(ExecutionModelException)17 end try18 end procedureAlgorithm 4.30 presents the ExecutionModel:join() procedure, that acts as theentry point for every thread wanting to join the execution model. The procedure takestwo arguments, an event and a timeout. The event represents an uncompletedoperation belonging to the execution model, e.g. an unreceived packet reply froma socket, that must be completed before the deadline given by timeout. It startsby retrieving the Execution Context stored in Thread-Specific Storage (TSS) (line 2).This allows the execution context to be private to the thread which owns it, avoidingsynchronized access to this data. At line 3, we retrieve the current, and also the last,execution model in which the thread has entered. If this last execution model does notallow its threads to migrate to the new execution, then an exception is raised and thejoin process is aborted. Otherwise, the thread joins the new execution model, by firstpushing the call stack with the information regarding the join (a new tuple containingthe new execution model, event and timeout) (Line 10). This is followed by the thread 151
  • CHAPTER 4. IMPLEMENTATIONjoining the threading strategy (Lines 10-11). If the threading strategy does not allowthe thread to join it, then an exception is raised and the join is aborted. Independentlyof the success or failure of the join, the call stack is popped, thus eliminating theinformation regarding this completed join.Algorithm 4.31: Execution Context stack management. var: this // the current Execution Context object var: stack // the environment stack object 1 procedure ExecutionContext:pushEnvironment(em,event,timeout) 2 topEnv ← 3 if timeout > topEnv.getTimeout() then 4 throw(ExecutionModelException) 5 end if 6 if topEnv.getExecutionModel() = em & topEnv.getEvent() = event then 7 topEnv.incrementNestingCounter() 8 else 9 nesting counter ← 110 context ← createContextItem(em,event,timeout,nesting counter)11 stack.push(context)12 end if13 end procedure14 procedure ExecutionContext:popEnvironment()15 topEnv ← topEnv.decrementNestingCounter()17 if topEnv.getNestingCounter() = ∅ then18 stack.pop()19 end if20 end procedure21 procedure ExecutionContext:peekExecutionModel()22 return end procedureAlgorithm 4.31 shows the most relevant procedures of the execution context. TheExecutionContext:pushEnvironment() procedure is responsible for pushing a newexecution environment into the call stack. It starts by checking if the timeout,belonging to the new environment, does not violate the previously established deadline(that belongs to the last execution model), and if it is the case, an exception is raised(Lines 3-5). If a thread is recursive, i.e. it enters multiple times in the same executionmodel, then instead of creating a new execution environment and pushing it into thequeue, it simply increments a nesting counter, that represents the number of times athread has reentered this execution domain (Lines 7-8). Otherwise (Lines 9-11), a newexecution environment (with the nesting counter set to 1) is created and pushed intothe stack. The ExecutionContext:popEnvironment() procedure eliminates the top152
  • 4.3. SUPPORT FOR MULTI-CORE COMPUTINGexecution environment present in the call stack. It starts by decrementing the nestingcounter, and if it is equal to 0 then no recursive threads are present and the stack cansafely be popped. Otherwise, no further action is taken. The remaining procedure,ExecutionContext:peekTopExecutionModel(), is an auxiliary procedure used topeek at the top execution model associated with the current thread.Applying the EM/EC Pattern to the RPC ServiceFigure 4.36 show the RPC service using the EM/EC pattern. Each service access point(SAP) is served by a thread-per-connection strategy that has a dedicated thread forhandling new connections, normally known as the Acceptor [13], that spawns a newthread for each new client connection. Furthermore, the RPC service uses two CPUpartitions, an Isolated RT partition for supporting high priority RT invocations and aBE & RT partition for supporting medium priority RT and best-effort invocations. Figure 4.36: RPC implementation using the EM/EC pattern.In Figure 4.36, each priority lane, the logical composition of the low-level socket handlingwith the high-level server handling, is managed through a single execution model. Eachconnection is handled by thread that after reading an invocation packet, uses the serveradapter to locate the target server and performs the invocation. As this approach doesnot enqueue requests between the layers, it does not introduce additional sources oflatency. However, if the SAP that received the invocation request does not belong to 153
  • CHAPTER 4. IMPLEMENTATIONthe same partition as the target server, then the request is enqueued in execution modelcontaining the server. The invocation is later dequeued, in this case by thread that ishandling the SAP, and executed. The reply is then enqueued in the execution modelthat originated the invocation.Algorithm 4.32: Implementation of the EM/EC pattern in the RPC service. var: thisSocket // the current RPC service object var: thisService // the current RPC socket object var: timeout // the timeout associated with the invocation var: rpcService // RPC service instance 1 procedure RPCServiceSocket:handleInput() 2 invocation ← getReadPacketFromSocket() 3 rpcService.handleRPCServiceMsg(thisSocket,invocation) 4 end procedure 5 procedure RPCServiceObject:handleTwoWayInvocation(pid,args) 6 try 7 event ← createInvocationEvent(pid,args) 8 thisService.getExecutionModel().join(event,timeout) 9 return event.getOutput()10 catch(ExecutionModelException ex)11 event.wait(timeout)12 return event.getOutput()13 end try14 end procedureAlgorithm 4.32 provides the main details of the EM/EC pattern implementation inthe RPC service. The RPCServiceSocket:handleInput() procedure is the callbackthat is used by the thread managing the connection when a input event has occurredin the socket. After the packet is read from the socket, its processing is delegated tothe upper level of the service, through the the RPCService:handleRPCServiceMsg()procedure (shown previously in Algorithm 4.19). The server adapter is a bridge betweenthe layers, and is shown with a dashed outline. It starts by locating the server objectand delegating the invocation to it. The handling of a two-way invocation is im-plemented in the RPCServiceObject:handleTwoWayInvocation() procedure (theone-way invocation was omitted for clarity). If the invocation originated from a threadbelonging to server’s priority lane, more specifically from the socket that is handlingthe connection, then is able to join the execution model of the server and help with thecomputation (lines 7 to 9). On the other hand, if the invocation was originated from athread belonging to a execution model outside the server’s partition, then the request isqueued. After the threading strategy of the server executes the invocation, the request154
  • 4.4. RUNTIME BOOTSTRAP PARAMETERSis signaled as completed. At this point, the thread that originated the request is wakenin the wait() procedure (line 11) and the output is returned (line 12).4.4 Runtime Bootstrap ParametersThe bootstrap of the core is implemented in method Core:open(args) and adjuststhe behavior of the runtime during its life-cycle. The arguments are passed to the coreby using command line options. Table 4.1 shows the most relevant arguments presentin the system. Property Meaning Default General use resource reservation Enables resource reservation true rr runtime Maximum global cpu runtime 10 rr period Maximum global cpu period 100 Overlay specific default interface Default NIC eth0 cell multicast interface Default NIC for multicast eth0 cell root discovery ip IP address for root cell discovery cell root discovery port Port address for root cell discovery 2001 tree span i Tree span at level i 2 cell peers i Maximum peers at tree level i 2 cell leafs i Maximum leafs at tree level i 80 Table 4.1: Runtime and overlay parameters.One of the most important flags in the system is the resource reservation support flagis controlled by the --resource reservation command line option. Upon initializa-tion, and if the resource reservation support flag is activated (the default behavior),the core creates a QoS client and connects to the resource reservation daemon. The--rr runtime parameter controls how much CPU time can be spent running in eachcomputational period, that in turn is defined by the --rr period parameter. Bothparameters are expressed in micro-seconds and are used to configure the underlyingLinux’s control groups. 155
  • CHAPTER 4. IMPLEMENTATIONThe overlay is controlled by a set of specific command line options. the default networkinterface card (NIC) to be used in the network communications is controlled by the--default interface parameter. The --cell multicast interface defines thenetwork interface card to be used by the cell discovery mechanism. Furthermore, the--cell root discovery ip and --cell root discovery port are used to specifythe IP address and port of the root multicast group. The --tree span i parameterspecifies the tree span for the i th level of the tree. The --cell peers i parameterspecifies the maximum number of peers in each cell at tree level i. Last, the maximumnumber of leaf peers for every cell in tree level i is controlled by the cell leafs iparameter.It is possible to automatically bootstrap an overlay during the initialization of the run-time, using the --overlay command line option. For example, using --overlay=p3,the core will look for a “” in the current directory, and bootstrap it. Alterna-tively, it is possible to programmatically attach an overlay to the runtime, c.f. Listing 3.1in Chapter 3.4.5 SummaryThis chapter provided an overall view of the implementation of the runtime. Wepresented an overlay implementation inspired in the P3 topology, detailing the threemandatory peer-to-peer services: mesh, discovery, and fault-tolerance.The chapter also provides a presentation of three high-level services that provide a proof-of-concept for our runtime architecture, namely: a RPC service that implements thetraditional remote procedure call; an Actuator service that exemplifies an aggregationservice that uses the FT service solely to minimize rebind latency, and; a StreamingService that offers buffering capabilities to ensure stream integrity even in the presenceof faults.Furthermore, the chapter provides an overview of the challenges faced in supportingmulti-core computing, followed by the presentation of our novel design pattern, theExecution Model/Context, that a provides an integrated solution for supporting multi-core computing.Last, the chapter ends with a short description of the options that may be used whenbootstrapping the runtime.156
  • –Success consists in being successful, not in hav- 5ing potential for success. Any wide piece ofground is the potential site of a palace, but there’sno palace till it’s built. Fernando Pessoa EvaluationThis chapter provides an evaluation of the real-time performance of the middlewarewhile in the presence of the fault-tolerance and resource reservation mechanisms. Thechapter highlights the performance of the two most important parts in the system,the overlay and the high-level services. This evaluation uses a set of benchmarksthat characterize key aspects of the infrastructure. The assessment of the overlayinfrastructure focuses on (a) membership (and recovery time), (b) query behavior, and(c) service deployment performance. Whereas, the evaluation of the high-level servicesfocused on (d) the impact of FT on service performance, (e) impact of multiple clients(using the RPC as test case), and finally, (f) a comparison with other platforms.5.1 Evaluation SetupThe evaluation setup is composed of the physical infrastructure and the overlay config-uration used to produce the benchmarks results discussed throughout this chapter.5.1.1 Physical InfrastructureThe physical infra-structure used to evaluate the middleware prototype consists of acluster of 20 quad-core nodes, equipped with AMD Phenom II X4 920@2.8Ghz CPUsand 4Gb of memory, totaling 80 cores and 80Gb of memory. Each node was installedwith Ubuntu 10.10 and kernel 2.6.39-git12. Despite our earlier efforts to use the real-time patch for Linux, known as the rt-preempt patch [135], this was not possible dueto bugs on the control group infrastructure. The purpose of this patch is to reduce 157
  • CHAPTER 5. EVALUATIONthe number and length of non-preemptive sections in the Linux kernel, resulting inless scheduling latency and jitter. Nevertheless, the 2.6.39 version incorporates most ofthe advancements brought by the rt-branch, namely, threaded-irqs [136]. The physicalnetwork infrastructure was a 100 Mbit/s Ethernet with a star topology.5.1.2 Overlay SetupAt bootstrap, the middleware starts by building a peer-to-peer overlay with a userspecified number of peers and leaf peers. The peers are grouped in cells that are createdaccording to the rules of the underlying P2 P framework, described in Chapter 4. Overlayproperties control the tree span and the maximum number of peers per cell at any givendepth. Figure 5.1: Overlay evaluation setup.Figure 5.1 shows the configuration used for all the benchmarks performed on the overlay.The overlay forms a binary tree with the first level, composed of the root cell, has fourpeers, whereas each cell on the second level has 3 peers, while the third, and last, levelhas two peers per cell.Figure 5.2 shows the physical layout used for the evaluation. Each peer is launched in aseparate node of the cluster, for a total of 18 cluster nodes. On the other hand, all theleaf peers are launched in a single node. Last, the clients are either launched in the samenode where the lead peers were launched, or in remaining free node of the cluster. Theallocation of the clients and leaf peers on the same node was done to provide accuratemeasurements in services, such as the streaming service, where the stream of data onlygoes one way. Otherwise, the physical clock of both client and leaf nodes would haveto be accurately synchronized through specialized hardware.158
  • 5.2. BENCHMARKS Figure 5.2: Physical evaluation setup.5.2 BenchmarksWe divided the benchmark suite into two separate categories, one focusing on thelow-level overlay performance and the other on the high-level services. The mainobjective is to isolate key mechanisms, specially at the overlay level, that may interferewith the behavior of the services. A second objective is to create a solid benchmarkfacility to assess the impact of future overlay implementations in the overall middlewareperformance.5.2.1 Overlay BenchmarksThe following benchmarks were designed to evaluate the performance of a P2 P overlayimplementation. Figure 5.3 shows an overview of the different overlay benchmarks.Membership Bind and RecoveryTo evaluate the performance of the membership mechanism, we take two measurements,(a) the bind time, which reflects the time a node takes to negotiate its entry into themesh, and, (b) the rebind time, comprehends the recovery and rebinding (renegotiation)time that a node must undertake to deal with a faulty environment (Figure 5.3a). Inour P2 P overlay, this failure happens when a coordinator node crashes, leading to a faulton the containing cell, and subsequently to a fault in the tree mesh. The faulty cellrecovers by electing a new coordinator node, allowing the children subtrees to rebindto the recovered cell. The time that it takes for a children subtree to rebind to newcoordinator is directly related with the size of its state (the serialized contents of the 159
  • CHAPTER 5. EVALUATION (a) Membership bind & recovery. (b) Querying. (c) Service deployment. Figure 5.3: Overview of the overlay benchmarks.subtree), thus the larger the subtree, the longer it will take to transfer its state to thenew coordinator. So, in order to evaluate the worst case scenario, after building themesh, the coordinator of the root cell is crashed, forcing a rebind of the first level cells.QueryingOne of the most fundamental aspects of P2 P is its ability to efficiently find resources inthe network. Given this, a measurement of the search mechanism is important to assessthe performance of a given P2 P implementation. To assess the worst case scenario, wefocused on measuring the Place of Launch (PoL) query, as shown in Figure 5.3b. Inour current P2 P implementation, a query is handled only at the root cell, since it hasa better account of the resource usage across the mesh tree.Service DeploymentIn a cloud like environment is important to quickly deploy services, and so the goal ofthis benchmark is to profile the performance of such a mechanism in our overlay. Thisbenchmark measures the latency associated with a service bootstrap with and without160
  • 5.2. BENCHMARKSFT. Figure 5.3c represents a request to launch a service on a peer to be discovered.After being found by the query PoL, the service is started. When a service is to bebootstrapped without FT support, the source creating the service only has to requestone PoL query, as no replicas are going to be bootstrapped. Otherwise, the primaryof the replication group has to issue the same number of PoL queries as the number ofreplicas that it is bootstrapping.5.2.2 Services BenchmarksWe wanted to evaluate the following parameters: (a) the impact of fault-tolerancemechanisms in priority-based real-time tasks; (b) the impact of fault-tolerance in iso-lated real-time tasks; (c) preliminary (latency only) comparison with other main-streammiddleware systems, such as TAO, ICE and RMI. We implemented three simple servicesto serve as benchmarks and one to inject load in the peers.The maximum allowed priority for all benchmarks is 48. Priorities above 48, and upto 99, are reserved for the various low-level Linux kernel threads, namely, the cgroupmanager and irq handlers. (a) RPC. (b) Actuator. (c) Streaming. Figure 5.4: Network organization for the service benchmarks.RPCThe RPC service (Figure 5.4a) executes a procedure in a foreign address space. Thisis a standard service in any middleware system. A primary server receives a call froma client, executes it, and updates the state in all service replicas. When all replicasacknowledge the update, the primary server then replies to the client. In the absence of 161
  • CHAPTER 5. EVALUATIONfault-tolerance mechanisms, the primary server executes the procedure and immediatelyreplies to the client.To evaluate the RPC service we used the maximum available priority of 48. The remoteprocedure simply increments a counter and returns the value. We performed 1000 RPCcalls each run, with an invocation rate of 250 per second.ActuatorThe actuator service (Figure 5.4b) allows a client to execute a command in a set of panelscontrolled by lead peers. This is used by EFACEC to display information of incomingand departing trains in a train station. After receiving the command, the primary serversends it to the panels, waits for their acknowledgments, and then acknowledges the clientitself. The service does not use the fault-tolerance support for data synchronization (asin the RPC service), but instead pre-binds the replicas to the panels in the set.We used 80 panels, and a string of 14 bytes. The 80 panels are representative of alarge real-world public information system in a light train network. The string lengthrepresents the average size in current systems at EFACEC. We issued 1000 commandseach run, with an invocation rate of 250 per second.StreamingThis service (Figure 5.4c) allows the streaming of a data flow (e.g. video, audio, events)from leaf peers to a client. This type of service is used by EFACEC to send and receivestreams from train stations, namely, to implement the CCTV subsystem. The primaryserver and the replicas all connect to the leaf peers, and receive the stream in parallel.Each of the replicas stores the stream flow up to a maximum pre-defined time, forexample 5 minutes. When a fault occurs in the primary, the client rebinds to the newlyelected primary of the replication group. As the client rebinds, it must inform the newprimary what was the last frame received. The new primary then calculates the missingdata and sends it back to the client, thereafter resuming the normal stream flow.We used a stream of 24 frames per second with length of 4 Kbytes, resulting in a bitrateof 768Kbit per second. For example, this bitrate allows for a medium quality MPEG-4stream with a 480 x 272 resolution, matching the video stream used by EFACEC’sCCTV. The client and leaf peers are located in the same machine as this allowsthe determination of the one-way latency and jitter for the traffic. The stream wastransmitted for 4 seconds in each run.162
  • 5.3. OVERLAY EVALUATION5.2.3 Load GeneratorComplex distributed systems are prone to be affected by the presence of rogue servicesthat can become a source of latency and jitter. We evaluate the impact of the presenceof such entities by introducing in each peer a load generator service. The later spawnsas many threads as the logical core count of the CPU. Unless explicitly mentioned,the threads are allocated to the SCHED FIFO schedule class, with priority 48. Thisscheduling policy represents the worst case scenario of unwanted computation. Givena desired load percentage p (in terms of the total available CPU time), each threadcontinuously generates random time intervals (up to a configurable maximum of 5ms).For each value it computes the percentage of time that it must compute so that the loadis p. For example, if the desired load is 75% and the value generated is 4ms, then theload generator must compute for 3ms and sleep for the remainder of that time lapse.The experiments on each benchmark were tested with increasing load values (5% step),up to a maximum of 95%. For each of these configurations we ran the benchmark 16times, and computed the average and the 95% confidence intervals (represented as errorbars). A vertical dashed line corresponds to a load of 90% is used as a reference for thecase where resource reservation is enabled.5.3 Overlay EvaluationThis section presents the results for the runs on all the three benchmarks designed forevaluating the P2 P overlay performance, namely: membership, recovery, query, andservice deployment. The benchmarks that present latency and jitter use a logarithmicscale that causes a distortion of the error bars. This may mislead in the evaluation ofthe results.5.3.1 Membership PerformanceThese experiments estimate the impact on membership bind and recovery time, whilethe peers are exposed to an increasing load. The membership mechanisms run withmaximum priority (48) on each peer of the overlay. 163
  • CHAPTER 5. EVALUATION 1000 1000 Legend: Legend: Res. No Res. Res. No Res. 100Latency (ms) Latency (ms) 100 10 100 10 20 30 40 50 60 70 80 90 100 10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (a) (b) Figure 5.5: Overlay bind (left) and rebind (right) performance. These two measurements are a key factor on the overall performance of the higher layers on the middleware, because a node is only fully functional when it is connected to the mesh. In the presence of a fault it is important to be able to quickly rebind and recover, to minimize the down time of the P2 P low level services. This in turn, can become a source of latency for the high level services, for example, the RPC service. The membership bind time, depicted in Figure 5.5a, shows a linear increase on bind latency without the resource reservation enabled. This is expected, as the load increases, it creates additional interference on the threads of the mesh service. When the resource reservation mechanisms are enabled, the mesh service uses a portion of the resource reservation allocated to the runtime. The use of the resource reservation mechanisms allow for an almost constant latency time with some minor jitter on loads higher than 80%. The rebind performance exhibits a similar behavior to the bind performance, although it exhibits a lower latency on loads less than 80%. As with the bind benchmark, the enabling of the resource reservations mechanisms allow for a near constant rebind latency with a very small jitter. 5.3.2 Query Performance The query performance is one of the most crucial aspects of every overlay implemen- tation, because it is the basis of the resource discovery. Figure 5.6 shows the result of 164
  • 5.3. OVERLAY EVALUATIONperforming the PoL query with and without resource reservation. 1000 Legend: Res. No Res. Latency (ms) 100 10 10 10 20 30 40 50 60 70 80 90 100 Load (%) Figure 5.6: Overlay query performance.The evaluation results show that up to loads of 70%, the use of resource reservationintroduces a small overhead as shown by the higher level of latency. This is explained bythe fact that the execution model uses a Thread-per-Connection (without a connectionpool) policy, where a peer creates a new connection (using the desired level of QoS) toperform a query. When a neighbor peer receives a new connection (from the discoveryservice), it has to spawn a new thread to deal with the request. This process is repeateduntil a peer is able to handle the query, or the root cell is reached and a failure messageis replied back to the originator peer. When using resource reservation, the creationof new threads must undergo an additional submission phase with the QoS daemon,and subsequently, within the QoS infrastructure in Linux (control groups), causing theincrease of latency when using resource reservation. Nevertheless, from 70% to 95%, theresource reservation mechanism is able to provide a stable behavior. Otherwise, in theabsence of the resource reservation mechanism, the query latency reaches a maximumof 400ms when the peers were subjected to a load of 95%.5.3.3 Service Deployment PerformanceThe quick allocation of services, and ultimately of resources, is a crucial aspect ofscalable middleware infrastructures. Figure 5.7 shows the evaluation results for servicedeployment, with varying number of replicas. 165
  • CHAPTER 5. EVALUATION 10000 Legend: Res. + NoFT No Res. + NoFT Res. + 1FT No Res. + 1FT Res. + 2FT No Res. + 2FT Res. + 4FT No Res. + 4FT Latency (ms) 1000 100 10 10 10 20 30 40 50 60 70 80 90 100 Load (%) Figure 5.7: Overlay service deployment performance.The results show that without resource reservation, the system exhibits a linear increaseof deployment time starting at loads of 30%, following (the linear) the increase of theload injected in the system. Associated with this high latency, the results show a highjitter throughout the service deployment. The maximum value registered was near 10s,for the deployment of the service with 4 replicas without resource reservation and a loadof 95%. On the other hand, when the discovery service used the resource reservationmechanism, it exhibited a near constant behavior, only showing a small increase ofthe deployment time when the service is deployed with FT. The increasing number ofreplicas brings additional latency to the deployment, as more queries are needed to beperformed to discover additional sites to deploy the replicas. Naturally, the deploymentof 4 replicas takes additional time, resulting in a maximum around 100ms, still, a100 fold improvement over the 4 replica deployment without resource reservation. Toconclude, the results show negligible jitter in all the deployment configurations whenthe resource reservation mechanism is activated.5.4 Services EvaluationSeveral aspects influence the behavior of the high-level services. Here, we present thetwo most important aspects, the impact of FT mechanisms in service latency and theimpact of resource reservation while enforcing FT policies. Additionally, we present166
  • 5.4. SERVICES EVALUATIONresults that characterize the impact of the presence of multiple clients, using RPC astest case. The evaluation of the system ends with a preliminary comparison with otherclosely related middleware systems.5.4.1 Impact of FT Mechanisms in Service LatencyThese experiments estimate the impact of the FT mechanisms in service latency andrebind latency, as the peers are subjected to increasing load. The services run withmaximum priority (48) without resource reservation. To assess the scalability of theFT mechanisms we also vary the size of the replication group for the service through2, 3 and 5 (1 primary server + 1, 2, 4 replicas). For the rebind latency, in the middleof the run, we crash the primary server. This is accomplished by invoking an auxiliaryRPC object, initially loaded in every peer of the system. Finally, as a baseline referencewe present the results obtained with the same benchmarks but with all FT mechanismsdisabled. In this case, no fault is injected, as no fault-tolerance is active.The results for the runs can be seen in Figure 5.8. In general, the rebind latency presentsa stepper increase when compared to invocation latency, although the differences withvarying number of replicas are masked by jitter. The rebind process involves severalsteps: failure detection; election of a new primary server; discovery of new primaryserver, and; transfer of lost data. In each step, the increasing load introduces anew source of latency and jitter that accumulates to the overall rebind time. In thisimplementation the client must use the discovery service of the mesh to find the newprimary server. This step could be optimized, for example, by keeping track of thereplicas in the client. Despite this, the rebind latency remains fairly constant up toloads of 40% to 45%. The minimum and maximum rebind latencies for the RPC,Actuator and Streaming services are, respectively: 5.9ms, 5.7ms, 7.2ms, and 2823ms,2068ms, 1087ms.The invocation latencies depicted in Figure 5.8 show that up to loads of 35% the FTmechanisms introduce low overhead and low jitter. In the case of the RPC benchmark,that uses a more complex replica synchronization protocol, the overhead remains aconstant factor in direct proportion to the number of replicas relative to the baseline case(no FT). The Actuator and Streaming services, with their simple (or non-existing) datasynchronization protocols follow the baseline very closely. Despite this, the Streamingservice is far more CPU intensive than Actuator and therefore shows more impact fromincreasing loads. The minimum and maximum invocation latencies measured for theRPC, Actuator and Streaming services are, respectively: 0.1ms, 1.5ms, 1.1ms, and 167
  • CHAPTER 5. EVALUATION Rebind Latency Invocation Latency RPC 10000 1000 Legend: Legend: 1 Replica 4 Replicas No FT 2 Replicas 2 Replicas 1 Replica 4 Replicas 1000 100 Latency (ms) Latency (ms) 100 10 10 1 10 10 20 30 40 50 60 70 80 90 100 0.10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) Actuator 100 10000 Legend: Legend: No FT 2 Replicas 1 Replica 4 Replicas 1 Replica 4 Replicas 2 Replicas 1000 Latency (ms) Latency (ms) 100 10 10 10 10 20 30 40 50 60 70 80 90 100 10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) Stream 100 10000 Legend: Legend: No FT 2 Replicas 1 Replica 4 Replicas 1 Replica 4 Replicas 2 Replicas 1000 100 Latency (ms) Latency (ms) 10 10 1 0.10 10 20 30 40 50 60 70 80 90 100 10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) Figure 5.8: Service rebind time (left) and latency (right).168
  • 5.4. SERVICES EVALUATION259ms, 19ms, 96ms.5.4.2 Real-Time and Resource Reservation EvaluationIn these runs we use the middleware’s QoS daemon to isolate the services by reservingat least 10% of the available CPU time for the runtime that executes the service. Theremainning 90% are used for operating system tasks and for the Load Generator service.Everything else is kept from the scenario described for the previous set of runs.Impact of FT in Service Latency with ReservationThe results for the runs can be seen in Figure 5.9. The fact that the services arenow isolated, at least in terms of CPU, from the remainder of the system contributes totheir almost constant latencies and stability (low jitter) with increasing peer loads. Theinvocation latency also shows the natural increase with the number of replicas. Theminimum and maximum rebind latencies for the RPC, Actuator and Streaming are,respectively: 9.2ms, 10.2ms, 10.9ms, and 15.8ms, 18.7ms, 21.9ms. The minimum andmaximum invocation latencies for the RPC, Actuator and Streaming are, respectively:0.1ms, 4.8ms, 1.1ms, and 1.0ms, 5.9ms, 1.9ms.Relative to the previous set of runs, the latencies for low values of peer loads withresource reservation activated are somewhat higher. For example, the ratios betweenthe minimum rebind latencies with and without reservation for RPC, Actuator andStreaming are, respectively: 1.6, 1.8, and 1.5. This is explained by the overheadintroduced by the reservation mechanisms (previously explained in Chapter 3). Thisoverhead has a higher impact on the rebind latency than it has on the invocationlatency, because the rebind process has a much shorter duration, therefore the overheadrepresents a larger fraction of total time. In other words, the overhead of the resourcereservation setup on the invocation latency, is amortized across the duration of thebenchmark, such as the 1000 calls performed to the RPC service.Impact of Multiple Clients in RPC LatencyTo evaluate the performance of the middleware in the presence of multiple clients withdifferent priorities, we extended the RPC benchmark and introduced three service accesspoints with distinct priorities, more precisely, 48, 24, and 0. The first two accesspoints are served by a thread-per-connection model with scheduling class SCHED FIFO(and priorities 48 and 24, respectively). The remaining SAP is served by threads withscheduling class SCHED OTHER (with static priority 0). This benchmark allows to 169
  • CHAPTER 5. EVALUATION Rebind Latency Invocation Latency RPC 100 10 Legend: Legend: 1 Replica 4 Replicas No FT 2 Replicas 2 Replicas 1 Replica 4 Replicas Latency (ms) Latency (ms) 10 1 10 10 20 30 40 50 60 70 80 90 100 0.10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) Actuator 100 10 Legend: Legend: 1 Replica 4 Replicas No FT 2 Replicas 2 Replicas 1 Replica 4 Replicas Latency (ms) Latency (ms) 10 10 10 20 30 40 50 60 70 80 90 100 10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) Stream 100 10 Legend: Legend: Replica 1 Replicas 4 No FT Replicas 2 Replicas 2 Replica 1 Replicas 4 Latency (ms) Latency (ms) 10 10 10 20 30 40 50 60 70 80 90 100 10 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) Figure 5.9: Rebind time and latency results with resource reservation.170
  • 5.4. SERVICES EVALUATIONmeasure the impact of multiple clients on RT performance, specially the impact of lowpriority clients on high priority clients.As with the previously RPC benchmark, the remote procedure increments a counter, butbefore returning the value, it continuously computes a batch of arithmetic operationsfor 10ms. The objective is to evaluate the Linux’s scheduler and control group RTperformance. We used three clients with priorities 48, 24 and 0, and performed 1000RPC calls each run, with an invocation rate of 25 per second (corresponding to adeadline of 40ms). To evaluate the impact of different load conditions, we performedthe benchmark using three load generator configurations, using priorities 48, 24 and 0,for the setups.Figures 5.10a, 5.10c and 5.10e show the number of deadlines missed for each client with-out resource reservation, under an increasing load of priorities 0, 24 and 48, respectively.Whereas, figures 5.10b, 5.10d and 5.10f show the number of deadlines missed under thesame premises but with resource reservation enabled.Without resource reservation, and if the load generator uses SCHED OTHER threadswith priority 0, the Linux’s scheduler is able to avoid any deadline miss. This is theexpected outcome for clients using priorities 24 and 48, as they are served by SCHED -FIFO threads that are always scheduled ahead of any other scheduling class. The clientusing priority 0 (and associated SCHED OTHER threads) is also able to avoid anymiss. This is explained by the good implementation of Linux’s fair scheduler, that isable to leverage loads up to 95% of CPU time.When the load generator uses SCHED FIFO threads the behavior starts to degradewith loads higher that 35%. In both cases, the client with priority 0 has approximately70% missed deadlines when the load is of 95% of CPU time. This is explained by theCPU starvation caused by the load generator RT high priority threads. When the loadgenerator uses priority 24, the client that uses priority 48 should not have any deadlinemisses. However, this is not the case. The client that uses priority 48 also experiencesmissed deadlines, although in a much lesser scale. This is due to priority inversionat the network interface card driver (whose IRQ is handled by a high priority kernelthread).When, in figure 5.10e, the load generator used priority 48, this priority inversion isexacerbated. Adding to this, the race between the load generator threads interfereswith the remaining threads, due to their SCHED FIFO scheduling. This type of threadsare only preempted by higher priority threads, otherwise, they keep running until theyvoluntary relinquish the CPU. But, as the load generator threads used the maximum 171
  • CHAPTER 5. EVALUATION Without Resource Reservation With Resource Reservation Load Priority 0 700 700 Legend: Legend: Missed Deadlines P48 Missed Deadlines P48 600 Missed Deadlines P24 600 Missed Deadlines P24 Missed Deadlines P0 Missed Deadlines P0 500 500 Missed Deadlines Missed Deadlines 400 400 300 300 200 200 100 100 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (a) (b) Load Priority 24 700 700 Legend: Legend: Missed Deadlines P48 Missed Deadlines P48 600 Missed Deadlines P24 600 Missed Deadlines P24 Missed Deadlines P0 Missed Deadlines P0 500 500 Missed Deadlines Missed Deadlines 400 400 300 300 200 200 100 100 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (c) (d) Load Priority 48 700 700 Legend: Legend: Missed Deadlines P48 Missed Deadlines P48 600 Missed Deadlines P24 600 Missed Deadlines P24 Missed Deadlines P0 Missed Deadlines P0 500 500 Missed Deadlines Missed Deadlines 400 400 300 300 200 200 100 100 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (e) (f) Figure 5.10: Missed deadlines without (left) and with (right) resource reservation.172
  • 5.4. SERVICES EVALUATIONpermitted priority, this caused unbounded latency on the middleware threads (evenwith the high priority threads).With resource reservation, and when the load generator used priority 0, there were avery few unexpected missed deadlines. We speculated that a possible explanation canreside in the fact that we use a thread-per-connection strategy that creates a new threadfor each new connection, with each new thread being submitted to the QoS daemon.This adds latency to service, and can cause some missed deadlines in the first invocationsfrom the client. When the load generator uses priority 24 and 48, it worsens the latencyassociated with the acceptance of new threads by the QoS daemon. However, additionalanalysis to the Linux kernel is still required to validate this hypothesis.Figures 5.11a, 5.11c and 5.11e show the invocation latencies for each client withoutresource reservation, under an increasing load of priorities 0, 24 and 48. Whereas,figures 5.11b, 5.11d and 5.11f show the invocation latencies with resource reservationenabled.The load generator using priority 0, figures 5.11a and 5.11b (with and without resourcereservation, respectively), only interferes with invocation using priority 0. When theload generator uses SCHED FIFO threads with priority 24 and 48, without the presenceof the resource reservation mechanisms (figures 5.11c and 5.11e), the performance startsto degrade at 35% of load. The client using priority 48 in figure 5.11c should have a nearconstant invocation latency, but due to priority inversion, it presents a linear increase(although with a much lesser increase when comparing with the other two priorities).Figure 5.11e shows the expected behavior with the load generator threads (using priority48) causes a gradual latency increase in all the clients.Figures 5.11d and 5.11f show the middleware performance with resource reservationenabled under load priorities of 24 and 48, respectively. A scheduling artifact is notice-able with invocations using priority 0, that instead of remaining constant, it presents alower latency with the increasing presence of load. The workload introduced by the RTthreads of the load generator on the control group infrastructure, that is continuouslyforced to perform load balancing across the scheduling domains, causes a small jitteron clients with priority 24 and 48.RPC Performance Comparison with Other PlatformsFigure 5.12 shows the measured invocation latencies for the RPC service as implementedin our middleware and other mainstream platforms, only using a client and a server andmaking 1000 RPC invocations, with a 250 invocations per second rate. 173
  • CHAPTER 5. EVALUATION Without Resource Reservation With Resource Reservation Load Priority 0 100 100 Legend: Legend: Invocations P48 Invocations P48 Invocations P28 Invocations P28 Invocations P0 Invocations P0 Latency (ms) Latency (ms) 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (a) (b) Load Priority 24 1000 100 Legend: Legend: Invocations P48 Invocations P48 Invocations P28 Invocations P28 Invocations P0 Invocations P0 Latency (ms) Latency (ms) 100 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (c) (d) Load Priority 48 1000 100 Legend: Legend: Invocations P48 Invocations P48 Invocations P28 Invocations P28 Invocations P0 Invocations P0 Latency (ms) Latency (ms) 100 100 10 20 30 40 50 60 70 80 90 100 100 10 20 30 40 50 60 70 80 90 100 Load (%) Load (%) (e) (f) Figure 5.11: Invocation latency without (left) and with (right) resource reservation.174
  • 5.4. SERVICES EVALUATION 100 Legend: Stheno, No Res. TAO Stheno, Res. RMI ICE 10 Latency (ms) 1 0.1 00 10 20 30 40 50 60 70 80 90 100 Load (%)Figure 5.12: RPC invocation latency comparing with reference middlewares (withoutfault-tolerance).As expected, RMI, implemented with Java SE, has the worst behavior with minimumand maximum latencies of, respectively, 0.3ms and 8.9ms. TAO was optimized forreal-time tasks by using the CORBA-RT extension, exhibiting minimum and maximumlatencies of, respectively, 0.3ms and 6.5ms. TAO’s results were hampered by its strictsupport to the (bloated) IIOP specification. ICE, while less stable than TAO, is overallmore efficient with minimum and maximum latencies of, respectively, 0.1ms and 7.8ms.Despite the absence of RT support in ICE, its lightweight implementation (it doesnot use IIOP) provides good performance for low values of load. Our middlewareimplementation is able to offer minimum and maximum latencies of, respectively, 0.1msand 14.6ms, without resource reservation. With resource reservation we achieve amaximum latency of just 0.1ms, by effectively isolating the service in terms of re-quired resources. Our implementation without resource reservation exhibits a mixedperformance. Up to 40% of load, it compares very favorably to the other platforms, butabove this limit, it starts to degrade more quickly. We attribute this behavior to theoverhead associated with the time it takes to create a new thread to handle an incomingconnection (a consequence of using the Thread-per-Connection strategy). Nevertheless,our performance is comparable with TAO’s. Above the 60% load threshold, all systemswithout resource reservation have their performance severally hampered by the LoadGenerator. Our system, with resource reservation enabled, is able to sustain high levelsof performance by shielding the service from resource starvation, offering at 95% of load, 175
  • CHAPTER 5. EVALUATIONan 55 fold improvement to the second best system (TAO), and a 77 fold improvementover the worst system (RMI).5.5 SummaryThis chapter provided an insight look on the performance behavior of several keycomponents of our middleware infrastructure, more precisely, the low-level overlayperformance and the high-level service layer. The benchmarks presented focused onhighlighting crucial characteristics on both levels. At the overlay level, we incised onthree aspects: membership behavior; query performance; and service deployment time.Whereas, at the service layer, we focused exposing the effects of our lightweight FTinfrastructure on service performance, as well the impact of the resource reservationmechanisms on both RT and FT performance. For contextualizing the performance ofour system, we presented two additional evaluations. The first, exhibits the effectsof the presence of multiple clients (with distinct priorities) in the RPC service, acommon practice for this type of service, such in [3]. The last evaluation presenteda RT performance comparison with other close related systems.176
  • –Success consists in being successful, not in hav- 6ing potential for success. Any wide piece ofground is the potential site of a palace, but there’sno palace till it’s built. Fernando Pessoa Conclusions and Future Work6.1 ConclusionsIn this thesis we have designed and implemented Stheno that to the best of our knowl-edge is the first middleware system to seamlessly integrate fault-tolerance and real-timein a peer-to-peer infrastructure. Our approach was motivated by the lack of support ofcurrent solutions for the timing, reliability and physical deployment characteristics ofour target systems, as shown in the survey on related work.Our hypothesis is that it is possible to effectively and efficiently integrate real-timesupport with fault-tolerance mechanisms in a middleware system using an approachfundamentally distinct from current solutions. Our solution involves: (a) implementingFT support at low level in the middleware, albeit on top of a suitable network ab-straction to maintain transparency; (b) using the peer-to-peer mesh services to supportFT, and; (c) supporting real-time services through kernel-level resource reservationmechanisms.The proposed architecture offers a flexible design that is able to support different fault-tolerance policies, including semi-active and passive. The runtime’s programming modeldetails the most important interfaces and their interactions. It was designed to providedthe necessary infrastructure for allowing users and services to interact with runtimesthat are not in the same address space, and thus allowing for a reduction in the resourcefootprint. Furthermore, it also provides support for additional languages.We provide a complete implementation of a P2 P overlay for efficient, transparentand configurable fault-tolerance, and support real-time through the use of resourcereservation, network communication demultiplexing, and multi-core computing. The 177
  • CHAPTER 6. CONCLUSIONS AND FUTURE WORKsupport for resource reservation was achieved through the implementation of a QoSdaemon that manages and interacts with the low-level QoS infrastructure present inthe Linux kernel. The multiplexing of requests can force high priority requests to misstheir deadlines, because of the FIFO nature of network communications. To avoidthis, our implementation allows services to define multiple access points, with eachone specifying a priority and a threading strategy. Last, to proper integrate resourcereservation and the different threading strategies, in a multi-core computing context, wehave designed a novel design pattern, the Execution Model/Context. Fault-tolerance isefficiently implemented using the P2 P overlay and the fault-tolerance strategy and thenumber of replicas are configurable per service. The current prototype has a code baseof almost 1000 files and contains around 55000 lines of code.The experiments show that Stheno meets and exceeds target system requirements forend-to-end latency and fail-over latency, and thus validating our approach of implement-ing fault-tolerance mechanisms directly over the peer-to-peer overlay infrastructure. Inparticular, it is possible to isolate real-time tasks from system overhead, even in thepresence of high-loads and faults. Although the support for proactive fault-toleranceis still absent from the current implementation, we were able to mitigate the impact offaults in the system by providing proper isolation between the low-level P2 P servicesand the user’s high-level services. This was mainly accomplished with the introductionof separate communications channels for both service types. We are able to maintainperformance in user services even in the presence of major mesh rebinds.When taken as a whole these evaluation results are promising and support the idea thatthe approach followed is valid. In summary, to the best of our knowledge, Stheno is thefirst system that supports:Configurable Architecture. The architecture of our middleware platform is open, ina sense that it offers an adjustable and modular design that is able to accommodate awide range of applications domains. Instead of focusing on a specific application domain,such as RPC, we designed a service-oriented platform that offers a computationalenvironment that seamlessly integrates both fault-tolerance and real-time. Furthermore,Stheno supports configurability at multiple levels: P2 P, real-time and fault-tolerance.P2 P. Our infrastructure, based on pluggable P2 P overlays, offers a resilient behaviorthat can be adjusted to meet the overall system requirements. The selection betweendifferent overlay topologies, structured or unstructured, allows a software architect toleverage between resource consumption, overall performance and resiliency.Fault-Tolerance. We have implemented a lightweight fault-tolerance infrastructure178
  • 6.2. FUTURE WORKdirectly in the P2 P overlay, currently supporting semi-active replication, that is able toprovide minimum overhead and thus enhancing real-time performance. Nevertheless,a great effort was spent to allow the support of additional replication policies, such aspassive replication and active replication.Real-Time Behavior. Our platform is able to offer resource reservation through theimplementation of a QoS daemon that leverages the available resources and interactswith the low-level resource reservation infrastructure provided by the Linux kernel.Furthermore, our architecture decouples control and data information flows throughthe introduction of distinct service access points (SAPs). These SAPs are served bya configurable threading strategy with an associated priority. Last, we introduced anovel design pattern, the Execution Model/Context, that is able to integrate resourcereservation with distinct threading strategies, namely, Leader-Followers [11], Thread-Pool [114], Thread-per-Connection [12] and Thread-per-Request [13], that focus on thesupport for multi-core computing.6.2 Future WorkThe work accomplished in this thesis opens paths in several research domains.Real-Time. An interesting challenge in the RT domain is to enhance the middlewarewith support for EDF [117] and study the limitations of implementing the hard real-time tasks in a general purpose operating system, such as Linux. A derivative workfrom this, is to study the implications of isolating low-level hardware interrupts andmeasure the impact of different runtime and periods in EDF tasks.An in-depth study of the impact of CPU architecture, specially cache topology, in real-time performance and resource reservation behavior would be good to contribute forimproving the deployment of distributed RT systems.Fault-Tolerance. An interesting idea, originated from the collaboration with Prof.Priya Narasimhan, consists in providing support for multiple overlays for further en-hancing dependability. This opens several challenges, (a) correlate faults from differentoverlays with the goal of identifying the root causes, (b) choosing the optimal deploy-ment site for service bootstrap, (c) enhance current state-of-art of fault-tolerance withsupport for inter-overlay replication groups, that is the placement of replicas across adistinct set of overlays, and (d) identify nodes that are common to several overlays, as 179
  • CHAPTER 6. CONCLUSIONS AND FUTURE WORKthey diminish FT capabilities.Currently, we use a reactive fault-detection model that only acts after a fault hashappened. Using a proactive approach, the runtime can predict imminent faults andtake actions to eliminate, or at least minimize, the consequences of such events. Apossible way to accomplish this can involve using a combination of real-time resourcemonitoring analysis and gossip-based network monitoring.The addition of new replicas into a replication group still poses a significant challengein distributed RT systems. The disturbance caused by the initialization process of thenew replica, can me mitigated by a two phase process. In the first phase, if thereis no checkpoint available, then the replication group would have to create one. Theexisting replicas would then split the checkpoint state between themselves, and thereforealleviating the primary of further overhead. In the second phase, all the replicas wouldtransfer their portion of the checkpoint state to the joining replica. This would endwith the primary providing the delta between the checkpoint state and the currentstate. This would greatly minimize the interference in the primary node, specially invery large states.Byzantine Fault-Tolerance. The introduction of Byzantine Fault-Tolerance (BFT)still poses a significant challenge. The integration of BFT with RT would representthe next evolution in terms of FT. We would like to assess the impact of recentBFT replication protocols, such as Zyzzyva [137] and Aardvark [138], in real-timeperformance.Virtualization. Current virtualization solutions focus on providing on-demand VirtualMachine (VM)s to the end-user QoS, such as the Amazon EC2. A more low-levelapproach can be taken by using lightweight VMs to provide a virtualized environmentfor runtime (user) services, allowing the support for legacy services. This also allowsthe migration of service without having to implement FT awareness into the service.A second benefit of having support for virtualized services is the inherent support forproving a strong isolation to services. This can be used as way to prevent maliciousservers to compromise the entire node.A broad study on the possibility of having RT performance on the currently availablehypervisors is needed to assess the feasability of having RT virtualized services. Tothe best of your knowledge, no RT support has ever been attempted in lightweightvirtualization hypervisors, such as Kernel Virtual-Machine (KVM) [108]. We spec-180
  • 6.3. PERSONAL NOTESulate, that the use of CPU isolation could make this feasible, possibly allowing theintroduction of RT semantics to the Infrastructure as a Service (IaaS) paradigm. Therecent developments on virtualization at the operating system level [139], by the Linux-CR project [140], could represent an interesting alternative to lightweight virtualizationhypervisors. Because no latency is added to the middleware runtime, the real-time be-havior should be preserved. Furthermore, only the state of the application is serialized,resulting in less overhead to the operating system and produces smaller state imagesthat should provide a more efficient way of migrating runtimes between nodes, withsubsequent improvement on the recovery time.6.3 Personal NotesThe main motivation for undertaking this PhD was the desire to solve the problemscreated by the requirements from our target systems, and it can be summarize withthe following question: ”Can we opportunistically leverage and integrate these provenstrategies to simultaneously support soft-RT and FT to meet the needs of our targetsystems even under faulty conditions?”.Doing research on middleware systems is a difficult, yet rewarding, task. We feel thatall the major goals of this PhD were met, and the author has gained an invaluableinsight on the vast and complex domain of distributed computing.From a computer science standpoint, the full implementation of a new P2 P middlewareplatform that is able to offer seamless integration of both real-time and fault-tolerancewas only possible with a thorough analysis of all the mechanisms involved, as well theirinter-dependencies. Eventually, this work will lead to further research on operatingsystems, parallel and distributed computing, and software engineering.From the early stages of this PhD there has been an increasing focus on the supportfor adaptive behavior. The ultimate goal is to leverage fault-tolerance assurances withreal-time performance, in order to meet the requirements of the target system. One ofthe most prevailing applications for this type of research is Cloud Computing. We hopethat our work provides an open adaptive framework that allows the researchers anddevelopers to customize the behavior of the middleware to best suit their needs, whilebenefiting from a resilient and distributed network layer built on top of P2 P overlays.The evolution of middleware systems, and in particular the ones that pursuit simulta-neous support of both real-time and fault-tolerance, has been gradually focusing on the 181
  • CHAPTER 6. CONCLUSIONS AND FUTURE WORKefficient implementations of byzantine fault-tolerance. The practical implementation ofsuch systems constitutes a promising and exciting research field. Another promisingresearch field is related to the introduction of hard real-time support in general purposemiddleware systems while supporting the dynamical insertion and removal of services.I hope to have the opportunity to contribute in these exciting research challenges.182
  • References [1] Paulo Ver´simo and Lu´ Rodrigues. Distributed Systems for System Architects. s ıs Kluwer Academic Publishers, Norwell, MA, USA, 2001. [2] Kenneth Birman. Guide to Reliable Distributed Systems. Texts in Computer Science. Springer, 2012. [3] Douglas Schmidt, David Levine, and Sumedh Mungee. The Design of the TAO Real-Time Object Request Broker. Computer Communications, 21(4):294–324, 1998. [4] Xavier Defago. Agreement-Related Problems: from Semi-Passive Replication ´ to Totally Ordered Broadcast. PhD thesis, Ecole Polytechnique F´d´rale de e e Lausanne, August 2000. [5] EFACEC, S.A. EFACEC Markets. presentationlayer/efacec_mercado_00.aspx?idioma=2&area=8&local= 302&mercado=55. [Online; accessed 17-October-2011]. [6] Rolando Martins, Priya Narasimhan, Lu´ Lopes, and Fernando Silva. Lightweight ıs Fault-Tolerance for Peer-to-Peer Middleware. In The First International Work- shop on Issues in Computing over Emerging Mobile Networks (C-EMNs’10), In Proceedings of the 29th IEEE Symposium on Reliable Distributed Systems (SRDS’10), pages 313–317, November 2010. [7] Bela Ban. Design and Implementation of a Reliable Group Communication Toolkit for Java. Technical report, Cornell University, September 1998. [8] Chen Lee, Ragunathan Rajkumar, and Cliff Mercer. Experiences with Processor Reservation and Dynamic QOS in Real-Time Mach. Proceedings of Multimedia Japan 96, April 1996. [9] Hideyuki Tokuda, Tatsuo Nakajima, and Prithvi Rao. Real-Time Mach: Towards a Predictable Real-Time System. In USENIX MACH Symposium, pages 73–82, October 1990.[10] Luigi Palopoli, Tommaso Cucinotta, Luca Marzario, and Giuseppe Lipari. AQuoSA - Adaptive Quality of Service Architecture. Software: Practice and Experience, 39(1):1–31, April 2009. 183
  • REFERENCES[11] Douglas Schmidt, Carlos O’Ryan, Irfan Pyarali, Michael Kircher, and Frank Buschmann. Leader/Followers: A Design Pattern for Efficient Multi-threaded Event Demultiplexing and Dispatching. In Proceedings of the 7th Conference on Pattern Languages of Programs (PLoP’01), August 2001.[12] Douglas Schmidt and Steve Vinoski. Comparing Alternative Programming Techniques for Multithreaded CORBA Servers. C++ Report, 8(7):47–56, July 1996.[13] Douglas Schmidt and Charles Cranor. Half-Sync/Half-Async: An Architectural Pattern for Efficient and Well-Structured Concurrent I/O. In Proceedings of the 2nd Annual Conference on the Pattern Languages of Programs (PLoP’95), pages 1–10, 1995.[14] Priya Narasimhan, Tudor Dumitra¸, Aaron Paulos, Soila Pertet, Carlos Reverte, s Joseph Slember, and Deepti Srivastava. MEAD: Support for Real-Time Fault- Tolerant CORBA: Research Articles. Concurrency and Computation: Practice & Experience, 17(12):1527–1545, October 2005. ınio Oliveira, Lu´ Lopes, and Fernando Silva. P3 : Parallel Peer to Peer - An[15] Lic´ ıs Internet Parallel Programming Environment. In Workshop on Web Engineering & Peer-to-Peer Computing, part of Networking 2002, volume 2376 of Lecture Notes in Computer Science, pages 274–288. Springer-Verlag, May 2002.[16] James E. White. A High-Level Framework for Network-Based Resource Sharing. In Proceedings of the June 7-10, 1976, National Computer Conference and Exposition (AFIPS’76), pages 561–570, New York, NY, USA, 1976. ACM.[17] Andrew D. Birrell and Bruce Jay Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems, 2(1):39–59, February 1984.[18] Object Management Group. CORBA Specification. OMG Technical Commit- tee Document:, Aug 1991. [Online; accessed 17-October-2011].[19] Ann Wollrath, Roger Riggs, and Jim Waldo. A Distributed Object Model for the Java System. Computing Systems, 9(4):265–290, 1996.[20] Michi Henning. The Rise and Fall of CORBA. Communications of the ACM, 51(8):52–57, August 2008.184
  • REFERENCES[21] Enterprise Team, Vlada Matena, Eduardo Pelegri-Llopart Mark Hapner, James Davidson, and Larry Cable. Java 2 Enterprise Edition Specifications. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.[22] A. Wigley, M. Sutton, S. Wheelwright, R. Burbidge, and R. Mcloud. Microsoft .Net Compact Framework: Core Reference. Microsoft Press, Redmond, WA, USA, 2002.[23] Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendel- sohn, Henrik Nielsen, Satish Thatte, and Dave Winer. Simple Object Access Protocol (SOAP) 1.1. W3c note, World Wide Web Consortium, May 2000. [Online; accessed 17-October-2011].[24] Marc Fleury and Francisco Reverbel. The JBoss Extensible Server. In Proceedings of the 4th ACM/IFIP/USENIX International Middleware Conference (Middleware’03), pages 344–373, New York, NY, USA, 2003. Springer-Verlag New York, Inc.[25] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s Highly Available Key-value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07), pages 205–220, October 2007.[26] Yan Huang, Tom Fu, Dah-Ming Chiu, John Lui, and Cheng Huang. Challenges, Design and Analysis of a Large-Scale P2P-VOD System. In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM ’08), pages 375–388, New York, NY, USA, August 2008. ACM.[27] Edward Curry. Message-Oriented Middleware, pages 1–28. John Wiley & Sons, Ltd, 2005.[28] Tibco. Tibco Rendezvous. rendezvous/. [Online; accessed 17-October-2011].[29] IBM. WebSphere MQ. [Online; accessed 17-October-2011].[30] Richard Monson-Haefel and David Chappell. Java Message Service. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2000. 185
  • REFERENCES[31] JCP. JAIN SLEE v1.1 Specification. JCP Document: http://download., Jul 2008. [On- line; accessed 17-October-2011].[32] Mobicents. The Open Source SLEE and SIP Server. http://www.mobicents. org/. [Online; accessed 17-October-2011].[33] Object Management Group. OpenDDS. [Online; accessed 17-October-2011].[34] RTI. Connext DDS. [Online; accessed 17-October-2011].[35] Douglas C. Schmidt and Hans van’t Hag. Addressing the challenges of mission- critical information management in next-generation net-centric pub/sub systems with opensplice dds. In IPDPS, pages 1–8, 2008.[36] Object Management Group. Fault Tolerant CORBA Specification. OMG Techni- cal Committee Document:, May 2010. [Online; accessed 17-October-2011].[37] Tarek Abdelzaher, Scott Dawson, Wu Feng, Farnam Jahanian, S. Johnson, Ashish Mehra, Todd Mitton, Anees Shaikh, Kang Shin, Zhiheng Wang, Hengming Zou, M. Bjorkland, and Pedro Marron. ARMADA Middleware and Communication Services. Real-Time Systems, 16:127–153, 1999.[38] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and R. Zainlinger. Distributed Fault-Tolerant Real-Time Systems: the Mars Ap- proach. Micro, IEEE, 9(1):25–40, February 1989.[39] Kane Kim. ROAFTS: A Middleware Architecture for Real-Time Object-Oriented Adaptive Fault Tolerance Support. In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium (HASE’98), page 50. IEEE Computer Society, November 1998.[40] Eltefaat Shokri, Patrick Crane, Kane Kim, and Chittur Subbaraman. Archi- tecture of ROAFTS/Solaris: A Solaris-Based Middleware for Real-Time Object- Oriented Adaptive Fault Tolerance Support. In COMPSAC, pages 90–98. IEEE Computer Society, 1998.[41] Kane Kim and Chittur Subbaraman. Fault-Tolerant Real-Time Objects. Com- munications of the ACM, 40(1):75–82, 1997.186
  • REFERENCES[42] Kane Kim and Chittur Subbaraman. A Supervisor-Based Semi-Centralized Network Surveillance Scheme and the Fault Detection Latency Bound. In Proceedings of the 16th Symposium on Reliable Distributed Systems (SRDS’97), pages 146–155, October 1997.[43] Manas Saksena, James da Silva, and Ashok Agrawala. Design and implementation of maruti-ii. In Sang Son, editor, Advances in Real-Time Systems, pages 73–102. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1995.[44] David Powell, Gottfried Bonn, D. Seaton, Paulo Ver´ ıssimo, and Fran¸ois c Waeselynck. The Delta-4 Approach to Dependability in Open Distributed Computing Systems. In Proceedings of the 18th Annual International Symposium on Fault-Tolerant Computing (FTCS’88), pages 246–251, Tokyo, Japan, 1988. IEEE Computer Society Press.[45] P. Bond P. Barrett, A. Hilborne, Lu´ Rodrigues, D. Seaton, N. Speirs, , and ıs Paulo Ver´ıssimo. The Delta-4 Extra Performance Architecture (XPA). 20th International Symposium on Fault-Tolerant Computing, pages 481–488, 1990.[46] James Gosling and Greg Bollella. The Real-Time Specification for Java. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.[47] Greg Bollella, James Gosling, Ben Brosgol, P. Dibble, Steve Furr, David Hardin, and Mark Turnbull. The Real-Time Specification for Java. The Java Series. Addison-Wesley, 2000.[48] Peter Dibble. Real-Time Java Platform Programming. BookSurge Publishing, 2nd edition, 2008.[49] Joshua Auerbach, David Bacon, Daniel Iercan, Christoph Kirsch, V. Rajan, Harald Roeck, and Rainer Trummer. Java Takes Flight:Time-Portable Real- Time Programming with Exotasks. In Proceedings of the 2007 ACM SIG- PLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’07), pages 51–62, New York, NY, USA, 2007. ACM.[50] Joshua Auerbach, David Bacon, Bob Blainey, Perry Cheng, Michael Dawson, Mike Fulton, David Grove, Darren Hart, and Mark Stoodley. Design and Implementation of a Comprehensive Real-time Java Virtual Machine. In Proceedings of the 7th ACM & IEEE International Conference on Embedded Software (EMSOFT’07), pages 249–258, New York, NY, USA, 2007. ACM. 187
  • REFERENCES[51] Introduction to WebLogic Real-Time. 01/wlrt/docs10/pdf/intro_wlrt.pdf. [Online; accessed 17-October-2011].[52] Silvano Maffeis. Adding Group Communication and Fault-Tolerance to CORBA. In USENIX Conference on Object-Oriented Technologies, 1995.[53] Alexey Vaysburd and Kenneth Birman. Building Reliable Adaptive Distributed Objects with the Maestro Tools. In Proceedings of Workshop on Dependable Distributed Object Systems (OOPSLA’97), 1997.[54] Yansong Ren, David Bakken, Tod Courtney, Michel Cukier, David Karr, Paul Rubel, Chetan Sabnis, William Sanders, Richard Schantz, and Mouna Seri. AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects. IEEE Trans. Comput., 52:31–50, January 2003.[55] Balachandran Natarajan, Aniruddha Gokhale, Shalini Yajnik, and Douglas Schmidt. DOORS: Towards High-Performance Fault Tolerant CORBA. In Proceedings of International Symposium on Distributed Objects and Applications (DOA’00), pages 39–48, 2000.[56] Silvano Maffeis and Douglas Schmidt. Constructing Reliable Distributed Commu- nications Systems with CORBA. IEEE Communications Magazine, 35(2):56–61, February 1997.[57] Robbert van Renesse, Kenneth Birman, and Silvano Maffeis. Horus: A Flexible Group Communication System. Communications of the ACM, 39(4):76–83, November 1996.[58] Kenneth Birman and Robert van Renesse. Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, 1994.[59] Robbert van Renesse, Kenneth Birman, Mark Hayden, Alexey Vaysburd, and David Karr. Building adaptive systems using Ensemble. Software–Practice and Experience, 28(8):963–979, August 1998.[60] Thomas C. Bressoud. TFT: A Software System for Application-Transparent Fault Tolerance. In Proceedings of the 28th Annual International Symposium on Fault- Tolerant Computing (FTCS’98), pages 128–137, 1998.[61] Richard Schantz, Joseph Loyall, Craig Rodrigues, Douglas Schmidt, Yamuna Krishnamurthy, and Irfan Pyarali. Flexible and Adaptive QoS Control for Distributed Real-Time and Embedded Middleware. In Markus Endler and188
  • REFERENCES Douglas Schmidt, editors, Proceedings of the ACM/IFIP/USENIX International Middleware Conference (Middleware’03), volume 2672 of Lecture Notes in Com- puter Science, pages 374–393. Springer, June 2003.[62] Douglas Schmidt and Fred Kuhns. An Overview of the Real-Time CORBA Specification. IEEE Computer, 33(6):56–63, June 2000.[63] IETF. An Architecture for Differentiated Services. rfc2475.txt. [Online; accessed 17-October-2011].[64] Lixia Zhang, Stephen Deering, Deborah Estrin, Scott Shenker, and Daniel Zappala. RSVP: A New Resource ReSerVation Protocol. IEEE Network, 7(5):8– 18, 1993.[65] Nanbor Wang, Christopher Gill, Douglas Schmidt, and Venkita Subramonian. Configuring Real-Time Aspects in Component Middleware. In CoopIS/DOA/OD- BASE (2), pages 1520–1537, 2004.[66] Friedhelm Wolf, Jaiganesh Balasubramanian, Aniruddha Gokhale, , and Douglas Schmidt. Component Replication Based on Failover Units. In Proceedings of the 15th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’09), pages 99–108, August 2009.[67] Nanbor Wang, Douglas Schmidt, Aniruddha Gokhale, Christopher Gill, Balachan- dran Natarajan, Craig Rodrigues, Joseph Loyall, and Richard Schantz. Total Quality of Service Provisioning in Middleware and Applications. Microprocessors and Microsystems, 26:9–10, 2003.[68] Richard Schantz, Joseph Loyall, Craig Rodrigues, Douglas Schmidt, Yamuna Krishnamurthy, and Irfan Pyarali. Flexible and adaptive QoS Control for Distributed Real-Time and Embedded Middleware. In Proceedings of the ACM/I- FIP/USENIX 2003 International Conference on Middleware (Middleware’03), pages 374–393, New York, NY, USA, June 2003. Springer-Verlag New York, Inc.[69] Fabio Kon, Fabio Costa, Gordon Blair, and Roy Campbell. The Case for Reflective Middleware. Communications of the ACM, 45:33–38, June 2002.[70] J¨rgen Schonwalder, Sachin Garg, Yennun Huang, Aad van Moorsel, and Shalini u Yajnik. A Management Interface for Distributed Fault Tolerance CORBA services. In Proceedings of the IEEE Third International Workshop on Systems Management (SMW ’98), pages 98–107, Washington, DC, USA, April 1998. 189
  • REFERENCES[71] Pascal Felber, Benoit Garbinato, and Rachid Guerraoui. The Design of a CORBA Group Communication Service. In Proceedings of the 15th Symposium on Reliable Distributed Systems (SRDS’96), Washington, DC, USA, October 1996. IEEE Computer Society.[72] Graham Morgan, Santosh Shrivastava, Paul Ezhilchelvan, and Mark Little. Design and Implementation of a CORBA Fault-Tolerant Object Group Service. In Proceedings of the 2nd IFIP WG 6.1 International Working Conference on Distributed Applications and Interoperable Systems (DAIS’99), pages 361–374, Deventer, The Netherlands, The Netherlands, 1999. Kluwer, B.V.[73] Object Management Group. Real-time CORBA Specification. OMG Technical Committee Document:, January 2005. [Online; accessed 17-October-2011].[74] Jaiganesh Balasubramanian. FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems. In Proceedings of the 4th Middleware Doctoral Symposium (MDS’07), pages 17:1– 17:6, New York, NY, USA, November 2007. ACM.[75] Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. The Primary-Backup Approach. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993.[76] Object Management Group. Light Weight CORBA Component Model Revised Submission. OMG Technical Committee Document: spec/CCM/3.0/PDF/, June 2002. [Online; accessed 17-October-2011].[77] Jaiganesh Balasubramanian, Aniruddha Gokhale, Abhishek Dubey, Friedhelm Wolf, Chenyang Lu, Christopher Gill, and Douglas Schmidt. Middleware for Resource-Aware Deployment and Configuration of Fault-Tolerant Real-time Systems. In Marco Caccamo, editor, Proceedings of the 16th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’10), pages 69–78. IEEE Computer Society, April 2010.[78] Fred Schneider. Replication Management using the State-machine Approach. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993.[79] Louise Moser, P. Michael Melliar-Smith, and Priya Narasimhan. A Fault Toler- ance Framework for CORBA. In Proceedings of the 29th Annual International190
  • REFERENCES Symposium on Fault-Tolerant Computing (FTCS’99), Washington, DC, USA, 1999. IEEE Computer Society.[80] Priya Narasimhan, Louise Moser, and P. Michael Melliar-Smith. Strongly Consistent Replication and Recovery of Fault-Tolerant CORBA Applications. Computer System Science and Engineering Journal, 17, 2002.[81] Justin Frankel and Tom Pepper. Gnutella Specification. http://www. [Online; accessed 17-October-2011].[82] Yoram Kulbak and Danny Bickson. The eMule Protocol Specification, January 2005. [Online; accessed 17-October-2011].[83] PPLive. PPTV. [Online; accessed 17-October-2011].[84] Mario Ferreira, Jo˜o Leit˜o, and Lu´ Rodrigues. Thicket: A Protocol for Building a a ıs and Maintaining Multiple Trees in a P2P Overlay. In Proceedings of the 29rd International Symposium on Reliable Distributed Systems (SRDS’10), pages 293– 302. IEEE, November 2010.[85] Zhi Li and Prasant Mohapatra. QRON: QoS-aware Routing in Overlay Networks. IEEE Journal on Selected Areas in Communications, 22(1):29–40, January 2004.[86] Eric Wohlstadter, Stefan Tai, Thomas Mikalsen, Isabelle Rouvellou, and Premku- mar Devanbu. GlueQoS: Middleware to Sweeten Quality-of-Service Policy Interactions. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04), pages 189–199, May 2004.[87] Anthony Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel. SCRIBE: The Design of a Large-Scale Event Notification Infrastructure. In Proceedings of the 3rd International COST264 Workshop on Networked Group Communication (NGC’01), pages 30–43, November 2001.[88] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. In Proceedings of the 2nd ACM/IFIP/USENIX International Middleware Conference (Middleware’01), pages 329–350, November 2001.[89] Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16:133–169, May 1998. 191
  • REFERENCES[90] Peter Pietzuch and Jean Bacon. Hermes: A Distributed Event-Based Middleware Architecture. In ICDCS Workshops, pages 611–618. IEEE Computer Society, July 2002.[91] Ben Zhao, Ling Huang, Jeremy Stribling, Sean Rhea, Anthony Joseph, and John Kubiatowicz. Tapestry: A Resilient Global-Scale Overlay for Service Deployment. IEEE Journal on Selected Areas in Communications, June 2003.[92] David Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer. SETI@home: an Experiment in Public-Resource Computing. Communications of the ACM, 45:56–61, November 2002.[93] Bj¨rn Knutsson, Honghui Lu, Wei Xu, and Bryan Hopkins. Peer-to-peer Support o for Massively Multiplayer Games. In Proceedings of the 23rd Annual Joint Con- ference of the IEEE Computer and Communications Societies (INFOCOM’04), volume 1, March 2004.[94] Gilles Fedak, C´cile Germain, Vincent N´ri, and Franck Cappello. XtremWeb: e e a Generic Global Computing System. In Proceedings of the 1st IEEE/ACM International Symposium on Cluster Computing, pages 582–587, May 2001.[95] Andrew Chien, Brad Calder, Stephen Elbert, and Karan Bhatia. Entropia: Architecture and Performance of an Enterprise Desktop Grid System. Journal Parallel Distributed Computing, 63:597–610, May 2003.[96] David Anderson. BOINC: A System for Public-Resource Computing and Storage. In Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (GRID’04), pages 4–10, Washington, DC, USA, November 2004. IEEE Computer Society.[97] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51:107–113, January 2008.[98] Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. Adapting MapReduce for Dynamic Environments Using a Peer-to-Peer Model. In Proceedings of the 1st Workshop on Cloud Computing and its Applications (CCA’08), Chicago, USA, October 2008.[99] Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy, Scott Shenker, Ion Stoica, and Harlan Yu. OpenDHT: A Public DHT Service and Its Uses. In Roch Gu´rin, Ramesh Govindan, and Greg Minshall, editors, e192
  • REFERENCES Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’05), pages 73–84. ACM, August 2005.[100] Philip Bernstein and Nathan Goodman. An Algorithm for Concurrency Control and Recovery in Replicated Distributed Databases. ACM Transactional Database Systems, 9(4):596–615, 1984.[101] Bruce Lindsay, Patricia Selinger, Cesare Galtieri, Jim Gray, Raymond Lorie, T. G. Price, Franco Putzolu, and Bradford Wade. Notes on Distributed Databases. Technical report, International Business Machines (IBM), San Jose, Research Laboratory (CA), July 1979.[102] Rolando Martins, Lu´ Lopes, and Fernando Silva. A Peer-to-Peer Mid- ıs dleware Platform for QoS and Soft Real-Time Computing. Technical Report DCC-2008-02, Departamento de Ciˆncia de Computadores, Fac- e uldade de Ciˆncias, Universidade do Porto, April 2008. e Available at[103] Rolando Martins, Lu´ Lopes, and Fernando Silva. A Peer-To-Peer Middleware ıs Platform for Fault-Tolerant, QoS, Real-Time Computing. In Proceedings of the 2nd Workshop on Middleware-Application Interaction, part of DisCoTec 2008, pages 1–6, New York, NY, USA, June 2008. ACM.[104] Rolando Martins, Priya Narasimhan, Lu´ Lopes, and Fernando Silva. On ıs the Impact of Fault-Tolerance Mechanisms in a Peer-to-Peer Middleware with QoS Constraints. Technical Report DCC-2010-02, Departamento de Ciˆncia e de Computadores, Faculdade de Ciˆncias, Universidade do Porto, April 2010. e Available at[105] Aniruddha Gokhale, Balachandran Natarajan, Douglas Schmidt, and Joseph Cross. Towards Real-Time Fault-Tolerant CORBA Middleware. Cluster Com- puting, 7(4):331–346, September 2004.[106] Michi Henning. A New Approach to Object-Oriented Middleware. IEEE Internet Computing, 8(1):66–75, January 2004.[107] Daniel Nurmi, Richard Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus Open-Source Cloud-Computing System. In Franck Cappello, Cho-Li Wang, and Rajkumar Buyya, editors, Proceedings of the 9th IEEE/ACM International Symposium 193
  • REFERENCES on Cluster, Cloud, and Grid Computing (CCGrid’09), pages 124–131. IEEE Computer Society, May 2009.[108] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. KVM: the Linux Virtual Machine Monitor. In Proceedings of the 9th Ottawa Linux Symposium (OLS’07), June 2007.[109] Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Ian Pratt, Andrew Warfield, Paul Barham, and Rolf Neugebauer. Xen and the Art of Virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’03), October 2003.[110] Canonical Ltd. JeOS and ”vmbuilder”. serverguide/C/jeos-and-vmbuilder.html. [Online; accessed 17-October- 2011].[111] Douglas Schmidt. An Architectural Overview of the ACE Framework. ;login: the USENIX Association newsletter, 24(1), January 1999.[112] Francisco Curbera, Matthew Duftler, Rania Khalaf, William Nagy, Nirmal Mukhi, and Sanjiva Weerawarana. Unraveling the Web Services Web: An Introduction to SOAP, WSDL, and UDDI. IEEE Distributed Systems Online, 3(4), 2002.[113] Ian Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications. In Proceedings of the ACM Special Interrest Group on Data Communication Conference (SIGCOMM’01), volume 31, 4 of Computer Communication Review, pages 149–160. ACM Press, August 2001.[114] Greg Lavender and Douglas Schmidt. Active Object: an Object Behavioral Pattern for Concurrent Programming. In Proceedings of the 2nd Conference on Pattern Languages of Programs (PLoP’95), September 1995.[115] Linux kernel 2.6.39. Real-Time Group Scheduling. doc/Documentation/scheduler/sched-rt-group.txt, 2009. [Online; accessed 17-October-2011].[116] Yuan Xu. A Study of Scalability and Performance of Solaris Zones, April 2007.[117] Dario Faggioli, Michael Trimarchi, and Fabio Checconi. An Implementation of the Earliest Deadline First Algorithm in Linux. In Sung Shin and Sascha194
  • REFERENCES Ossowski, editors, Proceedings of the 24th ACM Symposium on Applied Computing (SAC’09), pages 1984–1989. ACM, March 2009.[118] Nicola Manica, Luca Abeni, and Luigi Palopoli. Reservation-Based Interrupt Scheduling. In Marco Caccamo, editor, Proceedings of the 16th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’10), pages 46–55. IEEE Computer Society, April 2010.[119] Shinpei Kato, Yutaka Ishikawa, and Ragunathan Rajkumar. CPU Scheduling and Memory Management for Interactive Real-Time Applications. Real-Time Systems, pages 1–35, 2011.[120] Michael Stonebraker and Greg Kemnitz. The POSTGRES Next Generation Database Management System. Communications of the ACM, 34:78–92, October 1991.[121] Vincenzo Gulisano, Ricardo Jim´nez-Peris, Marta Pati˜o-Mart´ e n ınez, and Patrick Valduriez. StreamCloud: A Large Scale Data Streaming System. In Proceedings of the IEEE 30th International Conference on Distributed Computing Systems (ICDCS’10), pages 126–137, Washington, DC, USA, June 2010. IEEE Computer Society.[122] Levent Gurgen, Claudia Roncancio, Cyril Labb´, Andr´ Bottaro, and Vincent e e Olive. SStreaMWare: a Service Oriented Middleware for Heterogeneous Sensor Data Management. In Proceedings of the 5th international Conference on Pervasive Services (ICPS’08), pages 121–130, New York, NY, USA, July 2008. ACM.[123] Adrian Caulfield, Joel Coburn, Todor Mollov, Arup De, Ameen Akel, Jiahua He, Arun Jagatheesan, Rajesh Gupta, Allan Snavely, and Steven Swanson. Under- standing the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing. In Proceedings of the 23rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10), pages 1–11, Washington, DC, USA, November 2010. IEEE Computer Society.[124] Maxweel Carmo, Bruno Carvalho, Jorge S´ Silva, Edmundo Monteiro, Paulo Sim a oes, Mar´ Curado, and Fernando Boavida. NSIS-Based Quality of Service and ılia Resource Allocation in Ethernet Networks. In Torsten Braun, Georg Carle, Sonia Fahmy, and Yevgeni Koucheryavy, editors, Proceedings of the 4th International 195
  • REFERENCES Conference on Wired/Wireless Internet Communications (WWIC’06), volume 3970 of Lecture Notes in Computer Science, pages 132–142. Springer, 2006.[125] Jeff Bonwick. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USENIX Summer, pages 87–98, 1994.[126] Christoph Lameter. The SLUB Allocator. Articles/229096/, March 2007. [Online; accessed 17-October-2011].[127] Dinakar Guniguntala, Paul McKenney, Josh Triplett, and Jonathan Walpole. The Read-Copy-Update Mechanism for Supporting Real-Time Applications on Shared-Memory Multiprocessor Systems with Linux. IBM Systems Journal, 47:221–236, April 2008.[128] Steven Rostedt. RCU Preemption Priority Boosting. http://lwn. net/Articles/252837/, October 2007. [Online; accessed 17-October-2011].[129] Claudio Basile, Keith Whisnant, Zbigniew Kalbarczyk, and Ravishankar Iyer. Loose Synchronization of Multithreaded Replicas. In Proceedings of the 21st International Symposium on Reliable Distributed Systems (SRDS’02), pages 250– 255, October 2002.[130] Claudio Basile, Zbigniew Kalbarczyk, and Ravishankar Iyer. A Preemptive Deterministic Scheduling Algorithm for Multithreaded Replicas. In Proceedings of the 33rd International Conference on Dependable Systems and Networks (DSN’03), pages 149–158, June 2003.[131] Guang Tan and Stephen Jarvis and Daniel Spooner. Improving the Fault Resilience of Overlay Multicast for Media Streaming. IEEE Transactions on Parallel and Distributed Systems, 18(6):721–734, June 2007.[132] Irena Trajkovska, Rodriguez Salvachua, and Alberto Velasco. A Novel P2P and Cloud Computing Hybrid Architecture for Multimedia Streaming with QoS Cost Functions. In Proceedings of the International Conference on Multimedia (MM’10), pages 1227–1230, New York, NY, USA, October 2010. ACM.[133] Thomas Wiegand, Gary Sullivan, Gisle Bjntegaard, and Ajay Luthra. Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003.[134] Fred Kuhns, Douglas Schmidt, and David Levine. The Design and Performance of a Real-Time I/O Subsystem. In Proceedings of the 5th IEEE Real-Time Technology and Applications Symposium (RTAS’99), pages 154–163, June 1999.196
  • REFERENCES[135] Real-Time Preempt Linux Kernel Patch. pub/linux/kernel/projects/rt/. [Online; accessed 17-October-2011].[136] Moving Interrupts to Threads. [Online; accessed 17-October-2011].[137] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. Zyzzyva: Speculative byzantine fault folerance. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP ’07), pages 45–58, New York, NY, USA, 2007. ACM.[138] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco Marchetti. Making Byzantine Fault Tolerant Systems Tolerate Byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI’09), pages 153–168, Berkeley, CA, USA, 2009. USENIX Association.[139] Andrey Mirkin, Alexey Kuznetsov, and Kir Kolyshkin. Containers Checkpointing and Live Migration. In Proceedings of the 10th Annual Linux Symposium (OLS’08), July 2008.[140] Oren Laadan and Serge Hallyn. Linux-CR: Transparent Application Checkpoint- Restart in Linux. In Proceedings of the 12th Ottawa Linux Symposium (OLS’10), July 2010. 197