Rolando da Silva MartinsOn the Integration of Real-Time  and Fault-Tolerance in P2P          Middleware    Departamento de...
Rolando da Silva MartinsOn the Integration of Real-Time  and Fault-Tolerance in P2P          Middleware           Tese sub...
To my wife Liliana, for her endless love, support, and encouragement.                                                     ...
–Imagination is everything. It is the preview of                                                     Acknowledgmentslife’s...
–All is worthwhile if the soul is not small.                                                                         Abstr...
AcronymsAPI Application Programming InterfaceBFT Byzantine Fault-ToleranceCCM CORBA Component ModelCID Cell IdentifierCORBA...
JeOS Just Enough Operating SystemKVM Kernel Virtual-MachineLFU Least Frequently UsedLRU Least Recently UsedLwCCM Lightweig...
TDMA Time Division Multiple AccessTSS Thread-Specific StorageUUID Universal Unique IdentifierVM Virtual MachineVoD Video on ...
ContentsAcknowledgments                                                                           5Abstract               ...
2.3.2   QoS-Aware P2 P      . . . . . . . . . . . . . . . . . . . . . . . . . . . 45     2.4   P2 P+FT Middleware Systems ...
4.1.4   Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111  4.2   Implementation of Services . . ...
References   18216
List of Tables4.1   Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155                       ...
List of Figures1.1    Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 282.1    Middleware ...
4.3    The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    94     4.4    The initial bindi...
5.4    Network organization for the service benchmarks. . . . . . . . . . . . .          .   1615.5    Overlay bind (left)...
List of Algorithms4.1    Overlay bootstrap algorithm . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   ....
List of Listings3.1   Overlay plugin and runtime bootstrap. . . . . . . . . . . .        .   .   .   .   .   .   .   .    ...
–Most of the important things in the world have                                                                           ...
CHAPTER 1. INTRODUCTIONThe system supports three types of traffic: normal - for regular operations over thesystem, such as p...
1.2. CHALLENGES AND OPPORTUNITIESmust be propagated among the replicas through a replication algorithm that introducesan a...
CHAPTER 1. INTRODUCTIONonly partially support our goal of providing end-to-end guarantees, more specifically,we enhance our...
1.4. ASSUMPTIONS AND NON-GOALSat this level through the implementation of efficient and resilient services for, e.g.resource...
CHAPTER 1. INTRODUCTIONrequirement for our target systems. Furthermore, we do not provide a formal specifica-tion and verifi...
1.6. THESIS OUTLINEframework that features resource reservation with support for a broader range of sub-systems, including...
CHAPTER 1. INTRODUCTIONChapter 3 describes the runtime architecture on the proposed middleware. We startby providing a det...
–By failing to prepare, you are preparing to fail.                                                                        ...
CHAPTER 2. OVERVIEW OF RELATED WORK                           DDS                                       Video             ...
2.2. RT+FT MIDDLEWARE SYSTEMSCORBA-based middleware, such as TAO [3] (that later addressed the integration offault-toleran...
CHAPTER 2. OVERVIEW OF RELATED WORKThe network infrastructure used a reservation mechanism based on EDF schedulingpolicy t...
2.2. RT+FT MIDDLEWARE SYSTEMSfault server ; (b) and a network surveillance [42] manager. Fault-detection is assuredby the ...
CHAPTER 2. OVERVIEW OF RELATED WORKthe RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbagecoll...
2.2. RT+FT MIDDLEWARE SYSTEMSenvironment changes. Alternatively, the network reservation-based approach relies onthe RSVP ...
CHAPTER 2. OVERVIEW OF RELATED WORKto environment changes. Subsequently, it is possible to select networking protocols,enc...
2.2. RT+FT MIDDLEWARE SYSTEMSand; (d) a resource monitor that watches the load on nodes and periodically notifies theReplic...
CHAPTER 2. OVERVIEW OF RELATED WORKinclude logging-recovery mechanisms, replication mechanisms, and interceptors. Therepli...
2.3. P2 P+RT MIDDLEWARE SYSTEMSEach stream is divided into chunks. The size of these chunks have a direct influence onthe e...
CHAPTER 2. OVERVIEW OF RELATED WORKGlueQoSGlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoSfea...
2.4. P2 P+FT MIDDLEWARE SYSTEMSHermesHermes [90] focused on providing a distributed event-based middleware with an underly...
CHAPTER 2. OVERVIEW OF RELATED WORKthe system, and the other with the active slaves. When an user submits a MapReducejob, ...
2.5. P2 P+RT+FT MIDDLEWARE SYSTEMSto the services). Dynamo offers efficient key-value storage, while maximizing writeoperatio...
CHAPTER 2. OVERVIEW OF RELATED WORK2.6.1    TAOTAO is a classical RPC middleware and therefore only supports the client-se...
2.6. A CLOSER LOOK AT TAO, MEAD AND ICEproposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids thela...
CHAPTER 2. OVERVIEW OF RELATED WORKreplication properties, like replication style; and (c) a Generic Factory, the entry po...
2.6. A CLOSER LOOK AT TAO, MEAD AND ICE            Figure 2.3: FLARe’s architectural layout (adapted from [74]).updates to...
CHAPTER 2. OVERVIEW OF RELATED WORKMead Interceptor. The underlying communication is provided by Spread, an groupcommunica...
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
Upcoming SlideShare
Loading in …5
×

PhD Thesis

2,241 views
2,154 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,241
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
33
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PhD Thesis

  1. 1. Rolando da Silva MartinsOn the Integration of Real-Time and Fault-Tolerance in P2P Middleware Departamento de Ciˆncia de Computadores e Faculdade de Ciˆncias da Universidade do Porto e 2012
  2. 2. Rolando da Silva MartinsOn the Integration of Real-Time and Fault-Tolerance in P2P Middleware Tese submetida ` Faculdade de Ciˆncias da a e Universidade do Porto para obten¸˜o do grau de Doutor ca em Ciˆncia de Computadores e Advisors: Prof. Fernando Silva and Prof. Lu´ Lopes ıs Departamento de Ciˆncia de Computadores e Faculdade de Ciˆncias da Universidade do Porto e Maio de 2012
  3. 3. To my wife Liliana, for her endless love, support, and encouragement. 3
  4. 4. –Imagination is everything. It is the preview of Acknowledgmentslife’s coming attractions. Albert EinsteinTo my soul-mate Liliana, for her endless support on the best and worst of times. Herunconditional love and support helped me to overcome the most daunting adversitiesand challenges.I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and PauloPaix˜o, for their vision and support that allowed me to pursuit this Ph.D. aI would like to thank the financial support from EFACEC, Sistemas de Engenharia,S.A. and FCT - Funda¸˜o para a Ciˆncia e Tecnologia, with Ph.D. grant SFRH/B- ca eDE/15644/2006.I would especially like to thank my advisors, Professors Lu´ Lopes and Fernando Silva, ısfor their endless effort and teaching over the past four years. Lu´ thank you for steering ıs,me when my mind entered a code frenzy, and for teaching me how to put my thoughtsto words. Fernando, your keen eye is always able to understand the “big picture”, thiswas vital to detected and prevent the pitfalls of building large and complex middlewaresystems. To both, I thank you for opening the door of CRACS to me. I had an incredibletime working with you.A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor.She opened the door of CMU to me and helped to shape my work at crucial stages.Priya, I had a fantastic time mind-storming with you, each time I managed to learnsomething new and exciting. Thank you for sharing with me your insights on MEAD’sarchitecture, and your knowledge on fault-tolerance and real-time.Lu´ Fernando and Priya, I hope someday to be able to repay your generosity and ıs,friendship. It is inspirational to see your passion for your work, and your continuouseffort on helping others.I would like to thank Jiaqi Tan for taking the time to explain me the architecture andfunctionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on mywork and for his insights on byzantine failures and consensus protocols.I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor-reia, V´ Costa, and Inˆs Dutra, for listening and sharing their thoughts on my work. ıtor eA big thank you to Hugo Ribeiro, for his crucial help with the experimental setup. 5
  5. 5. –All is worthwhile if the soul is not small. Abstract Fernando PessoaThe development and management of large-scale information systems, such as high-speed transportation networks, are pushing the limits of the current state-of-the-artin middleware frameworks. These systems are not only subject to hardware failures,but also impose stringent constraints on the software used for management and there-fore on the underlying middleware framework. In particular, fulfilling the Quality-of-Service (QoS) demands of services in such systems requires simultaneous run-timesupport for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage thatremains a challenge for current middleware frameworks. Fault-tolerance support isusually introduced in the form of expensive high-level services arranged in a client-serverarchitecture. This approach is inadequate if one wishes to support real-time tasks dueto the expensive cross-layer communication and resource consumption involved.In this thesis we design and implement Stheno, a general purpose P2 P middlewarearchitecture. Stheno innovates by integrating both FT and soft-RT in the architecture,by: (a) implementing FT support at a much lower level in the middleware on top of asuitable network abstraction; (b) using the peer-to-peer mesh services to support FT,and; (c) supporting real-time services through a QoS daemon that manages the under-lying kernel-level resource reservation infrastructure (CPU time), while simultaneously(d) providing support for multi-core computing and traffic demultiplexing. Stheno isable to minimize resource consumption and latencies from FT mechanisms and allowsRT services to perform withing QoS limits.Stheno has a service oriented architecture that does not limit the type of service that canbe deployed in the middleware. Whereas current middleware systems do not providea flexible service framework, as their architecture is normally designed to support aspecific application domain, for example, the Remote Procedure Call (RPC) service.Stheno is able to transparently deploy a new service within the infrastructure withoutthe user assistance. Using the P2 P infrastructure, Stheno searches and selects a suitablenode to deploy the service with the specified level of QoS limits.We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, suchas membership, discovery and service deployment, the impact of FT over RT, withand without resource reservation, and compare with other closely related middlewareframeworks. Results showed that Stheno is able to sustain RT performance whilesimultaneously providing FT support. The performance of the resource reservationinfrastructure enabled Stheno to maintain this behavior even under heavy load. 7
  6. 6. AcronymsAPI Application Programming InterfaceBFT Byzantine Fault-ToleranceCCM CORBA Component ModelCID Cell IdentifierCORBA Common Object Request Broker ArchitectureCOTS Common Of The ShelfDBMS Database Management SystemsDDS Data Distribution ServiceDHT Distributed Hash TableDOC Distributed Object ComputingDRE Distributed Real-Time and EmbeddedDSMS Data Stream Management SystemsEDF Earliest Deadline FirstEM/EC Execution Model/Execution ContextFT Fault-ToleranceIDL Interface Description LanguageIID Instance IdentifierIPC Inter-Process CommunicationIaaS Infrastructure as a ServiceJ2SE Java 2 Standard EditionJMS Java Messaging ServiceJRTS Java Real-Time SystemJVM Java Virtual Machine 9
  7. 7. JeOS Just Enough Operating SystemKVM Kernel Virtual-MachineLFU Least Frequently UsedLRU Least Recently UsedLwCCM Lightweight CORBA Component ModelMOM Message-Oriented MiddlewareNSIS Next Steps in SignalingOID Object IdentifierOMA Object Management ArchitectureOS Operating SystemsPID Peer IdentifierPOSIX Portable Operating System InterfacePoL Place of LaunchQoS Quality-of-ServiceRGID Replication Group IdentifierRMI Remote Method InvocationRPC Remote Procedure CallRSVP Resource Reservation ProtocolRTSJ Real-Time Specification for JavaRT Real-TimeSAP Service Access PointSID Service IdentifierSLA Service Level of AgreementSSD Solid State Disk10
  8. 8. TDMA Time Division Multiple AccessTSS Thread-Specific StorageUUID Universal Unique IdentifierVM Virtual MachineVoD Video on Demand 11
  9. 9. ContentsAcknowledgments 5Abstract 7Acronyms 9List of Tables 17List of Figures 19List of Algorithms 23List of Listings 251 Introduction 27 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.4 Assumptions and Non-Goals . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Overview of Related Work 35 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2.1 Special Purpose RT+FT Systems . . . . . . . . . . . . . . . . . . 37 2.2.2 CORBA-based Real-Time Fault-Tolerant Systems . . . . . . . . . 39 2.3 P2 P+RT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.1 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 13
  10. 10. 2.3.2 QoS-Aware P2 P . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4 P2 P+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.1 Publish-subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Resource Computing . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5 P2 P+RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . 49 2.6 A Closer Look at TAO, MEAD and ICE . . . . . . . . . . . . . . . . . . 49 2.6.1 TAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6.3 ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Architecture 59 3.1 Stheno’s System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1.1 Application and Services . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.2 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.3 P2 P Overlay and FT Configuration . . . . . . . . . . . . . . . . . 66 3.1.4 Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.1.5 Operating System Interface . . . . . . . . . . . . . . . . . . . . . 76 3.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.1 Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.2 Overlay Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.2.3 Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3 Fundamental Runtime Operations . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1 Runtime Creation and Bootstrapping . . . . . . . . . . . . . . . . 81 3.3.2 Service Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.3 Client Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894 Implementation 91 4.1 Overlay Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.1 Overlay Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.2 Mesh Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.1.3 Discovery Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 10814
  11. 11. 4.1.4 Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111 4.2 Implementation of Services . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.1 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2.2 Actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.2.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3 Support for Multi-Core Computing . . . . . . . . . . . . . . . . . . . . . 142 4.3.1 Object-Based Interactions . . . . . . . . . . . . . . . . . . . . . . 142 4.3.2 CPU Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.3.3 Threading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.3.4 An Execution Model for Multi-Core Computing . . . . . . . . . . 148 4.4 Runtime Bootstrap Parameters . . . . . . . . . . . . . . . . . . . . . . . 155 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565 Evaluation 157 5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.1 Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.2 Overlay Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.1 Overlay Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.2 Services Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2.3 Load Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3 Overlay Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3.1 Membership Performance . . . . . . . . . . . . . . . . . . . . . . 163 5.3.2 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.3.3 Service Deployment Performance . . . . . . . . . . . . . . . . . . 165 5.4 Services Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4.1 Impact of Faul-Tolerance Mechanisms in Service Latency . . . . . 167 5.4.2 Real-Time and Resource Reservation Evaluation . . . . . . . . . . 169 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1766 Conclusions and Future Work 177 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3 Personal Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 15
  12. 12. References 18216
  13. 13. List of Tables4.1 Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155 17
  14. 14. List of Figures1.1 Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 282.1 Middleware system classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 TAO’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . . 512.3 FLARe’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 532.4 MEAD’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 543.1 Stheno overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Application Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3 Stheno’s organization overview. . . . . . . . . . . . . . . . . . . . . . . . 633.4 Core Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.5 QoS Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.6 Overlay Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.7 Examples of mesh topologies. . . . . . . . . . . . . . . . . . . . . . . . . 683.8 Querying in different topologies. . . . . . . . . . . . . . . . . . . . . . . . 693.9 Support framework layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.10 QoS daemon resource distribution layout. . . . . . . . . . . . . . . . . . . 733.11 End-to-end network reservation. . . . . . . . . . . . . . . . . . . . . . . . 753.12 Operating system interface. . . . . . . . . . . . . . . . . . . . . . . . . . 773.13 Interactions between layers. . . . . . . . . . . . . . . . . . . . . . . . . . 783.14 Multiple processes runtime usage. . . . . . . . . . . . . . . . . . . . . . . 793.15 Creating and bootstrapping of a runtime. . . . . . . . . . . . . . . . . . . 813.16 Local service creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.17 Finding a suitable deployment site. . . . . . . . . . . . . . . . . . . . . . 843.18 Remote service creation without fault-tolerance. . . . . . . . . . . . . . . 853.19 Remote service creation with fault-tolerance: primary-node side. . . . . . 863.20 Remote service creation with fault-tolerance: replica creation. . . . . . . 873.21 Client creation and bootstrap sequence. . . . . . . . . . . . . . . . . . . . 884.1 The peer-to-peer overlay architecture. . . . . . . . . . . . . . . . . . . . . 914.2 The overlay bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 19
  15. 15. 4.3 The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 The initial binding process for a new peer. . . . . . . . . . . . . . . . . . 95 4.5 The final join process for a new peer. . . . . . . . . . . . . . . . . . . . . 96 4.6 Overview of the cell group communications. . . . . . . . . . . . . . . . . 99 4.7 Cell discovery and management entities. . . . . . . . . . . . . . . . . . . 103 4.8 Failure handling for non-coordinator (left) and coordinator (right) peers. 105 4.9 Cell failure (left) and subsequent mesh tree rebinding (right). . . . . . . . 106 4.10 Discovery service implementation. . . . . . . . . . . . . . . . . . . . . . . 109 4.11 Fault-Tolerance service overview. . . . . . . . . . . . . . . . . . . . . . . 112 4.12 Creation of a replication group. . . . . . . . . . . . . . . . . . . . . . . . 113 4.13 Replication group binding overview. . . . . . . . . . . . . . . . . . . . . . 114 4.14 The addition of a new replica to the replication group. . . . . . . . . . . 115 4.15 The control and data communication groups. . . . . . . . . . . . . . . . . 118 4.16 Semi-active replication protocol layout. . . . . . . . . . . . . . . . . . . . 120 4.17 Recovery process within a replication group. . . . . . . . . . . . . . . . . 122 4.18 RPC service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.19 RPC invocation types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.20 RPC service architecture without (left) and with (right) semi-active FT. 130 4.21 RPC service with passive replication. . . . . . . . . . . . . . . . . . . . . 132 4.22 Actuator service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.23 Actuator service overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.24 Actuator fault-tolerance support. . . . . . . . . . . . . . . . . . . . . . . 137 4.25 Streaming service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.26 Streaming service architecture. . . . . . . . . . . . . . . . . . . . . . . . . 139 4.27 Streaming service with fault-tolerance support. . . . . . . . . . . . . . . . 141 4.28 Object-to-Object interactions. . . . . . . . . . . . . . . . . . . . . . . . . 143 4.29 Examples of CPU Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 144 4.30 Object-to-Object interactions with different partitions. . . . . . . . . . . 145 4.31 Threading strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.32 End-to-End QoS propagation. . . . . . . . . . . . . . . . . . . . . . . . . 148 4.33 RPC service using CPU partitioning on a quad-core processor. . . . . . . 148 4.34 Invocation across two distinct partitions. . . . . . . . . . . . . . . . . . . 149 4.35 Execution Model Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.36 RPC implementation using the EM/EC pattern. . . . . . . . . . . . . . . 153 5.1 Overlay evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Physical evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.3 Overview of the overlay benchmarks. . . . . . . . . . . . . . . . . . . . . 16020
  16. 16. 5.4 Network organization for the service benchmarks. . . . . . . . . . . . . . 1615.5 Overlay bind (left) and rebind (right) performance. . . . . . . . . . . . . 1645.6 Overlay query performance. . . . . . . . . . . . . . . . . . . . . . . . . . 1655.7 Overlay service deployment performance. . . . . . . . . . . . . . . . . . . 1665.8 Service rebind time (left) and latency (right). . . . . . . . . . . . . . . . 1685.9 Rebind time and latency results with resource reservation. . . . . . . . . 1705.10 Missed deadlines without (left) and with (right) resource reservation. . . 1725.11 Invocation latency without (left) and with (right) resource reservation. . 1745.12 RPC invocation latency comparing with reference middlewares (without fault-tolerance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 21
  17. 17. List of Algorithms4.1 Overlay bootstrap algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 934.2 Mesh startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Cell initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4 Cell group communications: receiving-end . . . . . . . . . . . . . . . . . 1004.5 Cell group communications: sending-end . . . . . . . . . . . . . . . . . . 1024.6 Cell Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.7 Cell fault handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.8 Cell fault handling (continuation). . . . . . . . . . . . . . . . . . . . . . . 1084.9 Discovery service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.10 Creation and joining within a replication group . . . . . . . . . . . . . . 1164.11 Primary bootstrap within a replication group . . . . . . . . . . . . . . . 1174.12 Fault-Tolerance resource discovery mechanism. . . . . . . . . . . . . . . . 1184.13 Replica startup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.14 Replica request handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.15 Support for semi-active replication. . . . . . . . . . . . . . . . . . . . . . 1214.16 Fault detection and recovery . . . . . . . . . . . . . . . . . . . . . . . . . 1234.17 A RPC object implementation. . . . . . . . . . . . . . . . . . . . . . . . 1264.18 RPC service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.19 RPC service implementation. . . . . . . . . . . . . . . . . . . . . . . . . 1284.20 RPC client implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 1294.21 Semi-active replication implementation. . . . . . . . . . . . . . . . . . . . 1304.22 Service’s replication callback. . . . . . . . . . . . . . . . . . . . . . . . . 1314.23 Passive Fault-Tolerance implementation. . . . . . . . . . . . . . . . . . . 1334.24 Actuator service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . 1354.25 Actuator service implementation. . . . . . . . . . . . . . . . . . . . . . . 1364.26 Actuator client implementation. . . . . . . . . . . . . . . . . . . . . . . . 1364.27 Stream service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.28 Stream service implementation. . . . . . . . . . . . . . . . . . . . . . . . 1404.29 Stream client implementation. . . . . . . . . . . . . . . . . . . . . . . . . 1414.30 Joining an Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . 1514.31 Execution Context stack management. . . . . . . . . . . . . . . . . . . . 1524.32 Implementation of the EM/EC pattern in the RPC service. . . . . . . . . 154 23
  18. 18. List of Listings3.1 Overlay plugin and runtime bootstrap. . . . . . . . . . . . . . . . . . . . 823.2 Transparent service creation. . . . . . . . . . . . . . . . . . . . . . . . . . 833.3 Service creation with explicit and transparent deployments. . . . . . . . . 853.4 Service creation with Fault-Tolerance support. . . . . . . . . . . . . . . . 873.5 Service client creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.1 A RPC IDL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 25
  19. 19. –Most of the important things in the world have 1been accomplished by people who have kepttrying when there seemed to be no hope at all. Dale Carnegie Introduction1.1 MotivationThe development and management of large-scale information systems is pushing thelimits of the current state-of-the-art in middleware frameworks. At EFACEC1 , we haveto handle a multitude of application domains, including: information systems usedto manage public, high-speed transportation networks; automated power managementsystems to handle smart grids, and; power supply systems to monitor power supply unitsthrough embedded sensors. Such systems typically transfer large amounts of streamingdata; have erratic periods of extreme network activity; are subject to relatively commonhardware failures and for comparatively long periods, and; require low jitter and fastresponse time for safety reasons, for example, vehicle coordination.Target SystemsThe main motivation for this PhD thesis was the need to address the requirements of thepublic transportation solutions at EFACEC, more specifically, the light-train systems.The deployment of one of such systems is installed in Oporto’s light-train network andis composed of 5 lines, 70 stations and approximately 200 sensors (partially illustratedin Figure 1.1). Each station is managed by a computational node, that we designateas peer, that is responsible for managing all the local audio, video, display panels, andlow-level sensors such as track sensors for detecting inbound and outbound trains. 1 EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systemsengineering namely in public transportation and energy systems, employs around 3000 people and hasa turnover of almost 1000 million euro; it is established in more than 50 countries and exports almosthalf of its production (c.f. http://www.efacec.com). 27
  20. 20. CHAPTER 1. INTRODUCTIONThe system supports three types of traffic: normal - for regular operations over thesystem, such as playing an audio message in a station through an audio codec; critical- medium priority traffic comprised of urgent events, such as an equipment malfunctionnotification; alarms - high priority traffic that notifies critical events, such as low-levelsensor events. Independently of the traffic type (e.g., event, RPC operation), the systemrequires that any operation must be completed within 2 seconds.From the point of view of distributed architectures, the current deployments would bebest matched with P2 P infra-structures that are resilient and allow resources (e.g., asensor connected through a serial link to a peer) to be seamlessly mapped to the logicaltopology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT)services. Support for both RT and FT is fundamental to meet system requirements.Moreover, the next generation light train solutions require deployments across citiesand regions that can be overwhelmingly large. This introduces the need for a scalablehierarchical abstraction, the cell, that is composed of several peers that cooperate tomaintain a portion of the mesh. Figure 1.1: Oporto’s light-train network.1.2 Challenges and OpportunitiesThe requirements from our target systems pose a significant number of challenges. Thepresence of FT mechanisms, specially using space redundancy [1], it introduces the needfor the presence of multiple copies of the same resource (replicas), and these, in turn,ultimately lead to a greater resource consumption.FT also introduce overheads in the form of latency and this is another constraintthat is important when dealing with RT systems. When an operation is performed,irrespectively, of whether it is real-time or not, any state change that it causes by it28
  21. 21. 1.2. CHALLENGES AND OPPORTUNITIESmust be propagated among the replicas through a replication algorithm that introducesan additional source of latency. Furthermore, the recovery time, that consists in thetime that the system needs to recover from a fault, is an additional source of latencyto real-time operations. There are well known replication styles that offer differenttrade-offs between state consistency and latency.Our target systems have different traffic types with distinct deadlines requirements thatmust be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernetnetworking) and software (e.g., Linux). This requires that the RT mechanisms leveragethe available resources, through resource reservation, while providing different threadingstrategies that allow different trade-offs between latency and throughput.To overcome the overhead introduced by the FT mechanisms, it must be possibleto employ a replication algorithm that do not compromises the RT requirements.Replication algorithms that offer a higher degree of consistency introduce a higherlevel of latency [1, 2] that may be prohibitive for certain traffic types. On the otherhand, certain replication algorithms exhibit a lower resource consumption and latencyat the expense of a longer recovery time, that may also be prohibitive.Considering current state-of-the-art research we see many opportunities to addressthe previous challenges. One is the use of COTS operating system that allow for afaster implementation time, thus smaller development cost, while offering the necessaryinfrastructure to build a new middleware system.P2 P networks can be used to provide a resilient infra-structure that mirrors the physicaldeployments of our target systems, furthermore, different P2 P topologies offer differenttrade-offs between self-healing, resource consumption and latency in end-to-end oper-ations. Moreover, by directly implementing FT on the P2 P infra-structure we hopeto lower resource usage and latency to allow the integration of RT. By using provenreplication algorithms [1, 2] that offer well-known trade-offs regarding consistency,resource consumption and latency, we can focus on the actual problem of integratingreal-time, fault-tolerance within a P2 P infrastructure.On the other hand, RT support can be achieve through the implementation of differentthreading strategies, resource reservation (through the Linux’s Control Groups) and byavoiding traffic multiplexing through the use of different access points to handle differenttraffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling wouldprovide greater RT guarantees, this goal will not be pursued due the lack of maturityof the current EDF implementations in Linux (our reference COTS operating system).Because we are limited to use priority based scheduling and resource reservation, we can 29
  22. 22. CHAPTER 1. INTRODUCTIONonly partially support our goal of providing end-to-end guarantees, more specifically,we enhance our RT guarantees through the use of RT scheduling policies with over-provisioning to ensure that deadlines are met.1.3 Problem DefinitionThe work presented in this thesis focuses on the integration of Real-Time (RT) andFault-Tolerance (FT) in a scalable general purpose middleware system. This goalcan only be achieved if the following premises are valid: (a) FT infrastructure cannotinterfere in RT behavior, independently of the replication policy; (b) the network modelmust be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and awareof the underlying infrastructure, i.e. network model, operating system and physicalenvironment.Our problem definition is a direct consequence of the requirements from our targetsystems, and it can be summarize with the following question: ”Can we opportunisticallyleverage and integrate these proven strategies to simultaneously support soft-RT and FTto meet the needs of our target systems even under faulty conditions?”In this thesis we argue that a lightweight implementation of fault-tolerance mechanismsin a middleware is fundamental for its successful integration with soft real-time support.Our approach is novel in that it explores peer-to-peer networking as a means to imple-ment generic, transparent, lightweight fault-tolerance support. We do this by directlyembedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage oftheir scalable, decentralized and resilient nature. For example, peer-to-peer networksreadily provide the functionality required to maintain and locate redundant copies ofresources. Given their dynamic and adaptive nature, they are promising infra-structuresfor developing lightweight fault-tolerant and soft real-time middleware.Despite these a priori advantages, mainstream generic peer-to-peer middleware systemsfor QoS computing are, to our knowledge, unavailable. Motivated by this state ofaffairs, by the limitations of the current infra-structure for the information system weare managing at EFACEC (based on CORBA technology) and, last but not least, by thecomparative advantages of flexible peer-to-peer network architectures, we have designedand implemented a prototype service-oriented peer-to-peer middleware framework.The networking layer relies on a modular infra-structure that can handle multiple peer-to-peer overlays. The support for fault-tolerance and soft real-time features is provided30
  23. 23. 1.4. ASSUMPTIONS AND NON-GOALSat this level through the implementation of efficient and resilient services for, e.g.resource discovery, messaging and routing. The kernel of the middleware system (theruntime) is implemented on top of these overlays and uses the above mentioned peer-to-peer functionalities to provide developers with APIs for customization of QoS policiesfor services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy,number of replicas). This approach was inspired in that of TAO [3], that allows fordistinct strategies for the execution of tasks by threads to be defined.1.4 Assumptions and Non-GoalsThe distributed model used in this thesis is based on a partial asynchronous modelcomputing model, as defined in [2], extended with fault-detectors.The services and P2 P plugin implemented in this thesis only support crash failures. Weconsider a crash failure [1] to be characterized as a complete shutdown of a computinginstance in the event of a failure, ceasing to interact any further with the remainingentities of the distributed system.The timing faults are handled differently by services and the P2 P plugin. In our serviceimplementations a timing fault is logged (for analysis) with no other action beingperformed, whereas, in the P2 P layer we consider a timing fault as a crash failure, i.e.,if the remote creation of a service exceeds its deadline, the peer is considered crashed.This method is also called as process controlled crash, or crash control, as defined in[4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect ofbeing crashed, it does not get killed or commits suicide, instead it gets shunned, thatis, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it mustrebind using the membership service in the P2 P layer.The fault model used was motivated by the author’s experience on several field deploy-ments of ligth-train transportation systems, such as the Oporto, Dublin and TenerifeLight Rail solutions [5]. Due to the use of highly redundant hardware solutions, such asredundant power supplies and redundant 10-Gbit network ring links, network failurestend to be short. The most common cause for downtime is related with software bugs,that mostly results in a crashing computing node. While simultaneous failures canhappen, they are considered rare events.We also assume that the resource-reservation mechanisms are always available.In this thesis we do not address value faults and byzantine faults, as they are not a 31
  24. 24. CHAPTER 1. INTRODUCTIONrequirement for our target systems. Furthermore, we do not provide a formal specifica-tion and verification of the system. While this would be beneficial to assess systemcorrectness, we had to limit the scope of this thesis. Nevertheless, we provide anempirical evaluation of the system.We also do not address hard real-time because the lack of a mature support for EDFscheduling in the Linux kernel. Furthermore, we do not provide a fully optimizedimplementation, but only a proof-of-concept to validate our approach. Testing thesystem in a production environment is left for future work.1.5 ContributionsBefore undertaking the task of building an entire new middleware system from scratch,we explored current solutions, presented in Chapter 2, to see if any of them couldsupport the requirements from our target system. As we did not find any suitablesolution, we then assessed if it was possible to extend an available solution to meetthose requirements. In our previous work, DAEM [6], we explored the use of JGroups [7]within an hierarchical P2 P mesh, and concluded that the simultaneous support for real-time, fault-tolerance and P2 P requires fine grain control of resources that is not possiblewith the use of ”black-box” solutions, for example, it is impossible to have out-of-the-box support for resource reservation in JGroups.Given these assessments, we have designed and implemented Stheno, that to the bestof our knowledge is the first middleware system to seamlessly integrate fault-toleranceand real-time in a peer-to-peer infrastructure. Our approach was motivated by thelack of support of current solutions for the timing, reliability and physical deploymentcharacteristics of our target systems.For that, a complete architectural design is proposed that addresses the levels of thesoftware stack, including kernel space, network, runtime and services, to achieve aseamless integration. The list of contributions include: (a) a full specification of a userApplication Programming Interface (API); (b) pluggable P2 P network infrastructureaiming to better adjust to the target application; (c) support for configurable FT onthe P2 P layer with the goal of providing lightweight FT mechanisms, that fully enableRT behavior, and; (d) integration of resource reservation at all the levels of runtime,enabling (partial) end-to-end Quality-of-Service (QoS) guarantees.Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioningfor real-time systems. In this thesis we present, Euryale, a QoS network oriented32
  25. 25. 1.6. THESIS OUTLINEframework that features resource reservation with support for a broader range of sub-systems, including CPU, memory, I/O and network bandwidth for a general purposeoperating system as Linux. At the heart of this infrastructure resides Medusa, a QoSdaemon that handles admission and management of QoS requests.Current well-known threading strategies, such as Leader-Followers [11], Thread-per-Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la-tency and resource usage [3, 14]. However, they do not support resource reservation,namely, CPU partitioning. In order to suppress this limitation, this thesis provides anadditional contribution with the introduction of a novel design pattern (Chapter 4) thatis able to integrate multi-core computing with resource reservation within a configurableframework that supports these well-known threading strategies. For example, when aclient connects to a service it can specify, through the QoS real-time parameters, for aparticular threading strategy that best meets its requirements.We present a full implementation that covers all the previously architectural features,including a complete overlay implementation, inspired in the P3 [15] topology, thatseamlessly integrates RT and FT.To evaluate our implementation and justify our claims, we present a complete evalua-tion for both mechanisms. The impact of the resource reservation mechanism is alsoevaluated, as well as a comparative evaluation of RT performance against state-of-the-art middleware systems. The experimental results show that Stheno meets and exceedstarget system requirements for end-to-end latency and fail-over latency.1.6 Thesis OutlineThe focus of this thesis is on the design, implementation and evaluation of a scalablegeneral purpose middleware that provides the seamless integration of RT and FT. Theremaining of this thesis is organized as follows.Chapter 2: Overview of Related Work.This chapter presents an overview on related middleware systems that exhibit supportfor RT, FT and P2 P, the mandatory requirements from our target system. We startedby searching for an available off-the-shelf solution that could support all of theserequirements, or in its absence, identifying a current solution that could be extended inorder to avoid creating a new middleware solution from scratch.Chapter 3: Architecture. 33
  26. 26. CHAPTER 1. INTRODUCTIONChapter 3 describes the runtime architecture on the proposed middleware. We startby providing a detailed insight on the architecture, covering all layers present in theruntime. Special attention is given to the presentation of the QoS and resource reser-vation infrastructure. This is followed by an overview of the programming model thatdescribes the most important interfaces present in the runtime, as well the interactionsthat occur between them. The chapter ends with the description of the fundamentalruntime operations, namely: the creation of services with and without FT support,deployment strategy, and client creation.Chapter 4: Implementation.Chapter 4 describes the implementation of a prototype based on the aforementionedarchitecture, and is divided in four parts. In the first part, we present a completeimplementation of P2 P overlay that is inspired on the P3 [15] topology, while providingsome insight on the limitations of the current prototype. The second part of this chapterfocuses on the implementation of three types of user services, namely, Remote ProcedureCall (RPC), Actuator, and Streaming. These services are thoroughly evaluated inChapter 5. In the third part, we describe our support for multi-core computing, throughthe presentation of a novel design pattern, the Execution Model/Context. This designpattern is able to integrate resource reservation, especially CPU partitioning, withdifferent well-known (and configurable) threading strategies. The fourth and final partof this chapter describes the most relevant parameters used in the bootstrap of theruntime.Chapter 5: Evaluation.The experimental results are presented in this chapter. It starts by providing detailsof physical setup used throughout the evaluation. Then it describes the parametersused in the testbed suite, that is composed by the three services previously described inChapter 4. We then focus on presenting the results for the benchmarks, including theassessment of the impact of FT on RT, and the impact of the resource reservation infra-structure in the overall performance. The chapter ends with a comparative evaluationagainst well-known middleware systems.Chapter 6: Conclusion and Future Work.This last chapter presents the concluding remarks. It highlights the contributions ofthe proposed and implemented middleware, and provides34
  27. 27. –By failing to prepare, you are preparing to fail. 2 Benjamin Franklin Overview of Related Work2.1 OverviewThis chapter presents an overview of the state-of-the-art on related middleware systems.As illustrated in Figure 2.1, we are mostly interested in systems that exhibit supportfor real-time (RT), fault-tolerance (FT) and peer-to-peer (P2 P), the mandatory require-ments from our target system. We started by searching for an available off-the-shelfsolution that could support all of these requirements, or in its absence, identify a currentsolution that could be extended, and thus avoid the creation of a new middlewaresolution from the ground up. For that reason, we have focused on the intersectingdomains, namely, RT+FT, RT+P2 P and FT+P2 P, since the systems contained in thesedomains are closer to meet the requirements of our target system.From an historic perspective, the origins of modern middleware systems can be tracedback to the 1980s, with the introduction of the concept of ubiquitous computing, inwhich computational resources are accessible and seen as ordinary commodities suchas electricity or tapwater [2]. Furthermore, the interaction between these resourcesand the users was governed by the client-server model [16] and a supporting protocolcalled RPC [17]. The client-server model is still the most prevalent paradigm in currentdistributed systems.An important architecture for client-server systems was introduced with the CommonObject Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did notaddress real-time or fault-tolerance. Only recently both real-time and fault-tolerancespecifications were finalized but remained mutually exclusive. This means that asystem supporting the real-time specification will not be able to support the fault- 35
  28. 28. CHAPTER 2. OVERVIEW OF RELATED WORK DDS Video Streaming RT RT+P2P CORBA RT FT RT+FT RT+FT+P2P P2P FT+P2P FT Pastry Distributed storage CORBA FT Stheno Figure 2.1: Middleware system classes.tolerance specification, and vice-versa. Nevertheless, seminal work has already ad-dressed these limitations and offered systems supporting both features, namely, TAO [3]and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared asa Java alternative capable of providing a more flexible and easy-to-use environment.In recent years, CORBA entered in a steady decline [20] in favor of web-orientedplatforms, such as J2EE [21], .NET [22] and SOAP [23], and P2 P systems. Theweb-oriented platforms, such as the JBoss [24] application server, aim to integrateavailability with scalability, but they remain unable to support real-time. Moreover,while partitioning offers a clean approach to improve scalability, it fails to supportlarge scale distributed systems [2]. Alternatively, P2 P systems focused on providinglogical organizations, i.e., meshes, that abstract the underlying physical deploymentwhile providing a decentralized architecture for increased resiliency. These systemsfocused initially on resilient distributed storage solutions, such as Dynamo [25], butprogressively evolved to support soft real-time systems, such as video streaming [26].More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributedmessage passing infrastructure based on an asynchronous interaction model, that isable to suppress the scaling issues present in RPC. A considerable amount of im-plementations exist, including Tibco [28], Websphere MQ [29] and Java MessagingService (JMS) [30]. MOM sometimes are integrated as subsystems in the applicationserver infrastructures, such as JMS in J2EE and Websphere MQ in the WebsphereApplication Server.A substantial body of research has focused on the integration of real-time within36
  29. 29. 2.2. RT+FT MIDDLEWARE SYSTEMSCORBA-based middleware, such as TAO [3] (that later addressed the integration offault-tolerance). More recently, QoS-enabled publish-subscribe middleware systemsbased on the JAIN SLEE specification [31], such as Mobicents [32], and in the DataDistribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34]and OpenSplice [35], appeared as a way to overcome the current lack of support forreal-time applications in SOA-based middleware systems.The introduction of fault-tolerance in middleware systems also remains an active topicof research. CORBA-based middleware systems were a fertile ground to test fault-tolerance techniques in a general purpose platform, resulting in the creation of theCORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA-based platforms, such as J2EE. One of the most popular deployments, JBoss, supportsscalability and availability through partitioning. Each partition is supported by a groupcommunication framework based on the virtual synchrony model, more specifically, theJGroups [7] group communication framework.2.2 RT+FT Middleware SystemsThis section overviews systems that provide simultaneous support for real-time andfault-tolerance. These systems are divided into special purposed solutions, designed forspecific application domains, and CORBA-based solutions, aimed for general purposedcomputing.2.2.1 Special Purpose RT+FT SystemsSpecial purpose real-time fault-tolerant systems introduced concepts and implementa-tion strategies that are still relevant on current state-of-the-art middleware systems.ArmadaArmada [37] focused on providing middleware services and a communication infrastruc-ture to support FT and RT semantics for distributed real-time systems. This waspursued in two ways, which we now describe.The first contribution was the introduction of a communication infrastructure that isable to provide end-to-end QoS guarantees, in both unicast and multicast primitives.This was supported by a control signaling and a QoS-sensitive data transfer (as in thenewer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)). 37
  30. 30. CHAPTER 2. OVERVIEW OF RELATED WORKThe network infrastructure used a reservation mechanism based on EDF schedulingpolicy that was built on top of the Mach OS priority based scheduling. The initialimplementation was done in the user-level but subsequently migrated to the kernellevel with the goal of reducing latency.Much of the architectural decisions regarding RT support were based on the availableoperating system at the time, mainly Mach OS. Despite the advantages of a micro-kernel approach, its application remains restricted by the underlying cost associatedwith message passing and context switching. Instead, a large body of research has beenmade on monolithic kernels, specially in Linux OS, that are able to offer the advantagesof the micro-kernel approach, through the introduction of kernel modules, and the speedof monolithic kernels.The second contribution came in the form of a group communication infrastructurebased on a ring topology that ensured the delivery of messages in a reliable and totalorder fashion within a bounded time. It also had support for membership managementthat offered consistent views of the group through the detection of process and commu-nication failures. These group communication mechanisms enabled the support for FTthrough the use of a passive replication scheme, that allowed for some inconsistenciesbetween the primary and the replicas, where the states of the replicas could lag behindthe state of the primary, up to a bounded time window.MarsMars [38] provided support for the analysis and deployment of synchronous hard real-time systems through a static off-line scheduler for CPU and Time Division MultipleAccess (TDMA) bus. Mars is able to offer FT support through the use of activeredundancy on the TDMA bus, i.e. sending multiple copies of the same message, andself-checking mechanisms. Deterministic communications are achieved though the useof a time-triggered protocol.The project focused on the RT process control, where all the intervening entities areknown in advance. So it does not offer any type of support for dynamical admission ofnew components, neither it supports on-the-fly fault-recovery.ROAFTSROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis-tributed RT applications, consisting in a network of Time-triggered Message-triggeredObjects [41] (TMO’s), whose execution is managed by a TMO support manager. TheFT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic38
  31. 31. 2.2. RT+FT MIDDLEWARE SYSTEMSfault server ; (b) and a network surveillance [42] manager. Fault-detection is assuredby the network surveillance TMO, and used by the generic fault-server to change theFT policy with the goal of preserving RT semantics. The system assumes that RTcan live with lesser reliability assurances from the middleware, under highly dynamicenvironments.MarutiMaruti [43] aimed to provide a development framework and an infrastructure for thedeployment of hard real-time applications within a reactive environment, focusing onreal-time requirements on a single-processor system. The reactive model is able tooffer runtime decisions on the admission of new processing requests without producingadverse effects on the scheduling of existing requests. Fault-tolerance is achieved byredundant computation. A configuration language allows the deployment of replicatingmodules and services.Delta-4Delta-4 [44] provided an in-depth characterization of fault assumptions, for both thehost and the network. It also demonstrated various techniques for handling them,namely, passive and active replication for fail-silent hosts and byzantine agreement forfail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi-tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 frameworkthrough the introduction of the Leader/Follower replication model (better known assemi-active replication) for fail-silent hosts. This work also lead to the extension to thecommunication system to support additional communication primitives (the originalwork on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, andAtLeastTo.2.2.2 CORBA-based RT+FT SystemsThe support for RT and FT in general purpose distributed platforms remains mostlyrestricted to CORBA. While some support was carried out by Sun to introduce RT sup-port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46,47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen-tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-TimeVM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attemptedto provide support for RT in a J2EE environment. Nevertheless, this support seems tobe confined to the introduction of a deterministic garbage collector, through the use of 39
  32. 32. CHAPTER 2. OVERVIEW OF RELATED WORKthe RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbagecollection [51].Previous work on integration of RT and FT in CORBA context systems can be catego-rized into three distinct approaches: (a) integration, where the base ORB is modified;(b) services, systems that rely on high-level services to provide FT (and indirectly, RT),and; (c) interception, systems that perform interception on client request to providetransparent FT and RT.Integration ApproachPast work on the integration of fault-tolerance in CORBA-like systems was done inElectra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors ofthe CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage-ment Architecture (OMA) to support transparent and non-transparent fault-tolerancecapabilities. Instead of using message queues or transaction monitors [56], it relied onobject-communication groups [57, 58]. Maestro [53] is a distribute layer built on top ofthe Ensemble [59] group communication, that was used by Electra [52] in the Qualityof Service for CORBA Objects (QuO) project [60]. Its main focus was to provide anefficient, extensible and non disruptive integration of the object layers with the low-level QoS system properties. The AQuA [54] system uses both QuO and Maestro ontop of the Ensemble communication groups, to provide a flexible and modular approachthat is able to adapt to faults and changes in the application requirements. Within itsframework a QuO runtime accepts availability requests by the application and relaysthem to a dependability manager, that is responsible to leverage the requests frommultiple QuO runtimes.TAO+QuOThe work done in [61] focused on the integration of QoS mechanisms, for both CPU andnetwork resources while supporting both priority- and reservation-based QoS semantics,with standard COTS Distributed Real-Time and Embedded (DRE) middleware, moreprecisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. Thepriority-based approach was built on top of the RT-CORBA specification, and it defineda set of standard features in order to provide end-to-end predictability for operationswithin a fixed priority context [62]. The CPU priority-based resource management isleft to the scheduling of the underlying Operating Systems (OS), whereas the networkpriority-based management is achieved through the use of the DiffServ architecture [63],by setting the DSCP codepoint on the IP header of the GIOP requests. Based onvarious factors, the QuO runtime can dynamically change this priority to adjust to40
  33. 33. 2.2. RT+FT MIDDLEWARE SYSTEMSenvironment changes. Alternatively, the network reservation-based approach relies onthe RSVP [64] signaling protocol to guarantee the desired network bandwidth betweenhosts. The QuO runtime monitors the RSVP connections and makes adjustments toovercome abnormal conditions. For example, in a video service it can drop frames tomaintain stability. The cpu-reservation is made using reservation mechanisms presentin the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservationspolicies. This was done to preserve the end-to-end QoS semantics that is only availableat a higher level of the middleware.CIAO+QuOCIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built ontop of TAO [3] that aims to alleviate the complexity of integrating real-time features onDRE using Distributed Object Computing (DOC) middleware. These DOC systems,of which TAO is an example, offer configurable policies and mechanisms for QoS,namely real-time, but lack a programming model that is capable of separating systemicaspects from applicational logic. Furthermore, QoS provisioning must be done in anend-to-end fashion, thus having to be applied to several interacting components. Itis difficult, or nearly impossible, to properly configure a component without takinginto account the QoS semantics for interacting entities. Developers using standardDOC middleware systems are susceptible to produce misconfigurations that cause anoverall system misbehavior. CIAO overcomes these limitations by applying a widerange of aspect-oriented development techniques that support the composition of real-time semantics without intertwining configurations concerns. The support for CIAO’sCCM architecture was done in CORFU [66] and is described below.Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67].The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi-sioning to a total provisioning middleware that is also able to accommodate dynamicaland adaptive QoS provisioning. For example, the setup of a RSVP [64] connectionwould require the explicit configuration from the developer, defeating the purpose ofCIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], itdoes not provide a solution for component cross-cutting.DynamicTAODynamicTAO [69] focused on providing a reflective model middleware that extendsTAO to support on-the-fly dynamic reconfiguration of its component behavior andresource management through meta-interfaces. It allows the application to inspectthe internal state/configuration and, if necessary, to reconfigure it in order to adapt 41
  34. 34. CHAPTER 2. OVERVIEW OF RELATED WORKto environment changes. Subsequently, it is possible to select networking protocols,encoding and security policies to improve the overall system performance in the presenceof unexpected events.Service-based ApproachAn alternative, high-level service approach for CORBA fault-tolerance was taken byDistributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service(OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providingreplica management, fault-detection and fault-recovery as a CORBA high-level service.It did group communication and it mainly focused on passive replication, but allowedthe developer to select the desired level of reliability (number of replicas), replicationpolicy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recoverystrategy. OGS improved over prior approaches by using a group communication protocolthat imposes consensus semantics. Instead of adopting an integrated approach, groupcommunication services are transparent to the ORB, by providing a request levelbridging. Newtop followed a similar approach to OGS but augmented the supportfor network partition, allowing the newly formed sub-groups to continue to operate.TAOTAO [3] is a CORBA middleware with support for RT and FT middleware, that iscompliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. Thesupport for RT includes priority propagation, explicit binding, and RT thread pools.The FT is supported through the of a high level service, the Replication Manager, thatsits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure,acting as a rendezvous for all the remaining components, more precisely, monitors thatwatch the status of the replicas, replica factories that allow the creation of new replicas,and fault notifiers that inform the manager of failed replicas. TAO’s architecture isfurther detailed in Section 2.6 of Chapter 3.FLARe and CORFUFLARe [74] focus on proactively adapting the replication group to underlying changeson resource availability. To minimize resource usage, it only supports passive replica-tion [75]. Its implementation is based on TAO [3]. It adds three new components tothe existing architecture: (a) Replication Manager high level service that decides on thestrategy to be employed to address the changes on resource availability and faults; (b)a client interceptor that redirects invocations to the active primary; (c) a redirectionagent that receives updates from the Replication Manager and is used by the interceptor,42
  35. 35. 2.2. RT+FT MIDDLEWARE SYSTEMSand; (d) a resource monitor that watches the load on nodes and periodically notifies theReplication Manager. In the presence of faulty conditions, such as overload of a node,the Replication Manager adapts the replication group to the changing conditions, byactivating replicas on nodes that have a lower resource usage, and additionally, changethe location of the primary node to a better suitable placement.CORFU [66] extends FLARe to support real-time and fault-tolerance for the LightweightCORBA Component Model (LwCCM) [76] standard for DRE systems. It providesfail-stop behavior, that is, when one component on a failover unit fails, then all theremaining components are stopped, allowing for a clean switch to a new unit. This isachieved through a fault mapping facility that allows the correspondence of the objectfailure into the respective plan(s), with the subsequent component shutdown.DeCoRAMThe DeCoRAM system [77] aims to provide RT and FT properties through a resource-aware configuration, executed using a deployment infrastructure. The class of supportedsystems is confined to closed DRE, where the number of tasks and their respectiveexecution and resource requirements are known a priori and remain invariant thoughtthe system’s life-cycle. As the tasks and resources are static, it is possible to optimize theallocation of the replicas on available nodes. The allocation algorithm is configurableallowing for a user to choose the best approach to a particular application domain.DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real-Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti-mization problem, while satisfying both RT and FT system constraints. Because of thelimited resources normally available on DRE systems, DeCoRAM only supports passivereplication [75], thus avoiding the high overhead associated with active replication [78].The allocation algorithm calculates the components inter-dependencies and deploys theexecution plan using the underlying middleware infrastructure, which is provided byFLARe [74].Interception-based ApproachThe work done in Eternal [79, 80] focused on providing transparent fault-tolerance forCORBA ensuring strong replica consistency through the use of reliable totally-orderedmulticast protocol. This approach alleviated the developer from having to deal with low-level mechanisms for supporting fault-tolerance. In order to maintain compatibility withthe CORBA-FT standard, Eternal exposes the replication manager, fault detector, andfault notifier to developers. However, the main infrastructure components are locatedbelow the ORB for both efficiency and transparency purposes. These components 43
  36. 36. CHAPTER 2. OVERVIEW OF RELATED WORKinclude logging-recovery mechanisms, replication mechanisms, and interceptors. Thereplication mechanisms provide support for warm and cold passive replication and activereplication. The interceptor captures the CORBA IIOP requests and replies (based onTCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recoverymechanisms are responsible for managing the logging, checkpointing, and performingthe recovery protocols.MEADMEAD focuses on providing fault-tolerance support in a non intrusive way by en-hancing distributed RT systems with (a) a transparent, although tunable FT, thatis (b) proactively dependable through (c) resource awareness, that has (d) scalable andfast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO,as proof-of-concept. The paper makes an important contribution by leveraging fault-tolerance resource consumption for providing RT behavior. MEAD is detailed furtherin Section 2.6 of Chapter 3.2.3 P2P+RT Middleware SystemsWhile most of the focus on P2 P systems has been on the support of FT, there is agrowing interested in using these systems for RT applications, namely, in streamingand QoS support. This section provides an overview on P2 P systems that support RT.2.3.1 StreamingStreaming and specially Video on Demand (VoD), were a natural evolution of the firstfile sharing P2 P systems [81, 82]. With the steady increase of network bandwidth on theInternet, it is now possible to have high-quality multimedia streaming solutions to theend-user. These focus on providing near soft real-time performance resorting to streamssplit through the use of distributed P2 P storage and redundant network channels.PPTVThe work done in [26] provides the background for the analysis, design and behaviorof VoD systems, focusing on the PPTV system [83]. An overview of the differentreplication strategies and their respective trade-offs is presented, namely, Least RecentlyUsed (LRU) and Least Frequently Used (LFU). The later uses a weighted estimationbased on the local cache completion and by the availability to demand ratio (ATD).44
  37. 37. 2.3. P2 P+RT MIDDLEWARE SYSTEMSEach stream is divided into chunks. The size of these chunks have a direct influence onthe efficiency of the streaming, with smaller size pieces facilitating replication and thusoverall system load-balancing, whereas bigger pieces decrease the resource overheadassociated with piece management and bandwidth consumption due to less protocolcontrol. To allow for a more efficient piece selection three algorithms are proposed:sequential, rarest first and anchor-based. To ensure real-time behavior the system isable to offer different levels of aggressiveness, including: simultaneous requests of thesame type to neighboring peers; simultaneous sending different content requests tomultiple peers, and; requesting to a single peer (making a more conservative use ofresources).ThicketEfficient data dissemination over unstructured P2 P was addressed by Thicket [84].The work used multiple trees to ensure efficient usage of resources while providingredundancy in the presence of node failure. In order to improve load-balancing acrossthe nodes, the protocol tries to minimize the existence of nodes that act as interiornodes on several of trees, thus reducing the load produced from forwarding messages.The protocol also defines a reconfiguration algorithm for leveraging load-balance acrossneighbor nodes and a tree repair procedure to handle tree partitions. Results showthat the protocol is able to quickly recover from a large number of simultaneous nodefailures and leverage the load across existing nodes.2.3.2 QoS-Aware P2 PUntil recently, P2 P systems have been focused on providing resiliency and throughput,and thus, not addressing the increasing need for QoS on latency-sensitive applications,such as VoD.QRONQRON [85] aimed to provide a general unified framework in contrast to application-specific overlays. The overlays brokers (OBs), present at each autonomous system inthe Internet, support QoS routing for overlay applications through resource negotiationand allocation, and topology discovery. The main goal of QRON is to find a path thatsatisfies the QoS requirements, while balancing the overlay traffic across the OBs andoverlay links. For this it proposes two distinct algorithms, a “modified shortest distancepath” (MSDP) and “proportional bandwidth shortest path (PBSP). 45
  38. 38. CHAPTER 2. OVERVIEW OF RELATED WORKGlueQoSGlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoSfeatures from two communicating processes. It provides a declarative language thatallows the specification of the feature QoS set (and possible conflicts) and a runtimenegotiation mechanism that finds a set of valid QoS features that is valid in the bothends of the interacting components. Contrary to aspect-oriented programming [65], thatonly enforces QoS semantics at deployment time, GlueQoS offers a runtime solution thatremains valid throughout the duration of the session between a client and a server.2.4 P2P+FT Middleware SystemsThe research on P2 P systems has been largely dominated by the pursuit for fault-tolerance, such as in distributed storage, mainly due to the resilient and decentralizednature of P2 P infrastructures.2.4.1 Publish-subscribeP2 P publish-subscribe systems are a set of P2 P systems that implement a messagepattern where the publishers (senders) do not have a predefined set of subscribers(receivers) to their messages. Instead, the subscribers must first register their interestswith the target publisher, before starting to receive published messages. This decou-pling between publishers and subscribers allows for a better scalability, and ultimately,performance.ScribeScribe [87] aimed to provided a large scale event notification infrastructure, built ontop of Pastry [88], for topic-based publish-subscribe applications. Pastry is used tosupport topics and subscriptions and build multicast trees. Fault-Tolerance is providedby the self-organizing capabilities of Pastry, through the adaptation to network failuresand subsequent multicast tree repair. The event dissemination performed is best-effort oriented and without any delivery order guarantees. Nevertheless, it is possibleto enhance Scribe to support consistent ordering thought the implementation of asequential time stamping at the root of the topic. To ensure strong consistency andtolerate topic root node failures, an implementation of a consensus algorithm such asPaxos [89] is needed across the set of replicas (of the topic root).46
  39. 39. 2.4. P2 P+FT MIDDLEWARE SYSTEMSHermesHermes [90] focused on providing a distributed event-based middleware with an underly-ing P2 P overlay for scalability and reliability. Inspired by work done in Distributed HashTable (DHT) overlay routing [88, 91], it also has some notions of rendezvous similarto [81]. It bridges the gap between programming language type semantics and low-levelevent primitives, by introducing the concepts of event-type and event-attributes thathave some common ground with Interface Description Language (IDL) within the RPCcontext. In order to improve performance, it is possible in the subscription process toattach a filter expression to the event attributes. Several algorithms are proposed forimproving availability, but they all provide weak consistency properties.2.4.2 Resource ComputingThere is a growing interest on harvesting and managing the spare computing powerfrom the increasing number of networked devices, both public and private, as reportedin [92, 93, 94, 95]. Some relevant examples are:BOINCBOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili-tate the harvesting of public resource computing by the scientific research community.BOINC implements a redundant computing mechanism to prevent malicious or erro-neous computational results. Each project specifies the number of results that should becreated for each “workunit”, i.e. the basic unit of computation to be performed. Whensome number of the results are available, an application specific function is called toevaluate the results and possibly choosing a canonical result. If no consensus is achieved,or if simply the results fail, a new set o results are computed. This process repeats untila successful consensus is achieved or an application defined timeout occurs.P2 P-MapReduceDeveloped at Google, MapReduce [97] is a programming model that is able parallelizethe processing of large data sets in a distributed environment. It follows a master-slavemodel, where a master distributes the data set across a set of slaves, returning at endthe computational results (from the map or reduce tasks). MapReduce provides fault-tolerance for slave nodes by reassigning the failed job to an alternative active slave,but lacks support for master failures. P2 P-MapReduce [98] provides fault-tolerance byresorting to two distinct P2 P overlays, one containing the current available masters in 47
  40. 40. CHAPTER 2. OVERVIEW OF RELATED WORKthe system, and the other with the active slaves. When an user submits a MapReducejob, it queries the master overlay for a list of the available masters (ordered by theirworkload). It then selects a master node and the number of replicas. After this, themaster node notifies its replicas that they will participate on the current job. A masternode is responsible for periodically synchronizing the state of the job over its replica set.In case of failure, a distributed procedure is executed to elect the new master acrossthe active replicas. Finally, the master selects the set of slaves using a performancemetric based on workload and CPU performance from the slave overlay and starts thecomputation.2.4.3 StorageStorage systems were one of the most prevalent applications on first generation P2 P sys-tems. Evolving from early file-sharing systems, and with the help of DHT middlewares,they have now become the choice for large-scale storage systems in both industry andacademia.openDHTWork done in [99] aimed to provide a lightweight framework for P2 P storage usingDHTs (such in [88, 91]) in a public environment. The key challenge was to handlemutually untrusting clients, while guarantying fairness in the access and allocation ofstorage. The work was able to provide a fair access to the underlying storage capacity,while taking the assumption that storage capacity is free. Because of its intrinsic fairapproach, the system is unable to provide any type of Service Level of Agreement (SLA)to the clients, so reducing the domain of applications that can use it.DynamoRecent research on data storage [25] and distribution at Amazon, focus on key-valueapproaches using P2 P overlays, more precisely DHT, to overcome the well exploredlimitation of simultaneous providing high availability and strong consistency (throughsynchronous replication) [100, 101]. The approach taken was to use an optimisticreplication scheme that relied on asynchronous replica synchronization (also knownas passive replication). The consistency conflicts between different replicas, that arecaused by network and server failures, are resolved in ’read time’, as opposed to themore traditional ’write time’ strategy, with this being done to maximize the writeavailability in the system. Such conflicts are resolved by the services, allowing for amore efficient resolution (although the system offers a default ’last value holds’ strategy48
  41. 41. 2.5. P2 P+RT+FT MIDDLEWARE SYSTEMSto the services). Dynamo offers efficient key-value storage, while maximizing writeoperations availability. Nevertheless, the ring based overlay hampers the scalability ofthe system, and depending on the partitioning strategy used, the membership processdoes not seem efficient.2.5 P2P+RT+FT Middleware SystemsThese types of systems offer a natural evolution over previous FT-RT middlewaresystems. They aim to provide scalability and resilience through a P2 P network infra-structure that is able to provide lightweight FT mechanisms, allowing them to supportsoft RT semantics. We first proposed an architecture [102, 103] for a general purposemiddleware that aimed to integrate FT into the P2 P network layer, while being ableto provide RT support. The first implementation, in Java, of the architecture was donein DAEM [6, 104]. This work used an hierarchical tree P2 P based on P3 [15]. The FTsupport was performed in all levels of the tree, resulting in a high availability rate butthe use of JGroups [7] for maintaining strong consistency, both for mesh and servicedata, resulted in high overhead. Due to its highly coupled tree architecture, faults had amajor impact on availability when they occurred near the root node, as they produceda cascade failure. Initial support for RT was provided, but the high overhead of thereplication infrastructure limited its applicability.2.6 A Closer Look at TAO, MEAD and ICEThis section provides a closer look at middleware systems that have provided us withseveral strategies and insights that we used to design and implement Stheno, ourmiddleware solution that is able to support RT, FT and P2 P.All the referred systems share a service oriented architecture with a client-server networkmodel, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEADsupport the RT-CORBA standard, while ICE only supports best-effort invocations. Asfor FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid,that combines both low and high-level services. 49
  42. 42. CHAPTER 2. OVERVIEW OF RELATED WORK2.6.1 TAOTAO is a classical RPC middleware and therefore only supports the client-server networkmodel. Name resolution is provided by a high-level service, representing a clear point-of-failure and a bottleneck.RT Support. TAO supports the RT CORBA specification 1.0., with the most impor-tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT threadpools.The priority propagation ensures that a request maintains its priority across a chain ofinvocations. A client issues a request to an Object A, that in turn, issues an invocationto other Object B. The request priority at Object A is then used to make the invocationat Object B. There are two types of propagation: a server declared priorities, andclient propagated priorities. In the first type, a server dictates the priority that willbe used when processing an incoming invocation. In the other type, the priority ofthe invocation is encoded within the request, so the server processes the request at thepriority specified by the client.A source of unbound priority inversion is caused by the use of multiplexed communica-tion channels. To overcome this, the RT CORBA specification defines that the networkchannels should be pre-established, avoiding the latency caused by their creation. Thismodel allows two possible policies: (a) private connection between the client and theserver, or; (b) priority banded connection that can be shared but limits the priority ofthe requests that can be made on it.In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], withthe support of a reactor (an object that handles network event de-multiplexing), and isnormally associated with an acceptor (an entity that handles the incoming connections),a connection cache, and a memory pool. In classic CORBA a high priority thread canbe delayed by a low priority one, leading to priority inversion. So in an effort to avoidthis unwanted side-effect, the RT-CORBA specification defines the concept of threadpool lanes.All the threads belonging to a thread pool lane have the same priority, and so, onlyprocess invocation that have the same priority (or a band that contains that priority).Because each lane has it own acceptor, memory pool and reactor, the risk of priorityinversion is greatly minimized at the expense of greater resource usage overhead.FT Support. In a effort to combine RT and FT semantics, the replication style50
  43. 43. 2.6. A CLOSER LOOK AT TAO, MEAD AND ICEproposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids thelatency associated with both warm and cold passive replication [105] and the highoverhead and non-determinism of active replication, but represents an extension to theFT specification. Figure 2.2: TAO’s architectural layout (adapted from [3]).Figure 2.2 shows the architectural overview of TAO. The support for FT is achievedthrough the use of a set of high-level services built on top of TAO. These services includea Fault Notifier, a Fault Detector and a Replication Manager.The Replication Manager is the central component of the FT infrastructure. It actsas central rendezvous to the remaining FT components, and it has the responsibilitiesof managing the replication groups life-cycle (creation/destruction) and perform groupmaintenance, that is the election of a new primary, removal of faulty replicas, andupdating group information.It is composed by three sub-components: (a) a Group Manager, that manages the groupmembership operations (adds and removes elements), allows the change of the primaryof a given group (for passive replication only), and allows manipulation and retrievalof group member localization; (b) a Property Manager, that allows the manipulation of 51
  44. 44. CHAPTER 2. OVERVIEW OF RELATED WORKreplication properties, like replication style; and (c) a Generic Factory, the entry pointfor creating and destroying objects.The Fault Detector is the most basic component of the FT infrastructure. Its role is tomonitor components, processes and processing nodes and report eventual failures to theFault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwardsthem to the Replication Manager.The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next;(b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that(d) finds the Replication Manager and registers itself with it. As a response, e) theReplication Manager connects as a consumer to the Fault Notifier. (f) For each nodethat is going to participate, starts a Fault Detector Factory and a Replica Factory, thatin turn register themselves in the Replication Manager. (g) A group creation request ismade to the Replication Manager (by an foreign entity, that is referred as Object GroupCreator ), followed by the request of a list to the available Fault Detector Factories anda Replica Factories; (h) this is followed by a request to create an object group in theGeneric Factory. (i) The Object Group Creator then bootstraps the desired numberof replicas using the Replica Factory at each target node, and in turn, each ReplicaFactory creates the actual replica, and at the same time, it starts a Fault Detectorat each site using the Fault Detector Factory. Each one of these detectors, finds theReplication Manager and retrieves the reference to the Fault Notifier and connects toit as a supplier. (j) Each replica is added to the object group by the Object GroupCreator by using the Group Manager at the Replication Manager. (k) At this point, aclient is started and retrieves the object reference from the naming service, and makesan invocation to that group. This is then carried out by the primary of the replicationgroup.Proactive FT Support. An alternative approach has been proposed by FLARe [74],that focus on proactively adapting the replication group to the load present in thesystem. The replication style is limited to semi-active replication using state-transfer,that is commonly referred solely as passive replication .Figure 2.3 shows the architectural overview of FLARe. This new architecture presentsthree new components to TAO’s FT infrastructure: (a) a client interceptor, that redi-rects the invocations to the proper server, as the initial reference could have beenchanged by the proactive strategy, in response to a load change; (b) a redirection agentthat receives the updates with these changes from the Replication Manager; and (c)a resource monitor that monitors the load on a processing node and sends periodical52
  45. 45. 2.6. A CLOSER LOOK AT TAO, MEAD AND ICE Figure 2.3: FLARe’s architectural layout (adapted from [74]).updates to the Replication Manager.In the presence of abnormal load fluctuations the Replication Manager changes thereplication group to adapt to these new conditions, by creating replicas on lower usagenodes and, if required, by changing the primary to a better suitable replica.TAO’s fault tolerance support relies on a centralized infrastructure, with its maincomponent, the Replication Manager, representing a major obstacle in the system’sscalability and resiliency. No mechanisms are provided to replicate this entity.2.6.2 MEADMEAD focused on providing fault-tolerance support in a non intrusive way for enhancingdistributed RT systems by providing a transparent, although tunable FT, that isproactively dependable through resource awareness, that has scalable and fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-concept.Transparent Proactive FT Support. MEAD’s architecture contains three majorcomponents, namely, the Proactive FT Manager, the Mead Recovery Manager and the 53
  46. 46. CHAPTER 2. OVERVIEW OF RELATED WORKMead Interceptor. The underlying communication is provided by Spread, an groupcommunication framework that offers reliable total ordered multicast, for guaranteeingconsistency for both component and node membership.The Mead Interceptor provides the usual interception of system calls between theapplication and underlying operating system. This approach allows a transparent andnon-intrusive way to enhance the middleware with fault-tolerance. Figure 2.4: MEAD’s architectural layout (adapted from [14]).Figure 2.4 shows the architectural overview of MEAD. The main component of theMEAD system is the Proactive FT Manager, and is embedded within the interceptorsin both server and client. It has the responsibility of monitoring the resource usage ateach server, initialization a proactive recovery schema based on a two-step threshold.When the resource usage gets higher then the first threshold, the proactive managersends a request to the MEAD Recover Manager to launch a new replica. If the usagegets higher than the second threshold then the proactive manager starts migrating thereplica’s clients to the next non-faulty replica server.The Mead Recovery Manager has some similarities with the Replication Manager ofCORBA-FT, as it also must launch new replicas in the presence of failures (node orserver). In MEAD, the recovery manager does not follow a centralized architecture, asin TAO or FLARe, where all the components of the FT infrastructure are connected tothe replication manager, instead, they are connected by a reliable total ordered groupcommunication framework that establishes an implicit agreement at each communica-tion round. These frameworks also provide a notion of view, i.e. an instantaneous54

×