Nubilum: Resource Management System for Distributed Clouds

964 views

Published on

My PhD thesis about resource management on Distributed Cloud environments.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Nubilum: Resource Management System for Distributed Clouds

  1. 1. Pós-Graduação em Ciência da Computação“Nubilum: Resource Management System forDistributed Clouds”PorGlauco Estácio GonçalvesTese de DoutoradoUniversidade Federal de Pernambucoposgraduacao@cin.ufpe.brwww.cin.ufpe.br/~posgraduacaoRECIFE, 03/2012
  2. 2. UNIVERSIDADE FEDERAL DE PERNAMBUCOCENTRO DE INFORMÁTICAPÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃOGLAUCO ESTÁCIO GONÇALVES“Nubilum: Resource Management System for Distributed Clouds"ESTE TRABALHO FOI APRESENTADO À PÓS-GRADUAÇÃO EMCIÊNCIA DA COMPUTAÇÃO DO CENTRO DE INFORMÁTICA DAUNIVERSIDADE FEDERAL DE PERNAMBUCO COMO REQUISITOPARCIAL PARA OBTENÇÃO DO GRAU DE DOUTOR EM CIÊNCIADA COMPUTAÇÃO.ORIENTADORA: Dra. JUDITH KELNERCO-ORIENTADOR: Dr. DJAMEL SADOKRECIFE, MARÇO/2012
  3. 3. Tese de Doutorado apresentada por Glauco Estácio Gonçalves à Pós- Graduação emCiência da Computação do Centro de Informática da Universidade Federal dePernambuco, sob o título “Nubilum: Resource Management System for DistributedClouds” orientada pela Profa. Judith Kelner e aprovada pela Banca Examinadoraformada pelos professores:___________________________________________________________Prof. Paulo Romero Martins MacielCentro de Informática / UFPE__________________________________________________________Prof. Stênio Flávio de Lacerda FernandesCentro de Informática / UFPE____________________________________________________________Prof. Kelvin Lopes DiasCentro de Informática / UFPE_________________________________________________________Prof. José Neuman de SouzaDepartamento de Computação / UFC___________________________________________________________Profa. Rossana Maria de Castro AndradeDepartamento de Computação / UFCVisto e permitida a impressão.Recife, 12 de março de 2012.___________________________________________________Prof. Nelson Souto RosaCoordenador da Pós-Graduação em Ciência da Computação doCentro de Informática da Universidade Federal de Pernambuco.
  4. 4. To my family Danielle, JoãoLucas, and Catarina.
  5. 5. ivAcknowledgmentsI would like to express my gratitude to God, cause of all the things and also myexistence; and to the Blessed Virgin Mary to whom I appealed many times in prayer, beingattended always.I would like to thank my advisor Dr. Judith Kelner and my co-advisor Dr. DjamelSadok, whose expertise and patience added considerably to my doctoral experience. Thanksfor the trust in my capacity to conduct my doctorate at GPRT (Networks andTelecommunications Research Group).I am indebted to all the people from GPRT for their invaluable help for this work. Avery special thanks goes out to Patrícia, Marcelo, and André Vítor, which have givenvaluable comments over the course of my PhD.I must also acknowledge my committee members, Dr. Jose Neuman, Dr. Otto Duarte,Dr. Rossana Andrade, Dr. Stênio Fernandes, Dr. Kelvin Lopes, and Dr. Paulo Maciel forreviewing my proposal and dissertation, offering helpful comments to improve my work.I would like to thank my wife Danielle for her prayer, patience, and love which gaveme the necessary strength to finish this work. A special thanks to my children, João Lucasand Catarina. They are gifts of God that make life delightful.Finally, I would like to thank my parents, João and Fátima, and my sisters, Cynara andKarine, for their love. Their blessings have always been with me as I followed in my doctoralresearch.
  6. 6. vAbstractThe current infrastructure of Cloud Computing providers is composed of networking andcomputational resources that are located in large datacenters supporting as many ashundreds of thousands of diverse IT equipment. In such scenario, there are severalmanagement challenges related to the energy, failure and operational management andtemperature control. Moreover, the geographical distance between resources and final usersis a source of delay when accessing the services. An alternative to such challenges is thecreation of Distributed Clouds (D-Clouds) with geographically distributed resources along toa network infrastructure with broad coverage.Providing resources in such a distributed scenario is not a trivial task, since, beyond theprocessing and storage resources, network resources must be taken in consideration offeringusers a connectivity service for data transportation (also called Network as a Service – NaaS).Thereby, the allocation of resources must consider the virtualization of servers and thenetwork devices. Furthermore, the resource management must consider all steps from theinitial discovery of the adequate resource for attending developers’ demand to its finaldelivery to the users.Considering those challenges in resource management in D-Clouds, this Thesisproposes then Nubilum, a system for resource management on D-Clouds considering geo-locality of resources and NaaS aspects. Through its processes and algorithms, Nubilumoffers solutions for discovery, monitoring, control, and allocation of resources in D-Cloudsin order to ensure the adequate functioning of the D-Cloud while meeting developers’requirements. Nubilum and its underlying technologies and building blocks are describedand their allocation algorithms are also evaluated to verify their efficacy and efficiency.Keywords: cloud computing, resource management mechamisms, network virtualization.
  7. 7. viResumoAtualmente, a infraestrutura dos provedores de computação em Nuvem é composta porrecursos de rede e de computação, que são armazenados em datacenters de centenas demilhares de equipamentos. Neste cenário, encontram-se diversos desafios quanto à gerênciade energia e controle de temperatura, além de, devido à distância geográfica entre os recursose os usuários, ser fonte de atraso no acesso aos serviços. Uma alternativa a tais desafios é ouso de Nuvens Distribuídas (Distributed Clouds – D-Clouds) com recursos distribuídosgeograficamente ao longo de uma infraestrutura de rede com cobertura abrangente.Prover recursos em tal cenário distribuído não é uma tarefa trivial, pois, além derecursos computacionais e de armazenamento, deve-se considerar recursos de rede os quaissão oferecidos aos usuários da nuvem como um serviço de conectividade para transporte dedados (também chamado Network as a Service – NaaS). Desse modo, o processo de alocaçãodeve considerar a virtualização de ambos, servidores e elementos de rede. Além disso, agerência de recursos deve considerar desde a descoberta dos recursos adequados paraatender as demandas dos usuários até a manutenção da qualidade de serviço na sua entregafinal.Considerando estes desafios em gerência de recursos em D-Clouds, este trabalhopropõe Nubilum: um sistema para gerência de recursos em D-Cloud que considera aspectosde geo-localidade e NaaS. Por meio de seus processos e algoritmos, Nubilum oferecesoluções para descoberta, monitoramento, controle e alocação de recursos em D-Clouds deforma a garantir o bom funcionamento da D-Cloud, além de atender os requisitos dosdesenvolvedores. As diversas partes e tecnologias de Nubilum são descritos em detalhes esuas funções delineadas. Ao final, os algoritmos de alocação do sistema são tambémavaliadas de modo a verificar sua eficácia e eficiência.Palavras-chave: computação em nuvem, mecanismos de alocação de recursos, virtualizaçãode redes.
  8. 8. viiContentsAbstract vResumo viAbbreviations and Acronyms xii1 Introduction 11.1 Motivation............................................................................................................................................. 21.2 Objectives ............................................................................................................................................. 41.3 Organization of the Thesis................................................................................................................. 42 Cloud Computing 62.1 What is Cloud Computing?................................................................................................................ 62.2 Agents involved in Cloud Computing.............................................................................................. 72.3 Classification of Cloud Providers...................................................................................................... 82.3.1 Classification according to the intended audience..................................................................................82.3.2 Classification according to the service type.............................................................................................82.3.3 Classification according to programmability.........................................................................................102.4 Mediation System............................................................................................................................... 112.5 Groundwork Technologies.............................................................................................................. 122.5.1 Service-Oriented Computing...................................................................................................................122.5.2 Server Virtualization..................................................................................................................................122.5.3 MapReduce Framework............................................................................................................................132.5.4 Datacenters.................................................................................................................................................143 Distributed Cloud Computing 153.1 Definitions.......................................................................................................................................... 153.2 Research Challenges inherent to Resource Management ............................................................ 183.2.1 Resource Modeling....................................................................................................................................183.2.2 Resource Offering and Treatment..........................................................................................................203.2.3 Resource Discovery and Monitoring......................................................................................................223.2.4 Resource Selection and Optimization....................................................................................................233.2.5 Summary......................................................................................................................................................274 The Nubilum System 284.1 Design Rationale................................................................................................................................ 284.1.1 Programmability.........................................................................................................................................284.1.2 Self-optimization........................................................................................................................................294.1.3 Existing standards adoption.....................................................................................................................294.2 Nubilum’s conceptual view.............................................................................................................. 294.2.1 Decision plane............................................................................................................................................304.2.2 Management plane.....................................................................................................................................314.2.3 Infrastructure plane...................................................................................................................................324.3 Nubilum’s functional components.................................................................................................. 324.3.1 Allocator......................................................................................................................................................334.3.2 Manager.......................................................................................................................................................34
  9. 9. viii4.3.3 Worker.........................................................................................................................................................354.3.4 Network Devices.......................................................................................................................................364.3.5 Storage System ...........................................................................................................................................374.4 Processes............................................................................................................................................. 374.4.1 Initialization processes..............................................................................................................................374.4.2 Discovery and monitoring processes......................................................................................................384.4.3 Resource allocation processes..................................................................................................................394.5 Related projects.................................................................................................................................. 405 Control Plane 435.1 The Cloud Modeling Language ....................................................................................................... 435.1.1 CloudML Schemas.....................................................................................................................................455.1.2 A CloudML usage example......................................................................................................................525.1.3 Comparison and discussion .....................................................................................................................565.2 Communication interfaces and protocols...................................................................................... 575.2.1 REST Interfaces.........................................................................................................................................575.2.2 Network Virtualization with Openflow.................................................................................................635.3 Control Plane Evaluation ................................................................................................................. 656 Resource Allocation Strategies 686.1 Manager Positioning Problem ......................................................................................................... 686.2 Virtual Network Allocation.............................................................................................................. 706.2.1 Problem definition and modeling ...........................................................................................................726.2.2 Allocating virtual nodes............................................................................................................................746.2.3 Allocating virtual links...............................................................................................................................756.2.4 Evaluation...................................................................................................................................................766.3 Virtual Network Creation................................................................................................................. 816.3.1 Minimum length Steiner tree algorithms ...............................................................................................826.3.2 Evaluation...................................................................................................................................................866.4 Discussion........................................................................................................................................... 897 Conclusion 917.1 Contributions ..................................................................................................................................... 927.2 Publications ........................................................................................................................................ 937.3 Future Work ....................................................................................................................................... 94References 96
  10. 10. ixList of FiguresFigure 1 Agents in a typical Cloud Computing scenario (from [24]) ..................................................7Figure 2 Classification of Cloud types (from [71]).................................................................................9Figure 3 Components of an Archetypal Cloud Mediation System (adapted from [24]) ................11Figure 4 Comparison between (a) a current Cloud and (b) a D-Cloud............................................16Figure 5 ISP-based D-Cloud example ...................................................................................................17Figure 6 Nubilum’s planes and modules...............................................................................................30Figure 7 Functional components of Nubilum......................................................................................33Figure 8 Schematic diagram of Allocator’s modules and relationships with other components..33Figure 9 Schematic diagram of Manager’s modules and relationships with other components...34Figure 10 Schematic diagram of Worker modules and relationships with the server system........35Figure 11 Link discovery process using LLDP and Openflow ..........................................................38Figure 12 Sequence diagram of the Resource Request process for a developer..............................39Figure 13 Integration of different descriptions using CloudML........................................................44Figure 14 Basic status type used in the composition of other types..................................................45Figure 15 Type for reporting status of the virtual nodes ....................................................................46Figure 16 XML Schema used to report the status of the physical node...........................................46Figure 17 Type for reporting complete description of the physical nodes.......................................46Figure 18 Type for reporting the specific parameters of any node ...................................................47Figure 19 Type for reporting information about the physical interface ...........................................48Figure 20 Type for reporting information about a virtual machine..................................................48Figure 21 Type for reporting information about the whole infrastructure ......................................49Figure 22 Type for reporting information about the physical infrastructure...................................49Figure 23 Type for reporting information about a physical link .......................................................50Figure 24 Type for reporting information about the virtual infrastructure .....................................50Figure 25 Type describing the service offered by the provider .........................................................51Figure 26 Type describing the requirements that can be requested by a developer .......................52Figure 27 Example of a typical Service description XML ..................................................................53Figure 28 Example of a Request XML..................................................................................................53Figure 29 Physical infrastructure description........................................................................................54Figure 30 Virtual infrastructure description..........................................................................................55Figure 31 Communication protocols employed in Nubilum..............................................................57Figure 32 REST operation for the retrieval of service information..................................................59Figure 33 REST operation for updating information of a service ....................................................59Figure 34 REST operation for requesting resources for a new application.....................................59Figure 35 REST operation for changing resources of a previous request .......................................60Figure 36 REST operation for releasing resources of an application ...............................................60Figure 37 REST operation for registering a new Worker...................................................................60Figure 38 REST operation to unregister a Worker..............................................................................61Figure 39 REST operation for update information of a Worker ......................................................61Figure 40 REST operation for retrieving a description of the D-Cloud infrastructure .................61Figure 41 REST operation for updating the description of a D-Cloud infrastructure...................61Figure 42 REST operation for the creation of a virtual node............................................................62Figure 43 REST operation for updating a virtual node ......................................................................62Figure 44 REST operation for removal of a virtual node...................................................................62Figure 45 REST operation for requesting the discovered physical topology ..................................63Figure 46 REST operation for the creation of a virtual link ..............................................................63Figure 47 REST operation for updating a virtual link.........................................................................64Figure 48 REST operation for removal of a virtual link.....................................................................64
  11. 11. xFigure 49 Example of a typical rule for ARP forwarding...................................................................65Figure 50 Example of the typical rules created for virtual links: (a) direct, (b) reverse..................65Figure 51 Example of a D-Cloud with ten workers and one Manager.............................................69Figure 52 Algorithm for allocation of virtual nodes............................................................................74Figure 53 Example illustrating the minimax path................................................................................75Figure 54 Algorithm for allocation of virtual links..............................................................................76Figure 55 The (a) old and (b) current network topologies of RNP used in simulations................77Figure 56 Results for the maximum node stress in the (a) old and (b) current RNP topology....78Figure 57 Results for the maximum link stress in the (a) old and (b) current RNP topology ......79Figure 58 Results for the mean link stress in the (a) old and (b) current RNP topology...............80Figure 59 Mean path length (a) old and (b) current RNP topology..................................................80Figure 60 Example creating a virtual network: (a) before the creation; (b) after the creation ......81Figure 61 Search procedure used by the GHS algorithm....................................................................83Figure 62 Placement procedure used by the GHS algorithm.............................................................84Figure 63 Example of the placement procedure: (a) before and (b) after placement.....................85Figure 64 Percentage of optimal samples for GHS and STA in the old RNP topology................87Figure 65 Percentage of samples reaching relative error ≤ 5% in the old RNP topology.............88Figure 66 Percentage of optimal samples for GHS and STA in the current RNP topology ........88Figure 67 Percentage of samples reaching relative error ≤ 5% in the current RNP topology......89
  12. 12. xiList of TablesTable I Summary of the main aspects discussed..................................................................................27Table II MIMEtypes used in the overall communications.................................................................58Table III Models for the length of messages exchanged in the system in bytes.............................67Table IV Characteristics present in Nubilum’s resource model ........................................................71Table V Reduced set of characteristics considered by the proposed allocation algorithms ..........72Table VI Factors and levels used in the MPA’s evaluation ................................................................78Table VII Factors and levels used in the GHS’s evaluation...............................................................86Table VIII Scientific papers produced ..................................................................................................94
  13. 13. xiiAbbreviations and AcronymsCDN Content Delivery NetworkCloudML Cloud Modeling LanguageD-Cloud Distribute CloudDHCP Dynamic Host Configuration ProtocolGHS Greedy Hub SelectionHTTP Hypertext Transfer ProtocolIaaS Infrastructure as a ServiceISP Internet Service ProviderLLDP Link Layer Discovery ProtocolMPA Minimax Path AlgorithmMPP Manager Positioning ProblemNaaS Network as a ServiceNV Network VirtualizationOA Optimal AlgorithmOCCI Open Cloud Computing InterfacePoP Point of PresenceREST Representational state transferRP Replica PlacementRPA Replica Placement AlgorithmSTA Steiner Tree ApproximationVM Virtual MachineVN Virtual NetworkXML Extensible Markup LanguageZAA Zhu and Ammar Algorithm
  14. 14. 11 Introduction“A inea incipere.”ErasmusNowadays, it is common to access content across the Internet with little reference to the underlyingdatacenter hosting infrastructure maintained by content providers. The entire technology used toprovide such level of locality transparency offers also a new model for the provisioning ofcomputing services, known as Cloud Computing. This model is attractive as it allows resources to beprovisioned according to users’ requirements leading to overall cost reduction. Cloud users can rentresources as they become necessary, in a much more scalable and elastic way. Moreover, such userscan transfer operational risks to cloud providers. In the viewpoint of those providers, the modeloffers a way for a better utilization of their own infrastructure. Ambrust et al. [1] point out that thismodel benefits from a form of statistical multiplexing, since it allocates resources for several usersconcurrently on a demand basis. This statistical multiplexing of datacenters is subsequent to severaldecades of research in many areas such as distributed computing, Grid computing, webtechnologies, service computing, and virtualization.Current Cloud Computing providers mainly use large and consolidated datacenters in order tooffer their services. However, the ever increasing need for over-provisioning to attend peakdemands and providing redundancy against failures allied to expensive cooling needs are importantfactors increasing the energetic costs of centralized datacenters [62]. In current datacenters, thecooling technologies used for heat dissipation control accounts for as much as 50% of the totalpower consumption [38]. In addition to these aspects, it must be observed that the network betweenusers and the Cloud is often an unreliable best-effort IP service, which can harm delay-constrainedservices and interactive applications.To deal with these problems, there have been some indicatives whereby small cooperativedatacenters can be more attractive since they offer cheaper and low-power consumption alternativereducing the infrastructure costs of centralized Clouds [12]. These small datacenters can be built atdifferent geographical regions and connected by dedicated or public (provided by Internet ServiceProviders) networks, configuring a new type of Cloud, referred to as a Distributed Cloud. Such
  15. 15. 2Distributed Clouds [20], or just D-Clouds, can exploit the possibility of (virtual) links creation andthe potential of sharing resources across geographic boundaries to provide latency-based allocationof resources to fully utilize this emerging distributed computing power. D-Clouds can reducecommunication costs by simply provisioning storage, servers, and network resources close to end-users.The D-Clouds can be considered as an additional step in the ongoing deployments of CloudComputing: one that supports different requirements and leverages new opportunities for serviceproviders. Users in a Distributed Cloud will be free to choose where to allocate their resources inorder to attend specific market niches, constraints on jurisdiction of software and data, or quality ofservice aspects of their clients.1.1 MotivationSimilarly to Cloud Computing, one of the most important design aspects of D-Clouds is theavailability of “infinite” computing resources which may be used on demand. Cloud users see this“infinite” resource pool because the Cloud offers the continuous monitoring and management of itsresources and the allocation of resources in an elastic way. Nevertheless, providing on-demandcomputing instances and network resources in a distributed scenario is not a trivial task. Dynamicallocation of resources and their possible reallocation are essential characteristics for accommodatingunpredictable demands and, ultimately, contributing to investment return.In the context of Clouds, the essential feature of any resource management system is toguarantee that both user and provider requirements are met satisfactorily. Particularly in D-Clouds,users may have network requirements, such as bandwidth and delay constraints, in addition to thecommon computational requirements, such as CPU, memory, and storage. Furthermore, other userrequirements are relevant including node locality, topology of nodes, jurisdiction, and applicationinteraction.The development of solutions to cope with resource management problems remains a veryimportant topic in the field of Cloud Computing. With regard to this technology, there are solutionsfocused on grid computing ([49], [70]) and on datacenters in current Cloud Computing scenarios([4]). However, such strategies do not fit well the D-Clouds as they are heavily based on assumptionsthat do not hold in Distributed Cloud scenarios. For example, such solutions are designed for over-provisioned networks and commonly do not take into consideration the cost of resources’communication, which is an important aspect for D-Clouds that must be cautiously monitoredand/or reserved in order to meet users’ requirements.
  16. 16. 3The design of a resource management system involves challenges other than the specificdesign of optimization algorithms for resource management. Since D-Clouds are composed ofcomputational and network devices with different architectures, software, and hardware capabilities,the first challenge is the development of a suitable resource model covering all this heterogeneity[20],. In addition, the next challenge is to describe how resources are offered, which is importantsince the requirements supported by the D-Cloud provider are defined in this step. The otherchallenges are related with the overall operation of the resource management system. When requestsarrive, the system should be aware of the current status of resources, in order to determine if thereare sufficient available resources in the D-Cloud that could satisfy the present request. In this way,the right mechanisms for resource discovery and monitoring should also be designed, allowing thesystem to be aware of the updated status of all its resources. Then, based on the current status andthe requirements of the request, the system may select and allocate resources to serve the presentrequest.Please note that the solution to those challenges involves the fine-grained coordination ofseveral distributed components and the orchestrated execution of the several subsystems composingthe resource management system. At a first glance, these subsystems can be organized into threeparts: one responsible for the direct negotiation of requirements with users; another responsible fordeciding what resources to allocate for given applications; and one last part responsible for theeffective enforcement of these decisions on the resources.Designing such system is a very interesting and challenging task, and it raises the followingresearch questions that will be investigated in this thesis:1. How Cloud users describe their requirements? In order to enable the automaticnegotiation between users and the D-Cloud, the Cloud must recognize a language orformalism for requirements description. Thus, the investigation of this topic must determinethe proper characteristics of such a language. In addition, it must verify the existentapproaches around this topic in the many relative computing areas.2. How to represent the resources available in the Cloud? Correlated to the first question,the resource management system must also maintain an information model to represent allthe resources in the Cloud, including their relationships (topology) and their current status.3. How the users’ applications are mapped onto Cloud resources? This question is aboutthe very aspect of resource allocation, i.e., the algorithms, heuristics, and strategies that areused to decide the set of resources meeting the applications’ requirements and optimizing autility function.
  17. 17. 44. How to enforce the decisions made? The effective enforcement of the decisions involvesthe extension of communication protocols or the development of new ones in order tosetup the state of the overall resources in the D-Cloud.1.2 ObjectivesThe main objective of this Thesis is to propose an integrated solution to problems related to themanagement of resources in D-Clouds. Such solution is presented as Nubilum, a resourcemanagement system that offers a self-managed system for challenges on discovery, control,monitoring, and allocation of resources in D-Clouds. Nimbulus provides fine-grained orchestrationof their components in order to allocate applications on a D-Cloud.The specific goals of this Thesis are strictly related to the research questions presented inSection 1.1, they are:• Elaborate an information model to describe D-Cloud resources and applicationrequirements as computational restrictions, topology, geographic location and othercorrelated aspects that can be employed to request resources directly to the D-Cloud;• Explore and extend communication protocols for the provisioning and allocation ofcomputational and communication resources;• Develop algorithms, heuristics, and strategies to find suitable D-Cloud resources based onseveral different application requirements;• Integrate the information model, the algorithms, and the communication protocols, into asingle solution.1.3 Organization of the ThesisThis Thesis identifies the challenges involved in the resource management on Distributed CloudComputing and presents solutions for some of these challenges. The remainder of this document isorganized as follows.The general concepts that make up the basis for all the other chapters are introduced in thesecond chapter. Its main objective is to discuss Cloud Computing while trying to explore suchdefinition and to classify the main approaches in this area.The Distributed Cloud Computing concept and several important aspects of resourcemanagement on those scenarios are introduced in the third chapter. Moreover, this chapter willmake a comparative analysis of related research areas and problems.
  18. 18. 5The fourth chapter introduces the first contribution of this Thesis: the Nubilum resourcemanagement system, which aggregates the several solutions proposed on this Thesis. Moreover, thechapter highlights the rationale behind Nubilum as well as their main modules and components.The fifth chapter examines and evaluates the control plane of Nubilum. It describes theproposed Cloud Modeling Language and details the communication interfaces and protocols usedfor communicating between Nubilum components.The sixth chapter gives an overview of the resource allocation problems in DistributedClouds, and makes a thorough examination of the specific problems related to Nubilum. Someparticular problems are analyzed and a set of algorithms is presented and evaluated.The seventh chapter of this Thesis reviews the obtained evaluation results, summarizes thecontributions and sets the path to future works and open issues on D-Cloud.
  19. 19. 62 Cloud Computing“Definitio est declaratio essentiae rei.”Legal ProverbIn this chapter the main concepts of Cloud Computing will be presented. It begins with a discussionon the definition of Cloud Computing (Section 2.1) and the main agents involved in CloudComputing (Section 2.2). Next, classifications of Cloud initiatives are offered in Section 2.3. Anexemplary and simple architecture of a Cloud Mediation System is presented in Section 2.4 followedby a presentation in Section 2.5 of the main technologies acting behind the scenes of CloudComputing initiatives.2.1 What is Cloud Computing?A definition of Cloud Computing is given by the National Institute of Standards and Technology(NIST) of the United States: “Cloud computing is a model for enabling convenient, on-demandnetwork access to a shared pool of configurable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidly provisioned and released with minimalmanagement effort or service provider interaction” [45]. The definition says that on-demanddynamic reconfiguration (elasticity) is a key characteristic. Additionally, the definition highlightsanother Cloud Computing characteristic: it assumes that minimal management efforts are requiredto reconfigure resources. In other words, the Cloud must offer self-service solutions that mustattend to requests on-demand, excluding from the scope of Cloud Computing those initiatives thatoperate through the rental of computing resources in a weekly or monthly basis. Hence, it restrictsCloud Computing to systems that provide automatic mechanisms for resource rental in real-timewith minimal human intervention.The NIST definition gives a satisfactory concept of Cloud Computing as a computing model.But, NIST does not cover the main object of Cloud Computing: the Cloud. Thus, in this Thesis,Cloud Computing is defined as the computing model that operates based on Clouds. In turn, theCloud is defined as a conceptual layer that operates above an infrastructure to provide elasticservices in a timely manner.
  20. 20. 7This definition encompasses three main characteristics of Clouds. Firstly, it notes that a Cloudis primarily a concept, i.e., a Cloud is an abstraction over an infrastructure. Thus, it is independent ofthe employed technologies and therefore one can accept different setups, like Amazon EC2 orGoogle App Engine, to be named Clouds. Moreover, the infrastructure is defined in a broad senseonce it can be composed by software, physical devices, and/or other Clouds. Secondly, all Cloudshave the same purpose: to provide services. This means that a Cloud hides the complexity of theunderlying infrastructure while exploring the potential of overlying services and acting as amiddleware. In addition, providing a service involves, implicitly, the use of some type of agreementthat should be guaranteed by the Cloud. Such agreements can vary from pre-defined contracts tomalleable agreements defining functional and non-functional requirements. Note that these servicesare qualified as elastic ones, which has the same meaning of dynamic reconfiguration that appearedin the NIST definition. Last but not least, the Cloud must provide services as quickly as possiblesuch that the infrastructure resources are allocated and reallocated to attend the users’ needs.2.2 Agents involved in Cloud ComputingDespite previous approaches ([64], [8], [72], and [68]), this Thesis focuses only on three distinctagents in Cloud Computing as shown in Figure 1: clients, developers, and the provider. The firstnotable point is that the provider deals with two types of users that are called developers and clients.Thus, clients are the customers of a service produced by a developer. Clients use services fromdevelopers, but such use generates demand to the provider that actually hosts the service, andtherefore the client can also be considered a user of the Cloud. It is important to highlight that insome scenarios (like scientific computing or batch processing) a developer may behave as a client tothe Cloud because it is the end-user of the applications. The text will use “users” when referring toboth classes without distinctions.Figure 1 Agents in a typical Cloud Computing scenario (from [24])Developers can be service providers, independent programmers, scientific institutions, and soon, i.e., all who build applications into the Cloud. They create and run their applications whileDeveloperDeveloperClient Client Client Client
  21. 21. 8keeping decisions related to maintenance and management of the infrastructure to the provider.Please note that, a priori, developers do not need to know about the technologies that makeup theCloud infrastructure, neither about the specific location of each item in the infrastructure.Lastly, the term application is used to mean all types of services that can be developed on theCloud. In addition, it is important to note that the type of applications supported by a Clouddepends exclusively on the goals of the Cloud as determined by the provider. Such a wide range ofpossible targets generates many different types of Cloud Providers that are discussed in the nextsection.2.3 Classification of Cloud ProvidersCurrently, there are several operational initiatives of Cloud Computing; however despite all beingcalled Clouds, they provide different types of services. For that reason, the academic community([64], [8], [45], [72], and [71]) classified these solutions accurately in order to understand theirrelationships. The three complementary proposals for classification are as follows.2.3.1 Classification according to the intended audienceThis first simple taxonomy is suggested by NIST [45] that organizes providers according to theaudience to which the Cloud is aimed. There are four classes in this classification: Private Clouds,Community Clouds, Public Clouds, and Hybrid Clouds.The first three classes accommodate providers in a gradual opening of the intended audiencecoverage. The Private Cloud class encompasses such types of Clouds destined to be used solely byan organization operating over their own datacenter or one leased from a third party for exclusiveuse. When the Cloud infrastructure is shared by a group of organizations with similar interests it isclassified as a Community Cloud. Furthermore, the Public Cloud class encompasses all initiativesintended to be used by the general public. Finally, Hybrid Clouds are simply the composition of twoor more Clouds pertaining to different classes (Private, Community, or Public).2.3.2 Classification according to the service typeIn [71], authors offer a classification as represented in Figure 2. Such taxonomy divides Clouds infive categories: Cloud Application, Cloud Software Environment, Cloud Software Infrastructure,Software Kernel, and Firmware/Hardware. The authors arranged the different types of Clouds in astack, showing that Clouds from higher levels are created using services in the lower levels. This ideapertains to the definitions of Cloud Computing discussed previously in Sections 2.1 and 2.2.Essentially, the Cloud provider does not need to be the owner of the infrastructure.
  22. 22. 9Figure 2 Classification of Cloud types (from [71])The class in the top of the stack, also called Software-as-a-Service (SaaS), involves applicationsaccessed through the Internet, including social networks, Webmail, and Office tools. Such servicesprovide software to be used by the general public, whose main interest is to avoid tasks related tosoftware management like installation and updating. From the point of view of the Cloud provider,SaaS can decrease costs with software implementation when compared with traditional processes.Similarly, the Cloud Software Environment, also called Platform-as-a-Service (PaaS), enclosesClouds that offer programming environments for developers. Through well-defined APIs,developers can use software modules for access control, authentication, distributed processing, andso on, in order to produce their own applications in the Cloud. Moreover, developers can contractservices for automatic scalability of their software, databases, and storage services.In the middle of the stack there is the Cloud Software Infrastructure class of initiatives. Thisclass encompasses solutions that provide virtual versions of infrastructure devices found indatacenters like servers, databases, and links. Clouds in this class can be divided into three subclassesaccording to the type of resource that is offered by them. Computational resources are grouped inthe Infrastructure-as-a-service (IaaS) subclass that provides generic virtual machines that can be usedin many different ways by the contracting developer. Services for massive data storage are groupedin the Data-as-a-Service (DaaS) class, whose main mission is to store remotely users’ data on remote,which allows those users to access their data from anywhere and at anytime. Finally, the thirdsubclass, called Communications-as-a-Service (CaaS), is composed of solutions that offer virtualprivate links and routers through telecommunication infrastructures.The last two classes do not offer Cloud services specifically, but they are included in theclassification to show that providers offering Clouds in higher layers can have their own softwareand hardware infrastructure. The Software Kernel class includes all of the software necessary toprovide services to the other categories like operating systems, hypervisors, cloud management
  23. 23. 10middleware, programming APIs, and libraries. Finally, the class of Firmware/Hardware covers allsale and rental services of physical servers and communication hardware.2.3.3 Classification according to programmabilityThe five-class scheme presented above can classify and organize the current spectrum of CloudComputing solutions, but such a model is limited because the number of classes and theirrelationships will need to be rearranged as new Cloud services emerge. Therefore, in this Thesis, adifferent classification model will be used based on the programmability concept, which waspreviously introduced by Endo et al. [19].Borrowed from the realm of network virtualization [11], programmability is a concept relatedto the programming features a network element offers to developers, measuring how much freedomthe developer has to manipulate resources and/or devices. This concept can be easily applied to thecomparison of Cloud Computing solutions. More programmable Clouds offer environments wheredevelopers are free to choose programming paradigms, languages, and platforms. Lessprogrammable Clouds restrict developers in some way: perhaps by forcing a set of programminglanguages or by providing support for only one application paradigm. On the other hand,programmability directly affects the way developers manage their leased resources. From this point-of-view, providers of less programmable Clouds are responsible to manage their infrastructure whilebeing transparent to developers. In turn, a more programmable Cloud leaves more of these tasks todevelopers, thus introducing management difficulties due to the more heterogeneous programmingenvironment.Thus, Cloud Programmability can be defined as the level of sovereignty under whichdevelopers have to manipulate services leased from a provider. Programmability is a relativeconcept, i.e., it was adopted to compare one Cloud with others. Also, programmability is directlyproportional to heterogeneity in the infrastructure of the provider and inversely proportional to theamount of effort that developers must spend to manage leased resources.To illustrate how this concept can be used, one can classify two current Clouds: Amazon EC2and Google App Engine. Clearly the Amazon EC2 is more programmable, since in this Clouddevelopers can choose between different virtual machine classes, operating systems, and so on. Afterthey lease one of these virtual machines, developers can configure it to work as they see fit: as a webserver, as a content server, as a unit for batch processing, and so on. On the other hand, GoogleApp Engine can be classified as a less programmable solution, because it allows developers to createWeb applications that will be hosted by Google. This restricts developers to the Web paradigm andto some programming languages.
  24. 24. 112.4 Mediation SystemFigure 3 introduces an Archetypal Cloud Mediation System. This is a conceptual model that will beused as a reference to the discussion on Resource Management in this Thesis. The Archetypal CloudMediation System focuses on one principle: resource management as the main service of any CloudComputing provider. Thus, other important services like authentication, accounting, and security areout of the scope of this conceptual system and, therefore these services are separated from theMediation System in this archetypal Cloud mediation system. Clients also do not factor into thisview of the system, since resource management is mainly related to the allocation of developers’applications and meeting their requirements.Figure 3 Components of an Archetypal Cloud Mediation System (adapted from [24])The mediation system is responsible for the entire process of resource management in theCloud. Such a process covers tasks that range from the automatic negotiation of developersrequirements to the execution of their applications. It has three main layers: negotiation, resourcemanagement, and resource control.The negotiation layer deals with the interface between the Cloud and developers. In the caseof Clouds selling infrastructure services, the interface can be a set of operations based on WebServices for control of the leased virtual machines. Alternately, in the case of PaaS services, thisinterface can be an API for software development in the Cloud. Moreover, the negotiation layerhandles the process of contract establishment between the enterprises and the Cloud. Currently, thisprocess is simple and the contracts tend to be restrictive. One can expect that in the future, Cloudswill offer more sophisticated avenues for user interaction through high level abstractions and servicelevel policies.MediationSystemResourcesResource ManagementNegotiationResource ControlDevelopersAuxiliaryServicesAccountAuthenticationSecurity
  25. 25. 12The resource management layer is responsible for the optimal allocation of applications forobtaining the maximum usage of resources. This function requires advanced strategies and heuristicsto allocate resources that meet the contractual requirements as established with the applicationdeveloper. These may include service quality restrictions, jurisdiction restrictions, elastic adaptation,among others.Metaphorically, one can say that while the resource management layer acts as the “brain” ofthe Cloud, the resource control layer plays the role of its “limbs”. The resource control encompassesall functions needed to enforce decisions generated by the upper layer. Beyond the tools used toconfigure the Cloud resources effectively, all communication protocols used by the Cloud areincluded in this layer.2.5 Groundwork TechnologiesSome of the main technologies that used by the current Cloud mediation systems (namely Service-oriented Computing, Virtualization, MapReduce, and Datacenters) will be discussed.2.5.1 Service-Oriented ComputingService-Oriented Computing defines a set of principles, architectural models, and technologies forthe design and development of distributed applications. The recent development of software whilefocusing on services gave rise to SOA (Service-Oriented Architecture), which can be defined as anarchitectural model “that supports the discovery, message exchange, and integration between looselycoupled services using industry standards” [37]. The common technology for the implementation ofSOA principles is the Web Service that defines a set of standards to implement services over theWorld Wide Web.In Cloud Computing, SOA is the main paradigm for the development of functions on theseveral layers of the Cloud. Cloud providers publish APIs for their services on the web, allowingdevelopers to use the Cloud and to automate several tasks related to the management of theirapplications. Such APIs can assume the form of WSDL documents or REST-based interfaces.Furthermore, providers can make available Software Development Kits (SDKs) and other toolkitsfor the manipulation of applications running on the Cloud.2.5.2 Server VirtualizationServer virtualization is a technique that allows a computer system to be partitioned onto multipleisolated execution environments offering a similar service as a single physical computer, which arecalled Virtual Machines (VM). Each VM can be configured in an independent way while having itsown operating system, applications, and network parameters. Commonly, such VMs are hosted on a
  26. 26. 13physical server running a hypervisor, the software that effectively virtualizes the server and managesthe VMs [54].There are several hypervisor options that can be used for server virtualization. From the open-source community, one can cite Citrix’s Xen1and the Kernel-based Virtual Machine (KVM)2. Fromthe realm of proprietary solutions, some examples are VMWare ESX3and Microsoft’s HyperV4.The main factor that boosted up the adoption of server virtualization within CloudComputing is that such technology offers good flexibility regarding the dynamic reallocation ofworkloads across servers. Such flexibility allows, for example, providers to execute maintenance onservers without stopping developers’ applications (that are running on VMs) or to implementstrategies for better resource usage through the migration of VMs. Furthermore, server virtualizationis adapted for the fast provisioning of new VMs through the use of templates, which enablesproviders to offer elastic services for applications developers [43].2.5.3 MapReduce FrameworkMapReduce [15] is a programming framework developed by Google for distributed processing oflarge data sets across computing infrastructures. Inspired on the map and reduce primitives presentin functional languages, its authors developed an entire framework for the automatic distribution ofcomputations. In this framework, developers are responsible for writing map and reduce operationsand for using them according to their needs, which is similar to the functional paradigm. These mapand reduce operations will be executed by the MapReduce system that transparently distributescomputations across the computing infrastructure and treats all issues related to nodecommunication, load balancing, and fault tolerance. For the distribution and synchronization of thedata required by the application, the MapReduce system also requires the use of a specially tailoreddistributed file system called Google File System (GFS) [23].Despite being introduced by Google, there are some open source implementations of theMapReduce system, like Hadoop [6] and TPlatform [55]. The former is a popular open-sourcesoftware used for running applications on large clusters built of commodity hardware. This softwareis used by large companies like Amazon, AOL, and IBM, as well as in different Web applicationssuch as Facebook, Twitter, Last.fm, among others. Basically, Hadoop is composed of two modules:a MapReduce environment for distributed computing, and a distributed file system called theHadoop Distributed File System (HDFS). The latter is an academic initiative that provides a1 http://www.xen.org/products/cloudxen.html2 http://www.linux-kvm.org/page/Main_Page3 http://www.vmware.com/4 http://www.microsoft.com/hyper-v-server/en/us/default.aspx
  27. 27. 14development platform for Web mining applications. Similarly to Hadoop and Google’s MapReduce,the TPlatform has a MapReduce module and a distributed file system known as the Tianwang FileSystem (TFS) [55].The use of MapReduce solutions is common groundwork technology in PaaS Clouds becauseit offers a versatile sandbox for developers. Differently from IaaS Clouds, PaaS developers using ageneral-purpose language with MapReduce support do not need to be concerned with softwareconfiguration, software updating and, network configurations. All these tasks are the responsibilityof the Cloud provider, which, in turn, benefits from the fact that such configurations will bestandardized across the overall infrastructure.2.5.4 DatacentersDevelopers who are hosting their applications on a Cloud wish to scale their leased resources,effectively increasing and decreasing their virtual infrastructure according to the demand of theirclients. This is also the case for developers making use of their own private Clouds. Thus,independently of the class of Cloud under consideration, a robust and safe infrastructure is needed.Whereas virtualization and MapReduce respond for the software solution required to attendthis demand, the physical infrastructure of Clouds is based on datacenters, which are infrastructurescomposed of TI components providing processing capacity, storage, and network services for oneor more organizations [66]. Currently, the size of a datacenter (in number of components) can varyfrom tens of components to tens of thousands of components depending on the datacenter’smission. In addition, there are several different TI components for datacenters including switchesand routers, load balancers, storage devices, dedicated storage networks, and, the main componentof any datacenter, in other words, servers [27].Cloud Computing datacenters provide the required power to attend developers’ demands interms of processing, storage, and networking capacities. A large datacenter, running a virtualizationsolution, allows for better granularity division of the hardware’s power through the statisticalmultiplexing of developers’ applications.
  28. 28. 153 Distributed Cloud Computing“Quae non prosunt singula, multa iuvant.”OvidThis chapter discusses the main concepts of Distributed Cloud (D-Cloud) Computing. It beginswith a discussion of their definition (Section 3.1) in an attempt to distinguish the D-Cloud from thecurrent Clouds and highlight their main characteristics. Next, the main research challenges regardingresource management on D-Clouds will be described in Section 3.2.3.1 DefinitionsCurrent Cloud Computing setups involve a huge amount of investments as part of the datacenter,which is the common underlying infrastructure of Clouds as previously detailed in Section 2.5.4.This centralized infrastructure brings many well-known challenges such as the need for resourceover-provisioning and the high cost for heat dissipation and temperature control. In addition toconcerns with infrastructure costs, one must observe that those datacenters are not necessarily closeto their clients, i.e., the network between end-users and the Cloud is often a long best-effort IPconnection, which means longer round-trip delays.Considering such limitations, industry and academy researchers have presented indicatives thatsmall datacenters can be sometimes more attractive since they offer a cheaper and low-powerconsumption alternative while also reducing the infrastructure costs of centralized Clouds [12].Moreover, Distributed Clouds, or just D-Clouds, as pointed out by Endo et al. in [20], can exploitthe possibility of links creation and the potential of sharing resources across geographic boundariesto provide latency-based allocation of resources to ultimately fully utilize this distributed computingpower. Thus, D-Clouds can reduce communication costs by simply provisioning data, servers, andlinks close to end-users.Figure 4 illustrates how D-Clouds can reduce the cost of communication through the spreadof computational power and the usage of a latency-based allocation of applications. In Figure 4(a)the client uses an application (App) running on the Cloud through the Internet, which is subject tothe latency imposed by the best-effort network. In Figure 4(b), the client is accessing the same App,
  29. 29. 16but in this case, the latency imposed by the network will be reduced due to the allocation of the Appin a server that is in a small datacenter closest to the client than the previous scenario.(a) (b)Figure 4 Comparison between (a) a current Cloud and (b) a D-CloudPlease note that the Figure 4(b) intentionally does not specify the network connecting theinfrastructure of the D-Cloud Provider. This network can be rented from different local ISPs (usingthe Internet for interconnection) or from an ISP with wide area coverage. In addition, such ISPcould be the own D-Cloud Provider itself. This may be the case as the D-Cloud paradigmintroduces an organic change in the current Internet where ISPs can start to play as D-Cloudproviders. Thus, ISPs could offer their communication and computational resources for developersinterested in deploying their applications at the specific markets covered by those ISPs.This idea is illustrated by Figure 5 that shows a D-Cloud offered by a hypothetical BrazilianISP. In this example, a developer deployed its application (App) on two servers in order to attendrequests from northern and southern clients. If the number of northeastern clients increases, thedeveloper can deploy its App (represented by the dotted box) on one server close to the northeastregion in order to improve its service quality. It is important to pay attention to the fact that thecontribution of this Thesis falls in this last scenario, i.e., a scenario where the network andcomputational resources are all controlled by the same provider.CloudProviderClientInternetAppClientAppDistributedCloudProvider
  30. 30. 17Figure 5 ISP-based D-Cloud exampleD-Clouds share similar characteristics with current Cloud Computing, including essentialofferings such as scalability, on demand usage, and pay-as-you-go business plans. Furthermore, theagents already stated for current Clouds (please see Figure 1) are exactly the same in the context ofD-Clouds. Finally, the many different classifications discussed in Section 2.3 can be applied also.Despite the similarity, one may highlight two peculiarities of D-Clouds: support to geo-locality andNetwork as a Service (NaaS) provisioning ([2], [63], [17]).The geographical diversity of resources potentially improves cost and performance and givesan advantage to several different applications, particularly, those that do not require massive internalcommunication among large server pools. In this category, as pointed out by [12], one canemphasize, firstly, applications being currently deployed in a distributed manner, like VOIP (Voiceover IP) and online games; secondly, one can indicate the applications that are good candidates fordistributed implementation, like traffic filtering and e-mail distribution. In addition, there are otherdifferent types of applications that use software or data with specific legal restrictions onjurisdiction, and specific applications whose public is restricted to one or more geographical areas,like the tracking of buses or subway routes, information about entertainment events, local news, etc.Support for geo-locality can be considered to be a step further in the deployment of CloudComputing that leverages new opportunities for service providers. Thus, they will be free to choosewhere to allocate their resources in order to attend to specific niches, constraints on jurisdiction ofsoftware and data, or quality of service aspects of end-users.The NaaS (or Communication as a Service – CaaS as cited in section 2.3.2) allows serviceproviders to manage network resources, instead of just computational ones. Authors in [2] call NaaSas a service offering transport network connectivity with a level of virtualization suitable to beAppAppApp
  31. 31. 18invoked by service providers. In this way, D-Clouds are able to manage their network resourcesaccording to their convenience, offering better response time for hosted applications. The NaaS isclose to the Network Virtualization (NV) research area [31], where the main problem consists inchoosing how to allocate a virtual network over a physical one, meeting requirements andminimizing usage of the physical resources. Although NV and D-Clouds are subject to similarproblems and scenarios, there is an essential difference between these two. While NV commonlymodels its resources at the infrastructure level (requests are always virtual networks mapped ongraphs), a D-Cloud can be engineered to work with applications in a different abstraction level,exactly as it occurs with actual Cloud service types like the ones described at Section 2.3.2. This way,one may see Network Virtualization simply as a particular instance of the D-Cloud. Other insightsabout NV are given in Section 3.3.2.Finally, it must be highlighted that the D-Cloud does not compete with the current CloudComputing paradigm, since the D-Cloud merely fits a certain type of applications that have hardrestrictions on geographical location, while the existent Clouds continue to be attracting forapplications demanding massive computational resources or simple applications with minor or norestrictions on geographical location. Thus, the current Cloud Computing providers are the firstpotential candidates to take advantage of the D-Cloud paradigm, since the current Clouds could hireD-Cloud resources on-demand and move the applications to certain geographical locations in orderto meet specific developers’ requirements. In addition to the current Clouds, the D-Clouds can alsoserve the developers directly.3.2 Research Challenges inherent to Resource ManagementD-Clouds face challenges similar to the ones presented in the context of current Cloud Computing.However, as stated in Chapter 1, the object of the present study is the resource management in D-Clouds. Thus, this Section gives special emphasis to the challenges for resource management in D-Clouds, while focusing on four categories as presented in [20]: a) resource modeling; b) resourceoffering and treatment; c) resource discovery and monitoring; and d) resource selection.3.2.1 Resource ModelingThe first challenge is the development of a suitable resource model that is essential to all operationsin the D-Cloud, including management and control. Optimization algorithms are also stronglydependent of the resource modeling scheme used.In a D-Cloud environment, it is very important that resource modeling takes into accountphysical resources as well as virtual ones. On one hand, the amount of details in each resourceshould be treated carefully, since if resources are described with great details, there is a risk that the
  32. 32. 19resource optimization becomes hard and complex, since the computational optimization problemconsidering the several modeled aspects can create NP-hard problems. On the other hand, moredetails give more flexibility and leverage the usage of resources.There are some alternatives for resource modeling in Clouds that could be applied to D-Clouds. One can cite, for example, the OpenStack software project [53], which is focused onproducing an open standard Cloud operating system. It defines a Restful HTTP service thatsupports JSON and XML data formats and it is used to request or to exchange information aboutCloud resources and action commands. OpenStack also offers ways to describe how to scale serverdown or up (using pre-configured thresholds); it is extensible, allowing the seamless addition of newfeatures; and it returns additional error messages in faults case.Other resource modeling alternative is the Virtual Resources and Interconnection NetworksDescription Language (VXDL) [39], whose main goal is to describe resources that compose a virtualinfrastructure while focusing on virtual grid applications. The VXDL is able to describe thecomponents of an infrastructure, their topology, and an execution chronogram. These three aspectscompose the main parts of a VXDL document. The computational resource specification partdescribes resource parameters. Furthermore, some peculiarities of virtual Grids are also present,such as the allocation of virtual machines in the same hardware and location dependence. Thespecification of the virtual infrastructure can consider specific developers’ requirements such asnetwork topology and delay, bandwidth, and the direction of links. The execution chronogramspecifies the period of resource utilization, allowing efficient scheduling, which is a clear concern forGrids rather than Cloud computing. Another interesting point of VXDL is the possibility ofdescribing resources individually or in groups, according to application needs. VXDL lacks supportfor distinct services descriptions, since it is focused on grid applications only.The proposal presented in [32], called VRD hereafter, describes resources in a networkvirtualization scenario where infrastructure providers describe their virtual resources and servicesprior to offering them. It takes into consideration the integration between the properties of virtualresources and their relationships. An interesting point in the proposal is its use of functional andnon-functional attributes. Functional attributes are related to characteristics, properties, andfunctions of components. Non-functional attributes specify criteria and constraints, such asperformance, capacity, and QoS. Among the functional properties that must be highlighted is the setof component types: PhysicalNode, VirtualNode, Link, and Interface. Such properties suggest aflexibility that can be used to represent routers or servers, in the case of nodes, and wired or wirelesslinks, in the case of communication links and interfaces.
  33. 33. 20Another proposal known as the Manifest language was developed by Chapman et al. [9]. Theyproposed new meta-models to represent service requirements, constraints, and elasticity rules forsoftware deployment in a Cloud. The building block of such framework is the OVF (OpenVirtualization Format) standard, which was extended by Chapman et al. to perform the vision of D-Clouds considering locality constraints. These two points are very interesting to our scenario. Withregard to elasticity, it assumes a rule-based specification formed by three fields: a monitoredcondition related to the state of the service (such as workload), an operator (relational and logicalones are accepted), and an associated action to follow when the condition is met. The locationconstraints identify sites that should be favored or avoided when selecting a location for a service.Nevertheless, the Manifest language is focused on the software architecture. Hence, the language isnot concerned with other aspects such as resources’ status or network resources.Cloud# is a language for modeling Clouds proposed by [16] to be used as a basis for Cloudproviders and clients to establish trust. The model is used by developers to understand the behaviorof Cloud services. The main goal of Cloud# is to describe how services are delivered, while takinginto consideration the interaction among physical and virtual resources. The main syntacticconstruct within Cloud# is the computation unit CUnit, which can model Cloud systems, virtualmachines, or operating systems. A CUnit is represented as a tuple of six components modelingcharacteristics and behaviors. This language gives developers a better understanding of the Cloudorganization and how their applications are dealt with.3.2.2 Resource Offering and TreatmentOnce the D-Cloud resources are modeled, the next challenge is to describe how resources areoffered to developers, which is important since the requirements supported by the provider aredefined in this step. Such challenge will also define the interfaces of the D-Cloud. This challengediffers from resource modeling since the modeling is independent of the way that resources areoffered to developers. For example, the provider could model each resource individually, likeindependent items in a fine-grained scale such as GHz of CPU or GB of memory, but could offerthem like a coupled collection of those items or a bundle, such as VM templates as cited at Section2.5.2.Recall that, in addition to computational requirements (CPU and memory) and traditionalnetwork requirements, such as bandwidth and delay, new requirements are present under D-Cloudscenarios. The topology of the nodes is a first interesting requirement to be described. Developersshould be able to set inter-nodes relationships and communication restrictions (e.g., downlink anduplink rates). This is illustrated in the scenario where servers – configured and managed by
  34. 34. 21developers – are distributed at different geographical localities while it is necessary for them tocommunicate with each other in a specific way.Jurisdiction is related to where (geographically) applications and their data must be stored andhandled. Due to restrictions such as copyright laws, D-Cloud users may want to limit the locationwhere their information will be stored (such as countries or continents). Other geographicalconstraint can be imposed by a maximum (or minimum) physical distance (or delay value) betweennodes. Here, though developers do not know about the actual topology of the nodes, they maymerely establish some delay threshold value for example.Developers should also be able to describe scalability rules, which would specify how andwhen the application would grow and consume more resources from the D-Cloud. Authors in [21]and [9] define a way of doing this, allowing the Cloud user to specify actions that should be taken,like deploying new VMs, based on thresholds of metrics monitored by the D-Cloud itself.Additionally, resource offering is associated to interoperability. Current Cloud providers offerproprietary interfaces to access their services, which can hinder users within their infrastructure asthe migration of applications cannot be easily made between providers [8]. It is hoped that Cloudproviders identify this problem and work together to offer a standardized API.According to [61], Cloud interoperability faces two types of heterogeneities: verticalheterogeneity and horizontal heterogeneity. The first type is concerned with interoperability within asingle Cloud and may be addressed by a common middleware throughout the entire infrastructure.The second challenge, the horizontal heterogeneity, is related to Clouds from different providers.Therefore, the key challenge is dealing with these differences. In this case, a high level of granularityin the modeling may help to address the problem.An important effort in the search for horizontal standardization comes from the Open CloudManifesto5, which is an initiative supported by hundreds of companies that aims to discuss a way toproduce open standards for Cloud Computing. Their major doctrines are collaboration andcoordination of efforts on the standardization, adoption of open standards wherever appropriate,and the development of standards based on customer requirements. Participants of the Open CloudManifesto, through the Cloud Computing Use Case group, produced an interesting white paper [51]highlighting the requirements that need to be standardized in a cloud environment to ensureinteroperability in the most typical scenarios of interaction in Cloud Computing.5 http://www.opencloudmanifesto.org/
  35. 35. 22Another group involved with Cloud standards is the Open Grid Forum6, which is intended todevelop the specification of the Open Cloud Computing Interface (OCCI)7. The goal of OCCI is toprovide an easily extendable RESTful interface Cloud management. Originally, the OCCI wasdesigned for IaaS setups, but their current specification [46] was extended to offer a generic schemefor the management of different Cloud services.3.2.3 Resource Discovery and MonitoringWhen requests reach a D-Cloud, the system should be aware of the current status of resources, inorder to determine if there are available resources in the D-Cloud that could satisfy the requests. Inthis way, the right mechanisms for resource discovery and monitoring should also be designed,allowing the system to be aware of the updated status of all its resources. Then, based on the currentstatus and request’ requirements, the system may select and allocate resources to serve these newrequest.Resource monitoring should be continuous and help taking allocation and reallocationdecisions as part of the overall resource usage optimization. A careful analysis should be done tofind a good and acceptable trade-off between the amount of control overhead and the frequency ofresource information updating.The monitoring may be passive or active. It is considered passive when there are one or moreentities collecting information. The entity may continuously send polling messages to nodes askingfor information or may do this on-demand when necessary. On the other hand, the monitoring isactive when nodes are autonomous and may decide when to send asynchronously state informationto some central entity. Naturally, D-Clouds may use both alternatives simultaneously to improve themonitoring solution. In this case, it is necessary to synchronize updates in repositories to maintainconsistency and validity of state information.The discovery and monitoring in a D-Cloud can be accompanied by the development ofspecific communication protocols. Such protocols act as a standard plane for control in the Cloud,allowing interoperability between devices. It is expected that such type of protocols can control thedifferent elements including servers, switches, routers, load balancers, and storage componentspresent in the D-Cloud. One possible method of coping with this challenge is to use smartcommunication nodes with an open programming interface to create new services within the node.One example of this type of open nodes can be seen in the emerging Openflow-enabled switches[44].6 http://www.gridforum.org/7 http://occi-wg.org/about/specification/
  36. 36. 233.2.4 Resource Selection and OptimizationWith information regarding Cloud resource availability at hand, a set of appropriate candidates maythen be highlighted. Next, the resource selection process finds the configuration that fulfills allrequirements and optimizes the usage of the infrastructure. Selecting solutions from a set ofavailable ones is not a trivial task due to the dynamicity, high algorithm complexity, and all differentrequirements that must be contemplated by the provider.The problem of resource allocation is recurrent on computer science, and several computingareas have faced such type of problem since early operating systems. Particularly in the CloudComputing field, due to the heterogeneous and time-variant environment in Clouds, the resourceallocation becomes a complex task, forcing the mediation system to respond with minimalturnaround time in order to maintain the developer’s quality requirements. Also, balancingresources’ load and projecting energy-efficient Clouds are major challenges in Cloud Computing.This last aspect is especially relevant as a result of the high demand for electricity to power and tocool the servers hosted on datacenters [7].In a Cloud, energy savings may be achieved through many different strategies. Serverconsolidation, for example, is a useful strategy for minimizing energy consumption whilemaintaining high usage of servers’ resources. This strategy saves the energy migrating VMs ontosome servers and putting idle servers into a standby state. Developing automated solutions forserver consolidation can be a very complex task since these solutions can be mapped to bin-packingproblems known to be NP-hard [72].VM migration and cloning provides a technology to balance load over servers within a Cloud,provide fault tolerance to unpredictable errors, or reallocate applications before a programmedservice interruption. But, although this technology is present in major industry hypervisors (likeVMWare or Xen), there remains some open problems to be investigated. These include cloning aVM into multiple replicas on different hosts [40] and developing VM migration across wide-areanetworks [14]. Also, the VM migration introduces a network problem, since, after migration, VMsrequire adaptation of the link layer forwarding. Some of the strategies for new datacenterarchitectures explained in [67] offer solutions to this problem.Remodeling of datacenter architectures is other research field that tries to overcomelimitations on scalability, stiffness of address spaces, and node congestion in Clouds. Authors in [67]surveyed this theme, highlighted the problems on network topologies of state-of-the-art datacenters,and discussed literature solutions for these problems. One of these solutions is the D-Cloud, as
  37. 37. 24pointed also by [72], which offers an energy efficient alternative for constructing a cloud and anadapted solution for time-critical services and interactive applications.Considering specifically the challenges on resource allocation in D-Clouds, one can highlightcorrelated studies based on the Placement of Replicas and Network Virtualization. The former isapplied into Content Distribution Networks (CDNs) and it tries to decide where and when contentservers should be positioned in order to improve system’s performance. Such problem is associatedwith the placement of applications in D-Clouds. The latter research field can be applied to D-Cloudsconsidering that a virtual network is an application composed by servers, databases, and the networkbetween them. Both research fields will be described in following sections.Replica PlacementReplica Placement (RP) consists of a very broad class of problems. The main objective of this typeof problems is to decide where, when, and by whom servers or their content should be positioned inorder to improve CDN performance. The correspondent existing solutions to these problems aregenerally known as Replica Placement Algorithms (RPA) [35].The general RP problem is modeled as a physical topology (represented by a graph), a set ofclients requesting services, and some servers to place on the graph (costs per server can beconsidered instead). Generally, there is a pre-established cost function to be optimized that reflectsservice-related aspects, such as the load of user’s requests, the distance from the server, etc. Aspointed out by [35], an RPA groups these aspects into two different components: the problemdefinition, which consists of a cost function to be minimized under some constraints, and aheuristic, which is used to search for near-optimal solutions in a feasible time frame, since thedefined problems are usually NP-complete.Several different variants of this general problem were already studied. But, according to [57],they fall into two classes: facility location and minimum K-median. In the facility location problem,the main goal is to minimize the total cost of the graph through the placement of a number ofservers, which have an associated cost. The minimum K-median problem, in turn, is similar butassumes the existence of a pre-defined number K of servers. More details on the modeling andcomparison between different variants of the RP problem are provided by [35].Different versions of this problem can be mapped onto resource allocation problems in D-Clouds. A very simple mapping can be defined considering an IaaS service where virtual machinescan be allocated in a geo-distributed infrastructure. In such mapping, the topology corresponds tothe physical infrastructure elements of the D-Cloud, the VMs requested by developers can betreated as servers, and the number of clients accessing each server would be their load.
  38. 38. 25Qiu et al. [57] proposed three different algorithms to solve the K-median problem in a CDNscenario: Tree-based algorithm, Greedy algorithm, and Hot Spot algorithm. The Tree-based solutionassumes that the underlying graph is a tree that is divided into several small trees, placing each serverin each small tree. The Greedy algorithm places servers one at a time in order to obtain a bettersolution in each step until all servers are allocated. Finally, the Hot Spot solution attempts to placeservers in the vicinity of clients with the greatest demand. The results showed that the GreedyAlgorithm for replica placement could provide CDNs with performance that is close to optimal.These solutions can be mapped onto D-Clouds considering the simple scenario of VMallocation on a geo-distributed infrastructure with the restriction that each developer has a fixednumber of servers to attend their clients. In such case, this problem can be straightforwardlyreduced to the K-median problem and the three solutions proposed could be applied. Basically, onecould treat each developer as a different CDN and optimize each one independently still consideringa limited capacity of the physical resources caused by the allocation of other developers.Presti et al. [56], treat a RP variant considering a trade-off between the load of requests percontent and the number of replica additions and removals. Their solution considers that each serverin the physical topology decides autonomously, based on thresholds, when to clone overloadedcontents or to remove the underutilized ones. Such decisions also encompass the minimization ofthe distance between clients and the respective accessed replica. A similar problem is investigated in[50], but considering constraints on the QoS perceived by the client. The authors propose amathematical offline formulation and an online version that uses a greedy heuristic. The resultsshow that the heuristic presents good results with minor computational time.The main focus of these solutions is to provide scalability to the CDN according to the loadcaused by client requests. Thus, despite working only with the placement of content replicas, suchsolutions can be also applied to D-Clouds with some simple modifications. Considering replicas asallocated VMs, one can apply the threshold-based solution proposed in [56] to the simple scenarioof VM scalability on a geo-distributed infrastructure.Network VirtualizationThe main problem of NV is the allocation of virtual networks over a physical network [10] and [3].Analogously, D-Clouds’ main goal is to allocate application requests on physical resources accordingto some constraints while attempting to obtain a clever mapping between the virtual and physicalresources. Therefore, problems on D-Clouds can be formulated as NV problems, especially inscenarios considering IaaS-level services.
  39. 39. 26Several instances of the NV based resource allocation problem can be reduced to a NP-hardproblem [48]. Even the versions where one knows beforehand all the virtual network requests thatwill arrive in the system is NP-hard. The basic solution strategy thus is to restrict the problem spacemaking it easier to deal with and also consider the use of simple heuristic-based algorithms toachieve fast results.Given a model based on graphs to represent both physical and virtual servers, switches, andlinks [10], an algorithm that allocates virtual networks should consider the constraints of theproblem (CPU, memory, location or bandwidth limits) and an objective function based on thealgorithm objectives. In [31], the authors describe some possible objective functions to beoptimized, like the ones related to maximize the revenue of the service provider, minimizing linkand nodes stress, etc. They also survey heuristic techniques used when allocating the virtualnetworks dividing them in two types: static and dynamic. The dynamic type permits reallocatingalong the time by adding more resources to already allocated virtual networks in order to obtain abetter performance. The static one means once a virtual network is allocated it will hardly everchange its setup.To exemplify the type of problems studied on NV, one can be driven to discuss the onestudied by Chowdhury et al. [10]. Its authors propose an objective function related to the cost andrevenue of the provider and constrained by capacity and geo-location restrictions. They reduce theproblem to a mixed integer programming problem and then relax the integer constraints through thederiving of two different algorithms for the solution’s approximation. Furthermore, the paper alsodescribes a Load Balancing algorithm, in which the original objective function is customizedin order to avoid using nodes and links with low residual capacity. This approach implies inallocation on less loaded components and an increase of the revenue and acceptance ratio ofthe substrate network.Such type of problem and solutions can be applied to D-Clouds. One example could be theallocation of interactive servers with jurisdiction restrictions. In this scenario, the provider mustallocate applications (which can be mapped on virtual networks) whose nodes are linked and thatmust be close to a certain geographical place according to a maximum tolerated delay. Thus, aprovider could apply the proposed algorithms with minor simple adjustments.In the paper of Razzaq and Rathore [58], the virtual network embedding algorithm is dividedin two steps: node mapping and link mapping. In the node mapping step, nodes with highestresource demand are allocated first. The link mapping step is based on an edge disjoint k-shortestpath algorithm, by selecting the shortest path which can fulfill the virtual link bandwidth
  40. 40. 27requirement. In [42], a backtracking algorithm for the allocation of virtual networks onto substratenetworks based on the graph isomorphism problem is proposed. The modeling considers multiplecapacity constraints.Zhu and Ammar [74] proposed a set of four algorithms with the goal of balancing the load onthe physical links and nodes, but their algorithms do not consider capacity aspects. Their algorithmsperform the initial allocation and make adaptive optimizations to obtain better allocations. The keyidea of the algorithms is to allocate virtual nodes considering the load of the node and the load ofthe neighbor links of that node. Thus one can say that they perform the allocation in a coordinatedway. For virtual link allocation, the algorithm tries to select paths with few stressed links in thenetwork. For more details about the algorithm see [74].Considering the objectives of NV and RP problems, one may note that NV problems are ageneral form of the RP problem: RP problems try to allocate virtual servers whereas NV considersallocation of virtual servers and virtual links. Both categories of problems can be applied to D-Clouds. Particularly, RP and NV problems may be respectively mapped on two different classes ofD-Clouds: less controllable D-Clouds and more controllable ones, respectively. The RP problemsare suitable for scenarios where allocation of servers is more critical than links. In turn, the NVproblems are especially adapted to situations where the provider is an ISP that has full control overthe whole infrastructure, including the communication infrastructure.3.2.5 SummaryThe D-Clouds’ domain brings several engineering and research challenges that were discussed in thissection and whose main aspects are summarized in Table I. Such challenges are only starting toreceive attention from the research community. Particularly, the system, models, languages, andalgorithms presented in the next chapters will cope with some of these challenges.Table I Summary of the main aspects discussedCategories AspectsResource ModelingHeterogeneity of resourcesPhysical and virtual resources must be consideredComplexity vs. FlexibilityResource Offeringand TreatmentDescribe the resources offered to developersDescribe the supported requirementsNew requirements: topology, jurisdiction, scalabilityResource Discoveryand MonitoringMonitoring must be continuousControl overhead vs. Updated informationResource Selectionand OptimizationFind resources to fulfill developer’s requirementsOptimize usage of the D-Cloud infrastructureComplex problems solved by approximation algorithms
  41. 41. 284 The Nubilum System“Expulsa nube, serenus fit saepe dies.”Popular ProverbSection 2.4 introduced an Archetypal Cloud Mediation system focusing specifically on the resourcemanagement process that ranges from the automatic negotiation of developers requirements to theexecution of their applications. Further, this system was divided into three layers: negotiation,resource management, and resource control. Keeping in mind this simple archetypal mediationsystem, this chapter presents Nubilum a resource management system that offers a self-managedsolution for challenges resulting from the discovery, monitoring, control, and allocation of resourcesin D-Clouds. This system appears previously in [25] under the name of D-CRAS (Distributed CloudResource Allocation System).Section 4.1 presents some decisions taken to guide the overall design and implementation ofNubilum. Section 4.2 presents a conceptual view of the Nubilum’s architecture highlighting theirmain modules. The functional components of Nubilum are detailed in Section 4.3. Section 4.4presents the main processes performed by Nubilum. Section 4.5 closes this chapter by summarizingthe contributions of the system and comparing them with correlated resource management systems.4.1 Design RationaleAs stated previously in Section 1.2, the objective of this Thesis is to develop a self-manageablesystem for resource management on D-Clouds. Before the development of the system and theircorrespondent architecture, some design decisions that will guide the development of the systemmust be delineated and justified.4.1.1 ProgrammabilityThe first aspect to be defined is the abstraction level in which Nubilum will act. Given that D-Clouds concerns can be mapped on previous approaches on Replica Placement (see Section 0) andNetwork Virtualization (see Section 0) research areas, a straightforward approach would be toconsider a D-Cloud working at the same abstraction level. Therefore, knowing that proposals inboth areas commonly seem to work at the IaaS level, i.e., providing virtualized infrastructures,Nubilum would naturally also operate at the IaaS level.
  42. 42. 29Nubilum offers a Network Virtualization service. Applications can be treated as virtualnetworks and the provider’s infrastructure is the physical network. In this way, the allocationproblem is a virtual network assignment problem and previous solutions for the NV area can beapplied. Note that such approach does not exclude previous Replica Placement solutions becausesuch area can be viewed as a particular case of Network Virtualization.4.1.2 Self-optimizationAs defined in Section 2.1, the Cloud must provide services in a timely manner, i.e., resourcesrequired by users must be configured as quickly as possible. In other words, to meet such restriction,Nubilum must operate as much as possible without human intervention, which is the very definitionof self-management from Autonomic Computing [69].The operation involves maintenance and adjustment of the D-Cloud resources in the face ofchanging application demands and innocent or malicious failures. Thus, Nubilum must providesolutions to cope with the four aspects leveraged by Autonomic Computing: self-configuration, self-healing, self-optimization, and self-protection. Particularly, this Thesis focuses on investigating self-optimization – and, at some levels possibly, self-configuration – on D-Clouds. The other twoaspects are considered out of scope of this proposal.According to [69], self-optimization of a system involves letting its elements “continually seekways to improve their operation, identifying and seizing opportunities to make themselves moreefficient in performance or cost”. Such definition fits very well the aim of Nubilum, which mustensure an automatic monitoring and control of resources to guarantee the optimal functioning ofthe Cloud while meeting developers’ requirements.4.1.3 Existing standards adoptionThe Open Cloud Manifesto, an industry initiative that aims to discuss a way to produce openstandards for Cloud Computing, states that Cloud providers “must use and adopt existing standardswherever appropriate” [51]. The Manifesto argues that several efforts and investments have beenmade by the IT industry in standardization, so it seems more productive and economic to use suchstandards when appropriate. Following this same line, Nubilum will adopt some industry standardswhen possible. Such adoption is also extended to open processes and software tools.4.2 Nubilum’s conceptual viewAs shown in Figure 6, the conceptual view of Nubilum’s architecture is composed of three planes: aDecision plane, a Management plane, and an Infrastructure plane. Starting from the bottom, thelower plane nestles all modules responsible for the appropriate virtualization of each resource in the

×