24 27

186
-1

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
186
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

24 27

  1. 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 Gleaming the Cloud in the Virtualization layer Janani.V.D, Hari Baabu.V, Sajeev ram.A M.E. Assistant Professor Computer Science and Engineering Computer Science and Engineering Vels University Vels University Abstract-- Cloud Computing is a powerful and physical machine. Similarly IaaS providers such asflexible software environment, which delegates the Amazon’s Elastic Computing Cloud (EC2) [2] arematerial’s management and in which users pay as they not allowed to look inside the running VMgo. Rather than operating their own data centers, today instances [7] because of tenants’ businesscloud users run their applications on the remote cloudinfrastructures that are owned and managed by cloud confidentiality and privacy.providers. Cloud users create virtual machineinstances to run their specific application logic In a local data center, operators usuallywithout knowing the underlying physical need the context information from bothinfrastructure. On the other side, cloud provider applications and infrastructures for effectivemanage and operate their cloud infrastructures problem determination. In commercial cloudwithout knowing their customers’ applications. environments, the decoupled ownership ofDue to the disjoint ownership of applications and applications and infrastructures introduces newinfrastructures, if a problem occurs, there is no challenges in system management. When usersvisibility for either cloud users or providers tounderstand the whole context of the incident encounter problems in their VMs, they have littleand solve it quickly. To this end, we propose visibility on the infrastructure layer for root causesoftware solution Cloud Vision, to provide some determination except trial-and-error troubleshootingvisibility through the middle virtualization layer for [9]. It is not even clear who is responsible for fixingboth cloud users and providers to address their such problems. To support end users, IaaSproblems quickly. Cloud Vision automatically tracks providers often run an online support forum (e.g.,each VM instance’s configuration status and Amazon EC2 online forum [3]) where users canmaintains their life-cycle configuration record in report their problems and seek solutions. Cloudconfiguration management database(CMDB). operators often try to solve those problems byWhen the user reports a problem, our algorithmautomatically analyze CMDB to probabilistic manually investigating the system settings of users’determine the root cause and invoke a recovery VMs. Since operators do not know tenants’process by interacting with the cloud user specific applications, it is even hard for them to solve problems only based on users’ problem Index Terms— Cloud Vision; Configuration reports.management; Virtualization layer; VM;Troubleshooting; Most of VM problems [ 7 ] in a cloud infrastructure (except those problems caused by I. INTRODUCTION applications running inside VM instances) can be Cloud computing provides a correlated to their configuration events. Examplesrevolutionary new computing paradigm for of such configuration events include incorrectdeploying enterprise applications and Internet update of Access Control List (ACL) in hypervisors,services. There are many evolving cloud computing blocking of access ports inside running VMmodels, including Infrastructure-as-a-Service (IaaS), instances, connection of virtual devices toPlatform-as-a-Service(PaaS), and Software- as-a- unsupported OSes, migration of instances to a slowService(SaaS), which provide IT infrastructures, machine, incorrect allocation of memory size,platforms, solution stacks, and software connection loss with underlying block storage andapplications from the cloud respectively [1]. Cloud so on. For example, on average 28 problems wereinfrastructures often employ a virtualization layer reported daily to Amazon EC2 online forum in[8] to ensure resource isolation and abstraction. July 2010, complaining about VM unavailability,They are designed to provide restricted visibility to VM hang, Elastic Block Storage (EBS) [2] [7]both users and IaaS providers. Users of crash or loose connectivity, performance problemVirtualMachine (VM) instances [7] are blocked and so on. Problems can occur at the time whenfrom looking down into the infrastructure layer an instance is first created from its source imagesince multiple tenants might share the same or during running time if its configurations are 24 All Rights Reserved © 2012 IJARCSEE
  2. 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012changed by some events. models require cloud users to report their problems to operators, who manually investigate the We make two intuitive observations problems to provide solutions. The premiumabout cloud environments. First, multiple VM support model guarantees a bounded probleminstances created from the same image source are resolution time with service charges while thelikely to have similar characteristics in best effort model is free but might take 8 to 10configurations. Second, some configuration days to solve user problems [1]. Hence, ourattributes have higher cardinality than others and objective is to provide automated solutions forhence they are less sensitive to configuration cloud providers to address users’ problem quicklychanges. Based on those we present a solution and also reduce the cost of manual operations. Our(called CloudVision) for cloud providers to experimental results show that CloudVision canautomate reasoning and troubleshooting [9] on user effectively reduce the solution time to seconds.reported problems under the restricted visibility.CloudVision runs in the infrastructure layer and III. CLOUD VISION SYSTEMtracks each VM instance as a standard object ARCHITECTURErunning in that infrastructure. In summary, we havemade the following contributions: The system architecture of CloudVision is shown in Figure 1 and it consists of three majorRepresentation: CloudVision develops a components: Monitoring Agent, Cloud CMDB andmonitoring mechanism at the infrastructure layer to Analysis Engine.automatically capture the event history of allrunning VM instances in cloud environments [7].Such event data is further structured and stored in aConfiguration Management Database (CMDB).Problem Reasoning: When users report aproblem, our solution fetches the relatedinformation from the CMDB and furtheridentifies a list of suspect events. We then applystatistical analysis to rank these events based ontheir sensitivity to configuration changes.Interactive Troubleshooting: Following the listof ranked events, CloudVision takes predefined Figure 1: CloudVision Architectureactions to check and solve problems. For eachspecific event, operators define the predicates tocheck whether this event is the root cause as well Monitoring Agent: The Monitoring Agentas the actions to remediate the problem monitors and tracks the configuration attributes of VM instances and infrastructures continuously. II. MOTIVATION We assume that all PM clocks are loosely synchronized. The agent sits in the hypervisor layer so that it can be rolled out through the cloud In a typical cloud environment, infrastructures as a standard management component.Operators of the cloud provider monitor their Section IV-A discusses the details of datainfrastructure stacks from the hardware layer such collection.as servers and network devices and up to thehypervisor layer. While the cloud provider ensuresa highly available cloud infrastructure, it typically Cloud CMDB: CMDB Manager structuresdoes not guarantee the availability of individual VM and stores configuration data sent by theinstances since it is not even aware of what is monitoring agent into a database calledrunning inside those VM instances. Therefore, cloud Configuration Management Database (CMDB).users will occasionally encounter various problems CMDB is often used for asset and configurationwith their VM instances [7]. management in large data centers. In this paper, we use a CMDB to track the life-cycle Cloud providers offer several options to configuration information of VM instances. It isaddress cloud users’ problems including a free best a relational database and each row includes aeffort service with no Service Level Agreement configuration change along with its timestamp and(SLA) guarantee and a premium commercial grade health status (sick or healthy). Initially all rows aresupport with SLA guarantees [1]. Both support marked as healthy. If a configuration change is found to be the cause of a reported problem, the 25 All Rights Reserved © 2012 IJARCSEE
  3. 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012row containing this change is marked as sick in When a problem is reported, the analysis enginethe database during the troubleshooting step [9]. analyzes the related CMDB records of the reportedIn Section IV-B, we discuss how CMDB is VM instance [ 7 ] to determine the suspect eventsdesigned and populated in cloud infrastructures. and then tries to check the root. cause and solve the problem. The whole process is divided into three steps: Identifying Suspect Events, Event RankingAnalysis Engine: After users encounter a and finally Interactive Troubleshooting.problem with their VM instances [7], they reportthe problem to CloudVision via a web interface. 1) Identifying Suspect Events: when a problem ofThe Analysis Engine fetches the data related to a VM instance occurs, it is most likely causedthe reported VM from the CMDB and compares its by the recent events of that VM instance. As thecurrent configuration set with its historical CMDB maintains the complete event history ofrecords to determine a list of suspect events. Then VM instances, the analysis engine can determinethe engine employs statistical analysis to rank the the list of suspect events from the CMDB. Eachlist of suspect events and uses predefined predicates event is associated with a change in theand actions to check these events and resolve the configuration attribute called suspect attribute.problem. Not all problems can be automatically CloudVision identifies the changes of the reportedresolved by CloudVision since some problems are instance by comparing its current configuration tobeyond the management responsibility of cloud the latest known stable configuration. Anyproviders. However, the suspected events are configuration that persists longer than a specificreported to end users with an interactive duration of time is considered as a stabletroubleshooting process [9] and its design details configuration. Unstable configurations may arisegiven in Section IV-C. when users try to fix problems by trial and error so that every configuration change only lasts for a IV. CLOUD VISION DESIGN A N D very short period of time. IM P LEME NTATION 2) Event Ranking: After determining the set ofA. DataCollection suspect events, CloudVision ranks them based onCloud Vision runs an event-based monitoring how likely an event could be the problem rootagent at the hypervisor layer of each physical cause. As the set of suspect events can be large, wemachine. The agent collects system configuration reduce the problem resolution time by checking thedata shell scripts. The shell scripts use virsh events in their ranked order. An event is highlymanagement interface library to collect values of suspected if it is very rare to occur. To measurecon- figuration attributes of running VM instances this, we define a sensitivity metric and rank theand it also collects IP addresses of running guests set of suspect events according to their sensitivity,by looking into the system. Collected information i.e., the sensitivity of an attribute to change.are represented using an XML format. We define two types of event sensitivity: localConfiguration data is collected periodically but it is sensitivity and global sensitivity. The localnot sent to the CMDB Manager unless there is a sensitivity defines the probability of an event’schange in the configuration data compared to the occurrence using the event history of theprevious one, which indicates new events. Hence instances originated from the same image source.the data collection is event-based to reduce Meantime the global sensitivity defines thenetwork overhead. probability of an event’s occurrence using the event history of all VM instances regardless of their imageB. Populate CMDB sources.The central CMDB Manager node runs programthat collects event-based configuration data from 3) Troubleshooting: After determining andmonitoring agents. It parses collected XML ranking suspect events, Cloud Vision performsmessages and uses a relational database to structure troubleshooting actions, which include two majorand store all configuration data along with the steps: Check and Solve. For each suspect event,reported timestamp. Each row in the database the check step checks whether it is the true rootindicates at least a single change in the set of cause (or culprit event) of the problem. The solveconfiguration attributes and it is also associated step tries to solve the problem automatically ifwith a status tag that defines whether the the solution is within the managementconfiguration is sick or healthy. Initially all entries responsibility of the cloud provider, otherwiseare set to be healthy and a configuration is changed Cloud Vision recommends the solutions to theto be sick only if it is proved to be faulty during the cloud users. To identify the root cause, Cloudtroubleshooting process. Vision uses a series of predefined predicates to check each suspect event, starting from the top ofC. CMDB Analysis the ranked suspect events. The suspect events are 26 All Rights Reserved © 2012 IJARCSEE
  4. 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012first classified into one or more problem classes. fingerprinting. There also exist some literature onActions are defined for each individual problem configuration man- agement. Whitaker et al. [6]class and also include a series of predicates to developed a tool named Chrouns to automaticallycheck the existence of the corresponding problem. search for a configuration state change indicating a failure. It stores local system config- urations at each checkpoint along time. When a problem occurs, Chronus takes user-provided probes to identify the working and non-working configuration state. This approach is quite similar to our check and solve step. VI. CONCLUSION In this paper, we present Cloud Vision, a novel solution for automated problem troubleshooting in cloud environments. Cloud Vision monitors and tracks the configuration attributes of VM instances [ 7 ] and infrastructure, and uses the historical event records to determine the root cause of problematic VM instances, and further provides a check-and-solve troubleshooting process to resolve user reported problems automatically. In our future work, we plan to Figure 2: Check-and-solve flow chart monitor more configuration attributes from cloud infrastructure and design new predicates and actionsThe check-and-solve troubleshooting process is to cover more problem classes.done interactively with users. After each solvestep, users are asked to provide feedbacks. If the REFERENCESproblem is solved, the troubleshooting processends; otherwise the check-and-solve steps [1] T. Benson, S. Sahu, A. Akella, and A.continue for the next suspected event. This Shaikh, “A first look at problems in the cloud,” inprocess is very similar to the troubleshooting Proc. of HotCloud, 2010.support for desktop printing or networking in [2] Amazon EC2, “http://aws.amazon.com/ec2/.”Windows OS. Sometimes the check-and-solve [3] Amazon EC2 Forum,steps need to shutdown or restart a instances to [4] M. Goldszmidt, A. Fox, I. Cohen, J.solve the problem, which has to get user’s Symons, S. Zhang, and T. Kelly, “Capturing,confirmation. If some users want to run their indexing, clustering, and retrieving system history,”instances without any interruption, Cloud Vision in Proc. of SOSP’05, 2005.sends the solution back to the users for their [5] P. Bodik, M. Goldszmidt, A. Fox,own decision. After a problem is reported, D.B.Woodard, and H.Andersen, “Fingerprinting theCloud Vision automatically invokes problem datacenter: automated clas- sification ofreasoning and uses the same web interface to performance crises,” in Proc. of EuroSys, 2010.interact with users and acquire their [6] A. Whitaker, R. S. Cox, and S. D.Gribble,confirmation and permission for actions. Figure 2 “Configuration debugging as search: Finding theshows the complete flow chart of our check-and- needle in the haystack,” in Proc. of OSDI, 2004.solve troubleshooting algorithm [9]. [7] Chunyi Peng, Minkyong Kim ; Zhe Zhang ; Hui Lei , “VDN: Virtual machine image distribution network for cloud data centers” 2012 V. RELATED WORK [8] HyogunYoon , Hanku Lee, “An Intelligence Virtualization Rule Based on Multi-layer to Support Much work has applied statistical learning Social-Media Cloud Service”,2011methods to detect and diagnose problems in large- [9] Shufen Zhang ; Xuebin Chen ; Xiuzhen Huo,scale computer systems. Cohen et al., [4] used a “Cloud Computing Research and Developmenttree-augmented naive (TAN) bayesian network to Trend”, 2010learn the probabilistic relationship between SLA [10] Jackson, K.R. Ramakrishnan, L. ; Muriki, K. ;violations and system resource usages. They used Canon, S. ; Cholia, S. ; Shalf, J. ; Wasserman, H.J.this learned bayesian network to identify ;Wright, N.J. , ” Performance Analysis of Highperformance bottlenecks. Bodik et al. [5] Performance Computing Applications on theproposed a “Fingerprint” approach, which Amazon Web Services Cloud ”, 2010identifies the problem with performance 27 All Rights Reserved © 2012 IJARCSEE

×