XRM is an event-based resource management framework for XCP that uses feedback control theory. It implements a control loop to monitor applications and resources, model performance goals, and take automated actions like migrating VMs or adjusting resource shares. This addresses challenges in managing shared cloud infrastructures where applications have different requirements and resources need to be optimally allocated. Preliminary results show XRM using algorithms like bin packing can consolidate VMs onto fewer hosts compared to other approaches. Feedback is sought from the Xen community on incorporating XRM into XCP and making it more openly available.
11. Xen Summit AMD 2010 Challenge #2: Resource Management Spans Multiple Layers Services PaaS Resource Management IaaS Hardware How to pass information between the layers so that they don’t make conflicting decisions?
25. High Application PerformanceXen Summit AMD 2010 A RM that can automatically re-arrange resources to multiple applications/VMs on multiple physical machines and provides optimal resource utilization and application performance We are building the (ultimate) RM system XRM = first incarnation on XCP!
26. Outline Motivation Challenges in RM XRM Feedback Control based Design XRM Implementation and Preliminary Results Summary and Feedback Xen Summit AMD 2010
27. How to achieve the automation? “Almost any system that is considered automatic has some element of feedback control” -Hellerstein et al. XRM = A Feedback Control System Xen Summit AMD 2010
28. RM in multiple layers Xen Summit AMD 2010 Services High level service request Does app modeling and may request changes PaaS RM Slice request Slice changes IaaS RM Knows only about VMs and hardware resources Automated control loop Hardware XRM = IaaS RM
29. XRM’s feedback control loop XCP Monitor Network stats Model can model applications, VMs, and underlying resources Model Performance goals Control Control parameters Action Change resource shares Power-off machines Migrate Xen Summit AMD 2010
30. Current incarnation XCP monitoring module Stats Stats analysis module Thresholds Rules Filtered Stats and stats analysis data Core algorithm module Algorithm bank RRD database Take action Out of band stat updates from XCP nodes Wrapper Low-level commands/XAPI commands XCP master node Openflow Xen Summit AMD 2010
31. XRM is an event-based framework Many algorithms can be developed and plugged in The algorithms register for specific events High CPU utilization Packet drops PowerOff PowerOn … Different algorithms may take different actions Xen Summit AMD 2010 A Common Abstraction for ALL Algorithms
32. What algorithms can you implement? AutoControl – automated control of multiple virtualized resources [PadalaEurosys09] Models application and sets VM shares based on application goals Xen Summit AMD 2010 App Controller App Controller App Controller Resource Shares Goals Node Controller Node Controller [PadalaEurosys09] PradeepPadala, Xiaoyun Zhu, Mustafa Uysal et al. Automated Control of Multiple Virtualized Resources. In the proceedings of the EuroSys 2009
33. Outline Motivation Challenges in RM XRM Feedback Control based Design XRM Implementation and Preliminary Results Summary and Feedback Xen Summit AMD 2010
34. XRM features Interface to upper layers Auto-* features External control Pluggable algorithms Extensibility Xen Summit AMD 2010
35. XRM Implementation Implemented on XCP 0.1.1 Written in Python Pluggable algorithms have to be written in Python Currently implements four algorithms Bin packing Bin packing + Live migration Random host Round-robin We have also implemented a simulator (run 1 Million VMs on 100,000 nodes!) Can capture data during a “real” run Run multiple algorithms on exact same trace Xen Summit AMD 2010
36. XRM Evaluation 5 hosts, 4 cores Random utilizations Random slice requests Three algorithms Bin-packing Round-robin Random-host Slicing algorithms evaluated in previous work - AutoControl [PadalaEurosy’09] Xen Summit AMD 2010
37. Comparing three algorithms Uses all five hosts, wasting energy Round-Robin Uses <= five hosts, wasting energy Random Host Host Utilization Uses <= three hosts! Bin Packing Time Interval
42. Load increased on ½ of the VMs chosen randomlyAutoControl experiments No control needed AutoControl can readjust
43. SLO (performance goal) violations Default Xen AutoControl Applications Time Time Target Bad Good
44. Summary Resource management in cloud infrastructures is complex Multiple layers of RM Complex primitives Complex decisions We are developing feedback control theory based RM XRM is event-based, pluggable and extensible Complex algorithms like AutoControl can be developed Research in advanced algorithms in progress Xen Summit AMD 2010
45. Summary of our experiences with XCP 0.1.1 We are trying to build a research cloud based on XCP Other than XRM, adding Fault Tolerance and a Web-based GUI to XCP Having to install a special distribution is difficult Why not have XCP as a set of packages in RHEL or other distributions? You are breaking toolstacks developed at various companies XCP docs is same as Citrix Xenserver docs Some of the features don’t work or not supported Better documentation of API XCP GUI needs to improve Bugs in OpenXenCenter Xen Summit AMD 2010
47. We want feedback from Xen community Comments on XRM architecture Should we incorporate XRM into XCP? Ocaml Are you interested in open source XRM? Does the community wants to be involved? Questions? ppadala@docomolabs-usa.com Xen Summit AMD 2010
Editor's Notes
Good afternoon everyone. My name is PradeepPadala from DOCOMO USA labs. Today, I am going to talk about a resource management framework at DOCOMO USA labs.
Let us start by looking at a typical scenario in shared infrastructure. Here we have two applications web search and data analytics sharing a common infrastructure. This is a pretty common scenario in companies like Yahoo and Google.
These applications, however have very different requirements. For example, the search app wants very fast searches, while the data mining app wants to blast read large amounts of data. If we translate this into system requirements, search app expects low response time, while data analytics app high throughput. These companies might be ready pay good amount of money to achieve their requirements, but the data center owner might prioritize depending on the pay. Note that the incentive may not be “real money” but other forms of utility. For example, for a company like Yahoo, search has more priority than data anlaytics. So, we want to achieve certain differentitation
How are these applications hosted currently? Currently, these applications are hosted by partitioning the resources. For example, here we see three applications that are hosted on four physical nodes. However, physical partitioning wastes resources as some application may not fully utilize the resources. The data center sprawl also makes it difficult to manageThe solution is to create a virtual data center where multiple applications are hosted together in physical nodes using virtualization. There are many benefits to consolidation including improved utilization, reduced maintenance and costs.
The first challenge is that developers don’t want to manage resources directly. Let’s see a simplified example of how a devloper would write a scalable service? Developers start with provisioning VMs and start running applications. Then, they have to monitor the applications, and if the application’s goal is not met, then they have to do some magic. They will have to first figure out the reason, which itself can be complex. Once the reason is found, we can scale up, scale out, etc. Finally, if you want to reduce costs, then we want to consolidate
The third challenge is the variety of scaling primitives that are available and the difficulty in combining them.
That brings me to our approach AutoControl, So, how to automatically allocate resources? Our approach basically follows from a key insight that is pretty much summed up in the quote. <read the slide>
Finally, we come to running AutoControl in a prototype data center. We have 16 servers … read the slide.
This slide visually shows the SLO violations in different nodes.