XS Oracle 2009 Just Run It

504 views
458 views

Published on

Yoshio Turner: Just RunIt

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
504
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

XS Oracle 2009 Just Run It

  1. 1. JustRunIt: Experiment-Based Management with Xen Wei Zheng1 Ricardo Bianchini1 Yoshio Turner2 J. Renato Santos2 G. Janakiraman3 1Rutgers University 2HP Labs 3Skytap Xen Summit 2009
  2. 2. Data Center Management • Challenging – Variety of tasks: Resource allocation, software/hardware upgrades, application and OS configuration… – Complex: Affect performance, availability, energy consumption – Getting worse: larger scale data centers hosting more diverse and dynamic workloads • Automation might save us – Using analytical models: insight into system behavior, fast exploration of large parameter spaces • Modeling has some important drawbacks – Expensive to develop, often relies on simplifying assumptions, may require re-calibration and re-validation as systems evolve
  3. 3. Our Approach: Experiment-Based Management • Experiments may be a better approach than modeling for many management tasks – Low cost: time and energy consumed by a few machines – No simplifying assumptions – No need for calibration/validation • Use of VMs in production facilitates experimentation JustRunIt – Infrastructure for experiment-based management of virtualized data centers hosting multiple services
  4. 4. JustRunIt Architecture Coordinates Experimenter Checker Experiment results Experiment results Coordinates X X X X X Interpolator Driver Heuristics Coordinate Range X I I X T T Time Limit I I I I X X I X T Management Entity
  5. 5. Experimenter • Step 1: Clone subset of VM production system to a sandbox
  6. 6. Experimenter • Step 1: Clone subset of VM production system to a sandbox – VM cloning: Modify Xen live migration to resume original VM VM instead of destroying it – Storage cloning: LVM copy-on- write snapshot for sandbox VM
  7. 7. Experimenter • Step 1: Clone subset of VM production system to a sandbox – VM cloning: Modify Xen live migration to resume original VM VM instead of destroying it – Storage cloning: LVM copy-on- write snapshot for sandbox VM – L2/L3 network address translation: implemented in driver domain netback driver to prevent network address conflict
  8. 8. Experimenter • Step 1: Clone subset of production system to a sandbox VM – VM cloning: Modify Xen live migration to resume original VM instead of destroying it – Storage cloning: LVM copy-on- VM write snapshot for sandbox VM – L2/L3 network address translation: implemented in driver domain netback driver to prevent network address conflict – Apply configuration changes : e.g., resource allocation – Resume execution
  9. 9. A1 A2 W1 D1 W2 D2 A1 A2 S_A2 W1 D1 W2 D3 A2 A3 S_W2 W2 D2 W3 D3 A2 A3 W1 D1 S_D2 W2 D2 A1 W3 D3 A2 Replies A3 Requests Assess performance and energy for different S_W2 S_A2 S_D2 configurations
  10. 10. Experimenter Step 2: Workload replay using network proxies • In-Proxy Tier-N VM Out-Proxy Sandbox VM • Proxies filter requests/replies from the sandbox VM • Emulates the timing and functional behavior of preceding and following service tiers – Application protocol level requests/replies (e.g. HTTP) • In-proxy replays with fixed delay – Delay needed if sandbox is faster than production system
  11. 11. Proxies In-Proxy Tier-N VM Out-Proxy Sandbox VM Req0 t0 Resp0 t0’ Resp0 ts0’ Req0 t00 Resp0 t00’ Req1 t1 Resp1 t1’ Req1 t01 Resp1 t01’ Resp1 ts1’ Resp2 ts2’ Req2 t2 Resp2 t2’ Req2 t02 Resp2 t02’ Resp3 ts3’ Req3 t3 Resp3 t3’ Req3 t03 Resp3 t03’ Respn tn’ Respn tsn’ Reqn tn Reqn t0n Respn t0n’ Online & Sandbox Throughput Mean response time Online & Sandbox
  12. 12. Experiment Driver CPU • Fill in results matrix Freq X X X within a time limit X X X X X X Corners • (min,min) CPU allocation Midpoints (recursive) • Heuristics • − Cancel experiments if gain for a resource addition falls below a threshold − Cancel experiments for tiers that do not produce the largest gains from a resource addition
  13. 13. Evaluation Overview • Implementation – Xen (< 50 lines python, <250 lines of C in netback) – Proxies for HTTP, mod_jk, MySQL (800-1500 LoC each) • Based on Tinyproxy codebase • Two services (auction and bookstore) • Case studies – Resource management: CPU – Evaluation of hardware upgrade • Proxy and Replay Overhead – Throughput: No impact – Response time: < 5 ms impact
  14. 14. Case Study 1: Resource Management • Goal: satisfy < 50ms response time using minimum resource • JustRunIt determines performance as function of resource allocation • Management entity assigns resources using a bin packing algorithm (simulated annealing) – Minimize resource usage and VM migrations • Compare experiment-based with a highly accurate model
  15. 15. Case Study 1 Results 60 50 Response Time (ms) 40 Service 0 RT JustRunIt Service 1 RT 30 Service 2 RT Service 3 RT 20 10 0 1 2 3 4 5 6 7 8 9 Time (minutes) 60 50 Response Time (ms) Highly 40 Service 0 RT accurate Service 1 RT 30 Service 2 RT modeling Service 3 RT 20 10 0 1 2 3 4 5 6 7 8 9 Time (minutes)
  16. 16. Case Study 2: Hardware Upgrades • Sandbox on upgraded hardware • Bin packing finds the number of new machines to accommodate the workload • Example scenario: two services, each app server uses 90% of one CPU core on old hardware – New machine requires 72% – Example too small to show any consolidation benefits of upgrading
  17. 17. Conclusions • JustRunIt infrastructure can support multiple (automated) management tasks – Answers “what-if” questions realistically and transparently using actual experiments • Possible directions include: – Tier interactions – Hypothetical workload mix – Validate administrator actions – Tradeoff between accuracy and experiment cost (resource usage and experiment time required)
  18. 18. Related Work • Modeling, simulation of data centers – Development and validation cost, simulation slow-down • Scaled down emulation – DieCast [NSDI08] uses VMs and time dilation for emulation (JustRunIt targets subset of mgmt tasks, native execution, and minimal additional hardware) • Sandboxing for managing data centers – Prior work from Rutgers (validating operator actions OSDI04 USENIX06, entire data center experiments EUROSYS07) and file server verification (Tan et al USENIX 05) – Very different infrastructure for JustRunIt (extrapolation, verification, application-transparency using VM cloning – practicality)

×