Improving Scalability of Xen: The 3,000 Domains Experiment


Published on

Machines are getting powerful these days and more and more VMs will run on a single machine. This work started off with a simple goal - to run 3,000 domains on a single host and address any scalability issues come across. I will start with Xen internal then state the problems and solutions. Several improvements are made, from hypervisor, Linux kernel to user space components like console backend and xenstore backend. The main improvement for hypervisor and Dom0 Linux kernel is the new event channel infrastructure, which enable Dom0 to handle much more events simultaneously - the original implementation only allows 1024 and 4096 respectively.
The targeting audiences are cloud developers, kernel developers and those who are interested in Xen scalability and internals. They need to have general knowledge of Linux, knowledge of Xen is not required but nice to have.

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Improving Scalability of Xen: The 3,000 Domains Experiment

  1. 1. Improving Scalability ofXen: the 3,000 domainsexperimentWei Liu <>
  2. 2. Xen: the gears of the cloud● large user baseestimated more than 10 million individuals users● power the largest clouds inproduction● not just for servers
  3. 3. Xen: Open SourceGPLv2 with DCO(like Linux)Diversecontributorcommunitysource:Mike Day
  4. 4. Xen architecture:PV guestsHardwareXenDom0 DomUHW driversPV backends PV FrontendsDomUPV FrontendsDomUPV Frontends
  5. 5. Xen architecture:PV protocolRequest ProducerRequest ConsumerResponse ProducerResponseConsumerBackend FrontendEvent channel fornotification
  6. 6. Xen architecture:driver domainsHardwareXenDom0 DomUNetFrontDisk DriverDomainToolstack Disk DriverBlockBackNetworkDriver DomainNetwork DriverNetBack BlockFront
  7. 7. Xen architecture:HVM guestsHardwareXenDom0stubdomHW driversPV backendsHVM DomUPV FrontendsHVM DomUQEMUIO emulation IO emulationQEMU
  8. 8. Xen architecture:PVHVM guestsHardwareXenDom0HW driversPV backendsPVHVMDomUPVHVMDomUPV FrontendsPVHVMDomU
  9. 9. Xen scalability:current statusXen 4.2● Up to 5TB host memory (64bit)● Up to 4095 host CPUs (64bit)● Up to 512 VCPUs for PV VM● Up to 256 VCPUs for HVM VM● Event channels○ 1024 for 32-bit domains○ 4096 for 64-bit domains
  10. 10. Xen scalability:current statusTypical PV / PVHVM DomU● 256MB to 240GB of RAM● 1 to 16 virtual CPUs● at least 4 inter-domain event channels:○ xenstore○ console○ virtual network interface (vif)○ virtual block device (vbd)
  11. 11. Xen scalability:current status● From a backend domains (Dom0 / driverdomain) PoV:○ IPI, PIRQ, VIRQ: related to number of cpus anddevices, typical Dom0 has 20 to ~200○ yielding less than 1024 guests supported for 64-bitbackend domains and even less for 32-bit backenddomains● 1K still sounds a lot, right?○ enough for normal use case○ not ideal for OpenMirage (OCaml on Xen) and othersimilar projects
  12. 12. Start of the story● Effort to run 1,000 DomUs (modified Mini-OS) on a single host *● Want more? How about 3,000 DomUs?○ definitely hit event channel limit○ toolstack limit○ backend limit○ open-ended question: is it practical to do so?*
  13. 13. Toolstack limitxenconsoled and cxenstored both use select(2)● xenconsoled: not very critical and can berestarted● cxenstored: critical to Xen and cannot beshutdown otherwise lost information● oxenstored: use libev so there is no problemswitch from select(2) to poll(2)implement poll(2) for Mini-OS
  14. 14. Event channel limitIdentified as key feature for 4.3 release. Twodesigns came up by far:● 3-level event channel ABI● FIFO event channel ABI
  15. 15. 3-level ABIMotivation: aimed for 4.3 timeframe● an extension to default 2-level ABI, hencethe name● started in Dec 2012● V5 draft posted Mar 2013● almost ready
  16. 16. Default (2-level) ABI11 0... 0Selector (1 word, per cpu)Bitmap (shared)01 Upcall pending flag
  17. 17. 3-level ABI11 01 0 All 0...... ...0First level selector(per cpu)Second level selector(per cpu)Bitmap(shared)01 Upcall pending flag
  18. 18. 3-level ABINumber of event channels:● 32K for 32 bit guests● 256K for 64 bit guestsMemory footprint:● 2 bits per event (pending and mask)● 2 / 16 pages for 32 / 64 bit guests● NR_VCPUS pages for controlling structureLimited to Dom0 and driver domains
  19. 19. 3-level ABI● Pros○ general concepts and race conditions are fairly wellunderstood and tested○ envisioned for Dom0 and driver domains only, smallmemory footprint● Cons○ lack of priority (inherited from 2-level design)
  20. 20. FIFO ABIMotivation: designed ground-up with gravyfeatures● design posted in Feb 2013● first prototype posted in Mar 2013● under development, close at hand
  21. 21. FIFO ABIEvent 1Event 2Event 3...Per CPU control structureEvent word (32 bit)Shared event arrayEmpty queue and non-empty queue (only showing the LINK field)Selector for picking up event queue
  22. 22. FIFO ABINumber of event channels:● 128K (2^17) by designMemory footprint:● one 32-bit word per event● up to 128 pages per guest● NR_VCPUS pages for controlling structureUse toolstack to limit maximum number ofevent channels a DomU can have
  23. 23. FIFO ABI● Pros○ event priority○ FIFO ordering● Cons○ relatively large memory footprint
  24. 24. Community decision● scalability issue not as urgent as we thought○ only OpenMirage expressed interest on extra eventchannels● delayed until 4.4 release○ better to maintain one more ABI than two○ measure both and take one● leave time to test both designs○ event handling is complex by nature
  25. 25. Back to the story3,000 DomUs experiment
  26. 26. Hardware spec:● 2 sockets, 4 cores, 16 threads● 24GB RAMSoftware config:● Dom0 16 VCPUs● Dom0 4G RAM● Mini-OS 1 VCPU● Mini-OS 4MB RAM● Mini-OS 2 event channels3,000 Mini-OSDEMO
  27. 27. 3,000 LinuxHydramonster hardware spec:● 8 sockets, 80 cores, 160 threads● 512GB RAMSoftware config:● Dom0 4 VCPUs (pinned)● Dom0 32GB RAM● DomU 1 VCPU● DomU 64MB RAM● DomU 3 event channels (2 + 1 VIF)
  28. 28. ObservationDomain creation time:● < 500 acceptable● > 800 slow● took hours to create 3,000 DomUs
  29. 29. ObservationBackend bottleneck:● network bridge limit in Linux● PV backend drivers buffer starvation● I/O speed not acceptable● Linux with 4G RAM can allocate ~45k eventchannels due to memory limitation
  30. 30. ObservationCPU starvation:● density too high:1 PCPU vs ~20 VCPUs● backend domain starvation● should dedicate PCPUs to critical servicedomain
  31. 31. SummaryThousands of domains, doable but not verypractical at the moment● hypervisor and toolstack○ speed up creation● hardware bottleneck○ VCPU density○ network / disk I/O● Linux PV backend drivers○ buffer size○ processing model
  32. 32. Beyond?Possible practical way to run thousands ofdomains:offload services to dedicated domains and trustXen scheduler.Disaggregation
  33. 33. Happy hacking andhave fun!Q&A
  34. 34. AcknowledgementPictures used in slides:thumbsup: