Improving Scalability of Xen: The 3,000 Domains Experiment
Upcoming SlideShare
Loading in...5
×
 

Improving Scalability of Xen: The 3,000 Domains Experiment

on

  • 954 views

Machines are getting powerful these days and more and more VMs will run on a single machine. This work started off with a simple goal - to run 3,000 domains on a single host and address any ...

Machines are getting powerful these days and more and more VMs will run on a single machine. This work started off with a simple goal - to run 3,000 domains on a single host and address any scalability issues come across. I will start with Xen internal then state the problems and solutions. Several improvements are made, from hypervisor, Linux kernel to user space components like console backend and xenstore backend. The main improvement for hypervisor and Dom0 Linux kernel is the new event channel infrastructure, which enable Dom0 to handle much more events simultaneously - the original implementation only allows 1024 and 4096 respectively.
The targeting audiences are cloud developers, kernel developers and those who are interested in Xen scalability and internals. They need to have general knowledge of Linux, knowledge of Xen is not required but nice to have.

Statistics

Views

Total Views
954
Views on SlideShare
954
Embed Views
0

Actions

Likes
3
Downloads
10
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • You can find the video at http://www.youtube.com/watch?v=ZJPCqC2pDSg
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Improving Scalability of Xen: The 3,000 Domains Experiment Improving Scalability of Xen: The 3,000 Domains Experiment Presentation Transcript

  • Improving Scalability ofXen: the 3,000 domainsexperimentWei Liu <wei.liu2@citrix.com>
  • Xen: the gears of the cloud● large user baseestimated more than 10 million individuals users● power the largest clouds inproduction● not just for servers
  • Xen: Open SourceGPLv2 with DCO(like Linux)Diversecontributorcommunitysource:Mike Dayhttp://code.ncultra.org View slide
  • Xen architecture:PV guestsHardwareXenDom0 DomUHW driversPV backends PV FrontendsDomUPV FrontendsDomUPV Frontends View slide
  • Xen architecture:PV protocolRequest ProducerRequest ConsumerResponse ProducerResponseConsumerBackend FrontendEvent channel fornotification
  • Xen architecture:driver domainsHardwareXenDom0 DomUNetFrontDisk DriverDomainToolstack Disk DriverBlockBackNetworkDriver DomainNetwork DriverNetBack BlockFront
  • Xen architecture:HVM guestsHardwareXenDom0stubdomHW driversPV backendsHVM DomUPV FrontendsHVM DomUQEMUIO emulation IO emulationQEMU
  • Xen architecture:PVHVM guestsHardwareXenDom0HW driversPV backendsPVHVMDomUPVHVMDomUPV FrontendsPVHVMDomU
  • Xen scalability:current statusXen 4.2● Up to 5TB host memory (64bit)● Up to 4095 host CPUs (64bit)● Up to 512 VCPUs for PV VM● Up to 256 VCPUs for HVM VM● Event channels○ 1024 for 32-bit domains○ 4096 for 64-bit domains
  • Xen scalability:current statusTypical PV / PVHVM DomU● 256MB to 240GB of RAM● 1 to 16 virtual CPUs● at least 4 inter-domain event channels:○ xenstore○ console○ virtual network interface (vif)○ virtual block device (vbd)
  • Xen scalability:current status● From a backend domains (Dom0 / driverdomain) PoV:○ IPI, PIRQ, VIRQ: related to number of cpus anddevices, typical Dom0 has 20 to ~200○ yielding less than 1024 guests supported for 64-bitbackend domains and even less for 32-bit backenddomains● 1K still sounds a lot, right?○ enough for normal use case○ not ideal for OpenMirage (OCaml on Xen) and othersimilar projects
  • Start of the story● Effort to run 1,000 DomUs (modified Mini-OS) on a single host *● Want more? How about 3,000 DomUs?○ definitely hit event channel limit○ toolstack limit○ backend limit○ open-ended question: is it practical to do so?* http://lists.xen.org/archives/html/xen-users/2012-12/msg00069.html
  • Toolstack limitxenconsoled and cxenstored both use select(2)● xenconsoled: not very critical and can berestarted● cxenstored: critical to Xen and cannot beshutdown otherwise lost information● oxenstored: use libev so there is no problemswitch from select(2) to poll(2)implement poll(2) for Mini-OS
  • Event channel limitIdentified as key feature for 4.3 release. Twodesigns came up by far:● 3-level event channel ABI● FIFO event channel ABI
  • 3-level ABIMotivation: aimed for 4.3 timeframe● an extension to default 2-level ABI, hencethe name● started in Dec 2012● V5 draft posted Mar 2013● almost ready
  • Default (2-level) ABI11 0... 0Selector (1 word, per cpu)Bitmap (shared)01 Upcall pending flag
  • 3-level ABI11 01 0 All 0...... ...0First level selector(per cpu)Second level selector(per cpu)Bitmap(shared)01 Upcall pending flag
  • 3-level ABINumber of event channels:● 32K for 32 bit guests● 256K for 64 bit guestsMemory footprint:● 2 bits per event (pending and mask)● 2 / 16 pages for 32 / 64 bit guests● NR_VCPUS pages for controlling structureLimited to Dom0 and driver domains
  • 3-level ABI● Pros○ general concepts and race conditions are fairly wellunderstood and tested○ envisioned for Dom0 and driver domains only, smallmemory footprint● Cons○ lack of priority (inherited from 2-level design)
  • FIFO ABIMotivation: designed ground-up with gravyfeatures● design posted in Feb 2013● first prototype posted in Mar 2013● under development, close at hand
  • FIFO ABIEvent 1Event 2Event 3...Per CPU control structureEvent word (32 bit)Shared event arrayEmpty queue and non-empty queue (only showing the LINK field)Selector for picking up event queue
  • FIFO ABINumber of event channels:● 128K (2^17) by designMemory footprint:● one 32-bit word per event● up to 128 pages per guest● NR_VCPUS pages for controlling structureUse toolstack to limit maximum number ofevent channels a DomU can have
  • FIFO ABI● Pros○ event priority○ FIFO ordering● Cons○ relatively large memory footprint
  • Community decision● scalability issue not as urgent as we thought○ only OpenMirage expressed interest on extra eventchannels● delayed until 4.4 release○ better to maintain one more ABI than two○ measure both and take one● leave time to test both designs○ event handling is complex by nature
  • Back to the story3,000 DomUs experiment
  • Hardware spec:● 2 sockets, 4 cores, 16 threads● 24GB RAMSoftware config:● Dom0 16 VCPUs● Dom0 4G RAM● Mini-OS 1 VCPU● Mini-OS 4MB RAM● Mini-OS 2 event channels3,000 Mini-OSDEMO
  • 3,000 LinuxHydramonster hardware spec:● 8 sockets, 80 cores, 160 threads● 512GB RAMSoftware config:● Dom0 4 VCPUs (pinned)● Dom0 32GB RAM● DomU 1 VCPU● DomU 64MB RAM● DomU 3 event channels (2 + 1 VIF)
  • ObservationDomain creation time:● < 500 acceptable● > 800 slow● took hours to create 3,000 DomUs
  • ObservationBackend bottleneck:● network bridge limit in Linux● PV backend drivers buffer starvation● I/O speed not acceptable● Linux with 4G RAM can allocate ~45k eventchannels due to memory limitation
  • ObservationCPU starvation:● density too high:1 PCPU vs ~20 VCPUs● backend domain starvation● should dedicate PCPUs to critical servicedomain
  • SummaryThousands of domains, doable but not verypractical at the moment● hypervisor and toolstack○ speed up creation● hardware bottleneck○ VCPU density○ network / disk I/O● Linux PV backend drivers○ buffer size○ processing model
  • Beyond?Possible practical way to run thousands ofdomains:offload services to dedicated domains and trustXen scheduler.Disaggregation
  • Happy hacking andhave fun!Q&A
  • AcknowledgementPictures used in slides:thumbsup:http://primary3.tv/blog/uncategorized/cal-state-university-northridge-thumbs-up/hydra: http://www.pantheon.org/areas/gallery/mythology/europe/greek_people/hydra.html