Windsor: Domain 0 Disaggregation for XenServer and XCP
In a traditional Xen configuration domain 0 is used for a large number of different functions including running the toolstack(s), backends for network and disk I/O, running the QEMU device model instances, driving the physical devices in the system, handling guest console/framebuffer I/O and miscellaneous monitoring and management functions. Having all these functions in one domain produces a complex environment which is susceptible to shared fate on the failure of any one function, has complex interactions between functions (including resource contention) which makes it difficult to predict performance, and has limited flexibility (such as requiring the same kernel for all device drivers).
""Domain 0 disaggregation"" has been discussed for some time as a way to break out domain 0's functions into separate domains. Doing this enables each domain to be tailored to its function such as using a different kernel or operating system to drive different physical devices. Splitting functions into separate domains removes some of the unintentional interactions such as in-domain resource contention and reduces the system impact of the failure of a single function such as a device driver crash.
Although domain 0 disaggregation is not new it is seldom used in practise and much of its use is focussed on providing enhanced security. Citrix XenServer will be moving towards a disaggregated domain 0 in order to provide better security, scalability, performance, reliability, supportability and flexibility. This talk will describe XenServer's “Windsor” architecture and explain how it will provide the above benefits to customers and users. We will present an overview of the architecture and some early experimental measurements showing the benefits.
Windsor: Domain 0 Disaggregation for XenServer and XCP
WindsorDomain 0 disaggregation for XenServerJames Bulpin, Director of Technology, XenServer, CitrixAugust 28, 2012
Introduction• XenServer/XenCloudPlatform (XCP): a distribution of Xen, a domain 0 and everything else needed to create a platform for virtualization• A platform for server virtualisation• A platform for virtual desktops (e.g. XenDesktop)• A platform for the cloud (e.g. OpenStack and CloudStack)• A platform for virtualised networking (e.g. NetScaler)• All use cases are tending towards much higher numbers of guest VMs per host• Current architecture works now but will hit bottlenecks as servers get bigger and more powerful• Want a flexible, modular solution to scalability
Overview• XenServer architecture evolution – a brief history• Limitations of the current architecture• Windsor: 3rd generation XenServer architecture ᵒ Using domain 0 disaggregation for scalability and performance This presentation is about the internal technology of XenServer/XCP. It is not a statement about the future feature set or capabilities of any particular XenServer release.
XenServer Architecture – a brief history (1)• First generation architecture in Burbank – XenEnterprise 3.0.0• Single host virtualisation: no resource pools, no XenMotion• Based on open-source xend/xm toolstack• Basic C host agent driving xend• XenAdmin Java management application• Initially used a small, read-only dom0 ᵒ Moved to writable environment in 3.1
XenServer Architecture – a brief history (2)• Second generation architecture in Rio – XenServer 4.0.0• Current releases, including XenServer 6.1 coming soon, based on this• All current versions of XCP based on this• Sophisticated Ocaml xapi toolstack implementing the XenAPI ᵒ Uses only lowest level part of open-source Xen toolstack (libxc)• Clustered architecture for resource pools, XenMotion and master mobility• Domain 0 evolved from 1st generation ᵒ Fairly full Linux distribution based on CentOS ᵒ Writable environment using RPMs for package management
XenServer 2nd generation host architecture XenAPI Streaming services vm consoles DVS Controller dom0 xapi other XML/RPC; HTTP VM1 pool db Storage Network Domains xenstore hosts SM i/f scripts qemu-dm Linux ovs-vswitchd tapdisk Linux Blktap/ storage Network devices vswitch Netback blkback devices kernel Xen Hypervisor Hardware devices Shared memory pages8
Limitations of the current architecture• With future larger, powerful servers domain 0 will become a bottleneck ᵒ Will limit scalability and performance• Lack of failure containment• Hard to reason about dom0 security• Non-optimal use of modern NUMA architectures• Limited third party extensibility
Goals for the new architecture• Improved scalability on single XenServer host• Remove performance bottlenecks• Cloud scale horizontal scaling of XenServer hosts• Better isolation for multi-tenancy• Increased availability and quality of service• Higher Trusted Computing Base measurement coverage
Design principles and overview• Scale-out, not scale-up – exploit parallelism and locality• Disaggregate Domain-0 functionality to other domains• Encapsulate and simplify• Clean APIs between components• Flexibility• Design for scalability and security
The approach• Can we improve performance and scalability with a bigger domain 0? – Yes• Can we tune domain 0 to better utilise hardware resource? – Yes• But... this makes domain 0 a bigger and more complex environment ᵒ Easy to change one thing and have a negative effect elsewhere ᵒ Multiple individuals and organisations working in the same complex environment ᵒ Still don’t get containment of failures• Instead disaggregate domain 0 functionality into multiple system domains ᵒ Can be thought of as “virtualising dom0” – after all we tell users that it’s better to have one workload per VM and use a hypervisor to run multiple VMs per server ᵒ Helps disentangle complex behaviour ᵒ Easier to provision suitable resources, allows for clear parallelism ᵒ Contains failures
Domain 0 disaggregation• Moving virtualisation functions out of domain 0 into other domains ᵒ Driver domains contain backend drivers (netback, blkback) and physical device drivers • Use PCI pass-through to give access to physical device to be virtualised ᵒ System domains to provide shared services such as logging ᵒ Domains for running qemu(s) – per VM stub domains or shared qemu domains ᵒ Monitoring, management and interface domains• Has been around for a while, mostly with a security focus ᵒ Improving Xen Security through Disaggregation (Murray et. al., VEE08) [http://www.cl.cam.ac.uk/~dgm36/publications/2008-murray2008improving.pdf] ᵒ Breaking up is hard to do: Xoar (Colp et. al., SOSP11) [http://www.cs.ubc.ca/~andy/papers/xoar- sosp-final.pdf] ᵒ Qubes (driver domains) [http://qubes-os.org] ᵒ Citrix XenClient XT (network driver domains) [http://www.citrix.com/xenclient]
Targets for disaggregation• Storage driver domains – storage stack (e.g. VHD), ring backends, device drivers• Network driver domains – network stack (e.g. OVS), ring backends, device drivers• xenstored domain – busy, central control database for low level functionality• xapi management domain• qemu domain (per node, per tenant or per-VM) – emulated BIOS etc.
Benefits• Better VM density• Better use of scale-out hardware (NUMA, many cores, balanced I/O)• Improved stability• Improved availability (fault isolation)• Opportunity for secure boot etc.• More extensible, future value-add opportunities
User VM User VM NF BF NF BF NB gntdev NB gntdev gntdevDom0 Network NFS/ Qemu xapi Logging Qemu Network NFS/ Local driver iSCSI domain domain domain domain driver iSCSI storageDomainmanager domain driver domain driver driver domain domain domain qemu qemuhealthd storaged storaged storaged networkd networkdxenopsd tapdisk tapdisk tapdisk libxl vswitch blktap3 xapi syslogd vswitch blktap3 blktap3 dbus over v4v eth eth eth eth scsi Xen NIC NIC CPU CPU NIC NIC RAM RAM RAID (or SR- (or SR- (or SR- (or SR- IOV VF) IOV VF) IOV VF) IOV VF)
Replication to scale-out on big servers • Scalability • More VMs, more system domains • Isolation for cloud environments VM VM VM VM VM VM VM VM VMDom 0 VM VM VM VM VM VM VM VM VMXAPI console ISV VM VM VM VM VM VM VM VM VM Logging xenstored VM VM VM VM VM VM VM VM VM qemu NIC Storage qemu NIC Storage qemu NIC Storage XenServer Hypervisor XenServer Hypervisor Hardware Hardware
NUMA affinity and memoryBalanced I/O placement Redundant All IO failover toprocessing, Network other node Network User User DMA, etc. driver driver Use driver VMs VMs kept within domain domain domain onNUMA node this node Socket 0 Socket 1 Core Core Core Core QPI Core Core Core Core PCIe PCIe I/O hub I/O hub NIC NIC DIMMs NIC NIC DIMMs Net 0 Net 1
Flexibility – value-add storage appliances(today) Dom0 Storage appliance User VM Local SR Clever stuff iSCSI/ NFS BB NB BB BF NF BF Xen Appliance exposes its VM block storage to Dom0 via accesses NFS/iSCSI over local traverse networking several rings
Flexibility – value-add storage appliances (Windsor)Dom0 Driver domain Dom0 Driver domain User VM User VM Clever Clever stuff stuff blktap3 blktap3 scsi BF scsi BFXen Network link Xen between cluster nodes (PV VIF or PCI passthrough Local blk ring, NIC/SR-IOV VF) faster than Access to NFS/iSCSI local via dom0 disks/SSDs
Summary• Next generation XenServer architecture built to scale-out using domain 0 disaggregation techniques ᵒ Scale with the workload and server size ᵒ Expect to significantly enhance scalability and aggregate performance• Well defined APIs between components will allow better extensibility• Avoids complexity of single, large, multi-function domain 0 ᵒ Easier to reason about ᵒ Easier to maintain and debug ᵒ Containment of failures