Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply



Published on

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Stub Domain Device Model Domain and PV-GRUB Kwon-yong Lee Distributed Computing & Communication Lab. (URL: Dept. of Computer Science Sogang University Seoul, Korea Tel : +82-2-3273-8783 Email :
  • 2. Domain0 Disaggregation
    • Big Dom0 Problems
      • Running a lot of Xen components
        • Physical device drivers
        • Domain manager
        • Domain builder
        • ioemu device models
        • PyGRUB
      • Security issues
        • Most of the components run as root.
      • Scalability issues
        • The hypervisor can not itself schedule them appropriately.
    • Goal
      • Move the components to separate domains
      • Helper domains
        • Driver domain, Builder domain, Device model domains, etc.
  • 3. PyGRUB
    • Acts as a “PV bootloader”
    • Allows to boot from a kernel that resides within the DomU disk or partition image
    • Needs to be root to access guest disk
      • Security issues
    • Can’t network boot
    • Re-implements GRUB
    Xen Hypervisor Dom0 PV Domain xend Linux PyGRUB menu.lst vmlinuz initrd
  • 4. Mini-OS
    • A sample PV guest for the Xen hypervisor
      • Very simple
      • Completely rely on the hypervisor to access the machine
        • Uses the Xen network, block, and console frontend/backend mechanism
      • Supports only
        • Non-preemptive threads
        • One virtual memory address space (no user space)
        • Single CPU (mono-VCPU)
  • 5. Mini-OS
    • Xen 3.3
      • It has been extended up to being able to run the newlib C library and the lwIP stack, thus providing a basic POSIX environment, including TCP/IP networking.
      • xen-3.3.1/extras/mini-os/
    • PS) being tested at Cisco for IOS
  • 6. xen-3.3.1/extras/mini-os/README Minimal OS ---------- This shows some of the stuff that any guest OS will have to set up. This includes: * installing a virtual exception table * handling virtual exceptions * handling asynchronous events * enabling/disabling async events * parsing start_info struct at start-of-day * registering virtual interrupt handlers (for timer interrupts) * a simple page and memory allocator * minimal libc support * minimal Copy-on-Write support * network, block, framebuffer support * transparent access to FileSystem exports (see tools/fs-back) - to build it just type make. - to build it with TCP/IP support, download LWIP 1.3 source code and type make LWIPDIR=/path/to/lwip/source - to build it with much better libc support, see the stubdom/ directory - to start it do the following in domain0 (assuming xend is running) # xm create domain_config This starts the kernel and prints out a bunch of stuff and then once every second the system time. If you have setup a disk in the config file (e.g. disk = [ 'file:/tmp/foo,hda,r' ] ), it will loop reading it. If that disk is writable (e.g. disk = [ 'file:/tmp/foo,hda,w' ] ), it will write data patterns and re-read them. If you have setup a network in the config file (e.g. vif = [''] ), it will print incoming packets. If you have setup a VFB in the config file (e.g. vfb = ['type=sdl'] ), it will show a mouse with which you can draw color squares. If you have compiled it with TCP/IP support, it will run a daytime server on TCP port 13.
  • 7. POSIX Environment on top of Mini-OS Xen Hypervisor Mini-OS New lib lwIP Additional Code getpid, sig, mmap, … Application Sched MM Console frontend Network frontend Block frontend FS frontend FB frontend
  • 8. POSIX Environment on top of Mini-OS
    • lwIP (lightweight IP)
      • Provides a lightweight TCP/IP stack
        • Just connect to the network frontend of Mini-OS
      • Widely used open source TCP/IP stack designed for embedded systems
      • Reduce resource usage while still having a full scale TCP
    • PS) uIP
      • TCP/IP stack for 8-bit microcontrollers
  • 9. POSIX Environment on top of Mini-OS
    • newlib
      • Provides the standard C library functions
      • Or GNU libc
    • Others
      • getpid and similar return e.g. 1.
        • Don’t have the notion of Unix process
      • sig functions can be void.
        • Don’t have signals either
      • mmap is only implemented for one case.
        • Anonymous memory
  • 10. POSIX Environment on top of Mini-OS
    • Disk frontend
    • FrameBuffer frontend
    • FileSystem frontend (to access part of the Dom0 FS)
      • Through the FileSystem frontend/backend mechanism
        • Imported from JavaGuest
          • By using very simple virtualized kernel, JavaGuest project avoids all the complicated semantics of a full-featured kernel, and hence permit far easier certification of the semantics of the JVM.
    • More advanced MM
      • Read-only memory
      • CoW for zeroed pages
  • 11. POSIX Environment on top of Mini-OS
    • Running a Mini-OS example
      • 1 초에 한번씩 타임스탬프가 출력
      • Xm create –c domain_config
      • 해당 도메인과의 콘솔 연결을 끊으려면 ‘ Ctrl+]’
    • Cross-compilation environment
      • binutils, gcc, newlib, lwip
      • Ex) ‘Hello World!’
        • xen-3.3.1/stubdom/c/
  • 12. Old HVM Device Model (< Xen 3.3)
    • Modified version of qemu, ioemu
      • To provide HVM domains with virtual hardware
      • Used to run in dom0 as a root process, since it needs to directly access disks and tap network
      • Problems
        • Security
          • The qemu code base was not particularly meant to be safe
        • Efficiency
          • When an HVM guest performs an I/O operation, the hypervisor gives hand to Dom0, which then may not schedule the ioemu process immediately, leading to uneven performances.
  • 13. Old HVM Device Model
    • Have to wait for Dom0 Linux to schedule qemu
    • Consume Dom0 CPU time
    Xen Hypervisor Dom0 HVM Domain IN/OUT qemu Linux
  • 14. Xen 3.3.1 (compared to 3.2)
    • Power management (P & C states) in the hypervisor
    • HVM emulation domains (qemu-on-minios) for better scalability, performance and security
    • PVGrub: boot PV kernels using real GRUB inside the PV domain
    • Better PV performance: domain lock removed from pagetable-update paths
    • Shadow3: optimizations to make this the best shadow pagetable algorithm yet, making HVM performance better than ever
    • Hardware Assisted Paging enhancements: 2MB page support for better TLB locality
    • CPUID feature leveling: allows safe domain migration across systems with different CPU models
    • PVSCSI drivers for SCSI access direct into PV guests
    • HVM frame-buffer optimizations: scan for frame-buffer updates more efficiently
    • Device pass-through enhancements
    • Full x86 real-mode emulation for HVM guests on Intel VT: supports a much wider range of legacy guest OSes
    • New qemu merge with upstream development
    • Many other changes in both x86 and IA64 ports
  • 15. HVM Device Model Domain (Xen 3.3 Feature)
    • In Xen 3.3, ioemu can be run in a Stub Domain.
      • Dedicated Device Model Domain for each HVM domain
      • Device Model Domain
        • Processes the I/O requests of the HVM guest
        • Uses the regular PV interface to actually perform disk and network I/O
  • 16. Stub Domain
    • Helper domains for HVM guest
      • Because the emulated devices are processes in Dom0, their execution time is accounted to Dom0.
        • An HVM guest performing a lot of I/O can cause Dom0 to use an inordinate amount of CPU time, preventing other guests from getting their fair share of the CPU.
      • Each HVM guest would have its own stub domain, responsible for its I/O.
        • Small stub domains run nothing other than the device emulators.
      • Based on Mini-OS
      • xen-3.3.1/stubdom/
  • 17. Stub Domain
    • Tricky scheduling
      • The current schedulers in Xen are based on the assumption that virtual machines are, for the most part, independent.
        • If domain 2 is under-scheduled, this doesn’t have a negative effect on domain 3.
      • With HVM and stub domain pairs,
        • The HVM guest is likely to be performance-limited by the amount of time allocated to the stub domain.
        • In case where the stub domain is under-scheduled, the HVM domain sits around waiting for I/O.
      • Potential solutions
        • Doors
        • Scheduler domains
  • 18. Stub Domain
    • Doors
      • From the Spring operating system and later Solaris
      • IPC mechanism
        • Allows a process to delegate the rest of its scheduling quantum to another
        • The stub domain would run whenever the pair needed to be scheduled.
        • It would then perform pending I/O emulation and “delegate” scheduler operation (instead of “yield”) on the HVM guest, which would then run for the remainder of the quantum.
  • 19. Stub Domain
    • Scheduler domains
      • Proposed by IBM based on work in the Nemesis Exokernel
      • Similar conceptually to the N:M threading model
        • The hypervisor’s scheduler would schedule this domain, and it would be responsible for dividing time amongst the others in the group.
        • In this way, the scheduler domain fulfills the same role as the user-space component of an N:M threading library.
  • 20. HVM Device Model Domain
    • Almost unmodified qemu
    • Relieve Dom0
    • Provides better CPU usage accounting
    • More efficient
      • Let the hypervisor schedule it directly
      • More lightweight OS
    • A lot safer
    Xen Hypervisor stubdom HVM Domain IN/OUT qemu Mini-OS Dom0 Linux PV
  • 21. HVM Device Model Domain
    • Performance
      • lnb : latency of I/O port accesses
        • The round trip time between the application in the HVM domain and the virtual device emulation part of qemu
  • 22. HVM Device Model Domain
      • Disk performance
    CPU %
  • 23. HVM Device Model Domain
      • Network performance
        • e1000
  • 24. HVM Device Model Domain
      • Network performance
        • bicore
  • 25. PV-GRUB
    • PyGRUB used to act as a “PV bootloader”
    • PV-GRUB
      • Real GRUB source code recompiled against Mini-OS
      • Runs inside the PV domain that will host the PV guest
      • Boot inside PV domain
      • Detect the PV disks and network interfaces of the domain
      • Use that to access the PV guests’ menu.lst
      • Use the regular PV console to show the GRUB menu
      • Use the PV interface to load the kernel image from the guest disk image
    • More secure that PyGRUB
      • Just only uses the resources that the PV guest will use
  • 26. PV-GRUB
    • Start
  • 27. PV-GRUB
    • Loading
  • 28. PV-GRUB
    • Loaded
    • kexec (kernel execution)
      • Allows “live” booting of a new kernel over the currently running one
  • 29. PV-GRUB
  • 30. PV-GRUB
    • Executes upstream GRUB
      • Replace native drivers with Mini-OS drivers
      • Add PV-kexec implementation
    • Just uses the target PV guest resources
    • Improve security
    • Provides network boot
  • 31. Reference
    • Samuel Thibault, Citrix/Xensource, “Stub Domains: A Step Towards Dom0 Disaggregation”
    • Samuel Thibault, and Tim Deegan, “Improving Performance by Embedding HPC Applications in Lightweight Xen Domains”, HPCVIRT’08, Oct. 2008.
    • “ The Definitive Guide to the Xen Hypervisor”
      • Xen 3.3 Features: Stub Domains
      • Xen 3.3 Features: HVM Device Model Domain
      • Xen 3.3 Features: PV-GRUB
  • 32. HVM Configuration
    • Para-virtualization
      • Hypercall
    • HVM (hardware virtualized machine)
      • Hardware support is needed to trap privileged instructions.
      • Trap-and-emulate approach
      • Processor flag
        • vmx : virtual machine extensions – Intel CPU
        • svm : support vector machine – AMD CPU
      • In Intel’s VT architecture
        • Use VMexit and VMentry operations -> a lot of costs