The 9th International Conference for Young Computer Scientists




                                 DVMM: a Distributed VMM for Supporting
                                    Single System Image on Clusters

                                  Jinbing Peng     Xiang Long       Limin Xiao
                       School of Computer Science & Engineering, Beihang University, Beijing
                   pengjinbing@les.buaa.edu.cn long@les.buaa.edu.cn xiaolm@buaa.edu.cn



                              Abstract                                  Since the advantages from the two architectures are
                                                                        complementary mutually, how to obtain the two-sided
            Providing single system image (SSI) on clusters             advantages is a certain idea. One way for the
      has ever been one of the hot topics in the research field         combination with the advantages from the two
      of parallel computer architecture, since SSI supports             architectures is to implement the image of shared
      easier programming and administration on clusters.                memory architecture based on the hardware of
      Currently, most SSI studies focus on the middleware               distributed memory architecture. Both DSM and SSI
      level of clusters, leading to some problems of poor               on clusters are the typical practices.
      transparence, low performance and so on. This paper                     This paper presents a novel solution to provide
      presents a novel solution to provide SSI on clusters              SSI on clusters using a DVMM with hardware-assisted
      using a distributed virtual machine monitor (DVMM)                virtualization technologies. The rest of the paper is
      with hardware-assisted virtualization technologies.               organized as follows. Section 2 describes the
      The DVMM contains some symmetrical and                            background of SSI and virtualization technologies as
      cooperative VMMs distributed on multi-node. The                   well as an introduction about relative works. Then, we
      cooperation among the VMMs virtualizes the                        describe the implementation details of the DVMM in
      distributed hardware resources to support SSI on a                Section 3. Section 4 compares the DVMM with
      cluster. Thus, the DVMM can support an unmodified                 existing solutions. Finally, this paper ends up with a
      legacy operating system (OS) to run transparently on a            concluding remark in section 5.
      cluster. Compared with the related work, our solution
      has some advantages of good transparence, high                    2. Background
      performance and easy implementation.
                                                                        2.1. Single system image
      Keywords : SSI, virtualization, hardware-assisted
      virtualization, VMM, DVMM.                                              SSI means that all the distributed resources are
                                                                        organized to a uniform unit for users, users can not be
      1. Introduction                                                   aware of the distributed attribute of the resources. SSI
                                                                        includes some attributes such as single memory space,
           Parallel computer architecture has been presenting           single process space, single I/O space, and so on [2].
      two development directions, one is the shared memory                    The SSI of a cluster can be implemented on the
      architecture represented by SMP (Symmetric                        hardware level, the underware level, the middleware
      Multiprocessor), the other is the distributed memory              level and the application level. Currently there are few
      architecture represented by COW (Cluster of                       solutions on the hardware level; they are Enterprise X-
      Workstations)[1].The shared memory architecture                   Architecture [3], cc-NUMA [4] and DSM [5]. Special
      supports the shared memory programming mode, has                  chips or hardware are adopted in these solutions, so
      good programmability, but has poor scalability because            that they have high cost and limited applications. The
      of some constrains, such as the bandwidth of the                  solutions on the underware level are also seldom. The
      shared memory. The distributed memory architecture                representative solutions are MOSIX [6], Sun Solaris-
      uses the message passing programming mode, has poor               MC [2]. SSI Implemented on this level has good
      programmability, however, it has strong scalability               transparency for the users, but it is difficult to
      because of using the loosely coupled interconnection.             implement. Current solutions on this level can only



978-0-7695-3398-8/08 $25.00 © 2008 IEEE                           183
DOI 10.1109/ICYCS.2008.190
implement part attributes of SSI. There are many                 2.3. Related work
solutions on the middleware level, the typical work
include: distributed shared memory systems such as                     The essential of virtualization is to separate the
IVY [7], the parallel and distributed file systems such          software from the hardware through abstracting the
as Lustre [8], the systems of resource management and            physical resources. The goal of SSI is to hide the
loads schedule such as LSF [9], the parallel                     distributed hardware environment of the cluster. Thus,
programming environments such as MPI and PVM                     SSI can be implemented by virtualization.
[10]. The SSI implemented on this level has poor
transparency. There are seldom solutions on the                  2.3.1. Virtual Multiprocessor. Virtual Multiprocessor
application level; the representation is LVS [11].               [16]implements an 8-way shared memory virtual
      Therefore, the SSI of a cluster may be                     machine on a cluster of 8 PCs. The VMMs runs in the
implemented on the application level, the middleware             application space with supports of the host OS. Para-
level, the underware level and the hardware level.               virtualization is used on the guest OS. The shared
From the top down, the difficulty to implement the SSI           memory space is supported by the DSM; the virtual
increases, but the transparency for the users increase,          processors are emulated by special processes; the I/O
too. Currently most studies focus on the middleware              virtualization is implemented through the cooperation
level, leading to some problems, such as poor                    between the VMMs and the dedicated I/O sever. The
transparence. There are seldom solutions on the                  disadvantages of this system are that VMMs on the
application level, the underware level or the hardware           application level lead to low performance and weak
level, furthermore, the solutions on these levels have           flexibility; para-virtualization needs to modify the
pitfalls respectively, for example, the solutions on the         guest OS and only the devices in the I/O server can be
hardware level have high cost, the solutions on the              utilized, so it has limited application and it is difficult
underware level can not implement the SSI attributes             to implement. Furthermore, Virtual Multiprocessor can
roundly, and the solutions on the application level have         not provide SSI on a SMP cluster.
poor flexibility.
                                                                 2.3.2. vNUMA. vNMUA [17] implements a 2-way
2.2. Virtualization                                              NUMA virtual machine on a cluster of two
                                                                 workstations; each one has an IA64 processor. The
      Virtualization means that computation and                  VMMs are implemented directly on the hardware
processing are done on the virtual base instead of the           without host OS support. Pre-virtualization technology
real base. A virtual platform can be constructed                 is used to modify the guest OS. The guest OS is
between the hardware and the OS by means of                      compelled to run on the ring 1. The shared memory is
virtualization techniques for creating multiple domains          supported by DSM. One node is the master node from
on one hardware platform, the domains are isolated               which the system is set up. The disadvantages of
respectively, and each domain can support the running            vNMUA are that pre-virtualization needs to modify the
of his OS and applications [12].                                 guest OS and degrading the privilege level of the guest
      Virtualization techniques can be differentiated to         OS can bring the confusion of privileges. Also,
full-virtualization,      para-virtualization,      pre-         vNUMA can not provide SSI on a SMP cluster.
virtualization and hardware-assisted virtualization
[13][14]. Hardware-assisted virtualization is the most           3. Design and implementation of DVMM
advanced virtualization technology. VT-x [15] is a
hardware-assisted virtualization technology for the IA-          3.1. Overview
32 architecture. The contents of VT-x are listed as
follows. A new operation form, called VMX (Virtual                     The goal of DVMM is to hide the distributed
Machine Extensions), is added to the processor. Two              hardware attributes, provide SSI on a SMP cluster, and
VMX transitions, VM entry and VM exit,are defined.               support a single OS to run transparently on the cluster.
A VMCS (Virtual-Machine Control Structure) and ten               Therefore, three essential problems must be solved.
new instructions used for controlling the VM are added           Firstly, the distributed hardware configurations of the
to the architecture. With the support of VT-x, the               cluster can be detected and merged to form the global
design of VMM can be simplified, and full                        information. Secondly, the global hardware resources
virtualization can be implemented without using binary           can be virtualized and presented to the OS. Thirdly, the
translation.                                                     OS can manage, schedule and utilize the global
                                                                 resources just as on a single SMP machine.




                                                           184
For providing SSI on a cluster, a new layer named           global virtual resources is reported to the OS, so that
DVMM is added between the OS and the cluster                      the OS is aware of the global virtual resources.
hardware. The DVMM contains some symmetrical and
cooperative VMMs distributed on the cluster. A single             3.2.2.      Resource        virtualization.      Resource
OS supporting cc-NUMA architecture runs on the                    virtualization includes ISA virtualization, interrupt
DVMM. Using hardware-assisted virtualization                      mechanism virtualization, memory virtualization and
technologies, the DVMM detects and merges the                     the I/O device virtualization. Unlike existing
physical resources of the cluster to form the global              virtualization techniques, the virtualization technique
information, virtualizes the whole physical resources,            in this paper can implement the virtualization for
and presents the virtual resources to the OS. The OS              resources crossing the nodes.
schedules and runs the processes, manages and                           The IA-32 ISA is virtualized through the VT-x;
allocates the virtual resources. These actions by OS are          the techniques are similar to that used in the HVM of
transparent to DVMM. The DVMM intercepts the                      Xen [18]. The interrupt mechanism is virtualized as
operations of accessing resources from the OS and                 follows. DVMM emulates interrupt controllers with
handles them on behalf of the OS, such as mapping the             software, interferes the accesses from the OS to the
virtual resources to the physical resources and                   interrupt controllers, if the target interrupt controller is
manipulating the physical resources. In this way, it is           in the native node, then DVMM manipulates the
assured that the OS can be aware of the whole                     interrupt controller to reflect the guest’s operation; if
resources of the cluster as well as can manage and                the target interrupt controller is in a remote node,
utilize them. And the distributed attributes of the               DVMM sends the access request to the target node, the
hardware are hidden and the whole cluster is presented            target VMM manipulates the virtual interrupt
to the OS as a cc-NUMA virtual machine.                           controller accordingly. DVMM catches the hardware
                                                                  interrupt, and the contents of the virtual interrupt
3.2. Strategies                                                   controller are modified by the native VMM or by the
                                                                  remote VMM according to the node in which the
     Providing SSI on a cluster faces problems of                 interrupted object is, so that the interrupt can be shown
detecting, presenting and utilizing the resources of the          to the OS. Combine the techniques of Shadow Page
cluster. To solve these problems, our strategies are that         Table (SPT) and software DSM to virtualize the
detecting the physical resources of each node during              distributed memory resources. That is merging the
the startup of VMMs and integrating the physical                  memory resources of the cluster to a distributed shared
resources through communication among the VMMs;                   memory with the software DSM, and then virtualizing
virtualzing the physical resources and reporting them             the distributed shared memory with SPT. The I/O
to the OS through hardware-assisted virtualization;               operations are interfered by the VT-x, if the I/O
managing and utilizing the physical resources of the              operation will be processed on the native node, the
cluster through the cooperation between the OS and the            native VMM executes the interfered instruction and
DVMM. The details of the strategies are as follows.               returns the results to the OS, If the I/O operation will
                                                                  be done on a remote node, the I/O instruction is sent to
3.2.1. Resource detection and merger. Emulates and                the target VMM for executing, the results are sent back
extends the BIOS to the eBIOS (Extended Basic                     to the native VMM, and then to the OS.
Input/Output System). After the eBIOS acquires the
information about the physical resources of native                3.2.3. Resource management and utilization. The OS
node, it communicates with the other nodes to                     manages and utilizes the virtual resources and the
collective the information about the physical resources           DVMM manages and utilizes the physical resources.
of whole cluster, and merges the information to form              The OS interacts with the DVMM through the VM
the information of the global physical resources. Based           entry and the VM exit [15]. Based on the virtual
on the global physical resources, DVMM reserves                   resources, the OS schedules and runs the processes,
some resources and virtualizes the remains. DVMM                  manages and allocates the virtual resources
organizes the virtual resources. This includes forming            independently. This is transparent to the DVMM.
various resources mapping tables, implementing the                When the OS runs a sensitive instruction or a trap or
mappings from the virtual resources to the physical               interrupt occurs, the control is switched to the DVMM
resources and from the physical resources to the nodes,           by the VM exit. The DVMM handles the problem
creating the global virtual resources information table.          according to the reason of VM exit, for example,
Based on the virtual resources, the OS is set up, the             allocating and manipulating various physical devices.
calls for BIOS are captured, and the information of the           After the DVMM handles the event for which the VM
                                                                  exit is triggered, the results and the control are returned



                                                            185
to the OS through the VM entry. Through the                       32 ISA and cooperates with the interrupt virtualization
interactions between the OS and the DVMM, the                     module so as to the OS can manage and schedule the
management and utilization of the global physical                 virtual computing resources. The I/O virtualization
resources are implemented.                                        module virtualizes the global I/O resources. The
                                                                  interrupt virtualization module virtualizes the interrupt
3.3. Design and implementation                                    control mechanism, notifies the interrupt event to the
                                                                  OS. The MMU virtualization module virtualizes the
3.3.1. System architecture. The system architecture is            memory resources and assures that the OS can run
shown in figure 1. From the bottom up, the system                 correctly in the virtual physical address space. The
contains hardware level, DVMM level and OS level.                 DSM module implements a distributed shared memory
The hardware level contains some SMP nodes                        transparently. The communication module provides the
interconnected by the gigabit Ethernet, and the CPUs              communication service for the cooperative VMMs.
of the nodes can support VT-x. The DVMM level
contains some symmetrical and cooperative VMMs                    3.3.3. DVMM mechanism. The DVMM mechanism is
distributed on the nodes. The VMMs can communicate                shown in figure 3.
through the dedicated communication software. The
OS can be any one which supports the cc-NUMA. The
key element for implementing this system is to
construct the DVMM.




                                                                              Figure 3. DVMM mechanism
            Figure 1. System architecture
                                                                       The ISA virtualization module is the entry point
3.3.2. DVMM structure. The DVMM is composed of
                                                                  as well as the exit point of the DVMM. This module
the VMMs distributed on the nodes. The DVMM runs
                                                                  may invocate every other module of the VMM except
on the bare machines. The functions of the VMM are
                                                                  the communication module, and vice versa. When a
detecting, integrating and virtualizing the physical
                                                                  VM exit occurs, this module analyzes the reason of the
resources, reporting the virtual resources to the OS and
                                                                  VM exit and invocates appropriate module to handle.
cooperating across the nodes. The structure of the
                                                                  When one module completes his duties, it invocates
DVMM is shown in the figure 2.
                                                                  this module to return to the guest system. The
                                                                  communication module is the base of the cooperation
                                                                  among the VMMs. This module may invocate every
                                                                  other module of the VMM except the ISA
                                                                  virtualization module, and vice versa. The eBIOS
                                                                  module is used only during the initialization of the
                                                                  DVMM and the setting up of the OS. Firstly, the
                                                                  eBIOS module invocates the interrupt virtualization
                                                                  module, the I/O virtualization module and the
             Figure 2. DVMM structure                             communication module to detect and build the
                                                                  resource information of the whole system. Secondly,
     The initialization module loads and runs the                 while the ISA virtualization module captures the calls
VMM. The eBIOS module detects and integrates the                  to the BIOS during setting up the OS, the eBIOS
resource information of the cluster and reports it to the         module returns the information about the virtual
OS. The ISA virtualization module virtualizes the IA-             resources of the whole system to the OS. The I/O




                                                            186
virtualization module receives instructions from the                     invocates the DSM module to get the page. Invocated
ISA virtualization module, according to the node on                      by the MMU virtualization module, the DSM module
which the I/O request should be done, it may execute                     requests the page from the remote node, while
the I/O instruction or invocate the communication                        invocated by the communication module it serves the
module to send the I/O request to the target node.                       request and sends the result to the remote node.
When the I/O virtualization module receives an I/O                            Through the cooperation among the modules of
request from a remote node, it manipulates the native                    the DVMM, based on resource virtualization, the SSI
I/O device and sends the result to the source node. The                  of the SMP cluster is implemented.
interrupt virtualization module is invocated by the ISA
virtualization module to emulate the operation to the                    4. Discussion
virtual interrupt controller by the OS; on the other
hand, it converts the external interrupt vectors to the                        There are many existing solutions for providing
virtual interrupt vectors and injects a virtual interrupt                SSI on clusters. Few of them are based on
to the OS. While the ISA virtualization module                           virtualization techniques, and the others are not. To
captures a sensitive instruction or a trap related to                    distinguish the features of our solution, we compare it
MMU, it invocates the MMU virtualization module to                       with the existing solutions as follows.
handle it. When the MMU virtualization module finds
that the requested page is not in the native node, it

                    Table 1.Comparison among DVMM, Virtual Multiprocessor and vNUMA

                  Level         Technique             Difficulty    Transparence   Symmetry    Performance   SMP          ISA
                                                                                                             Supporting
 Virtual          Application   Para-virtualization   High          Poor           No          Low           No           IA-32
 Multiprocessor   Level
 vNUMA            Underware     Pre-virtualization    High          Good           No          Moderation    No           IA64
                  Level
 DVMM             Underware     Hardware-assisted     Moderation    Good           Yes         High          Yes          IA-32
                  Level         Virtualization
                                                                         so the DVMM is more transparent than the Virtual
     Known from the table 1, the DVMM has                                Multiprocessor; because the IA-32 is used more widely
advantages to the Virtual Multiprocessor and the                         than the IA64, the DVMM has wider application and
vNUMA. Firstly, the DVMM can implement SSI on                            higher utilization value than the vNUMA.
SMP clusters, while the Virtual Multiprocessor and the                        Compared with the existing solutions mentioned
vNUMA can not. Secondly, the DVMM utilizes                               in section 2.1, the DVMM also has advantages. Firstly,
hardware-assisted    virtualization   technology     to                  the DVMM does not demand special hardware, so it
implement full virtualization, need not to modify the                    has lower cost and wider application than the solutions
guest OS, so that it has moderate difficulty to design                   on the hardware level. Secondly, the DVMM can
and implement. While the Virtual Multiprocessor and                      implement full attributes of SSI, while the solutions on
the vNUMA adopt para-virtualization and pre-                             the firmware level can only implement part attributes
virtualization respectively, both of them need to                        of SSI, so the DVMM has higher utilization value.
modify the guest OS, so they have high difficulty to                     Thirdly, the DVMM has better transparence and higher
implement and have limited applications. Thirdly, the                    performance than the solutions on the middleware
DVMM is implemented based on assistance of the                           level. Finally, the DVMM has better flexibility and
hardware, and runs on the metal, so it has high                          higher performance than the solutions on the
performance. While both the Virtual Multiprocessor                       application level.
and the vNUMA are implemented by software, so that
they have low performance, further more the Virtual                      5. Conclusions and future work
Multiprocessor is implemented at the application level,
it must pass through several software layers leading to                       The DVMM implements the SSI of clusters on the
lower performance. Finally, the nodes of the DVMM                        underware level based on the hardware-assisted
are full symmetrical, while the nodes of the Virtual                     virtualization technologies, so it can support an
Multiprocessor and the vNUMA are not symmetrical,                        unmodified legacy OS to run transparently on a cluster.
one of them is the master node. Besides, the DVMM is                     Compared with the existing solutions for implementing
implemented at the underware level, while the Virtual                    the SSI of clusters, the DVMM has some advantages.
Multiprocessor is implemented at the application level,




                                                                   187
There are still further improvements to be made:                      Canada [OL]. http://www.LinuxVirtualServer.org/.
firstly, using the most advanced VT-d [19] and                         [12] James E.Smith, Ravi Nair. Virtual Machines: Versatile
EPT(Extended Page Tables) [20] techniques to reduce                         Platforms for Systems and Processes. ELSEVIER,
the implementing difficulty and adopting the processor                      2006.
                                                                       [13] VMware.        Understanding      Full    Virtualization,
consistency model instead of the sequential                                 Paravirtualization, and Hardware Assist. 2007.
consistency model for higher performance; secondly,                         [OL].http://www.vmware.com/files/pdf/VMware_para
detecting the physical resources dynamically to support                     virtualization.pdf
the dynamic change of the number of the nodes;                         [14] Joshua, LeVasseur, et al. Pre-Virtualization: Slashing
thirdly, adding the functions of resource management                        the          Cost          of        Virtualization[OL].
and load schedule to the DVMM for supporting                                http://l4ka.org/publications/2005/previrtualization-
multiple guest OS running transparently and separately                      techreport.pdf www.l4ka.org. 2005.
on a cluster.                                                          [15] Intel. Intel® 64 and IA-32 Architectures Software
                                                                            Developer’s Manual. Vol. 3:System Programming
                                                                            Guide. 2007.
Acknowledgment                                                         [16] Kenji Kaneda, Yoshihiro Oyama, and Akinori
                                                                            Yonezawa. A Virtual Machine Monitor for Providing a
    This work is supported by Hi-tech Research and                          Single System Image (in Japanese). In Proceedings of
Development Program of China (863 Program, No.                              the 17th IPSJ Computer System Symposium (ComSys
2006AA01Z108).                                                              ’05), pages 3–12, November 2005.
                                                                       [17] M. Chapman and G. Heiser. Implementing transparent
                                                                            shared memory on clusters using virtual machines. In
References                                                                  USENIX Annual Technical Conference, Anaheim, CA,
                                                                            USA, Apr. 2005.
[1]  Culler D E, Singh J P, Gupta A. Parallel computer                 [18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,
     architecture — a hardware/software approach. China                     A. Ho, R. Neugebauer, I. Pratt, and A. War_eld. Xen
     Machine Press. 1999                                                    and the Art of Virtualization. In Proceedings of the
[2] Rajkumar Buyya, Toni Cortes, Hai Jin. Single System                     19th ACM SOSP, pages 164.177, October 2003.
     Image (SSI). The International Journal of High                    [19] Intel. Intel® Virtualization Technology for Directed I/O
     Performance Computing Applications, Volume 15, No.                     [OL].
     2, Summer 2001, pp. 124-135                                            http://www.intel.com/technology/itj/2006/v10i3/2-io/7-
[3] IBM Enterprise X-Architecture Technology [OL].                          conclusion.htm.
     http://www.unitech-                                               [20] Gil Neiger. Intel Virtualization Technology: Hardware
     ie.com/ole/doc_library/xArchitecture%20technology%                     Support for Efficient Processor Virtualization. Intel
     202.pdf                                                                Technology Journal, Vol. 10, Issue 3, 2006.
[4] B. C. Brock, G. D. Carpenter, et al. Experience with
     building a commodity Intel-based ccNUMA system.
     IBM J. Res. & Dev. Vol. 45 No. 2 March 2001
[5] Ayal, Itzkovitz and Assaf, Schuster. Distributed Shared
     Memory: Bridging the Granularity Gap. 1999. In
     Proceedings of the 1st Workshop on Software
     Distributed Shared Memory.
[6] L. Amar, A. Barak, and A. Shiloh, The MOSIX Direct
     File System Access Method for Supporting Scalable
     Cluster File Systems. Cluster Comput-ing, 7(2), pp.
     141-150, 2004.
[7] Li, Kai and PAUL, HUDAK. Memory Coherence in
     Shared Virtual Memory Systems. ACM Transactions
     on Computer Systems (TOCS) . 1989, Vol. 7,
     ISSN:0734-2071, pp. 321-359.
[8] SUN        Corp.     Lustre    File      System      [OL].
     http://www.sun.com/software/products/lustre/
[9] Platform       Corp.     LSF         Reference       [OL].
     http://support.sas.com/rnd/scalability/platform/lsf_ref_
     6.0.pdf
[10] Geist, A., and Sunderam, V. 1990.PVM: A framework
     for parallel distributed computing. Journal of
     Concurrency: Practice and Experience [OL].
     http://www.epm.ornl.gov/pvm/.
[11] Zhang, W. 2000. Linux virtual servers for scalable
     network services.Ottawa Linux Symposium 2000,




                                                                 188

Imp (distributed vmm)

  • 1.
    The 9th InternationalConference for Young Computer Scientists DVMM: a Distributed VMM for Supporting Single System Image on Clusters Jinbing Peng Xiang Long Limin Xiao School of Computer Science & Engineering, Beihang University, Beijing pengjinbing@les.buaa.edu.cn long@les.buaa.edu.cn xiaolm@buaa.edu.cn Abstract Since the advantages from the two architectures are complementary mutually, how to obtain the two-sided Providing single system image (SSI) on clusters advantages is a certain idea. One way for the has ever been one of the hot topics in the research field combination with the advantages from the two of parallel computer architecture, since SSI supports architectures is to implement the image of shared easier programming and administration on clusters. memory architecture based on the hardware of Currently, most SSI studies focus on the middleware distributed memory architecture. Both DSM and SSI level of clusters, leading to some problems of poor on clusters are the typical practices. transparence, low performance and so on. This paper This paper presents a novel solution to provide presents a novel solution to provide SSI on clusters SSI on clusters using a DVMM with hardware-assisted using a distributed virtual machine monitor (DVMM) virtualization technologies. The rest of the paper is with hardware-assisted virtualization technologies. organized as follows. Section 2 describes the The DVMM contains some symmetrical and background of SSI and virtualization technologies as cooperative VMMs distributed on multi-node. The well as an introduction about relative works. Then, we cooperation among the VMMs virtualizes the describe the implementation details of the DVMM in distributed hardware resources to support SSI on a Section 3. Section 4 compares the DVMM with cluster. Thus, the DVMM can support an unmodified existing solutions. Finally, this paper ends up with a legacy operating system (OS) to run transparently on a concluding remark in section 5. cluster. Compared with the related work, our solution has some advantages of good transparence, high 2. Background performance and easy implementation. 2.1. Single system image Keywords : SSI, virtualization, hardware-assisted virtualization, VMM, DVMM. SSI means that all the distributed resources are organized to a uniform unit for users, users can not be 1. Introduction aware of the distributed attribute of the resources. SSI includes some attributes such as single memory space, Parallel computer architecture has been presenting single process space, single I/O space, and so on [2]. two development directions, one is the shared memory The SSI of a cluster can be implemented on the architecture represented by SMP (Symmetric hardware level, the underware level, the middleware Multiprocessor), the other is the distributed memory level and the application level. Currently there are few architecture represented by COW (Cluster of solutions on the hardware level; they are Enterprise X- Workstations)[1].The shared memory architecture Architecture [3], cc-NUMA [4] and DSM [5]. Special supports the shared memory programming mode, has chips or hardware are adopted in these solutions, so good programmability, but has poor scalability because that they have high cost and limited applications. The of some constrains, such as the bandwidth of the solutions on the underware level are also seldom. The shared memory. The distributed memory architecture representative solutions are MOSIX [6], Sun Solaris- uses the message passing programming mode, has poor MC [2]. SSI Implemented on this level has good programmability, however, it has strong scalability transparency for the users, but it is difficult to because of using the loosely coupled interconnection. implement. Current solutions on this level can only 978-0-7695-3398-8/08 $25.00 © 2008 IEEE 183 DOI 10.1109/ICYCS.2008.190
  • 2.
    implement part attributesof SSI. There are many 2.3. Related work solutions on the middleware level, the typical work include: distributed shared memory systems such as The essential of virtualization is to separate the IVY [7], the parallel and distributed file systems such software from the hardware through abstracting the as Lustre [8], the systems of resource management and physical resources. The goal of SSI is to hide the loads schedule such as LSF [9], the parallel distributed hardware environment of the cluster. Thus, programming environments such as MPI and PVM SSI can be implemented by virtualization. [10]. The SSI implemented on this level has poor transparency. There are seldom solutions on the 2.3.1. Virtual Multiprocessor. Virtual Multiprocessor application level; the representation is LVS [11]. [16]implements an 8-way shared memory virtual Therefore, the SSI of a cluster may be machine on a cluster of 8 PCs. The VMMs runs in the implemented on the application level, the middleware application space with supports of the host OS. Para- level, the underware level and the hardware level. virtualization is used on the guest OS. The shared From the top down, the difficulty to implement the SSI memory space is supported by the DSM; the virtual increases, but the transparency for the users increase, processors are emulated by special processes; the I/O too. Currently most studies focus on the middleware virtualization is implemented through the cooperation level, leading to some problems, such as poor between the VMMs and the dedicated I/O sever. The transparence. There are seldom solutions on the disadvantages of this system are that VMMs on the application level, the underware level or the hardware application level lead to low performance and weak level, furthermore, the solutions on these levels have flexibility; para-virtualization needs to modify the pitfalls respectively, for example, the solutions on the guest OS and only the devices in the I/O server can be hardware level have high cost, the solutions on the utilized, so it has limited application and it is difficult underware level can not implement the SSI attributes to implement. Furthermore, Virtual Multiprocessor can roundly, and the solutions on the application level have not provide SSI on a SMP cluster. poor flexibility. 2.3.2. vNUMA. vNMUA [17] implements a 2-way 2.2. Virtualization NUMA virtual machine on a cluster of two workstations; each one has an IA64 processor. The Virtualization means that computation and VMMs are implemented directly on the hardware processing are done on the virtual base instead of the without host OS support. Pre-virtualization technology real base. A virtual platform can be constructed is used to modify the guest OS. The guest OS is between the hardware and the OS by means of compelled to run on the ring 1. The shared memory is virtualization techniques for creating multiple domains supported by DSM. One node is the master node from on one hardware platform, the domains are isolated which the system is set up. The disadvantages of respectively, and each domain can support the running vNMUA are that pre-virtualization needs to modify the of his OS and applications [12]. guest OS and degrading the privilege level of the guest Virtualization techniques can be differentiated to OS can bring the confusion of privileges. Also, full-virtualization, para-virtualization, pre- vNUMA can not provide SSI on a SMP cluster. virtualization and hardware-assisted virtualization [13][14]. Hardware-assisted virtualization is the most 3. Design and implementation of DVMM advanced virtualization technology. VT-x [15] is a hardware-assisted virtualization technology for the IA- 3.1. Overview 32 architecture. The contents of VT-x are listed as follows. A new operation form, called VMX (Virtual The goal of DVMM is to hide the distributed Machine Extensions), is added to the processor. Two hardware attributes, provide SSI on a SMP cluster, and VMX transitions, VM entry and VM exit,are defined. support a single OS to run transparently on the cluster. A VMCS (Virtual-Machine Control Structure) and ten Therefore, three essential problems must be solved. new instructions used for controlling the VM are added Firstly, the distributed hardware configurations of the to the architecture. With the support of VT-x, the cluster can be detected and merged to form the global design of VMM can be simplified, and full information. Secondly, the global hardware resources virtualization can be implemented without using binary can be virtualized and presented to the OS. Thirdly, the translation. OS can manage, schedule and utilize the global resources just as on a single SMP machine. 184
  • 3.
    For providing SSIon a cluster, a new layer named global virtual resources is reported to the OS, so that DVMM is added between the OS and the cluster the OS is aware of the global virtual resources. hardware. The DVMM contains some symmetrical and cooperative VMMs distributed on the cluster. A single 3.2.2. Resource virtualization. Resource OS supporting cc-NUMA architecture runs on the virtualization includes ISA virtualization, interrupt DVMM. Using hardware-assisted virtualization mechanism virtualization, memory virtualization and technologies, the DVMM detects and merges the the I/O device virtualization. Unlike existing physical resources of the cluster to form the global virtualization techniques, the virtualization technique information, virtualizes the whole physical resources, in this paper can implement the virtualization for and presents the virtual resources to the OS. The OS resources crossing the nodes. schedules and runs the processes, manages and The IA-32 ISA is virtualized through the VT-x; allocates the virtual resources. These actions by OS are the techniques are similar to that used in the HVM of transparent to DVMM. The DVMM intercepts the Xen [18]. The interrupt mechanism is virtualized as operations of accessing resources from the OS and follows. DVMM emulates interrupt controllers with handles them on behalf of the OS, such as mapping the software, interferes the accesses from the OS to the virtual resources to the physical resources and interrupt controllers, if the target interrupt controller is manipulating the physical resources. In this way, it is in the native node, then DVMM manipulates the assured that the OS can be aware of the whole interrupt controller to reflect the guest’s operation; if resources of the cluster as well as can manage and the target interrupt controller is in a remote node, utilize them. And the distributed attributes of the DVMM sends the access request to the target node, the hardware are hidden and the whole cluster is presented target VMM manipulates the virtual interrupt to the OS as a cc-NUMA virtual machine. controller accordingly. DVMM catches the hardware interrupt, and the contents of the virtual interrupt 3.2. Strategies controller are modified by the native VMM or by the remote VMM according to the node in which the Providing SSI on a cluster faces problems of interrupted object is, so that the interrupt can be shown detecting, presenting and utilizing the resources of the to the OS. Combine the techniques of Shadow Page cluster. To solve these problems, our strategies are that Table (SPT) and software DSM to virtualize the detecting the physical resources of each node during distributed memory resources. That is merging the the startup of VMMs and integrating the physical memory resources of the cluster to a distributed shared resources through communication among the VMMs; memory with the software DSM, and then virtualizing virtualzing the physical resources and reporting them the distributed shared memory with SPT. The I/O to the OS through hardware-assisted virtualization; operations are interfered by the VT-x, if the I/O managing and utilizing the physical resources of the operation will be processed on the native node, the cluster through the cooperation between the OS and the native VMM executes the interfered instruction and DVMM. The details of the strategies are as follows. returns the results to the OS, If the I/O operation will be done on a remote node, the I/O instruction is sent to 3.2.1. Resource detection and merger. Emulates and the target VMM for executing, the results are sent back extends the BIOS to the eBIOS (Extended Basic to the native VMM, and then to the OS. Input/Output System). After the eBIOS acquires the information about the physical resources of native 3.2.3. Resource management and utilization. The OS node, it communicates with the other nodes to manages and utilizes the virtual resources and the collective the information about the physical resources DVMM manages and utilizes the physical resources. of whole cluster, and merges the information to form The OS interacts with the DVMM through the VM the information of the global physical resources. Based entry and the VM exit [15]. Based on the virtual on the global physical resources, DVMM reserves resources, the OS schedules and runs the processes, some resources and virtualizes the remains. DVMM manages and allocates the virtual resources organizes the virtual resources. This includes forming independently. This is transparent to the DVMM. various resources mapping tables, implementing the When the OS runs a sensitive instruction or a trap or mappings from the virtual resources to the physical interrupt occurs, the control is switched to the DVMM resources and from the physical resources to the nodes, by the VM exit. The DVMM handles the problem creating the global virtual resources information table. according to the reason of VM exit, for example, Based on the virtual resources, the OS is set up, the allocating and manipulating various physical devices. calls for BIOS are captured, and the information of the After the DVMM handles the event for which the VM exit is triggered, the results and the control are returned 185
  • 4.
    to the OSthrough the VM entry. Through the 32 ISA and cooperates with the interrupt virtualization interactions between the OS and the DVMM, the module so as to the OS can manage and schedule the management and utilization of the global physical virtual computing resources. The I/O virtualization resources are implemented. module virtualizes the global I/O resources. The interrupt virtualization module virtualizes the interrupt 3.3. Design and implementation control mechanism, notifies the interrupt event to the OS. The MMU virtualization module virtualizes the 3.3.1. System architecture. The system architecture is memory resources and assures that the OS can run shown in figure 1. From the bottom up, the system correctly in the virtual physical address space. The contains hardware level, DVMM level and OS level. DSM module implements a distributed shared memory The hardware level contains some SMP nodes transparently. The communication module provides the interconnected by the gigabit Ethernet, and the CPUs communication service for the cooperative VMMs. of the nodes can support VT-x. The DVMM level contains some symmetrical and cooperative VMMs 3.3.3. DVMM mechanism. The DVMM mechanism is distributed on the nodes. The VMMs can communicate shown in figure 3. through the dedicated communication software. The OS can be any one which supports the cc-NUMA. The key element for implementing this system is to construct the DVMM. Figure 3. DVMM mechanism Figure 1. System architecture The ISA virtualization module is the entry point 3.3.2. DVMM structure. The DVMM is composed of as well as the exit point of the DVMM. This module the VMMs distributed on the nodes. The DVMM runs may invocate every other module of the VMM except on the bare machines. The functions of the VMM are the communication module, and vice versa. When a detecting, integrating and virtualizing the physical VM exit occurs, this module analyzes the reason of the resources, reporting the virtual resources to the OS and VM exit and invocates appropriate module to handle. cooperating across the nodes. The structure of the When one module completes his duties, it invocates DVMM is shown in the figure 2. this module to return to the guest system. The communication module is the base of the cooperation among the VMMs. This module may invocate every other module of the VMM except the ISA virtualization module, and vice versa. The eBIOS module is used only during the initialization of the DVMM and the setting up of the OS. Firstly, the eBIOS module invocates the interrupt virtualization module, the I/O virtualization module and the Figure 2. DVMM structure communication module to detect and build the resource information of the whole system. Secondly, The initialization module loads and runs the while the ISA virtualization module captures the calls VMM. The eBIOS module detects and integrates the to the BIOS during setting up the OS, the eBIOS resource information of the cluster and reports it to the module returns the information about the virtual OS. The ISA virtualization module virtualizes the IA- resources of the whole system to the OS. The I/O 186
  • 5.
    virtualization module receivesinstructions from the invocates the DSM module to get the page. Invocated ISA virtualization module, according to the node on by the MMU virtualization module, the DSM module which the I/O request should be done, it may execute requests the page from the remote node, while the I/O instruction or invocate the communication invocated by the communication module it serves the module to send the I/O request to the target node. request and sends the result to the remote node. When the I/O virtualization module receives an I/O Through the cooperation among the modules of request from a remote node, it manipulates the native the DVMM, based on resource virtualization, the SSI I/O device and sends the result to the source node. The of the SMP cluster is implemented. interrupt virtualization module is invocated by the ISA virtualization module to emulate the operation to the 4. Discussion virtual interrupt controller by the OS; on the other hand, it converts the external interrupt vectors to the There are many existing solutions for providing virtual interrupt vectors and injects a virtual interrupt SSI on clusters. Few of them are based on to the OS. While the ISA virtualization module virtualization techniques, and the others are not. To captures a sensitive instruction or a trap related to distinguish the features of our solution, we compare it MMU, it invocates the MMU virtualization module to with the existing solutions as follows. handle it. When the MMU virtualization module finds that the requested page is not in the native node, it Table 1.Comparison among DVMM, Virtual Multiprocessor and vNUMA Level Technique Difficulty Transparence Symmetry Performance SMP ISA Supporting Virtual Application Para-virtualization High Poor No Low No IA-32 Multiprocessor Level vNUMA Underware Pre-virtualization High Good No Moderation No IA64 Level DVMM Underware Hardware-assisted Moderation Good Yes High Yes IA-32 Level Virtualization so the DVMM is more transparent than the Virtual Known from the table 1, the DVMM has Multiprocessor; because the IA-32 is used more widely advantages to the Virtual Multiprocessor and the than the IA64, the DVMM has wider application and vNUMA. Firstly, the DVMM can implement SSI on higher utilization value than the vNUMA. SMP clusters, while the Virtual Multiprocessor and the Compared with the existing solutions mentioned vNUMA can not. Secondly, the DVMM utilizes in section 2.1, the DVMM also has advantages. Firstly, hardware-assisted virtualization technology to the DVMM does not demand special hardware, so it implement full virtualization, need not to modify the has lower cost and wider application than the solutions guest OS, so that it has moderate difficulty to design on the hardware level. Secondly, the DVMM can and implement. While the Virtual Multiprocessor and implement full attributes of SSI, while the solutions on the vNUMA adopt para-virtualization and pre- the firmware level can only implement part attributes virtualization respectively, both of them need to of SSI, so the DVMM has higher utilization value. modify the guest OS, so they have high difficulty to Thirdly, the DVMM has better transparence and higher implement and have limited applications. Thirdly, the performance than the solutions on the middleware DVMM is implemented based on assistance of the level. Finally, the DVMM has better flexibility and hardware, and runs on the metal, so it has high higher performance than the solutions on the performance. While both the Virtual Multiprocessor application level. and the vNUMA are implemented by software, so that they have low performance, further more the Virtual 5. Conclusions and future work Multiprocessor is implemented at the application level, it must pass through several software layers leading to The DVMM implements the SSI of clusters on the lower performance. Finally, the nodes of the DVMM underware level based on the hardware-assisted are full symmetrical, while the nodes of the Virtual virtualization technologies, so it can support an Multiprocessor and the vNUMA are not symmetrical, unmodified legacy OS to run transparently on a cluster. one of them is the master node. Besides, the DVMM is Compared with the existing solutions for implementing implemented at the underware level, while the Virtual the SSI of clusters, the DVMM has some advantages. Multiprocessor is implemented at the application level, 187
  • 6.
    There are stillfurther improvements to be made: Canada [OL]. http://www.LinuxVirtualServer.org/. firstly, using the most advanced VT-d [19] and [12] James E.Smith, Ravi Nair. Virtual Machines: Versatile EPT(Extended Page Tables) [20] techniques to reduce Platforms for Systems and Processes. ELSEVIER, the implementing difficulty and adopting the processor 2006. [13] VMware. Understanding Full Virtualization, consistency model instead of the sequential Paravirtualization, and Hardware Assist. 2007. consistency model for higher performance; secondly, [OL].http://www.vmware.com/files/pdf/VMware_para detecting the physical resources dynamically to support virtualization.pdf the dynamic change of the number of the nodes; [14] Joshua, LeVasseur, et al. Pre-Virtualization: Slashing thirdly, adding the functions of resource management the Cost of Virtualization[OL]. and load schedule to the DVMM for supporting http://l4ka.org/publications/2005/previrtualization- multiple guest OS running transparently and separately techreport.pdf www.l4ka.org. 2005. on a cluster. [15] Intel. Intel® 64 and IA-32 Architectures Software Developer’s Manual. Vol. 3:System Programming Guide. 2007. Acknowledgment [16] Kenji Kaneda, Yoshihiro Oyama, and Akinori Yonezawa. A Virtual Machine Monitor for Providing a This work is supported by Hi-tech Research and Single System Image (in Japanese). In Proceedings of Development Program of China (863 Program, No. the 17th IPSJ Computer System Symposium (ComSys 2006AA01Z108). ’05), pages 3–12, November 2005. [17] M. Chapman and G. Heiser. Implementing transparent shared memory on clusters using virtual machines. In References USENIX Annual Technical Conference, Anaheim, CA, USA, Apr. 2005. [1] Culler D E, Singh J P, Gupta A. Parallel computer [18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, architecture — a hardware/software approach. China A. Ho, R. Neugebauer, I. Pratt, and A. War_eld. Xen Machine Press. 1999 and the Art of Virtualization. In Proceedings of the [2] Rajkumar Buyya, Toni Cortes, Hai Jin. Single System 19th ACM SOSP, pages 164.177, October 2003. Image (SSI). The International Journal of High [19] Intel. Intel® Virtualization Technology for Directed I/O Performance Computing Applications, Volume 15, No. [OL]. 2, Summer 2001, pp. 124-135 http://www.intel.com/technology/itj/2006/v10i3/2-io/7- [3] IBM Enterprise X-Architecture Technology [OL]. conclusion.htm. http://www.unitech- [20] Gil Neiger. Intel Virtualization Technology: Hardware ie.com/ole/doc_library/xArchitecture%20technology% Support for Efficient Processor Virtualization. Intel 202.pdf Technology Journal, Vol. 10, Issue 3, 2006. [4] B. C. Brock, G. D. Carpenter, et al. Experience with building a commodity Intel-based ccNUMA system. IBM J. Res. & Dev. Vol. 45 No. 2 March 2001 [5] Ayal, Itzkovitz and Assaf, Schuster. Distributed Shared Memory: Bridging the Granularity Gap. 1999. In Proceedings of the 1st Workshop on Software Distributed Shared Memory. [6] L. Amar, A. Barak, and A. Shiloh, The MOSIX Direct File System Access Method for Supporting Scalable Cluster File Systems. Cluster Comput-ing, 7(2), pp. 141-150, 2004. [7] Li, Kai and PAUL, HUDAK. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems (TOCS) . 1989, Vol. 7, ISSN:0734-2071, pp. 321-359. [8] SUN Corp. Lustre File System [OL]. http://www.sun.com/software/products/lustre/ [9] Platform Corp. LSF Reference [OL]. http://support.sas.com/rnd/scalability/platform/lsf_ref_ 6.0.pdf [10] Geist, A., and Sunderam, V. 1990.PVM: A framework for parallel distributed computing. Journal of Concurrency: Practice and Experience [OL]. http://www.epm.ornl.gov/pvm/. [11] Zhang, W. 2000. Linux virtual servers for scalable network services.Ottawa Linux Symposium 2000, 188