SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing 03/17/10 Distributed System Lab  H. Andrés Lagar-Cavilla, Jos...
Outline <ul><li>Introduction  </li></ul><ul><li>VM Fork  </li></ul><ul><li>Design Rationale  </li></ul><ul><li>SnowFlock I...
Introduction <ul><li>VM technology is widely adopted as an enabler of cloud computing </li></ul><ul><li>Benefits  </li></u...
Introduction <ul><li>Introduce VM fork  </li></ul><ul><ul><li>Simplifies  development  and  deployment  of cloud applicati...
Introduction <ul><li>VM fork primitive allows for the forked  copies to  be instantiated on a set of  different physical m...
Introduction <ul><li>Enables the trivial implementation of several  useful  and  well-known patterns  that are based on  s...
Introduction <ul><li>SnowFlock   </li></ul><ul><ul><li>Provides swift parallel stateful VM cloning with  little runtime ov...
Introduction <ul><li>Takes advantage of several key techniques . </li></ul><ul><ul><li>Second </li></ul></ul><ul><ul><li>A...
Introduction <ul><li>Evaluated SnowFlock by focusing on a demanding instance of Figure 1 (b)  </li></ul><ul><li>(Interacti...
VM Fork <ul><li>Advs </li></ul><ul><ul><li>Execute  independently  on  different physical hosts  </li></ul></ul><ul><ul><l...
VM Fork <ul><li>Forked VMs are transient entities  </li></ul><ul><ul><li>Memory image  and  virtual disk  are  discarded  ...
VM Fork <ul><li>The semantics of VM fork  </li></ul><ul><ul><li>Integration with a dedicated, isolated virtual network con...
VM Fork <ul><li>User must be conscious of the IP re­configuration semantics:  </li></ul><ul><li>-Network shares must be (r...
Design Rationale <ul><li>Plotting the cost of  suspending and resuming  a  1GB  VM to an increasing number of hosts over N...
Design Rationale <ul><li>Method 1 : </li></ul><ul><li>Implement VM fork using existing VM  suspend/resume </li></ul><ul><u...
Design Rationale <ul><li>Method 2 </li></ul><ul><li>Solving the problem of VM fork latency uses our multicast library . </...
Design Rationale <ul><li>VM Descriptors   </li></ul><ul><ul><li>Lightweight mechanism  ,Instantiates a new forked VM with ...
Design Rationale <ul><li>Memory on-demand  :  non-intrusive approach  (Reduces state transfer without altering the behavio...
SnowFlock Implementation <ul><li>SnowFlock is an  open-source project   (on the Xen 3.0.3 VMM) </li></ul><ul><li>Xen </li>...
SnowFlock Implementation <ul><li>Four mechanisms to fork a VM . </li></ul><ul><ul><li>Parent VM  is temporarily  suspended...
Implementation- 1.API <ul><li>VM fork in Snow-Flock consists of  two stages </li></ul><ul><li>sf_request_ticket  ( Reserva...
Implementation- 1.API <ul><li>API is  simple  and  flexible   (modification of existing code bases) </li></ul><ul><li>Wide...
Implementation-2.VM Descriptors <ul><li>Condensed VMimage  </li></ul><ul><ul><li>Swift VM replication to a separate physic...
Implementation-2.VM Descriptors <ul><li>The page tables make up the bulk of a VM descriptor. </li></ul><ul><ul><li>Each pr...
Implementation-2.VM Descriptors <ul><li>Descriptor is  multicast  to  multiple physical hosts  (mcdist) Section 4.5 </li><...
Implementation-2.VM Descriptors <ul><li>Evaluation </li></ul><ul><li>Time spent replicating a single-processor VM with 1 G...
Implementation-2.VM Descriptors <ul><li>VM descriptor  for experiments was  1051 ± 7 KB. </li></ul><ul><li>The time to cre...
Implementation- 3.Memory-On-Demand <ul><li>SnowFlock’s memory-on-demand subsystem -  memtap   </li></ul><ul><ul><ul><li>Af...
Implementation- 3.Memory-On-Demand <ul><li>To allow  hypervisor   trap  memory accesses to pages  (Not yet been fetched) <...
Implementation- 3.Memory-On-Demand <ul><li>On parent VM </li></ul><ul><ul><li>Memtap  -  Implements  copy-on-write </li></...
Implementation- 3.Memory-On-Demand <ul><li>To understand the overhead  involved -- microbenchmark </li></ul><ul><ul><li>Mu...
Implementation- 4.Avoidance Heuristics <ul><li>Fetching pages from the parent  still  incurs an  overhead   (May prove exc...
Implementation- 4.Avoidance Heuristics <ul><li>The second heuristic. </li></ul><ul><ul><li>Addresses the case where a  vir...
Implementation- 4.Avoidance Heuristics <ul><li>Evaluation </li></ul><ul><li>Result in substantial benefits. </li></ul><ul>...
Implementation- 5.Multicast Distribution <ul><li>Mcdist  </li></ul><ul><ul><li>Multicast distribution system efficiently p...
Implementation- 5.Multicast Distribution <ul><li>Mcdist server design is minimalistic </li></ul><ul><ul><li>Only switch pr...
Implementation- 5.Multicast Distribution <ul><li>Mcdist clients are memtap processes </li></ul><ul><ul><li>Receive pages  ...
Implementation- 5.Multicast Distribution <ul><li>To maximize total goodput </li></ul><ul><ul><li>Server uses flow control ...
Implementation- 5.Multicast Distribution <ul><li>Evaluation  </li></ul><ul><li>Results obtained with SHRiMP.  </li></ul><u...
Implementation- 5.Multicast Distribution 03/17/10 Distributed System Lab
Implementation- 5.Multicast Distribution <ul><li>Figure 4(c) shows the benefit of mcdist for a case where an important por...
6.Virtual I/O Devices --  Virtual Disk <ul><li>Implemented with a  blocktap  [Warfield 2005]  driver .  </li></ul><ul><ul>...
6.Virtual I/O Devices --  Virtual Disk <ul><li>Virtual disk is used as the base root partition for the VMs.  </li></ul><ul...
6.Virtual I/O Devices --  Network Isolation <ul><li>Employ a mechanism to isolate  (prevent interference, eavesdropping). ...
Application Evaluation <ul><li>Focuses on a particularly demanding scenario   </li></ul><ul><ul><li>The ability to deliver...
Application Evaluation <ul><li>Each host  </li></ul><ul><ul><li>4 GB of RAM </li></ul></ul><ul><ul><li>4 Intel Xeon 3.2 GH...
Applications <ul><li>3  typical applications from  bioinformatics </li></ul><ul><li>3  applications  </li></ul><ul><ul><li...
Results <ul><li>32 4-core SMP VMs on 32 physical hosts </li></ul><ul><li>A im to answer the following questions </li></ul>...
Results - Comparison <ul><li>SHRIMP , 128 processors under three configurations </li></ul><ul><ul><li>SnowFlock  with all ...
Results - Application Performance 03/17/10 Distributed System Lab
Results - Application Performance <ul><li>Compares SnowFlock to an optimal “zero-cost fork” baseline  </li></ul><ul><li>Ba...
Results - Application Performance <ul><li>Extremely well   </li></ul><ul><li>Reducing execution time   </li></ul><ul><ul><...
Scale and Agility <ul><li>Address SnowFlock’s capability  (support multiple concurrent forking VMs) </li></ul><ul><ul><li>...
Scale and Agility <ul><li>SnowFlock  is capable of  withstanding  the  increased demands  of multiple concurrent forking V...
Conclusion and Future Directions <ul><li>Introduced  VM fork  and  SnowFlock  ,  Xen-based implementation </li></ul><ul><l...
Conclusion and Future Directions <ul><li>SnowFlock is an  active open-source project </li></ul><ul><ul><li>Plans involve a...
Upcoming SlideShare
Loading in...5
×

* Distributed System Lab 1 SnowFlock: Rapid Virtual Machine ...

1,062
-1

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,062
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Asdfdffef Wef Wef We F F Ew Fewf Ewfewfewfwf We F Wef We F ewf
  • * Distributed System Lab 1 SnowFlock: Rapid Virtual Machine ...

    1. 1. SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing 03/17/10 Distributed System Lab H. Andrés Lagar-Cavilla, Joseph A. Whitney, Adin Scannell, Philip Patchin, Stephen M. Rumble, Eyal de Lara, Michael Brudno, M. Satyanarayanan University of Toronto and Carnegie Mellon University http://sysweb.cs.toronto.edu/snowflock 游清權
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>VM Fork </li></ul><ul><li>Design Rationale </li></ul><ul><li>SnowFlock Implementation </li></ul><ul><li>Application Evaluation </li></ul><ul><li>Conclusion and Future Directions </li></ul>03/17/10 Distributed System Lab
    3. 3. Introduction <ul><li>VM technology is widely adopted as an enabler of cloud computing </li></ul><ul><li>Benefits </li></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Performance isolation </li></ul></ul><ul><ul><li>Ease of management </li></ul></ul><ul><ul><li>Flexibility (user-customized environment ) </li></ul></ul><ul><ul><li>Use a variable number of physical machines and VM instances depending on the needs of the problem </li></ul></ul><ul><ul><li>EX. Task may need only a single CPU during some phases of execution </li></ul></ul>03/17/10 Distributed System Lab
    4. 4. Introduction <ul><li>Introduce VM fork </li></ul><ul><ul><li>Simplifies development and deployment of cloud applications . </li></ul></ul><ul><ul><li>Allows for the rapid (< 1 second) instantiation of stateful computing elements in a cloud environment </li></ul></ul><ul><li>VM fork is similar to the process fork </li></ul><ul><ul><li>Child VMs receive a copy of all of the state generated by the parent VM prior to forking </li></ul></ul><ul><ul><li>Different in three fundamental ways . </li></ul></ul>03/17/10 Distributed System Lab
    5. 5. Introduction <ul><li>VM fork primitive allows for the forked copies to be instantiated on a set of different physical machines </li></ul><ul><ul><li>Enabling the task to take advantage of large compute clusters. </li></ul></ul><ul><ul><li>Previous work [Vrable 2005] is limited to cloning VMs within the same host </li></ul></ul><ul><li>Have made our primitive parallel , enabling the creation of multiple child VMs with a single call </li></ul><ul><li>VM fork replicates all of the processes and threads of the originating VM </li></ul><ul><ul><li>(Eables effective replication of multiple cooperating processes ) </li></ul></ul><ul><li>E.g. A customized LAMP Linux/Apache/MySql/Php) stack </li></ul>03/17/10 Distributed System Lab
    6. 6. Introduction <ul><li>Enables the trivial implementation of several useful and well-known patterns that are based on stateful replication </li></ul><ul><li>Pseudocode for four of these is illustrated in Figure 1 </li></ul><ul><li>Sandboxing of untrusted code </li></ul><ul><li>Instantiating new worker nodes to handle increased load (e.g. due to flash crowds) </li></ul><ul><li>Enabling parallel computation </li></ul><ul><li>Opportunistically utilizing unused cycles with short tasks </li></ul>03/17/10 Distributed System Lab
    7. 7. Introduction <ul><li>SnowFlock </li></ul><ul><ul><li>Provides swift parallel stateful VM cloning with little runtime overhead and frugal consumption of cloud I/O resources </li></ul></ul><ul><li>Takes advantage of several key techniques . </li></ul><ul><ul><li>First </li></ul></ul><ul><ul><li>SnowFlock utilizes lazy state replication to minimize the amount of state propagated to the child VMs. </li></ul></ul><ul><ul><li>Extremely fast instantiation of clones by initially copying the minimal necessary VM data , and transmitting only the fraction of the parent’s state that clones actually need . </li></ul></ul>03/17/10 Distributed System Lab
    8. 8. Introduction <ul><li>Takes advantage of several key techniques . </li></ul><ul><ul><li>Second </li></ul></ul><ul><ul><li>A set of avoidance heuristics eliminate substantial superfluous memory transfers for the common case of clones allocating new private state . </li></ul></ul><ul><ul><li>Finally </li></ul></ul><ul><ul><li>Child VMs execute </li></ul></ul><ul><ul><ul><li>Very similar code paths </li></ul></ul></ul><ul><ul><ul><li>Access common data structures </li></ul></ul></ul><ul><ul><li> Use a Multicast distribution technique for VM state that provides scalability and prefetching </li></ul></ul>03/17/10 Distributed System Lab
    9. 9. Introduction <ul><li>Evaluated SnowFlock by focusing on a demanding instance of Figure 1 (b) </li></ul><ul><li>(Interactive parallel computation) </li></ul><ul><li>Conducted experiments with applications from </li></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><ul><li>Quantitative finance </li></ul></ul><ul><ul><li>Rendering </li></ul></ul><ul><ul><li>Parallel compilation </li></ul></ul><ul><li>Be deployed as Internet services . </li></ul><ul><li>128 processors </li></ul><ul><li>SnowFlock achieves speedups coming within 7% or better of optimal execution </li></ul>03/17/10 Distributed System Lab
    10. 10. VM Fork <ul><li>Advs </li></ul><ul><ul><li>Execute independently on different physical hosts </li></ul></ul><ul><ul><li>Isolation </li></ul></ul><ul><ul><li>Ease of software development associated with VMs . </li></ul></ul><ul><ul><li>Greatly reducing the performance overhead of creating a collection of identical VMs on a number of physical machines. </li></ul></ul><ul><li>Each forked VMs proceeds with an identical view of the system </li></ul><ul><ul><li>Save for a unique identifier (vmid) . </li></ul></ul><ul><ul><li>(be distinguished , parent or ..) </li></ul></ul><ul><li>Each forked VM has its own independent copy (OS ,and virtual disk ) </li></ul><ul><ul><li>State updates are not propagated between VMs </li></ul></ul>03/17/10 Distributed System Lab
    11. 11. VM Fork <ul><li>Forked VMs are transient entities </li></ul><ul><ul><li>Memory image and virtual disk are discarded once they exit </li></ul></ul><ul><li>Any application-specific state or values generate explicitly communicated to the parent VM . </li></ul><ul><ul><li>(message passing or via a distributed file system ) </li></ul></ul><ul><li>Conflicts may arise </li></ul><ul><ul><li>(multiple processes within the same VM simultaneously invoke VM forking) </li></ul></ul><ul><ul><li>. </li></ul></ul><ul><li>VM fork will be used in VMs that have been carefully customized to run a single application (like serving a web page). </li></ul>03/17/10 Distributed System Lab
    12. 12. VM Fork <ul><li>The semantics of VM fork </li></ul><ul><ul><li>Integration with a dedicated, isolated virtual network connecting child VMs with their parent. </li></ul></ul><ul><ul><li>Each child is configured with a new IP address based on its vmid , and it is placed on the same virtual subnet . </li></ul></ul><ul><ul><li>Child VMs cannot communicate with hosts outside this virtual network </li></ul></ul>03/17/10 Distributed System Lab
    13. 13. VM Fork <ul><li>User must be conscious of the IP re­configuration semantics: </li></ul><ul><li>-Network shares must be (re)mounted after cloning . </li></ul><ul><li>2. Provide a NAT layer to allow the clones to connect to certain external IP </li></ul><ul><ul><li>NAT performs firewalling and throttling </li></ul></ul><ul><ul><li>Only allows external inbound connections to the parent VM </li></ul></ul><ul><ul><li>Useful to implement web-based fron­tend , </li></ul></ul><ul><ul><li>Allow access to a dataset provided by another party </li></ul></ul>03/17/10 Distributed System Lab
    14. 14. Design Rationale <ul><li>Plotting the cost of suspending and resuming a 1GB VM to an increasing number of hosts over NFS (see Section 5 for details on the testbed) </li></ul><ul><li>Direct relationship between I/O involved and fork latency, with latency growing to the order of hundreds of seconds </li></ul>03/17/10 Distributed System Lab
    15. 15. Design Rationale <ul><li>Method 1 : </li></ul><ul><li>Implement VM fork using existing VM suspend/resume </li></ul><ul><ul><li>The whole­sale copying of a VM to multiple hosts is far too taxing </li></ul></ul><ul><ul><li>Decreases overall system scalability </li></ul></ul><ul><ul><ul><li>(by clogging the network with gigabytes of data .) </li></ul></ul></ul><ul><li>Contention caused by the simultaneous requests by all children turns the source host into a hot spot . </li></ul><ul><ul><li>Live migration [Clark 2005, VMotion] </li></ul></ul><ul><ul><li>A popular mechanism for consolidating VMs in clouds [Steinder 2007, Wood 2007], ( Same algorithm plus extra rounds of copying, taking longer to replicate VMs ) </li></ul></ul>03/17/10 Distributed System Lab
    16. 16. Design Rationale <ul><li>Method 2 </li></ul><ul><li>Solving the problem of VM fork latency uses our multicast library . </li></ul><ul><ul><li>Multicast delivers state simultaneously to all hosts </li></ul></ul><ul><ul><li>Overhead is still in the range of minutes </li></ul></ul><ul><ul><li>( substantially reduce the total amount of VM state pushed over the network. ) </li></ul></ul><ul><li>Fast VM fork implementation is based on </li></ul><ul><ul><li>Start executing child VM on a remote site by initially replicating only minimal state </li></ul></ul><ul><ul><li>Children will typically access only a fraction of the original memory image (parent) </li></ul></ul><ul><ul><li>It’s common for children to allocate memory after forking </li></ul></ul><ul><ul><li>Children often execute similar code and access common data structures. </li></ul></ul>03/17/10 Distributed System Lab
    17. 17. Design Rationale <ul><li>VM Descriptors </li></ul><ul><ul><li>Lightweight mechanism ,Instantiates a new forked VM with only the critical metadata needed to start execution on a remote site. </li></ul></ul><ul><li>Memory-On-Demand </li></ul><ul><ul><li>Mechanism whereby clones lazily fetch portions of VM state over the network as it is accessed . </li></ul></ul><ul><li>Experience </li></ul><ul><ul><li>Possible to start a child VM by shipping only 0.1% state of the parent . </li></ul></ul><ul><ul><li>Children require a fraction of the original memory image of the parent. </li></ul></ul><ul><ul><li>Read portions of a remote dataset or allocate local storage </li></ul></ul><ul><ul><li>Optimization can reduce communication . 1GB  40MBs 4%! (for application footprints ) </li></ul></ul>03/17/10 Distributed System Lab
    18. 18. Design Rationale <ul><li>Memory on-demand : non-intrusive approach (Reduces state transfer without altering the behavior of the guest OS). </li></ul><ul><li>Another non-intrusive approach : </li></ul><ul><li>Copy-on-write , by Potemkin [Vrable 2005]. (Same Host ) </li></ul><ul><li>Potemkin does NOT provide runtime stateful cloning , since all new VMs are copies of a frozen template </li></ul><ul><li>Multicast replies to memory page requests </li></ul><ul><ul><li>High correlation across memory accesses of the children (insight iv) </li></ul></ul><ul><ul><li>Prevent the parent from becoming a hot-spot </li></ul></ul><ul><li>Multicast provides Scalability and Prefetching. </li></ul><ul><li>Children operate independently and individually </li></ul><ul><li>A Child waiting for a page does not prevent others from making progress. </li></ul>03/17/10 Distributed System Lab
    19. 19. SnowFlock Implementation <ul><li>SnowFlock is an open-source project (on the Xen 3.0.3 VMM) </li></ul><ul><li>Xen </li></ul><ul><ul><li>Hypervisor running at the highest processor privilege level .Controlling the execution of domains (VMs) ,The domain kernels are paravirtualized . </li></ul></ul><ul><li>SnowFlock </li></ul><ul><ul><li>Modifications to the Xen VMM and daemons (In domain0). </li></ul></ul><ul><ul><li>Daemons form a distributed system that controls the life-cycle of VMs (cloning and deallocation ) </li></ul></ul><ul><ul><li>Policy decisions: </li></ul></ul><ul><ul><ul><li>Resource accounting </li></ul></ul></ul><ul><ul><ul><li>Allocation of VMs to physical hosts </li></ul></ul></ul><ul><ul><ul><li>(To suitable cluster management software via a plug-in architecture ) </li></ul></ul></ul><ul><ul><li>Lazy state replication ( avoidance heuristics to minimize state transfer ) </li></ul></ul>03/17/10 Distributed System Lab
    20. 20. SnowFlock Implementation <ul><li>Four mechanisms to fork a VM . </li></ul><ul><ul><li>Parent VM is temporarily suspended ,produce a VM descriptor </li></ul></ul><ul><ul><ul><li>A small file ( VM metadata and guest kernel memory management data ) </li></ul></ul></ul><ul><ul><ul><li>Distributed to other physical hosts to spawn new VMs </li></ul></ul></ul><ul><ul><ul><li>In subsecond time </li></ul></ul></ul><ul><ul><li>Memory-on-demand mechanism . </li></ul></ul><ul><ul><ul><li>Lazily fetches additional VM memory state </li></ul></ul></ul><ul><ul><li>The avoidance heuristics . </li></ul></ul><ul><ul><ul><li>Reduce the amount of memory that needs to be fetched on demand </li></ul></ul></ul><ul><ul><li>Multicast distribution system mcdist </li></ul></ul><ul><ul><ul><li>Deliver VM state simultaneously and efficiently </li></ul></ul></ul><ul><ul><ul><li>Providing implicit prefetching </li></ul></ul></ul>03/17/10 Distributed System Lab
    21. 21. Implementation- 1.API <ul><li>VM fork in Snow-Flock consists of two stages </li></ul><ul><li>sf_request_ticket ( Reservation for the desired number of clones) </li></ul><ul><li>1. To optimize for common use cases in SMP hardware </li></ul><ul><ul><li>Cloned VMs span multiple hosts. </li></ul></ul><ul><ul><li>The processes within each VM span the physical underlying cores </li></ul></ul><ul><ul><li>Due to user quotas , current load , and other policies, the cluster management system may allocate fewer VMs than requested </li></ul></ul><ul><li>2. Fork the VM across the hosts with the sf_clone call. </li></ul><ul><ul><li>Child VM finishes its part of the computation, then sf_exit </li></ul></ul><ul><ul><li>A parent VM can wait for its children to terminate with sf_join </li></ul></ul><ul><ul><li>Force their termination with sf_kill ... </li></ul></ul>03/17/10 Distributed System Lab
    22. 22. Implementation- 1.API <ul><li>API is simple and flexible (modification of existing code bases) </li></ul><ul><li>Widely used Message Passing Interface (MPI) library </li></ul><ul><ul><li>Allows unmodified parallel applications to use SnowFlock’s capabilities </li></ul></ul>03/17/10 Distributed System Lab
    23. 23. Implementation-2.VM Descriptors <ul><li>Condensed VMimage </li></ul><ul><ul><li>Swift VM replication to a separate physical host. </li></ul></ul><ul><li>Starts by spawning a thread in the VM kernel that quiesces its I/O devices. </li></ul><ul><ul><li>Deactivates all but one of the virtual processors (VCPUs), </li></ul></ul><ul><ul><li>Issues a hypercall suspending the VM’s execution </li></ul></ul><ul><li>When Hypercall succeeds. </li></ul><ul><ul><li>Maps the suspended VM memory to populate the descriptor. </li></ul></ul><ul><li>Descriptor contains: </li></ul><ul><ul><li>Metadata describing the VM and its virtual devices </li></ul></ul><ul><ul><li>Few memory pages shared between the VM and the Xen hypervisor. </li></ul></ul><ul><ul><li>Registers of the main VCPU, </li></ul></ul><ul><ul><li>Global Descriptor Tables (GDT) used by the x86 segmentation hardware for memory protection </li></ul></ul><ul><ul><li>Page tables of the VM. </li></ul></ul>03/17/10 Distributed System Lab
    24. 24. Implementation-2.VM Descriptors <ul><li>The page tables make up the bulk of a VM descriptor. </li></ul><ul><ul><li>Each process in the VM needs a small number of additional page tables . </li></ul></ul><ul><ul><li>The cumulative size of a VM descriptor is thus loosely dependent on the number of processes the VM is executing. </li></ul></ul><ul><li>Entries in a page table are “canonicalized” before saving . </li></ul><ul><li>Translated from references to host-specific pages to frame numbers within the VM’s private contiguous physical space </li></ul><ul><ul><li>(“machine” and “physical” addresses in Xen parlance ,respectively). </li></ul></ul><ul><li>A few other values included in the descriptor, e.g. the cr3 register of the saved VCPU (also canonicalized ) . </li></ul>03/17/10 Distributed System Lab
    25. 25. Implementation-2.VM Descriptors <ul><li>Descriptor is multicast to multiple physical hosts (mcdist) Section 4.5 </li></ul><ul><li>Metadata is used to allocate a VM with the appropriate virtual devices and memory footprint . </li></ul><ul><li>All state saved in the descriptor is loaded: </li></ul><ul><ul><li>Pages shared with Xen </li></ul></ul><ul><ul><li>Segment descriptors, </li></ul></ul><ul><ul><li>Page tables </li></ul></ul><ul><ul><li>VCPU registers. </li></ul></ul><ul><li>Physical addresses in page table entries are translated to use the new mapping between VM specific physical addresses and host machine addresses . </li></ul><ul><li>The VM replica resumes execution , enables the extra VCPUs , and reconnects its virtual I/O devices to the new frontends. </li></ul>03/17/10 Distributed System Lab
    26. 26. Implementation-2.VM Descriptors <ul><li>Evaluation </li></ul><ul><li>Time spent replicating a single-processor VM with 1 GB of RAM to n clones in n physical hosts </li></ul>03/17/10 Distributed System Lab
    27. 27. Implementation-2.VM Descriptors <ul><li>VM descriptor for experiments was 1051 ± 7 KB. </li></ul><ul><li>The time to create a descriptor = </li></ul><ul><ul><li>“ Save Time” (our code) +“Xend Save” (Recycled and Unmodified Xen code). </li></ul></ul><ul><li>“ Starting Clones” time distributing the order to spawn a clone to each host </li></ul><ul><li>Clone creation in each host is composed by “Fetch Descriptor” (wait for the descriptor to arrive),.. </li></ul><ul><li>“ Restore Time” (our code) </li></ul><ul><li>“ Xend Restore” (recycled Xen code). </li></ul><ul><li>Overall, VM replication is a fast operation. ( 600 to 800 milliseconds) </li></ul><ul><li>Replication time is largely independent of the number of clones created </li></ul><ul><li>?? </li></ul>03/17/10 Distributed System Lab
    28. 28. Implementation- 3.Memory-On-Demand <ul><li>SnowFlock’s memory-on-demand subsystem - memtap </li></ul><ul><ul><ul><li>After being instantiated from a descriptor </li></ul></ul></ul><ul><ul><ul><li>Find it is missing state needed to proceed. </li></ul></ul></ul><ul><ul><ul><li>Memtap, handles by lazily populating the clone VM’s memory with state fetched from the parent (Immutable copy of the VM’s memory) </li></ul></ul></ul><ul><li>memtap  Hypervisor logic + Userspace domain0 process (associated with the clone VM) </li></ul><ul><ul><li>1.Missing page </li></ul></ul><ul><ul><li>2.Pauses that VCPU </li></ul></ul><ul><ul><li>3.Notifies the memtap process </li></ul></ul><ul><ul><li>4. Fetches its contents from the parent </li></ul></ul><ul><ul><li>5. Notifies the hypervisor  VCPU may be unpaused. </li></ul></ul>03/17/10 Distributed System Lab
    29. 29. Implementation- 3.Memory-On-Demand <ul><li>To allow hypervisor trap memory accesses to pages (Not yet been fetched) </li></ul><ul><li>Use Xen shadow page tables.. </li></ul><ul><ul><li>The x86 register (cr3) </li></ul></ul><ul><ul><li>Replace pointer to initially empty page table </li></ul></ul><ul><ul><li>Shadow page table is filled on demand from the real page table (Faults on empty entries occur) </li></ul></ul><ul><li>If first access to a page not yet been fetched. </li></ul><ul><ul><li>Hypervisor notifies memtap </li></ul></ul><ul><ul><li>Fetches are also triggered </li></ul></ul><ul><ul><li>Accesses by domain0 of the VM’s memory for the purpose of virtual device DMA. …..?? </li></ul></ul>03/17/10 Distributed System Lab
    30. 30. Implementation- 3.Memory-On-Demand <ul><li>On parent VM </li></ul><ul><ul><li>Memtap - Implements copy-on-write </li></ul></ul><ul><li>Use shadow page tables in “log-dirty” mode . </li></ul><ul><ul><li>All parent VM memory write attempts are trapped by disabling the writable bit on shadow page table. </li></ul></ul><ul><ul><li>Hypervisor duplicates the page and patches the mapping of the memtap server process to point to the duplicate. </li></ul></ul><ul><ul><li>Parent VM is then allowed to continue execution. . </li></ul></ul>03/17/10 Distributed System Lab
    31. 31. Implementation- 3.Memory-On-Demand <ul><li>To understand the overhead involved -- microbenchmark </li></ul><ul><ul><li>Multiple microbenchmark runs ten thousand page fetches ,Figure 4(a). </li></ul></ul><ul><ul><li>split a page fetch operation into six components. </li></ul></ul><ul><li>Six components. </li></ul><ul><li>“ Page Fault” . Hardware page fault overheads </li></ul><ul><ul><ul><li>(cause by using shadow page tables) .. </li></ul></ul></ul><ul><li>“ Xen” . Xen hypervisor shadow page table logic. </li></ul><ul><li>“ HV Logic” Hypervisor logic: </li></ul><ul><li>“ Dom0 Switch” Context switch to the domain0 ( memtap ). </li></ul><ul><li>“ Memtap Logic” memtap internals, Mapping the faulting VM page. </li></ul><ul><li>“ Network” software (libc and Linux kernel TCP stack) hardware overheads </li></ul>03/17/10 Distributed System Lab Evaluation
    32. 32. Implementation- 4.Avoidance Heuristics <ul><li>Fetching pages from the parent still incurs an overhead (May prove excessive for many workloads) </li></ul><ul><li>Augmented the VM kernel with two fetch-avoidance heuristics . </li></ul><ul><ul><li>Bypass lots unnecessary memory fetches ,retaining correctness. </li></ul></ul><ul><li>First heuristic </li></ul><ul><ul><li>Optimizes the general case in which a clone VM allocates new state. </li></ul></ul><ul><ul><li>Intercepts pages (selected by kernel’s page allocator) </li></ul></ul><ul><ul><li>The kernel page allocator is invoked when more memory is needed </li></ul></ul><ul><ul><li>The recipient of the selected pages does not care about the pages’ previous contents </li></ul></ul><ul><ul><li>(… page 6 , Right) </li></ul></ul>03/17/10 Distributed System Lab
    33. 33. Implementation- 4.Avoidance Heuristics <ul><li>The second heuristic. </li></ul><ul><ul><li>Addresses the case where a virtual I/O device writes to the guest memory . </li></ul></ul><ul><ul><li>Consider Block I/O: </li></ul></ul><ul><ul><li>Target page is typically a kernel buffer that is being recycled and whose previous contents do not need to be preserved. </li></ul></ul><ul><ul><li>Again, there is no need to fetch this page </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Fetch-avoidance heuristics </li></ul><ul><ul><li>Implemented by mapping the memtap bitmap in the guest kernel’s address space. </li></ul></ul>03/17/10 Distributed System Lab
    34. 34. Implementation- 4.Avoidance Heuristics <ul><li>Evaluation </li></ul><ul><li>Result in substantial benefits. </li></ul><ul><ul><li>Runtime and data transfer </li></ul></ul><ul><li>With the heuristics </li></ul><ul><ul><li>State transmissions to clones are reduced to 40 MBs, </li></ul></ul><ul><ul><li>Tiny fraction (3.5%) of the VM’s footprint.. </li></ul></ul>03/17/10 Distributed System Lab
    35. 35. Implementation- 5.Multicast Distribution <ul><li>Mcdist </li></ul><ul><ul><li>Multicast distribution system efficiently provides data to all cloned VMs simultaneously. </li></ul></ul><ul><li>Two goals (Not served by point-to-point ). </li></ul><ul><li>First: Data needed by clones is often prefetched. </li></ul><ul><ul><li>Single clone requests a page </li></ul></ul><ul><ul><li>Response also reaches all other clones. </li></ul></ul><ul><li>Second: Load (network) is greatly reduced </li></ul><ul><ul><li>Sending a piece of data to all VM clones (1 operation). </li></ul></ul>03/17/10 Distributed System Lab
    36. 36. Implementation- 5.Multicast Distribution <ul><li>Mcdist server design is minimalistic </li></ul><ul><ul><li>Only switch programming and flow control logic. </li></ul></ul><ul><li>Ensuring Reliability - Timeout mechanism . </li></ul><ul><li>IP-multicast : Send data to multiple hosts simultaneously. </li></ul><ul><ul><li>Supported by most off-the-shelf commercial Ethernet hardware . </li></ul></ul><ul><li>IP-multicast hardware </li></ul><ul><ul><li>Capable of scaling to thousands of hosts and multicast groups </li></ul></ul><ul><ul><li>Automatically relaying multicast frames across multiple hops. </li></ul></ul>03/17/10 Distributed System Lab
    37. 37. Implementation- 5.Multicast Distribution <ul><li>Mcdist clients are memtap processes </li></ul><ul><ul><li>Receive pages asynchronously and unpredictably in response to requests by fellow VM clones. </li></ul></ul><ul><ul><li>Memtap clients batch received pages until </li></ul></ul><ul><ul><ul><li>A threshold is hit </li></ul></ul></ul><ul><ul><ul><li>Page that has been explicitly requested arrives. </li></ul></ul></ul><ul><li>Single hypercall is invoked to map the pages in a batch. </li></ul><ul><li>Threshold of 1024 pages has proven to work well in practice. </li></ul>03/17/10 Distributed System Lab
    38. 38. Implementation- 5.Multicast Distribution <ul><li>To maximize total goodput </li></ul><ul><ul><li>Server uses flow control logic  Limitits sending rate </li></ul></ul><ul><ul><li>Server and clients estimate their send and receive rate </li></ul></ul><ul><li>Clients </li></ul><ul><ul><li>Provide explicit feedback </li></ul></ul><ul><li>The server increases its rate limit linearly </li></ul><ul><ul><li>Loss is detected  The server scales its rate limit back. </li></ul></ul><ul><li>Other Server flow control mechanism - lockstep detection </li></ul><ul><ul><li>Multiple requests for the same page  Ignores duplicate requests </li></ul></ul>03/17/10 Distributed System Lab
    39. 39. Implementation- 5.Multicast Distribution <ul><li>Evaluation </li></ul><ul><li>Results obtained with SHRiMP. </li></ul><ul><li>Shows that multicast distribution’s lockstep avoidance works effectively: </li></ul><ul><ul><li>Lockstep-executing VMs issue simultaneous requests that are satisfied by a single response from the server. </li></ul></ul><ul><ul><li>Hence the difference between the “Requests” and “Served” bars in the multicast experiments. </li></ul></ul>03/17/10 Distributed System Lab
    40. 40. Implementation- 5.Multicast Distribution 03/17/10 Distributed System Lab
    41. 41. Implementation- 5.Multicast Distribution <ul><li>Figure 4(c) shows the benefit of mcdist for a case where an important portion of memory state is needed after cloning </li></ul><ul><ul><li>(The avoidance heuristics cannot help. ) </li></ul></ul><ul><li>Experiment (NCBI BLAST) </li></ul><ul><ul><li>Executes queries against a 256 MB portion of the NCBI genome database that the parent caches into memory before cloning. </li></ul></ul><ul><li>Speedup results for SnowFlock: unicast VS multicast, </li></ul><ul><li>Idealized zero-cost fork configuration </li></ul><ul><ul><li>VMs have been previously allocated, </li></ul></ul><ul><ul><li>with no cloning or state-fetching overhead. </li></ul></ul>03/17/10 Distributed System Lab
    42. 42. 6.Virtual I/O Devices -- Virtual Disk <ul><li>Implemented with a blocktap [Warfield 2005] driver . </li></ul><ul><ul><li>Multiple views of the virtual disk are supported by a hierarchy of copy-on-write(COW) slices located at the site where the parent VM runs. </li></ul></ul><ul><li>Each fork operation adds a new COW slice </li></ul><ul><ul><li>Rendering the previous state of the disk immutable. </li></ul></ul><ul><li>Children access a sparse local version of the disk, </li></ul><ul><ul><li>Fetched on demand from the disk server. </li></ul></ul><ul><li>Virtual disk exploits same optimizations (memory subsystem) </li></ul><ul><ul><li>Unnecessary fetches during writes are avoided using heuristics , </li></ul></ul><ul><ul><li>Original disk state is provided to all clients simultaneously via multicast . </li></ul></ul>03/17/10 Distributed System Lab
    43. 43. 6.Virtual I/O Devices -- Virtual Disk <ul><li>Virtual disk is used as the base root partition for the VMs. </li></ul><ul><li>For data-intensive tasks </li></ul><ul><ul><li>Envision serving data volumes to the clones through network file systems such as NFS </li></ul></ul><ul><ul><li>Suitable big-data filesystems such as Hadoop or Lustre [Braam 2002]. </li></ul></ul><ul><li>Most work done by clones is processor intensive </li></ul><ul><ul><li>Writes do not result in fetches </li></ul></ul><ul><ul><li>The little remaining disk activity mostly hits kernel caches . </li></ul></ul><ul><li>Largely exceeds the demands of many realistic tasks </li></ul><ul><ul><li>Not cause any noticeable overhead for the experiments (Section 5). </li></ul></ul>03/17/10 Distributed System Lab
    44. 44. 6.Virtual I/O Devices -- Network Isolation <ul><li>Employ a mechanism to isolate (prevent interference, eavesdropping). </li></ul><ul><li>Performed at the level of Ethernet packets, the primitive exposed by Xen virtual network devices. </li></ul><ul><li>Before being sent </li></ul><ul><ul><li>Source MAC addresses of packets sent by a SnowFlock VM are rewritten as a special address which is a function of both the parent and child identifiers . </li></ul></ul><ul><ul><li>Simple filtering rules are used by all hosts to ensure that no packets delivered to a VM come from VMs that are not its parent or a sibling. </li></ul></ul><ul><li>A packet is delivered </li></ul><ul><ul><li>Destination MAC address is rewritten to be as expected, rendering the entire process transparent. </li></ul></ul><ul><li>Small number of special rewriting rules are required for protocols with payloads containing MAC addresses, such as ARP. </li></ul><ul><li>Filtering and rewriting impose an imperceptible overhead while maintaining full IP compatibility. </li></ul>03/17/10 Distributed System Lab
    45. 45. Application Evaluation <ul><li>Focuses on a particularly demanding scenario </li></ul><ul><ul><li>The ability to deliver interactive parallel computation , </li></ul></ul><ul><ul><li>VM forks multiple workers to participate in a short-lived computationally-intensive parallel job. </li></ul></ul><ul><li>Scenario </li></ul><ul><ul><li>Users interact with a web frontend and submit queries </li></ul></ul><ul><ul><li>Parallel algorithm run on a compute cluster </li></ul></ul><ul><li>Cluster of 32 Dell PowerEdge 1950 blade servers . </li></ul>03/17/10 Distributed System Lab
    46. 46. Application Evaluation <ul><li>Each host </li></ul><ul><ul><li>4 GB of RAM </li></ul></ul><ul><ul><li>4 Intel Xeon 3.2 GHz cores </li></ul></ul><ul><ul><li>Broadcom NetX­treme II BCM5708 gigabit NIC </li></ul></ul><ul><li>All machines running SnowFlock prototype (Xen 3.0.3) . </li></ul><ul><li>Para-virtualized Linux 2.6.16.29 (Guest, Host) </li></ul><ul><li>All machines were connected to two daisy-chained Dell PowerConnect 5324 gigabit switches </li></ul>03/17/10 Distributed System Lab
    47. 47. Applications <ul><li>3 typical applications from bioinformatics </li></ul><ul><li>3 applications </li></ul><ul><ul><li>Graphics rendering </li></ul></ul><ul><ul><li>Parallel compilation </li></ul></ul><ul><ul><li>Financial services </li></ul></ul><ul><li>Driven by a workflow shell script (clones VM and launches application ). </li></ul><ul><li>NCBI BLAST  Computational tool used by biologists. </li></ul><ul><li>SHRiMP  Tool for aligning large collections of very short DNA sequences </li></ul><ul><li>ClustalW  multiple alignment of a collection of protein or DNA sequences </li></ul><ul><li>QuantLib  Toolkit widely used in quantitative finance </li></ul><ul><li>Aqsis – Renderman  In films and television visual effects [Pixar] </li></ul><ul><li>Distc  parallel compilation. </li></ul>03/17/10 Distributed System Lab
    48. 48. Results <ul><li>32 4-core SMP VMs on 32 physical hosts </li></ul><ul><li>A im to answer the following questions </li></ul><ul><ul><li>How does SnowFlock compare to other methods for instantiating VMs? </li></ul></ul><ul><ul><li>How close does SnowFlock come to achieving optimal application speedup? </li></ul></ul><ul><ul><li>How scalable is SnowFlock? </li></ul></ul>03/17/10 Distributed System Lab
    49. 49. Results - Comparison <ul><li>SHRIMP , 128 processors under three configurations </li></ul><ul><ul><li>SnowFlock with all the mechanisms </li></ul></ul><ul><ul><li>Xen’s standard Suspend/Resume that use NFS </li></ul></ul><ul><ul><li>Multicast to distribute the suspended VM image </li></ul></ul>03/17/10 Distributed System Lab
    50. 50. Results - Application Performance 03/17/10 Distributed System Lab
    51. 51. Results - Application Performance <ul><li>Compares SnowFlock to an optimal “zero-cost fork” baseline </li></ul><ul><li>Baseline </li></ul><ul><ul><li>128 threads to measure overhead </li></ul></ul><ul><ul><li>one thread to measure speedup </li></ul></ul><ul><li>Zero-cost </li></ul><ul><ul><li>VMs previously allocated, </li></ul></ul><ul><ul><li>No cloning or state-fetching overhead </li></ul></ul><ul><ul><li>In an idle state . </li></ul></ul><ul><ul><li>Overly optimistic </li></ul></ul><ul><ul><li>Not representative of cloud computing environments </li></ul></ul><ul><li>zero-cost VMs </li></ul><ul><ul><li>Vanilla Xen 3.0.3 domains ,configured identically to SnowFlock VMs </li></ul></ul>03/17/10 Distributed System Lab
    52. 52. Results - Application Performance <ul><li>Extremely well </li></ul><ul><li>Reducing execution time </li></ul><ul><ul><li>Hours  Tens of seconds (for all the benchmarks). </li></ul></ul><ul><li>Speedups </li></ul><ul><ul><li>Very close to the zero-cost optimal , </li></ul></ul><ul><ul><li>Comes within 7% of the optimal runtime . </li></ul></ul><ul><li>Overhead ( VM replication , on-demand state fetching )  Small . </li></ul><ul><li>ClustalW The best results </li></ul><ul><ul><li>Less than 2 seconds of overhead for a 25 second task . </li></ul></ul>03/17/10 Distributed System Lab
    53. 53. Scale and Agility <ul><li>Address SnowFlock’s capability (support multiple concurrent forking VMs) </li></ul><ul><ul><li>Launch four VMs that each forks 32 uniprocessor VMs. </li></ul></ul><ul><li>After completing a parallel task, each parent VM joins and terminates its children . (Than launches another parallel task , repeating five times) </li></ul><ul><li>Each parent VM runs a different application. </li></ul><ul><ul><li>Employed an “ adversarial allocation ” in which each task uses 32 processors, one per physical host </li></ul></ul><ul><ul><li>128 SnowFlock VMs are active at most times. </li></ul></ul><ul><ul><li>Each physical host needs to fetch state from four parent VMs </li></ul></ul>03/17/10 Distributed System Lab
    54. 54. Scale and Agility <ul><li>SnowFlock is capable of withstanding the increased demands of multiple concurrent forking VMs </li></ul><ul><li>Believe : Optimizing mcdist , consistent running times ↓ </li></ul><ul><li>Perform a 32-host 40-seconds or less parallel computation, with five seconds or less of overhead </li></ul>03/17/10 Distributed System Lab
    55. 55. Conclusion and Future Directions <ul><li>Introduced VM fork and SnowFlock , Xen-based implementation </li></ul><ul><li>VM fork : </li></ul><ul><ul><li>I nstantiate dozens of VMs in different hosts in sub-second time, runtime overhead ↓ , cloud IO resources↓ </li></ul></ul><ul><li>SnowFlock </li></ul><ul><ul><li>Drastically reduce the time (copying only the critical state) </li></ul></ul><ul><ul><li>Fetching the VM’s memory image efficiently on-demand … </li></ul></ul><ul><li>Simple modifications guest kernel ( reduce network traffic) </li></ul><ul><ul><li>Eliminating the transfer of pages that will be overwritten. </li></ul></ul><ul><li>Multicast (locality of memory accesses across cloned VMs) </li></ul><ul><ul><li>Low cost. </li></ul></ul>03/17/10 Distributed System Lab
    56. 56. Conclusion and Future Directions <ul><li>SnowFlock is an active open-source project </li></ul><ul><ul><li>Plans involve adapting SnowFlock to bigdata applications. </li></ul></ul><ul><li>Fertile research ground studying the interactions of VM fork with data parallel APIs. </li></ul><ul><li>SnowFlock’s objective: performance > reliability. </li></ul><ul><ul><li>Memory-on-demand provides performance </li></ul></ul><ul><ul><li>(Dependency on a single source of VM state) </li></ul></ul><ul><ul><li>How to push VM state in background without sacrificing </li></ul></ul><ul><li>Wish : wide-area VM migration . </li></ul>03/17/10 Distributed System Lab
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×