Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Optimizing Dirty-Page Tracking Mechanism for Virtualization Based
Fault Tolerance
Vijayakumar M M Nikhil Pujari Sireesh Bo...
GPTEs and sets them inside the SPT. When dirty page tracking is enabled, the new SPTEs set by
sh_propagate are write prote...
• Emulated write transition PTE
• Emulated write real PTE
Each emulated write involves a VMENTER/VMEXIT and about 8000 cyc...
Mechanism of optimization
We partition the memory pages into logical groups consisting of contiguous pages based
on the vi...
Placement of counters
We place counters at the following places in the page handler code:
1) sh_page_fault entry: To measu...
Page Fault Clustering pattern
The graph below shows the way Page Faults are clustered in virtual address
space. To determi...
We checked this by implementing the optimization for group sizes ranging from 16 to
256(doubling each time), and measured ...
WorkLoad : HTTPerf, No. of Epochs = 1000
WorkLoad : TPC-C, No. of Epochs = 1000
WorkLoad : TPC-C, No. of Epochs = 1000
Below is the description of workloads.
• HTTPerf workload
HTTPerf is a tool for mea...
Conclusion:
It is seen that with a group size of 256 the Page Faults reduce by 35% and 32% respectively for
HTTPerf and TP...
Upcoming SlideShare
Loading in …5
×

Optimizing dirty page tracking mechanism for virtualization based fault tolerance

2,901 views

Published on

Published in: Technology, Education
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Optimizing dirty page tracking mechanism for virtualization based fault tolerance

  1. 1. Optimizing Dirty-Page Tracking Mechanism for Virtualization Based Fault Tolerance Vijayakumar M M Nikhil Pujari Sireesh Bolla Stony Brook University Overview Transactional fault tolerant systems typically involve a Master and a Slave server, the memory states of both are synchronized at regular intervals. The disk and network outputs are not released until this synchronization is complete. When the master fails the slave takes over, and this process is transparent to the external clients. This process of synchronization is also closely related to the process of live migration of virtual machines. In live migration, after first stage of pre-copying, in which the entire memory is copied to the target machine, the subsequent stages involve copying only the dirtied pages at regular intervals. Xen employs the following mechanism for dirty page tracking. For HVM domains, to track the dirtied pages, the pages which have read-write access in the guest page tables are marked as read-only in the shadow page tables. When the actual write happens, it is hence trapped and that page is marked as dirty in a page dirty bitmap. A page-fault would occur in the following cases: 1) guest tries to modify its page table 2) GPT(Guest Page Table) entry is present, but corresponding SPT(Shadow Page Table) entry is not present. 3) GPT entry is RW enabled, but the SPT entry is read-only. When a page fault happens, the page fault handler sh_page_fault() is invoked. In this handler, a guest page table walk is executed, to find out the corresponding faulted entry in the GPT. Depending on the kind of fault(RW-RO or P-NP), the corresponding Accessed or Dirty bits are set in the GPTE. After that sh_propagate is invoked from one of its wrappers, which depends on the level of the page table to which the GPTE belongs to(e.g L1,L2 or L3). The sh_propagate function is the heart of the shadow code, which computes the SPTEs from the corresponding
  2. 2. GPTEs and sets them inside the SPT. When dirty page tracking is enabled, the new SPTEs set by sh_propagate are write protected. Scope for optimization Observations on page dirtying patterns observed during live migration of VMs have shown that, page dirtying is often clustered. If a page is dirtied it is disproportionately likely that its close neighbours would be dirtied in the same epoch. (http://www.cl.cam.ac.uk/research/srg/netos/papers/2005-migration-nsdi-pre.pdf). This locality can be exploited by forming groups of pages on which the optimization can be applied. Changes in shadowing mechanism from Xen 3.1 to Xen 3.3 In Xen 3.1 the following method was employed for shadowing: • Remove write access to any guest pagetable. • When a guest attempts to write to the guest pagetable, mark it out-of-sync, add the page to the out-of-sync list and give write permission. • On the next page fault or cr3 write, take each page from the out-of-sync list and o resync the page: look for changes to the guest pagetable, propagate those entries into the shadow pagetable o remove write permission, and clear the out-of-sync bit. This method did not work well for the MS Windows HVM domains because of its demand paging technique. In Shadow -2 (Xen 3.2), the OOS mechanism was removed and each write to the guest page table was emulated. But this removed the batching effects of guest page table updates. Each time the guest OS tried to modify its page table it results in a page fault, and the writes are emulated by the hypervisor. Windows writes transition values into its page table entries when pages are being mapped out or mapped in. So this process looks like, • Page fault
  3. 3. • Emulated write transition PTE • Emulated write real PTE Each emulated write involves a VMENTER/VMEXIT and about 8000 cycles of emulation inside the hypervisor. Shadow -3 (Xen 3.3) brings back the OOS mechanism to improve on this, but with a few modifications. Only the L1-pagetables are allowed to go out-of-sync, and all other page table writes are emulated. This improves HVM shadow performance significantly. This affects our optimization as follows. Attempts to implement this optimization on Shadow-2 did not result in significant improvements because the number of "page-table-write-faults" dominated the total number of page-faults by a great margin. Out of total page faults per epoch, approximately 98% of the page faults were for GPT writes and only 2% were data page write faults. But in Xen 3.3, since the L1 pagetables are allowed to go out of sync and are resynced only on the next page fault, the percentage of GPT faults has reduced, and this has made our optimization viable. Still, the percentage of GPT faults and GPTE-present SPTE-not present(P-NP type faults) is significant compared to the data page faults. This is primarily because each time a process is scheduled out, SPT for that process is destroyed and whenever it is rescheduled, it is reconstructed. The GPT pages are marked as read-only to construct the SPT, and track changes to it attempted by the guest OS. If we implement epochs to be independent of context switches, i.e. multiple epochs between two context switches, we might be able to contrast only the GPT modification faults against data page faults, thus largely eliminating the SPT construction faults from our measurements, to better demonstrate the efficacy of our optimization.
  4. 4. Mechanism of optimization We partition the memory pages into logical groups consisting of contiguous pages based on the virtual addresses. When a RW-RO type fault occurs on an attempted write to a page, we set the write permission bit for all the pages in that group, thereby granting write privileges for the whole group. We also maintain a seperate table where we record the dirtied group. At end of the epoch we scan our group bitmap and determine the groups which were dirtied. Then we scan the individual page table entries (SPTEs) of the pages belonging to those groups, to check the dirty bits of the PTEs, to determine actually which of the pages were written to. Our approach is to strike a balance between a linear scan of the page table and incurring a page fault per page being written. Scanning the whole page table after the end of each epoch is too expensive. Our aim is to explore if we could optimize the dirty page tracking process by finding an optimal group size. This mechanism gives rise to a complication in propagating the dirty bits to the GPT. Since we enable the write access to all pages in the group, the dirty bits wont be propagated since writes to those pages would not be trapped. To ensure correctness, we set the dirty bits in all the corresponding GPTEs. A better way to manage this would be to set only the dirty bits of the GPTEs, whose corresponding SPTEs have been dirtied, while we scan the dirty groups at the end of the epoch. Epoch time An epoch is defined as a time interval during which the dirty pages are tracked and shipped at the end of that time interval. All the entries in the SPT are again marked as write protected after the epoch ends, before the start of a new epoch. For purposes of our project, to collect page fault statistics and to demonstrate the effects of our optimization, we have chosen the context switch time interval as the epoch time. We aggregate our counters and implement the group scanning loop in the sh_update_cr3 function in the shadow code, which is executed whenever a context switch happens inside the guest.
  5. 5. Placement of counters We place counters at the following places in the page handler code: 1) sh_page_fault entry: To measure the total number of page faults 2) sh_propagate line no. 616: At this place it is checked whether the page fault was a write fault(RW-RO type fault). If it is, and if the target MFN i.e. the SPTE is valid, then Xen marks the page as dirty. We place our counter here to count the number of RW-RO faults. Test Results Shadow2-Shadow3 difference The graph below demonstrates the already mentioned improvement of Shadow3 over Shadow2 in terms of reduction in the Page Table Write Page Faults. It can be seen that percentage of Data Page faults has risen from 2% to 80% from Shadow2 to Shadow3.
  6. 6. Page Fault Clustering pattern The graph below shows the way Page Faults are clustered in virtual address space. To determine the range of optimization/pattern of page dirtying, we conducted our measurements as follows. We divided the address space into groups of 256 pages. Then we checked how many groups were dirtied and how many pages in each group were dirtied. We found that there are an overwhelming percentage of groups with upto 10 pages dirtied. There were comparatively a very small percentage of groups with additional 10 pages dirtied,20 pages dirtied and so on (up to additional 40 pages). From this we estimated that, by increasing the group size, we would only increase the looping time inside the loops and number of page faults would not decrease proportionally, since the additional number of dirty pages found within each group does not increase proportionally. The locality of reference does not improve much beyond 32 pages. No. of Epochs = 1000
  7. 7. We checked this by implementing the optimization for group sizes ranging from 16 to 256(doubling each time), and measured the decrease in number of page-faults Optimal Group Size Two workloads HTTPerf and TPC-C were run with the proposed optimization implemented for 16, 32, 64, 128 and 256 sized groups. We measured the number of Page Faults and the extra time introduced for scanning the entries within the groups at the end of the epoch. These are presented below in separate graphs that depict average values over multiple iterations of tests. WorkLoad : HTTPerf, No. of Epochs = 1000
  8. 8. WorkLoad : HTTPerf, No. of Epochs = 1000 WorkLoad : TPC-C, No. of Epochs = 1000
  9. 9. WorkLoad : TPC-C, No. of Epochs = 1000 Below is the description of workloads. • HTTPerf workload HTTPerf is a tool for measuring web server performance, we used the request-only facility to pump the server with a connection rate of 50 connections per second for a total of 900 seconds. • TPC-C workload The TPC-C workload emulates the OLTP activities of an order-entry environment where a population of users execute queries against a central database. The benchmark tool is Benchmark Factory using MySQL as the database backend. In this workload, there are 5 transaction types and the default mixture of these 5 transaction types are used. The duration of the workload is 1500 seconds.
  10. 10. Conclusion: It is seen that with a group size of 256 the Page Faults reduce by 35% and 32% respectively for HTTPerf and TPC-C. But the time spent for scanning the entries is highest for this group size. As the group size is decreased the Page Faults increase very gradually with group size 32 having 30% and 25% reduction in Page Faults. For group size of 16 we only see a slightly higher increase in Page Faults in both cases(TPC-C and HTTPerf) with Page Fault reduction of only 22% and 16%. From the graph we also see that as the group size is increased the scanning time increases disproportionately while the Page Faults decrease only insignificantly. This is because most groups have <10 and <20 Page Faults, which confirmed our initial estimates. Therefore from comparing both the trends, No. of Page Faults and Scanning Time, we conclude that within the group sizes we tested for (16 to 256), 32 is the most optimal size which gives a reduction of 30% and 25% for HTTPerf and TPC-C respectively with optimal tradeoff for scanning the entries at the end of the epoch. REFERENCES [1] H. Labs. HTTPerf. http://www.hpl.hp.com/research/linux/httperf/, 2008. [2] T. P. P. Council. TPC Benchmark C. http://www.tpc.org/tpcc/, 1997. [3] Q. Software. Benchmark Factory for Databases. http://www.quest.com/benchmark-factory/, 1999. [4] M. Lu and Tzi-cker Chiueh. Fast Memory State Synchronization for Virtualization-based Fault Tolerance. In Proceedings of the 39th IEEE Conference on Dependable Systems and Networks (DSN 2009), 2009. [5] XenWiki

×