Optimizing dirty page tracking mechanism for virtualization based fault tolerance
Optimizing Dirty-Page Tracking Mechanism for Virtualization Based
Vijayakumar M M Nikhil Pujari Sireesh Bolla
Stony Brook University
Transactional fault tolerant systems typically involve a Master and a Slave server, the
memory states of both are synchronized at regular intervals. The disk and network outputs are
not released until this synchronization is complete. When the master fails the slave takes over,
and this process is transparent to the external clients. This process of synchronization is also
closely related to the process of live migration of virtual machines. In live migration, after first
stage of pre-copying, in which the entire memory is copied to the target machine, the subsequent
stages involve copying only the dirtied pages at regular intervals.
Xen employs the following mechanism for dirty page tracking. For HVM domains, to
track the dirtied pages, the pages which have read-write access in the guest page tables are
marked as read-only in the shadow page tables. When the actual write happens, it is hence
trapped and that page is marked as dirty in a page dirty bitmap.
A page-fault would occur in the following cases:
1) guest tries to modify its page table
2) GPT(Guest Page Table) entry is present, but corresponding SPT(Shadow Page Table)
entry is not present.
3) GPT entry is RW enabled, but the SPT entry is read-only.
When a page fault happens, the page fault handler sh_page_fault() is invoked. In this handler, a
guest page table walk is executed, to find out the corresponding faulted entry in the GPT.
Depending on the kind of fault(RW-RO or P-NP), the corresponding Accessed or Dirty bits are
set in the GPTE. After that sh_propagate is invoked from one of its wrappers, which depends on
the level of the page table to which the GPTE belongs to(e.g L1,L2 or L3). The sh_propagate
function is the heart of the shadow code, which computes the SPTEs from the corresponding
GPTEs and sets them inside the SPT. When dirty page tracking is enabled, the new SPTEs set by
sh_propagate are write protected.
Scope for optimization
Observations on page dirtying patterns observed during live migration of VMs have
shown that, page dirtying is often clustered. If a page is dirtied it is disproportionately likely that
its close neighbours would be dirtied in the same epoch.
This locality can be exploited by forming groups of pages on which the optimization can be
Changes in shadowing mechanism from Xen 3.1 to Xen 3.3
In Xen 3.1 the following method was employed for shadowing:
• Remove write access to any guest pagetable.
• When a guest attempts to write to the guest pagetable, mark it out-of-sync, add the
page to the out-of-sync list and give write permission.
• On the next page fault or cr3 write, take each page from the out-of-sync list and
o resync the page: look for changes to the guest pagetable, propagate those
entries into the shadow pagetable
o remove write permission, and clear the out-of-sync bit.
This method did not work well for the MS Windows HVM domains because of its demand
In Shadow -2 (Xen 3.2), the OOS mechanism was removed and each write to the guest page
table was emulated. But this removed the batching effects of guest page table updates. Each time
the guest OS tried to modify its page table it results in a page fault, and the writes are emulated
by the hypervisor. Windows writes transition values into its page table entries when pages are
being mapped out or mapped in. So this process looks like,
• Page fault
• Emulated write transition PTE
• Emulated write real PTE
Each emulated write involves a VMENTER/VMEXIT and about 8000 cycles of emulation inside
Shadow -3 (Xen 3.3) brings back the OOS mechanism to improve on this, but with a few
modifications. Only the L1-pagetables are allowed to go out-of-sync, and all other page table
writes are emulated. This improves HVM shadow performance significantly.
This affects our optimization as follows. Attempts to implement this optimization on Shadow-2
did not result in significant improvements because the number of "page-table-write-faults"
dominated the total number of page-faults by a great margin. Out of total page faults per epoch,
approximately 98% of the page faults were for GPT writes and only 2% were data page write
faults. But in Xen 3.3, since the L1 pagetables are allowed to go out of sync and are resynced
only on the next page fault, the percentage of GPT faults has reduced, and this has made our
Still, the percentage of GPT faults and GPTE-present SPTE-not present(P-NP type faults) is
significant compared to the data page faults. This is primarily because each time a process is
scheduled out, SPT for that process is destroyed and whenever it is rescheduled, it is
reconstructed. The GPT pages are marked as read-only to construct the SPT, and track changes
to it attempted by the guest OS.
If we implement epochs to be independent of context switches, i.e. multiple epochs between two
context switches, we might be able to contrast only the GPT modification faults against data
page faults, thus largely eliminating the SPT construction faults from our measurements, to
better demonstrate the efficacy of our optimization.
Mechanism of optimization
We partition the memory pages into logical groups consisting of contiguous pages based
on the virtual addresses. When a RW-RO type fault occurs on an attempted write to a page, we
set the write permission bit for all the pages in that group, thereby granting write privileges for
the whole group. We also maintain a seperate table where we record the dirtied group. At end of
the epoch we scan our group bitmap and determine the groups which were dirtied. Then we scan
the individual page table entries (SPTEs) of the pages belonging to those groups, to check the
dirty bits of the PTEs, to determine actually which of the pages were written to.
Our approach is to strike a balance between a linear scan of the page table and incurring a page
fault per page being written. Scanning the whole page table after the end of each epoch is too
expensive. Our aim is to explore if we could optimize the dirty page tracking process by finding
an optimal group size.
This mechanism gives rise to a complication in propagating the dirty bits to the GPT.
Since we enable the write access to all pages in the group, the dirty bits wont be propagated since
writes to those pages would not be trapped. To ensure correctness, we set the dirty bits in all the
corresponding GPTEs. A better way to manage this would be to set only the dirty bits of the
GPTEs, whose corresponding SPTEs have been dirtied, while we scan the dirty groups at the end
of the epoch.
An epoch is defined as a time interval during which the dirty pages are tracked and
shipped at the end of that time interval. All the entries in the SPT are again marked as write
protected after the epoch ends, before the start of a new epoch.
For purposes of our project, to collect page fault statistics and to demonstrate the effects of our
optimization, we have chosen the context switch time interval as the epoch time. We aggregate
our counters and implement the group scanning loop in the sh_update_cr3 function in the
shadow code, which is executed whenever a context switch happens inside the guest.
Placement of counters
We place counters at the following places in the page handler code:
1) sh_page_fault entry: To measure the total number of page faults
2) sh_propagate line no. 616: At this place it is checked whether the page fault was
a write fault(RW-RO type fault). If it is, and if the target MFN i.e. the SPTE is
valid, then Xen marks the page as dirty. We place our counter here to count the
number of RW-RO faults.
The graph below demonstrates the already mentioned improvement of Shadow3 over Shadow2
in terms of reduction in the Page Table Write Page Faults. It can be seen that percentage of Data
Page faults has risen from 2% to 80% from Shadow2 to Shadow3.
Page Fault Clustering pattern
The graph below shows the way Page Faults are clustered in virtual address
space. To determine the range of optimization/pattern of page dirtying, we conducted our
measurements as follows.
We divided the address space into groups of 256 pages. Then we checked how many groups
were dirtied and how many pages in each group were dirtied. We found that there are an
overwhelming percentage of groups with upto 10 pages dirtied. There were comparatively a very
small percentage of groups with additional 10 pages dirtied,20 pages dirtied and so on (up to
additional 40 pages).
From this we estimated that, by increasing the group size, we would only increase the looping
time inside the loops and number of page faults would not decrease proportionally, since the
additional number of dirty pages found within each group does not increase proportionally. The
locality of reference does not improve much beyond 32 pages.
No. of Epochs = 1000
We checked this by implementing the optimization for group sizes ranging from 16 to
256(doubling each time), and measured the decrease in number of page-faults
Optimal Group Size
Two workloads HTTPerf and TPC-C were run with the proposed optimization implemented for
16, 32, 64, 128 and 256 sized groups. We measured the number of Page Faults and the extra time
introduced for scanning the entries within the groups at the end of the epoch. These are presented
below in separate graphs that depict average values over multiple iterations of tests.
WorkLoad : HTTPerf, No. of Epochs = 1000
WorkLoad : HTTPerf, No. of Epochs = 1000
WorkLoad : TPC-C, No. of Epochs = 1000
WorkLoad : TPC-C, No. of Epochs = 1000
Below is the description of workloads.
• HTTPerf workload
HTTPerf is a tool for measuring web server performance, we used the request-only
facility to pump the server with a connection rate of 50 connections per second for a total of 900
• TPC-C workload
The TPC-C workload emulates the OLTP activities of an order-entry environment where
a population of users execute queries against a central database. The benchmark tool is
Benchmark Factory using MySQL as the database backend. In this workload, there are 5
transaction types and the default mixture of these 5 transaction types are used. The duration
of the workload is 1500 seconds.
It is seen that with a group size of 256 the Page Faults reduce by 35% and 32% respectively for
HTTPerf and TPC-C. But the time spent for scanning the entries is highest for this group size. As
the group size is decreased the Page Faults increase very gradually with group size 32 having
30% and 25% reduction in Page Faults. For group size of 16 we only see a slightly higher
increase in Page Faults in both cases(TPC-C and HTTPerf) with Page Fault reduction of only
22% and 16%.
From the graph we also see that as the group size is increased the scanning time increases
disproportionately while the Page Faults decrease only insignificantly. This is because most
groups have <10 and <20 Page Faults, which confirmed our initial estimates. Therefore from
comparing both the trends, No. of Page Faults and Scanning Time, we conclude that within the
group sizes we tested for (16 to 256), 32 is the most optimal size which gives a reduction of 30%
and 25% for HTTPerf and TPC-C respectively with optimal tradeoff for scanning the entries at
the end of the epoch.
 H. Labs. HTTPerf. http://www.hpl.hp.com/research/linux/httperf/, 2008.
 T. P. P. Council. TPC Benchmark C. http://www.tpc.org/tpcc/, 1997.
 Q. Software. Benchmark Factory for Databases. http://www.quest.com/benchmark-factory/,
 M. Lu and Tzi-cker Chiueh. Fast Memory State Synchronization for Virtualization-based
Fault Tolerance. In Proceedings of the 39th IEEE Conference on Dependable Systems and
Networks (DSN 2009), 2009.