Vancouver, February 2009




Memory management in (x86) Xen




 Tim Deegan
Xen’s memory services
• Memory management
  • Allocating memory to guests, scrubbing free memory
  • Tracking memory usage with reference counts and types
   Heap allocators and the frametable.
• Virtual memory
  • Protecting guests from each other
  • Enforcing typing rules, e.g. read-only areas
  • Providing translation services between address spaces
   MMU hypercalls, shadow pagetables, hardware-assisted paging




               © 2008 Citrix Systems, Inc. — All rights reserved   2
Terminology
• Virtual address/Physical address/Machine address
• Frame vs. Page
• PFN: physical frame number
  • Guest’s abstraction for tracking/allocating RAM
  • Usually fairly contiguous
• GFN: guest frame number
  • Guest’s idea of what hardware addresses are
  • Used in guest pagetables
• MFN: machine frame number
  • Actual hardware addresses
                © 2008 Citrix Systems, Inc. — All rights reserved   3
Basic memory management
• Buddy allocator hands out frames
• Each guest has a max number of frames
• Frame-table records for each frame:
  •   Owner, if any
  •   Linked list of other frames owned by this guest
  •   Reference count (must be zero to free the frame)
  •   Type, and a refcount for the type (must be zero to change type)
  •   TLB-flush-avoidance timestamp




                  © 2008 Citrix Systems, Inc. — All rights reserved     4
PV pagetables, a.k.a. direct paging
• PFN  MFN table managed by the guest
• Shared MFN  PFN table provided by Xen
• GFN == MFN, so pagetables can be used directly
 by the hardware
• Xen checks the contents of the guest pagetables
 before allowing the hardware to see them.




            © 2008 Citrix Systems, Inc. — All rights reserved   5
Enforcing isolation
• Guest pagetables must have a pagetable type
• Xen checks that page contents obey the typing
 rules before allowing them to take on PT type
• Typing rules:
  • No mapping other guests’ frames
  • No read-write mappings of frames with PT type
• Modifying an already-typed PT needs a call to Xen
 to check the modification obeys the rules.
   (Or trap-and-emulate assistance from Xen.)


               © 2008 Citrix Systems, Inc. — All rights reserved   6
Grant Tables
• Guest-supplied ACLs allowing other guests to map
 their frames
• Mapper makes a hypercall with a domid, an
 opaque index, and the address of a PTE
• Xen checks that entry in the mappee’s grant table
 and if it’s OK, modifies the PTE
• Needs explicit unmap hypercall when finished
• Also available: grant-copy, where Xen memcpy()s
 from/to a granted frame instead of mapping it.

            © 2008 Citrix Systems, Inc. — All rights reserved   7
HVM pagetables
• PFN  MFN table managed by Xen
• GFN == PFN so need another layer of translation
• Guest won’t cooperate in enforcing access control
• Two options:
   • Xen builds shadow copies of guest pagetables
    with the extra translations and controls added; or
  • Hardware support for using a second set of
    pagetables containing extra translations and
    controls
            © 2008 Citrix Systems, Inc. — All rights reserved   8
Shadow pagetables
• Keep Xen-maintained copies of guest frames that
 we think are being used as pagetables
• Guest never sees the shadows so we can add any
 translations and restrictions we like
• 13 different kinds of shadows depending on what
 kind of pagetable we think it is: a single frame can
 have up to 10 shadows at once
• Also have three kinds of shadows for faking out
 superpages (2MB of contiguous PFNs does not
 mean 2MB of contiguous MFNs)
            © 2008 Citrix Systems, Inc. — All rights reserved   9
Shadow pagetables: building
• Start with an empty top-level shadow of the PFN in
 CR3
• On pagefault, shadow the entries in the PT walk,
 making new shadows at each level if necessary.
• Each shadow entry is the guest entry with the GFN
 replaces by an MFN (of the next-level shadow or of
 guest memory) and extra access restrictions:
  • Pages that have shadows are mapped read-only.
  • Extra restrictions can be specified in the PFN  MFN table.
  • We can restrict write access to guest’s frames for tracking page-
    dirtying during live migration.

                 © 2008 Citrix Systems, Inc. — All rights reserved      10
Shadow pagetables: maintenance
• Shadowed pages are always kept read-only.
• When the guest writes to a shadowed frame, Xen’s
 pagefault handler must:
  • Emulate the current instruction to figure out what’s being written;
  • Write the new value into the guest pagetable; and
  • Update the equivalent parts of all shadows of the frame.




                 © 2008 Citrix Systems, Inc. — All rights reserved        11
Shadow pagetables: tearing back down
• Shadowing a frame is expensive
  • Thousands of cycles for trap and emulation of every write.
• Easy to tell when a page becomes a PT; harder to
 tell when it stops:
  • Reference count based on higher-level shadows and CR3 contents,
      but hard to know when a PFN’s been used in CR3 for the last time
  •   Guess based on odd-looking page contents
  •   Guess based on memory access patterns
  •   Get PV drivers to give us hints
  •   Recycle under memory pressure by approximating LRU



                  © 2008 Citrix Systems, Inc. — All rights reserved      12
Optimizations
• Tagged TLBs (AMD’s ASID; Intel’s VPID) allow us
 to avoid a TLB flush on every VMEXIT/VMENTER
  • In theory can do even better now that Win2k8 supports context
   switching without TLB flushing.

• Shadowing not-present entries with invalid entries
 lets us fast-track “real” pagefaults back to the guest
• Out-of-sync shadows: let the guest write directly to
 the lowest level of pagetables and sync up the
 shadows whenever a hardware TLB would re-read
 (TLB flush, page faults, higher-level writes)

                © 2008 Citrix Systems, Inc. — All rights reserved   13
Hardware-assisted paging
• Xen supplies a second set of pagetables describing
 the PFN  MFN translation and extra restrictions
• CPU takes a pointer to this as well as a (PFN-
 space) CR3 value from the guest
• MMU hardware applies the composition of the two
 translations and the intersection of the access
 rights




            © 2008 Citrix Systems, Inc. — All rights reserved   14
Hardware-assisted paging: performance
Avoid expensive trap + emulate on writes to PTs,
 and extra logic on pagefault path
TLB fill can now take 20 memory accesses!
CPU’s TLB is much smaller than the set of
 shadows we can maintain
• AMD’s RVI gives +10% performance over shadows
 on some workloads, -10% on others; Intel’s EPT
 seems more consistently better than shadowing
• Performance depends heavily on using superpage
 mappings in the second pagetable
            © 2008 Citrix Systems, Inc. — All rights reserved   15
Fin




© 2008 Citrix Systems, Inc. — All rights reserved    16

Xen Memory Management

  • 1.
    Vancouver, February 2009 Memorymanagement in (x86) Xen Tim Deegan
  • 2.
    Xen’s memory services •Memory management • Allocating memory to guests, scrubbing free memory • Tracking memory usage with reference counts and types  Heap allocators and the frametable. • Virtual memory • Protecting guests from each other • Enforcing typing rules, e.g. read-only areas • Providing translation services between address spaces  MMU hypercalls, shadow pagetables, hardware-assisted paging © 2008 Citrix Systems, Inc. — All rights reserved 2
  • 3.
    Terminology • Virtual address/Physicaladdress/Machine address • Frame vs. Page • PFN: physical frame number • Guest’s abstraction for tracking/allocating RAM • Usually fairly contiguous • GFN: guest frame number • Guest’s idea of what hardware addresses are • Used in guest pagetables • MFN: machine frame number • Actual hardware addresses © 2008 Citrix Systems, Inc. — All rights reserved 3
  • 4.
    Basic memory management •Buddy allocator hands out frames • Each guest has a max number of frames • Frame-table records for each frame: • Owner, if any • Linked list of other frames owned by this guest • Reference count (must be zero to free the frame) • Type, and a refcount for the type (must be zero to change type) • TLB-flush-avoidance timestamp © 2008 Citrix Systems, Inc. — All rights reserved 4
  • 5.
    PV pagetables, a.k.a.direct paging • PFN  MFN table managed by the guest • Shared MFN  PFN table provided by Xen • GFN == MFN, so pagetables can be used directly by the hardware • Xen checks the contents of the guest pagetables before allowing the hardware to see them. © 2008 Citrix Systems, Inc. — All rights reserved 5
  • 6.
    Enforcing isolation • Guestpagetables must have a pagetable type • Xen checks that page contents obey the typing rules before allowing them to take on PT type • Typing rules: • No mapping other guests’ frames • No read-write mappings of frames with PT type • Modifying an already-typed PT needs a call to Xen to check the modification obeys the rules. (Or trap-and-emulate assistance from Xen.) © 2008 Citrix Systems, Inc. — All rights reserved 6
  • 7.
    Grant Tables • Guest-suppliedACLs allowing other guests to map their frames • Mapper makes a hypercall with a domid, an opaque index, and the address of a PTE • Xen checks that entry in the mappee’s grant table and if it’s OK, modifies the PTE • Needs explicit unmap hypercall when finished • Also available: grant-copy, where Xen memcpy()s from/to a granted frame instead of mapping it. © 2008 Citrix Systems, Inc. — All rights reserved 7
  • 8.
    HVM pagetables • PFN MFN table managed by Xen • GFN == PFN so need another layer of translation • Guest won’t cooperate in enforcing access control • Two options: • Xen builds shadow copies of guest pagetables with the extra translations and controls added; or • Hardware support for using a second set of pagetables containing extra translations and controls © 2008 Citrix Systems, Inc. — All rights reserved 8
  • 9.
    Shadow pagetables • KeepXen-maintained copies of guest frames that we think are being used as pagetables • Guest never sees the shadows so we can add any translations and restrictions we like • 13 different kinds of shadows depending on what kind of pagetable we think it is: a single frame can have up to 10 shadows at once • Also have three kinds of shadows for faking out superpages (2MB of contiguous PFNs does not mean 2MB of contiguous MFNs) © 2008 Citrix Systems, Inc. — All rights reserved 9
  • 10.
    Shadow pagetables: building •Start with an empty top-level shadow of the PFN in CR3 • On pagefault, shadow the entries in the PT walk, making new shadows at each level if necessary. • Each shadow entry is the guest entry with the GFN replaces by an MFN (of the next-level shadow or of guest memory) and extra access restrictions: • Pages that have shadows are mapped read-only. • Extra restrictions can be specified in the PFN  MFN table. • We can restrict write access to guest’s frames for tracking page- dirtying during live migration. © 2008 Citrix Systems, Inc. — All rights reserved 10
  • 11.
    Shadow pagetables: maintenance •Shadowed pages are always kept read-only. • When the guest writes to a shadowed frame, Xen’s pagefault handler must: • Emulate the current instruction to figure out what’s being written; • Write the new value into the guest pagetable; and • Update the equivalent parts of all shadows of the frame. © 2008 Citrix Systems, Inc. — All rights reserved 11
  • 12.
    Shadow pagetables: tearingback down • Shadowing a frame is expensive • Thousands of cycles for trap and emulation of every write. • Easy to tell when a page becomes a PT; harder to tell when it stops: • Reference count based on higher-level shadows and CR3 contents, but hard to know when a PFN’s been used in CR3 for the last time • Guess based on odd-looking page contents • Guess based on memory access patterns • Get PV drivers to give us hints • Recycle under memory pressure by approximating LRU © 2008 Citrix Systems, Inc. — All rights reserved 12
  • 13.
    Optimizations • Tagged TLBs(AMD’s ASID; Intel’s VPID) allow us to avoid a TLB flush on every VMEXIT/VMENTER • In theory can do even better now that Win2k8 supports context switching without TLB flushing. • Shadowing not-present entries with invalid entries lets us fast-track “real” pagefaults back to the guest • Out-of-sync shadows: let the guest write directly to the lowest level of pagetables and sync up the shadows whenever a hardware TLB would re-read (TLB flush, page faults, higher-level writes) © 2008 Citrix Systems, Inc. — All rights reserved 13
  • 14.
    Hardware-assisted paging • Xensupplies a second set of pagetables describing the PFN  MFN translation and extra restrictions • CPU takes a pointer to this as well as a (PFN- space) CR3 value from the guest • MMU hardware applies the composition of the two translations and the intersection of the access rights © 2008 Citrix Systems, Inc. — All rights reserved 14
  • 15.
    Hardware-assisted paging: performance Avoidexpensive trap + emulate on writes to PTs, and extra logic on pagefault path TLB fill can now take 20 memory accesses! CPU’s TLB is much smaller than the set of shadows we can maintain • AMD’s RVI gives +10% performance over shadows on some workloads, -10% on others; Intel’s EPT seems more consistently better than shadowing • Performance depends heavily on using superpage mappings in the second pagetable © 2008 Citrix Systems, Inc. — All rights reserved 15
  • 16.
    Fin © 2008 CitrixSystems, Inc. — All rights reserved 16