Nakajima numa-final
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Nakajima numa-final






Total Views
Views on SlideShare
Embed Views



1 Embed 11 11



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Nakajima numa-final Presentation Transcript

  • 1. Xen Guest NUMA: General Enabling Part 29 April 2010 Jun Nakajima, Dexuan Cui, and Nitin Kamble
  • 2. Legal Disclaimer  INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.  Intel may make changes to specifications and product descriptions at any time, without notice.  All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.  Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.  *Other names and brands may be claimed as the property of others.  Copyright © 2010 Intel Corporation. Xen Summit NA 2010 2
  • 3. Xen Guest NUMA Project • Working with Xen Community: − Andre Przywara − Dulloor Rao − You are welcome to join us • Generic guest NUMA support both for PV and HVM − Major difference is basically ACPI tables − NUMA-specific enlightenments are applicable to both Xen Summit NA 2010 3
  • 4. Agenda • NUMA machines • Importance of NUMA Awareness • Motivation of NUMA Guests • What is required to support effective NUMA guest? • Getting host info and resource allocation • Guest configuration • Current Status and Next Steps Xen Summit NA 2010 4
  • 5. NUMA Machines I/O Hub Node* Cores *: A socket/package can contain multiple nodes Xeon® 7500 Xeon® 7500 Xeon® 7500 Xeon® 7500 Memory I/O Hub Memory Buffer Xen Summit NA 2010 5
  • 6. NUMA Machines (cont.) 2-socket 2+2+2+2 (8S) 4S (64DIMMs) 4+4 (8S) 2+2 (4S) 4S (32DIMMs) CPU Socket I/O Hub Interconnect Memory 6 * Other names and brands be claimed as the property of others. Copyright Copyright © 2010, Intel *Other names and brands maymay be claimed as the property of others. © 2010, Intel Corporation. Corporation. Intel Confidential
  • 7. Importance of NUMA Awareness Andre Przywara <> lmbench's rd benchmark (normalized to native Linux (=100)): guests numa=off numa=on avg increase min avg max min avg max 1 78.0 102.3 7 37.4 45.6 62.0 90.6 102.3 110.9 124.4% 15 21.0 25.8 31.7 41.7 48.7 54.1 88.2% 23 13.4 17.5 23.2 25.0 28.0 30.1 60.2% kernel compile in tmpfs, 1 VCPU, 2GB RAM, average of elapsed time: guests numa=off numa=on increase 1 480.610 464.320 3.4% 7 482.109 461.721 4.2% 15 515.297 477.669 7.3% 23 548.427 495.180 9.7% again with 2 VCPUs and make -j2: 1 264.580 261.690 1.1% 7 279.763 258.907 7.7% *: 4 socket AMD Magny-Cours machine with 8 nodes, 15 330.385 272.762 17.4% 23 463.510 390.547 15.7% (46 VCPUs on 32pCPUs) 48 cores and 96 GB RAM. Xen Summit NA 2010 7
  • 8. Motivation • More NUMA machines in the market • Run very large guests efficiently on NUMA machines for performance reasons − More memory, VCPUs, I/O spanning across multiple nodes − More performance, throughput • Allow existing OS and apps to run in virtualization with NUMA enabled (or disabled) − Populate guest ACPI SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table) − NUMA libraries • NUMA-specific optimizations/enlightenments Xen Summit NA 2010 8
  • 9. Achieving NUMA Performance • Which processors (i.e. cores) are connected directly to which blocks of memory? − SRAT (Static Resource Affinity Table) or PV • How far apart the processors are from their associated memory banks? − SLIT (System Locality Information Table) or PV • Virtualization Specific Requirements − Bind VCPUs to node − Construct guest SRAT and SLIT Xeon® 7500 Xeon® 7500 • Need to reflect hardware attributes • Predictable and repeatable − Use fixed guest configuration Xeon® 7500 Xeon® 7500 Xen Summit NA 2010 9
  • 10. Constructing SRAT and SLIT for Guests • Get platform info from host using host NUMA API (in upstream) − XEN_SYSCTL_topologyinfo • # of cores per node/socket − XEN_SYSCTL_numainfo • Equivalent to SRAT and SLIT • Allocate memory from nodes based on memory allocation strategy in config file − CONFINE, SPLIT, STRIP (next page) − # of nodes Xeon® 7500 Xeon® 7500 Xeon® 7500 Xeon® 7500 Xen Summit NA 2010 10
  • 11. Guest NUMA Config Options • Number of nodes means “# of nodes from which memory is allocated” − Not necessarily visible to guest • max_guest_nodes=<N> − Specify desirable number of nodes. Number of system nodes by default. • min_guest_nodes=<N> − Specify minimum number of nodes. Memory is allocated from nodes ( >= min_guest_nodes). Creation of guest fails if allocation does not meet it. 1 by default. • Number of nodes matter for SPLIT and STRIP (next page) • Create guest in deterministic way by setting min_guest_nodes = max_guest_nodes Xen Summit NA 2010 11
  • 12. Guest NUMA Config Options (cont.) Memory Allocation Strategy: • CONFINE : Allocate entire domain memory from single node. Fail if does not work. − No need to tell guest NUMA at all. • SPLIT : Allocate domain memory from nodes by splitting equally across the nodes. Fail if does not work. − Populate NUMA topology, and propagate to guest (includes PV querying via hypercall). If guest is paravirtualized and does not know about NUMA (missing ELF hint), fail. • STRIPE : Interleave domain memory across nodes. − No need to tell guest about NUMA at all. • AUTOMATIC: Try three strategies after each other (order: CONFINE, SPLIT, STRIP) Xen Summit NA 2010 12
  • 13. Considerations on Live Migration • Number of nodes needs to be same • Memory allocation strategy needs to be inherited for live migration − CONFINE and STRIPE are not really NUMA guest − SPLIT: SPLIT will be used at live-migration time. • If target machine has similar NUMA characteristics, it’s possible to do live migration retaining NUMA performance. Xen Summit NA 2010 13
  • 14. Current Status and Next Steps • Current Status − Host NUMA API is in upstream − Rebasing the patches to submit − Re-measuring performance − Merge patches from Dulloor and Andre • Next Steps − Performance analysis and different workloads • Scheduling − I/O NUMA • DMA across nodes with direct device assignment − Live Migration • Anyone? Xen Summit NA 2010 14