Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

XPDS16: High-Performance Virtualization for HPC Cloud on Xen - Jun Nakajima & Tianyu Lan, Intel Corp.

21,359 views

Published on

We have been working to get Xen up and running on self-boot Intel® Xeon Phi processors to build HPC clouds. We see several challenges because of the unique (but not unusual for HPC) hardware technologies and performance requirements. For example, such hardware technologies include 1) >256 CPUs, 2) MCDRAM (high-bandwidth memory), 3) integrated fabric (i.e. Intel® Omni-Path). Unlike the “coprocessor“ model, supporting self-boot with >256 CPUs has various implications to Xen, including scheduling and scalability. We need to allow user applications to use MCDRAM directly to perform optimally. Also, we need to enable the integrated HPC fabric for the VM to use by direct I/O assignment.

In addition, we have only a single VM on each node to meet the high-performance requirements of HPC clouds. This (i.e. non-shared) model allowed us to optimize Xen more. In this talk, we share our design and lessons, and discuss the options we considered to achieve high-performance virtualization for HPC.

Published in: Technology
  • Be the first to comment

XPDS16: High-Performance Virtualization for HPC Cloud on Xen - Jun Nakajima & Tianyu Lan, Intel Corp.

  1. 1. 1 High-Performance Virtualization for HPC Cloud on Xen Jun Nakajima Tianyu Lan
  2. 2. 2 Legal Disclaimer  INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.  Intel may make changes to specifications and product descriptions at any time, without notice.  All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.  Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.  *Other names and brands may be claimed as the property of others.  Copyright © 2016 Intel Corporation.
  3. 3. 3 Agenda • Intel® Xeon Phi™ processor • HPC Cloud usage • Challenges for Xen • Achieving high performance • Call for action
  4. 4. 4 Intel® Xeon Phi™ x100 Product Family formerly codenamed Knights Corner Intel® Xeon Phi™ x200 Product Family codenamed Knights Landing Skylake The world is going parallel – stick with sequential code and you will fall behind. 61 4 512-bit 352 GB/s Cores Threads/Core Vector Width Peak Memory Bandwidth 18 2 256-bit 68 GB/s 72 4 512-bit (x2) >500 GB/s 28 2 512-bit 128 GB/s Intel® Xeon® Processor E5-2600 v3 Product Family formerly codenamed Haswell … Intel® Xeon® Processor E5-2600 v4 Product Family codenamed Broadwell 22 2 256-bit 77 GB/s The world is going parallel 4
  5. 5. 5 Intel® Xeon Phi™ Processor • Intel’s first bootable host processor specifically designed for HPC • Binary compatible with Xeon Processor • Integration of memory on package: Innovative memory architecture for high bandwidth and high capacity • Integration of Omni-path Fabric on package
  6. 6. 6 *Results will vary. This simplified test is the result of the distillation of the more in-depth programming guide found here: https://software.intel.com/sites/default/files/article/383067/is-xeon-phi-right-for-me.pdf All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 1 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle. FLOPS = cores x clock frequency x floating-point operations per second per cycle. 2 Host processor only  22 nm process  Coprocessor only  >1 TF DP Peak  Up to 61 Cores  Up to 16GB GDDR5 Available Today Knights Corner Intel® Xeon Phi™ x100 Product Family Launched Knights Landing Intel® Xeon Phi™ x200 Product Family Future Knights Hill 3rd generation  14 nm process  Host Processor & Coprocessor  >3 TF DP Peak1  Up to 72 Cores  Up to 16GB HBM  Up to 384GB DDR42  ~460 GB/s STREAM  Integrated Fabric2  10 nm process  Integrated Fabric (2nd Generation)  In Planning… Intel® Xeon Phi™ Product Family
  7. 7. 7 Hardware Overview Chip: Up To 36 tiles interconnected by Mesh Tile: 2 Cores + 2 VPU/core + 1MB L2 Core: 4 hyper threads / core ISA: Binary Compatible with Intel Xeon processors + AVX 512 extension Memory : Up To 16GB on-package MCDRAM + up to 6 channels of DDR4-2400 (up to 384GB) IO: 36 lanes PCIe Gen3 + 4 lanes DMI for chipset Node: 1-socket only DDR4 x4 DMI2 to PCH 36 Lanes PCIe* Gen3 (x16, x16, x4) MCDRAM MCDRAM MCDRAM MCDRAM DDR4 TILE: Tile IMC (integrated memory controller) EDC (embedded DRAM controller) IIO (integrated I/O controller) Xeon Phi 2VPU Core 2VPU Core 1MB L2 HUB
  8. 8. 8 MCDRAM Memory modes Cache Mode Flat Mode Hybrid Mode Description Hardware automatically manages the MCDRAM as a “memory side cache” between CPU and ext DDR memory Manually manage how the app uses the integrated on-package memory and external DDR for peak perf Joins the benefits of both Cache and Flat modes by segmenting the integrated on- package memory DRAM 8 or 4 GB MCDRAM 8 or 12GB MCDRAM8GB/ 16GB MCDRAM Up to 384 GB DRAM PhysicalAddress DRAM 16GB MCDRAM 64B cache lines direct-mapped
  9. 9. 9 MCDRAM(Flat) • Platform with 2 NUMA nodes • Memory allocated in DDR by default • Keep low bandwidth data out of MCDRAM • Apps explicitly allocates important data in MCDRAM NUMA 0 CPU DDR MCDRAM NUMA 1 Platform
  10. 10. 10 Agenda • Intel® Xeon Phi™ processor • HPC Cloud usage • Challenges for Xen • Achieving high performance • Call for action
  11. 11. 11 HPC Cloud usage • Single VM on one machine • Expose most host CPUs to VM • More than 255 VCPUs in VM • Expose MCDRAM to VM • Pass through Omni-path Fabric to VM
  12. 12. 12 Agenda • Intel® Xeon Phi™ processor • HPC Cloud usage • Challenges for Xen • Achieving high performance • Call for action
  13. 13. 13 Challenges for Xen • Support >255 VCPUs • Virtual IOMMU support • Scalability • Scalability issue in tasklet subsystem
  14. 14. 14 Support >255 VCPUs • HVM guest supports 128 VCPUs • X2APIC mode is required for >255 VCPUs • Linux disables X2APIC mode when no IR(interrupt remapping) • No Virtual IOMMU support in Xen • > 255 VCPUs => X2APIC => IR => Virtual IOMMU • Enable DMA translation first • Linux IOMMU driver can’t work without DMA translation
  15. 15. 15 Virtual IOMMU Hvmloader Virtual IOMMU Dom0 Qemu Dummy Xen-VIOMMU Hypervisor VM Xenstore Hypercall Linux Kernel IOMMU driver ACPI DMAR
  16. 16. 16 Virtual IOMMU (DMA Translation) Virtual IOMMU Dom0 Qemu Dummy Xen-VIOMMU Hypervisor VM Linux Kernel IOMMU driver Physical IOMMU IOVA Physical PCI Device Hardware Virtual PCI device Memory Region DMA Memory Address Translation DMA IOVA -> GPA Shadow IOVA->HPA IOVA-> Target GPA IOVA->HPA
  17. 17. 17 Virtual IOMMU (IR) Dom0 Qemu Hypervisor VM Linux kernel IOMMU driver VIOAPIC/VMSI IRQ VLAPIC VIRQ IRQ subsystem Hardware Virtual PCI device Virtual IOMMU Physical PCI Device IR table IRQ Remapping Device Driver Inject VIRQ
  18. 18. 18 Challenge for Xen • Support >255 VCPUs • Virtual IOMMU support • Scalability • Scalability issue in tasklet subsystem
  19. 19. 19 Scalability issue in tasklet subsystem • Tasklist work lists are percpu data structures • A global spin lock “tasklet_lock” protects all these lists • Tasklet_lock becomes hot point when running heavy workload in VM • Take average180k tsc count to acquire global lock (IO VM exit:150k tsc count) • Change tasklet_lock to percpu lock 63 50 50 87 85 86 0 20 40 60 80 100 Stream Dgemm Sgemm Benchmark Host Original VM Optimizated VM
  20. 20. 20 Agenda • Intel® Xeon Phi™ processor • HPC Cloud usage • Challenges for Xen • Achieving high performance • Call for action
  21. 21. 21 Achieving high performance • Expose key compute resources to VM: • CPU topology • MCDRAM • Reduce timer interrupts
  22. 22. 22 VM CPU topology Guest Native Pin Core 0 0 1 2 3 Core 0 0 1 2 3 Core 63 Core 64 Core 63 Core XCore 0 Core Y Dom 0 • HPC software assigns workload according CPU topology • Balance work load among physical cores
  23. 23. 23 Expose MCDRAM to VM • Create vNUMA nodes as host’s NUMA topology • Keep vNUMA of MCDRAM with far distance to vNUMA of CPU Host VM NUMA 0 VNUMA 0 VNUMA 1 NUMA 1 CPU DDR MCDRAM RAMVCPU RAM
  24. 24. 24 Reduce timer interrupts • Local APIC timer interrupt causes frequent VM exit(26000 exits/s) during running benchmark • Reduce timer interrupt via setting timer_slop to 10ms • Side affect: Low timer’s resolution 63 50 50 87 85 86 99 98 97 0 50 100 Stream Dgemm Sgemm Benchmark Host Original VM Tasklet fixed VM Timer slop VM
  25. 25. 25 25 Reduce timer interrupts(Next to do) Hypervisor: • No need scheduler for single VM Guest: • Make Guest Linux tickless
  26. 26. 26 Agenda • Intel® Xeon Phi™ processor • HPC Cloud usage • Challenges for Xen • Achieving high performance • Call for action
  27. 27. 27 Call for action • We were able to achieve high-performance HPC on Xen • Changes required in Xen • Increase vcpu numbers › 128 => 255 vcpus › Virtual IOMMU
  28. 28. Q & A

×