Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Heterogeneous Computing : The Future of Systems


Published on

Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage

Published in: Technology

Heterogeneous Computing : The Future of Systems

  1. 1. IBM Confidential Heterogeneous Computing The Future of Systems Anand Haridass Senior Technical Staff Member IBM Cognitive Systems NITK (KREC) – Batch of ‘95 (E&C) IBM Academy of Technology NITK-IBM Computer Systems Research Group (NCSRG) Seminar Sep/18/2017
  2. 2. 2 Agenda System Overview Technology Trends – End of Dennard Scaling Vertical Integration - OpenPOWER “Feeding the Engine” – Memory / Storage Need for High Performance Bus – OpenCAPI GPU Attach - NVLINK Accelerator Examples
  3. 3. 3 Von Neumann Architecture • First published by John von Neumann in 1945. • Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs. • Stored-program computer concept instruction data and program data are stored in the same memory. • Most Servers & PC’s produced today use this design.
  4. 4. 4 Typical 2 Socket Systems [2017] CPU CPU Memory Memory IO/ Storage / NW AcceleratorAccelerator IO/ Storage / NW
  5. 5. 5 Processor Technology Trends Moore’’’’s Law Alive & Kicking Moore’s Law (1965) ”Number of transistors in a dense integrated circuit doubles approximately every two years”
  6. 6. 6 Dennard Scaling Limits Dennard scaling As transistors get smaller their power density stays constant, so that the power use stays in proportion with area: both voltage and current scale (downward) with length. Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%. • Voltage scaling for high-performance designs is limited • By leakage issues: can’t reduce threshold voltages • Need steeper sub-threshold slopes • Limited by variability, esp VT variability • Need to minimize random dopant fluctuations • Limited by gate oxide thickness • Some relief from high-K materials • Limited voltage scaling + decreasing feature sizes Increasing electric fields • New device structures needed (FinFETs) • Reliability challenges (devices and wires)
  7. 7. 7 CMOS Power - Performance Scaling Where this curve is flat, can only improve chip frequency by: a) Pushing core/chip to higher power density (air cooling limits) b) Design power efficiency improvements (low-hanging fruit all gone) 10 100 0.01 0.1 1 10 Feature pitch (microns) RelativePerformanceMetric (Constpowerdensity) When scaling was good…
  8. 8. 8 Processor Technology Trends ‘‘‘‘Affordable’’’’ Air Cooled Limit ~120-190W Dennard Scaling limiting from 2002-04
  9. 9. 9 Processor Technology Trends Processor Frequency peaks at ~6Ghz and settle between 2-4GHz
  10. 10. 10 Processor Technology Trends Strongly Correlated
  11. 11. 11 Processor Technology Trends Multi-Cores (& threads) Parallel Programming to leverage
  12. 12. 12 End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics Cost/Performance is the metric Processors Semiconductor Technology Industry trends, Challenges & Opportunities Microprocessors alone no longer drive sufficient Cost/Performance improvements
  13. 13. 13 System stack innovations are required to drive Cost/Performance
  14. 14. 14 OpenPOWER Foundation
  15. 15. 15 Materials Innovations - Increased Complexity & Cost Global Foundries projects that a computer chip manufacturing plant in NY would cost $14.7 billion to build
  16. 16. 16 “Data Access” Performance (bandwidth & latency) & Cost (Power) still very challenging Some techniques to hide latency/bw/pwr Caches Locality optimization Out-of-order execution Multithreading Pre-fetching “Fat’ pipes / Memory Buffers ns StorageMemory Storage Class Memory (100 – 1000ns) Source: SNIA “Feeding the Engine” Challenge
  17. 17. 17 Access latency in uP cycles (@ 4GHz) Source H.Hunter IBM 21 23 211 213 215 219 223 L1/L2(SRAM) HDD 27 L3/L4 25 29 217 221 Flash “I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store) DRAM Memory / Storage Storage Class of Memory NVMe - Non-Volatile Memory express (PCIe) • Standardized high performance interface for PCI Express SSD. Available today in three different form factors: PCIe Add in Card, SFF 2.5” and M.2 • PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs [1.5GB/s /port] • PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs [3GB/s /port] NVMe over fabrics (low latency RDMA access) <10us including switches CAPI based Flash (today) x16 (16GB/s) – at faster access latencies (more on this later) HBM (High Bandwidth memory) • 3D Stacked DRAM from AMD/Hynix/Samsung • HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked) • 1024bits x 2GT/s • HBM3 512GB/sec ~2020 time frame NVDIMM • Persistent memory solution on DDR interface • Combines DRAM, NAND Flash and power source • Delivers DRAM R/W perf with the persistence & reliability of NAND
  18. 18. 18 Source: SNIA The Contenders
  19. 19. 19 Function offload – greater concurrency & utilization Power efficiency (performance/watt) Workloads Encryption-decryption / Compression- decompression / Encoding-decoding / Network Controllers / Math Libraries / DB queries / Search Deep Learning (Arms race !) for training & inferencing Hardware Acceleration Types of Accelerators General Purpose GPU / Many Integrated Core (MIC) Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon Field Programmable Gate Array (FPGA) Xilinx, Altera (now Intel) Purpose Built / Custom ASIC’s Google’s TPU Intelligent Network Controllers Cavium ARM-accelerated NIC Mellanox NIC+FPGA Microsoft FPGA-only network adapter Traditionally (“IO” limited) sequential instructions on processor / parallel compute offloaded to accelerator Penalty for “IO” operations heavy
  20. 20. 20 HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth HPC & Deep learning require more bandwidth between accelerators and memory PCI Express has limitations (coherence / bandwidth / protocol overhead) Desired Attributes Low Latency / High Bandwidth / Coherence Emergence of complex storage & memory solutions (BW & latency & heterogeneity) Growing demand for network performance (BW & latency) Various form factors (e.g., GPUs, FPGAs, ASICs, etc.) Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in Volume pricing advantages & Broad software ecosystem growth and adoption Vendor specific variants Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport Open Standards evolving Cache Coherent Interconnect for Accelerators (CCIX) Gen-Z Open Coherent Accelerator Processor Interface (OpenCAPI) Need for High Performance Next Generation Bus/Interconnect
  21. 21. 21 Coherent Accelerator Processor Interface (CAPI) - 2014 CAPP PCIe Power Processor FPGA Functionn Function0 Function1 Function2 CAPI IBM Supplied POWER Service Layer Virtual Addressing Removes the requirement for pinning system memory for PCIe transfers Eliminates the copying of data into and out of the pinned DMA buffers Eliminates the operating system call overhead to pin memory for DMA Accelerator can work with same addresses that the processors use Pointers can be de-referenced same as the host application - Example: Enables the ability to traverse data structures Coherent Caching of Data Enables an accelerator to cache data structures Enables Cache to Cache transfers between accelerator and processor Enables the accelerator to participate in “Locks” as a normal thread Elimination of Device Driver Direct communication with Application No requirement to call an OS device driver or Hypervisor function for mainline processing Enables Accelerator Features not possible with PCIe Enables efficient Hybrid Applications Applications partially implemented in the accelerator and partially on the host CPU Visibility to full system memory Simpler programming model for Application Modules Coherent Accelerator Processor Proxy (CAPP) – Proxy for FPGA Accelerator on PowerBus – Integrated into Processor – Programmable (Table Driven) Protocol for CAPI – Shadow Cache Directory for Accelerator • Up to 1MB Cache Tags (Line based) • Larger block based Cache POWER Service Layer (PSL) – Implemented in FPGA Technology – Provides Address Translation for Accelerator • Compatible with POWER Architecture – Provides Cache for Accelerator – Facilities for downloading Accelerator Functions
  22. 22. 22 PCIe How CAPI Works AlgorithmAlgo mrith POWER8 Processor Acceleration Portion: Data or Compute Intensive, Storage or External I/O Application Portion: Data Set-up, Control Sharing the same memory space Accelerator is a peer to POWER8 Core CAPI Developer Kit Card Coherent Accelerator Processor Interface (CAPI) - 2014 Accelerator is a Full Peer to Processor Accelerator Function(s) use an unmodified Effective address Full access to Real address space Utilize Processor’s Page Tables Directly Page Faults handled by System Software Multiple Functions can exist in a single Accelerator
  23. 23. 23 Memory Subsystem Virt Addr IO Attached Accelerator POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core App FPGA PCIE Variables Input Data DD Device Driver Storage Area Variables Input Data Variables Input Data Output Data Output Data An application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation. 3 versions of the data (not coherent). 1000s of instructions in the device driver.
  24. 24. 24 Memory Subsystem Virt Addr CAPI Coherency POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core App FPGA PCIE With CAPI, the FPGA shares memory with the cores PSL Variable s Input Data Output Data 1 coherent version of the data. No device driver call/instructions.
  25. 25. 25 Typical I/O Model Flow: Flow with a Coherent Model: Shared Mem. Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion Application Dependent, but Equal to below Application Dependent, but Equal to above 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions 1,000 Instructions 7.9µs 4.9µs Total ~13µs for data prep 400 Instructions 100 Instructions 0.3µs 0.06µs Total 0.36µs CAPI vs. I/O Device Driver: Data Prep
  26. 26. 26 IBM Accelerated GZIP Compression An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread througput of ~2GB/s and a compression rate significantly better than low-CPU overhead compressors like snappy.
  27. 27. 27 CAPI Attached Flash
  28. 28. 28
  29. 29. 29 CAPI Acceleration 29 Examples: Encryption, Compression, Erasure prior to network or storage Processor Chip Acc Data Egress Transform DLx/TLx Processor Chip Acc Data Bi-Directional Transform Acc TLx/DLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Needle-in-a-haystack Engine Examples: Machine or Deep Learning potentially using OpenCAPI attached memory Memory Transform Processor Chip Acc DataDLx/TLx Example: Basic work offload Processor Chip Acc NeedlesDLx/TLx Examples: Database searches, joins, intersections, merges Ingress Transform Processor Chip Acc DataDLx/TLx Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI), Data Plane Accelerator (DPA), Video Encoding (H.265), etc Needle-In-A-Haystack Engine Haystack Data OpenCAPI WINS due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  30. 30. 30 NVLink 1 4 links 20 GBps per link raw bandwidth each direction ~160GBps total net NVLink bandwidth NVLink 2 6 links 25GBps per link raw bandwidth each direction ~300GBps total net NVLink bandwidth Volta GV100 • 15 TFLOPS FP32 • 16GB HBM2 – 900 GB/s • 300W TDP • 50 GFLOPS/W (FP32) • 12nm process • 300GB/s NV Link2 • Tensor Core.... Source: Nvidia NVIDIA GPU
  31. 31. 31 “Minsky” S822LC for HPC • Tight coupling: strong CPU: strong GPU performance • Equalizing access to memory - for all kinds of programming • Closer programming to the CPU paradigm 115GB/S 115GB/S NVLink DDR4 P8’ DDR4 P8’ Tesla P100 Tesla P100 80GB/S Tesla P100 Tesla P100 80GB/S OpenPOWER P8’ Design PCIe 32GBps GPUGPU x86x86 GPUGPU GPUGPU x86x86 GPUGPU For x86 Servers: PCIe Bottleneck No NVLink between CPU & GPU 2.7X faster query response time on “Minsky” 87% of the total speedup (2.35x of 2.7x improvement) is due to the NVLink Interface from CPU:GPU • Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time. • Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04. • Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
  32. 32. 32 Custom ASIC’s Reducing Flexibility CPU > GPU > FPGA > ASIC Increasing Efficiency CPU < GPU < FPGA < ASIC Source: William Dally, Nvidia
  33. 33. 33 Google TPU 1.0 [Jouppi et al., ISCA 2017] Relative performance/Watt (TDP) of GPU server (blue) and TPU server (red) to CPU server, and TPU server to GPU server (orange). TPU’ is an improved TPU that uses GDDR5 memory. The green bar shows its ratio to the CPU server, and the lavender bar shows its relation to the GPU server. Total includes host server power, but incremental doesn’t. GM and WM are the geometric and weighted means.
  34. 34. 34 Google TPU performance Stars are for the TPU Triangles are for the K80 Circles are for Haswell. [Jouppi et al., ISCA 2017]
  35. 35. 35 Microsoft Azure FPGA Usage [M.Russinovich, MSBuild 2017] FPGA for SDN Offload FPGA for Bing
  36. 36. 36 Hardware Micro-services A hardware-only self-contained service that can be distributed and accessed from across the datacenter compute fabric
  37. 37. 37 Ease of Consumption Compiler Optimization Math libraries optimization Native Support for CUDA / OpenMP / OpenCL .. Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
  38. 38. 38 POWER9 (SO) – Premier Accelerator Platform …… On-ChipInterconnect PCIeGen4DDR425Gb/s MemoryI/OCAPISMPNV OCAP I On-Chip Accel 16Gb/s 2 Socket SMP: 256 GB/s OpenCAPI and/or NVLink 2.0 200-300 GB/s 3x16 PCIeG4 : 192 GB/s Core POWER9 POWER9 Memory CAPI 2.0 Links : 128 GB/s (Uses up to 2 x16 ports) 8 DDR4 ports @ 2667 MT/s PCIe Device IBM / Partner Device NVIDIA GPU IBM / Partner Device Bandwidths shown are bi-directional 512kL2/SMT8Core+120MBL3NUCACache
  39. 39. 39 Newell POWER9 System - 6 GPU / 2CAPI
  40. 40. 40 BACKUP
  41. 41. 41 Source: SNIA / Flash Summit
  42. 42. 42 When to Use FPGAs Transistor Efficiency & Extreme Parallelism Bit-level operations Variable-precision floating point Power-Performance Advantage >2x compared to Multicore (MIC) or GPGPU Unused LUTs are powered off Technology Scaling better than CPU/GPU FPGAs are not frequency or power limited yet 3D has great potential Dynamic reconfiguration Flexibility for application tuning at run-time vs. compile-time Additional advantages when FPGAs are network connected ... allows network as well as compute specialization Extreme FLOPS & Parallelism Double-precision floating point leadership Hundreds of GPGPU cores Programming Ease & Software Group Interest CUDA & extensive libraries OpenCL IBM Java (coming soon) Bandwidth Advantage on Power Start w/PCIe gen3 x16 and then move to NVLink Leverage existing GPGPU eco-system and development base Lots of existing use-Cases to build on Heavy HPC investment in GPGPU When to Use GPGPUs
  43. 43. 43 CCIX Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
  44. 44. 44 Gen-Z Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
  45. 45. Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI specifications are downloadable from the website at - Register - Download
  46. 46. OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory Open standard interface enables to attach wide range of devices OpenCAPI protocol was architected to minimize latency Especially advantageous for classic DRAM memory Extreme bandwidth beyond classical DDR memory interface Agnostic interface allows extension to evolving memory technologies in the future (e.g., compute-in-memory) Ability to handle a memory buffer to decouple raw memory and host interfaces to optimize power, cost and performance Common physical interface between non-memory and memory devices 9
  47. 47. 47 OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes • Architecture agnostic bus – Applicable with any system/microprocessor architecture • Coherency - Attached devices operate natively within application’s user space and coherently with host uP • High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency • Point to point construct optimized within a system • Allows attached device to fully participate in application without kernel involvement/overhead • 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device • Supports a wide range of use cases and access semantics • Hardware accelerators • High-performance I/O devices • Advanced memories and Classic memory • Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.) • Reduced complexity of design implementation • Wanted to make this easy for the accelerator, memory and system design teams • Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify attached devices and facilitate interoperability across multiple CPU architectures
  48. 48. Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits An OpenCAPI device operates in the virtual address spaces of the applications that it supports • Eliminates kernel and device driver software overhead • Allows device to operate on application memory without kernel-level data copies/pinned pages • Simplifies programming effort to integrate accelerators into applications • Improves accelerator performance The Virtual-to-Physical Address Translation occurs in the host CPU • Reduces design complexity of OpenCAPI-attached devices • Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures • Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access