Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

6 open capi_meetup_in_japan_final


Published on

Document provided by Haman Yu (IBM Taiwan) / Hank Chang (IBM Taiwan)

Published in: Technology
  • Be the first to comment

6 open capi_meetup_in_japan_final

  1. 1. TM OpenCAPI Overview Open Coherent Accelerator Processor Interface Haman Yu / Hank Chang IBM OpenPOWER Technical Enablement
  2. 2. OpenCAPI Topics ØIndustry Background ØWhere/How OpenCAPI Technology is used ØTechnology Overview and Advantages ØHeterogeneous Computing ØSNAP Framework for CAPI/OpenCAPI Computation DataAccess
  3. 3. Industry Background that Defined OpenCAPI §Growing computational demand due to emerging workloads (e.g., AI, cognitive, etc.) §Moore’s Law not being supported by traditional silicon scaling §Driving increased dependence on Hardware Acceleration for performance • Hyperscale Datacenters and HPC need much higher network bandwidth • 100 Gb/s -> 200 Gb/s -> 400Gb/s are emerging • Deep learning and HPC require more bandwidth between accelerators and memory • Emerging memory/storage technologies are driving need for bandwidth with low latency § Hardware accelerators are defining the attributes of a high performance bus • Growing demand for network performance and network offload • Introduction of device coherency requirements (IBM’s introduction in 2013) • Emergence of complex storage and memory solutions • Various form factors with no one able to address everything (e.g., GPUs, FPGAs, ASICs, etc.) Computation DataAccess …all Relevant to Modern Data Centers
  4. 4. Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI OpenCAPI3.0 OpenCAPI3.1
  5. 5. 8 and 16Gbps PHY Protocols Supported • PCIe Gen3 x16 and PCIe Gen4 x8 • CAPI 2.0 on PCIe Gen4 PCIeGen4 P9 25Gbs 25Gbps PHY Protocols Supported • OpenCAPI 3.0 • NVLink 2.0 Silicon Die Various packages (scale-out, scale-up) POWER9 IO Leading the Industry • PCIe Gen4 • CAPI 2.0 • NVLink 2.0 • OpenCAPI 3.0 POWER9
  6. 6. Acceleration Paradigms with Great Performance Examples: Encryption, Compression, Erasure prior to delivering data to the network or storage ProcessorChip Egress Transform Acc DataDLx/TLx ProcessorChip Acc Data Bi-Directional Transform Acc DLx/TLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory ProcessorChip Acc DataDLx/TLx Memory Transform Example: Basic work offload Ingress Transform ProcessorChip Acc DataDLx/TLx Examples: Video Analytics, Network Security, Deep Packet Inspection, Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading etc Needle-in-a-haystack Engine ProcessorChip Acc Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor Needle-In-A-Haystack Engine DLx/TLx Needles Large Haystack Of Data OpenCAPI is ideal for acceleration due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  7. 7. Comparison of Memory Paradigms Emerging Storage Class Memory ProcessorChip DLx/TLx SCM Data Storage Class Memories have the potential to be the next disruptive technology….. Examples include ReRAM, MRAM, Z-NAND…… All are racing to become the defacto OpenCAPI 3.1 Architecture Ultra Low Latency ASIC buffer chip adding +5ns on top of native DDR direct connect!! Main Memory ProcessorChip DLx/TLx DDR4/5 Example: Basic DDR attach Data Storage Class Memory tiered with traditional DDR Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics Tiered Memory ProcessorChip DLx/TLx DDR4/5 DLx/TLx SCM Data Data §Common physical interface between non-memory and memory devices §OpenCAPI protocol was architected to minimize latency; excellent for classic DRAM memory §Extreme bandwidth beyond classical DDR memory interface §Agnostic interface will handle evolving memory technologies in the future (e.g., compute-in-mem) §Ability to handle a memory buffer to decouple raw memory and host interface to optimize power, cost, perf
  8. 8. CAPI and OpenCAPI Performance CAPI 1.0 PCIE Gen3 x8 Measured BW @8Gb/s CAPI 2.0 PCIE Gen4 x8 Measured BW @16Gb/s OpenCAPI 3.0 25 Gb/s x8 Measured BW @25Gb/s 128B DMA Read 3.81 GB/s 12.57 GB/s 22.1 GB/s 128B DMA Write 4.16 GB/s 11.85 GB/s 21.6 GB/s 256B DMA Read N/A 13.94 GB/s 22.1 GB/s 256B DMA Write N/A 14.04 GB/s 22.0 GB/s POWER8 Introduced in 2013 POWER9 Second Generation POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency POWER8CAPI1.0 POWER9CAPI2.0 andOpenCAPI3.0 Xilinx KU60/VU3P FPGA
  9. 9. Latency Ping-Pong Test § Simple workload created to simulate communication between system and attached FPGA § Bus traffic recorded with protocol analyzer and PowerBus traces § Response times and statistics calculated TL, DL, PHY 1. 2. 3. 4. Host Code Copy 512B from cache to FPGA Poll on incoming 128B cache injection Reset poll location Repeat TLx, DLx, PHYx 1. 2. 3. 4. FPGA Code Poll on 512B received from host Reset poll location DMA write 128B for cache injection Repeat OpenCAPI Link PCIe Stack 1. 2. 3. 4. Host Code Copy 512B from cache to FPGA Poll on incoming 128B cache injection Reset poll location Repeat FPGA Code 1. Poll on 512B received from host 2. Reset poll location 3. DMA write 128B for cache injection 4. Repeat * HIPrefers to hardened IP PCIeLink Altera PCIe HIP*
  10. 10. Latency Test Results ‡ ǁ § ǁ 378ns Total Latency est. <555ns Total Latency 737ns Total Latency 776ns Total Latency
  11. 11. OpenCAPI Enabled FPGA Cards Mellanox Innova2AcceleratorCard Alpha Data 9v3AcceleratorCard Typicaleyediagram at 25Gb/s usingthese cards
  12. 12. OpenCAPI Topics ØIndustry Background ØWhere/How OpenCAPI Technology is used ØTechnology Overview and Advantages ØHeterogeneous Computing ØSNAP Framework for CAPI/OpenCAPI Computation DataAccess
  13. 13. OpenCAPI: Heterogeneous Computing ØWhy OpenCAPI could let the specific workloads run faster? à FPGA (Various High Bandwidth I/O, great at deep Parallel & Pipeline designs) ØHow OpenCAPI could let the software/application run faster? à Shared Coherent Memory (with Virtual Address, Low Latency & Low Overhead Design) Computation DataAccess Single Processor CPU Distributed computing CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Heterogeneous Computing CPU GPU ASIC FPGA
  14. 14. FPGAs: What they are good at l FPGA: Field Programmable Gate Array l Configurable I/O and High-Speed Serial links l Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM Controller,...etc. ) l Custom logic, Complex Special Instructions l Bit/Matrix Manipulation, Image Processing, graphs, Neural Networks…etc. l Great at deep Parallel & Pipeline Designs (for workloads) Processing Engine Processing Engine Processing Engine Processing Engine Processing Engine Processing Engine Processing Engine parallelism pipeline & Hash + + RAM Instruction complexity =
  15. 15. FPGAs: Different type of High Bandwidth I/O Cards Alpha Data 9V3 (Networking) Mellanox Innova 2 (Networking [CX-5]) Nallatech 250S+ (Storage / NVMe SSDs)
  16. 16. Accelerated OpenCAPI Device OpenCAPI: Key Attributes for Acceleration 16 TL/DL 25Gb I/O Any OpenCAPI Enabled Processor Accelerated Function TLx/DLx 1. Architecture agnostic bus – Applicable with any system/microprocessor architecture 2. Optimized for High Bandwidth and Low Latency - 25Gb/s Links (with SlimSAS connector) - Removes PCIe layering and use brand new TL/DL--TLx/DLx thinner protocol instead (latency optimized) - High performance industry standard interface design with zero ‘overhead’ 3. Coherency & Virtual addressing - Attached devices operate natively within application’s user space and coherently with host microprocessor - Enables low overhead with no Kernel, hypervisor or firmware involvement - Shared coherent data structures and pointers (put the data/mem closer to the Proc/FPGA) - It’s all traditional thread level programming with CPU coherent device memory 4. Supports a wide range of use cases - Architected for both Classic Memory and emerging Storage Class Memory Caches Application § Storage/Compute/Network etc § ASIC/FPGA/FFSA § FPGA, SOC, GPU Accelerator § Load/Store or Block Access Standard System Memory Device Memory Advanced SCM Solutions BufferedSystemMemory OpenCAPIMemoryBuffers
  17. 17. OpenCAPI: Virtual Addressing and Benefits §An OpenCAPI device operates in the Virtual Address spaces of the applications that it supports •Eliminates kernel and device driver software overhead •Allows device to operate on application memory without kernel-level data copies/pinned pages •Simplifies programming effort to integrate accelerators into applications (SNAP) •Improves accelerator performance §The Virtual-to-Physical Address Translation occurs in the host CPU (no PSL logic needed) •Reduces design complexity of OpenCAPI-attached devices •Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures •Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access
  18. 18. OpenCAPI: Shared coherent data structures and pointers No Kernel/Device Driver involved (-- put the data/mem closer to the Proc/FPGA) AFU: Attached Functional Unit Typical I/O Model with Device Driver OpenCAPI: Virtual Addressing and Benefits (deeper)
  19. 19. Ø The OpenCAPI Transaction Layer specifies the control and response packets between a host and an endpoint OpenCAPI device: TL and TLX Ø On the Host side the Transaction Layer converts: • Host specific protocol requests into transaction layer defined commands • TLx commands into host specific protocol requests. • Responses Ø On the Endpoint OpenCAPI device side, the Transaction Layer converts: • AFU protocol requests into transaction layer commands • TL commands into AFU protocol requests. • Responses Ø The OpenCAPI Data Link Layer supports a 25Gbps serial data rate per lane connecting a processor to an FPGA or an ASIC that contains an endpoint accelerator or device: DL and DLX • The basic configuration supports 8 lanes running at 25.78125 GHz for a 25 GB/s data rate. Noted that: TL/DL/PHY (Host Side) è IBM P9 HW & FW both Ready now; TLx/DLx/PHYx (Device) è I/O Vendors also have the reference design Ready Host bus protocol layer TL TL Frame/Parser DL PHY PHYX DLX TLX Frame/Parser TLX AFU protocol layer AFU HostProcessorOpenCAPIDevice OpenCAPI: Protocol Stack (much thinner than traditional PCIe) Host bus interface OpenCAPI packets DL packet (format) DL packet Serial link DLX packet DLX packet (format) AFU packets AFU protocol stack interface Host bus protocol stack interface The full TL/DL specification can be obtained by simply going to and registering under the technical à specifications pull down menu.
  20. 20. FPGA • Cusomter application and accelerator • Operation system enablement • Little Endian Linux • Reference Kernel Driver (ocxl) • Reference User Library (libocxl) • Hardware and reference designs to enable coherent acceleration Core Processor OS App (software) Memory (Coherent) AFU TLx DLx 25G ocxl libocxl Ø OCSE (OpenCAPI Simulation Environment) models the red outlined area Ø OCSE enables AFU and Application co- simulation only when the reference libocxl and reference TLx/DLx are used. Ø OCSE dependencies ØFixed reference TLx/AFU interface ØFixed reference libocxl user API Ø Will be contributed to the OpenCAPI consortium Ø Development Progress: 90% Cable Memory (Coherent) 25G DL TL OpenCAPI: OCSE (OpenCAPI Simulation Environment)
  21. 21. OpenCAPI: Two Factors of Low Latency 1. No Kernel/Device Driver process for mapping the mem address or no need to move data back & forth between user-space, kernel, and devices • 2. Thinner layers of protocol à Compare to PCIe Typical I/O Model Flow with Device Driver: DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions 1,000 Instructions 7.9µs 4.9µs Flow with a Coherent Model (CAPI): Shared Mem. Notify Shared Mem. Completion 400 Instructions 100 Instructions 0.3µs 0.06µs Acceleration TL TL Frame/Parser DL PHY PHYX DLX TLX Frame/Parser TLX OpenCAPI links Faster Memory Access & Easier Programing Faster & more Effective Protocol Total 0.36µs Total ~13µs for data prep
  22. 22. OpenCAPI Topics ØIndustry Background ØWhere/How OpenCAPI Technology is used ØTechnology Overview and Advantages ØHeterogeneous Computing ØSNAP Framework for CAPI/OpenCAPI Computation DataAccess
  23. 23. SNAP Framework Concept for CAPI/OpenCAPI Action X Action Y Action Z CAPI/ OpenCAPI SNAP Vivado HLS CAPI/ OpenCAPI FPGA becomes a peer of the CPU è Action directly accesses host memory SNAP Manage server threads and actions Manage access to I/Os (AXI to memory/network…) è Action easily accesses resources FPGA Gives on-demand compute capabilities Gives direct I/O access (AXI to storage/network…) è Action directly accesses external resources Vivado HLS Compile Action written in C/C++ code Optimize code to get performance è Develop Action code efficiently + + + =Best way to offload/accelerate a C/C++ code with: - Minimum change in code - Quick porting - Better performance than CPU FPGA Storage, Networking, Analytics Programming framework
  24. 24. CAPI Development without SNAP Process C Slave Context libcxl cxl SNAP library Job Queue Process B Slave Context libcxl cxl SNAP library Job Queue Process A Slave Context Libcxl/libocxl cxl/ocxl HDK: CAPI PSL CAPI Huge development effort Performance focused, full cache line control Programming based on libcxl/libocxl & the VHDL and Verilog code Software Program Hardware Logic Application on Host Acceleration on FPGA
  25. 25. SNAP: Focus on the Additional Acceleration Values 25 Process C Slave Context libcxl cxl SNAP library Job Queue Process B Slave Context libcxl cxl SNAP library Job Queue Process A Slave Context libcxl/libocxl cxl/ocxl SNAP library Job Queue HDK: CAPI PSL CAPI Software Program DRAM on-card Network (TBD) NVMe on-card AXI AXI Host DMA Control MMIO Job Manager Job Queue AXI PSL/AXI bridge AXI lite Quick and easy development Use High Level Synthesis tool to compile C/C++ to RTL, or directly use RTL Programming based on SNAP library function call and AXI interface AXI is an industry standard for on-chip interconnection ( Action1 VHDL Action3 Go, … Action2 C/C++ Hardware Action Application on Host Acceleration on FPGA
  26. 26. Summary: SNAP Framework for POWER9 • SNAP will be supported on POWER9 ü Abstraction layer on CAPI, the SNAP actions will be portable to CAPI 2.0 & OpenCAPI ü Minimum Code Changes & Quick Porting Ø SNAP hides the differences (use the Standard AXI Interfaces to the I/Os… DRAM/NVMe/Ethernet…etc) Ø Support higher level languages (Vivado HLS helps you covert C/C++ to VHDL/Verilog) Ø SNAP for OpenCAPI development progress now is about 70% Ø OpenCAPI Simulation Environment (OCSE) development progress is almost 90% • All Open Source!! à ü Driven by OpenPOWER Foundation Accelerator Workgroup ü Cross-company Collaboration and Contributions Power8 à Power9 CAPI1.0 à CAPI2.0/OpenCAPI Action1 VHDL Action3 … Action2 C/C++ Software Program AXI interfaceslibsnap APIs Your current actions
  27. 27. Table of Enablement Deliveries 27 Item Availability OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications) Today Xilinx Vivado Project Build with Memcopy Exerciser Today Device Discovery and Configuration Specification and RTL Today AFU Interface Specification Today Reference Card Design Enablement Specification Today 25Gbps PHY Signal Specification Today 25Gbps PHY Mechanical Specification Today OpenCAPI Simulation Environment (OCSE) Tech Preview Today Memcopy and Memory Home Agent Exercisers AFP Exerciser Today Today Reference Driver Available Today # Join us! # OpenCAPI Consortium:
  28. 28. Join us! Any Questions?
  29. 29. Backup
  30. 30. Backup: Overall Application Porting Preparations Keys (right person to do the right things) Software profiling Software/hardware function partition Understand the data location, data size, data dependency Parallelism estimation Consider I/O bandwidth limitation Decide SNAP mode and FPGA card API parameters: prevent from using the interface between main application and hardware action. Decide what Algorithm(s) you want to accelerate. Validate Physical Card choice Define your API Parameters Decide on SNAP Mode. To Execution phase Start Plan
  31. 31. Backup: Advantage Summary Traditional IO Attached FPGA CAPI Attached FPGA Benefit Device Driver w/ system calls to move data from application memory to IO memory and initiate data transfer “Hardware” Device Driver – hardware handles address translation, no system call needed. Less latency to initiate data transfer from host memory to FPGA Offload CPU Limited to PCIe Gen3 Bandwidth and hardware latency Gen4 x8 and OpenCAPI provide higher bandwidth and lower latency Better roadmap for future performance enhancements Separate memory address domains True shared memory with processor. Processor and FPGA use same virtual address. Programming framework: CAPI-SNAP. Open source at • Easier programmability • Enables pointer chasing, linked lists • No pinning of pages • FPGA access to all of system memory • Scatter / Gather capability removes need to prepare data in sequential blocks Virtualization and multi-process handled by complex OS support Added security and multi-process capability CAPI supports multi-process access in hardware with security to prevent cross process access
  32. 32. Backup: Comparison of IBM CAPI Implementations & Roadmap Feature CAPI 1.0 CAPI 2.0 OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI 4.0 Processor Generation POWER8 POWER9 POWER9 Power9 Follow-On Power9 Follow-On CAPI Logic Placement FPGA/ASIC FPGA/ASIC NA DL/TL on Host DLx/TLx on endpoint FPGA/ASIC NA DL/TL on Host DLx/TLx on Memory Buffer NA DL/TL on Host DLx/TLx on endpoint FPGA/ASIC Interface Lanes per Instance Lane bit rate PCIe Gen3 x8/x16 8 Gb/s PCIe Gen4 2 x (Dual x8) 16 Gb/s Direct 25G x8 x4 fail down 25 Gb/s Direct 25G x8 x4 fail down 25 Gb/s Direct 25G x8 x4 fail down 25 Gb/s Address Translation on CPU No Yes Yes Yes Yes Native DMA from Endpoint Accelerator No Yes Yes NA Yes Home Agent Memory on OpenCAPI Endpoint with Load/Store Access No No Yes NA Yes Native Atomic Ops to Host Processor Memory from Accelerator No Yes Yes NA Yes Host Memory Caching Function on Accelerator Real Address Cache in PSL Real Address Cache in PSL No NA Effective Address Cache in Accelerator Remove PCIe layers to reduce latency significantly