Industry Collaboration and Innovation
1
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
2
Industry Background
§ Data rich but insight poor
§ ½ of the world’s data has been generated in the last 2 years alone!
§ Only 2% of the world’s data has actually been analyzed into
actionable intelligence
§ Data is growing at an ever alarming rate with social media, sensor
data, camera data and the like
§ Diverse data types has been become a real challenge
§ Tough problems need to be addressed beyond CPUs and GPUs
§ FPGAs and other specialized HW should be considered for
those class of problems where CPUs and GPUs fall short
§ Data prep feeding a GPU is a prime example
3
Processor Architecture Bottleneck
4
• CPU Core count have increased but memory and IO subsystems have not kept up
• Direct attach DRAM buses provide limited quantity and performance scaling
• Processor devices experiencing increasing latency per core
• Modern AI workloads and exponential data growth require lower latency and greater BW
Memory
BW
/ Core
I/O BW / Core
Mem Latency / Core
3
typical cases
when you should consider
Boost a function
Offload your CPU
using external accelerators
Free network resources
100101010100011001100
110010010010010101010
001100110011001001001
001010101000110011001
100100100100101010100
011001100110010010010
010101010001100110011
001001001001010101000
110011001100100100100
101010100011001100110
010010010010101010001
100110011001001001001
0101010001100
CPU
CPU
CPU
CPU
CPU
DATA
GPU
Thousands of tiny CPU using high
parallelization
è compute intensive application
Logic + IOs are customized exactly for the
application's needs.
è Very low and predictable latency applications
2 options
FPG
A
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
7
Why take a look at OpenCAPI?
§ IBM has been leading the charge in acceleration since CAPI 1.0 was introduced in Power8 in 2013
§ OpenCAPI is the 3rd generation of acceleration that is now architecture agnostic and is not tied to the Power
ISA
§ It takes multiple generations to get a new acceleration offering that meets all the technical requirements
that were initially defined
§ OpenCAPI has at least a 4 year time to market advantage over the rest of the industry
§ Any accelerator development using the existing OpenCAPI acceleration technologies today will have direct
applicability as the industry and markets mature
§ Our open source framework supporting OpenCAPI 3.0 (OC-Accel) is available in a public github environment
today!
8
Accelerated
OpenCAPI Device
OpenCAPI Key Attributes
9
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
U
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
3. High performance 25Gbps PHY design
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit
6. Wide range of Use Cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both Classic Memory and emerging Advanced Storage Class Memory
9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA)
Caches
Application
§ Storage/Compute/Network etc
§ ASIC/FPGA/FFSA
§ FPGA, SOC, GPU Accelerator
§ Load/Store or Block Access
Standard System Memory Advanced SCM
Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
Device Memory
OpenCAPI in Power 9
OpenCAPI attach capabilities are broken into
two subclasses
• Compute (AFUc) for function acceleration
using a more traditional IO model with
DMAs mastered by the device
• Memory (AFUm) for attaching various
memory technologies using Loads / Stores
mastered by the host
OpenCAPI 3.0 – 25 Gbps (P9)
• AFUC1, AFUM1
Available today with our Power9 offerings and
at least 4 years before Intel’s first
implementation of CXL
10
OpenCAPI
Hostinterface
Memory
Controller
System Memory
CPU
Device Memory
OpenCAPI Device
OpenCAPI
Deviceinterface
AFUC1
Cnfg
AFUM1
OpenCAPI in Power 10
OpenCAPI attach capabilities are broken into
two subclasses
• Compute (AFUc) for function acceleration
using a more traditional IO model with
DMAs mastered by the device
• Memory (AFUm) for attaching various
memory technologies using Loads / Stores
mastered by the host
Power 9: OpenCAPI 3.0 @ 25 Gbps
• AFUC1, AFUM1
Power 10: OpenCAPI 3.1 @ 25.6 Gbps
• OpenCAPI Memory Interface (OMI)
Power 10: OpenCAPI 4.0 @ 25 Gbps
• Posted Writes
• AFUC2 - EA Cache
IBM Confidential 11
OpenCAPI
Hostinterface
OMI
System Memory
CPU
Device Memory
OpenCAPI Device
OpenCAPI
Deviceinterface
AFUC2Cnfg
EA
Cache
AFU
OCMBOCMBOCMBOCMB
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
12
Acceleration Paradigms with Great Performance
13
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Examples: Video Analytics, Network Security, Deep Packet Inspection,
Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading, etc
OpenCAPI WINS due to Bandwidth to/from
accelerators, best of breed latency, and flexibility
of an Open architecture
Main Transform
Processor Chip DLx/TLx
Acc
Example: Basic work offload
Data
Egress Transform
Processor Chip DLx/TLx
Acc
Data
Needle-In-A-Haystack
Processor Chip DLx/TLx
Engine
Acc
Needles
Large
Haystack
Of Data
Ingress Transform
Processor Chip DLx/TLx
Acc
Data
Bi-Directional Transform
Processor Chip DLx/TLx
Acc
Acc
Data
Comparison of Memory Paradigms
Emerging Storage Class Memory
Processor Chip OpenCAPI SCMData
OpenCAPI 3.1 Architecture (OMI)
Ultra Low Latency ASIC Memory buffer chip adding ~5 ns on
top of native DDR direct connect!!
Microchip is our first partner to offer an OMI based memory
buffer
Storage Class Memory tiered with traditional DDR Memory all
built upon OpenCAPI 3.1 & 3.0 architecture.
Still have the ability to use Load/Store Semantics
Storage Class Memories have the potential to be the next
disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
Main Memory
Processor Chip Memory
Buffer
DDR4/5
Example: Basic DDR attach
Data
Tiered Memory
Processor Chip
Memory
Buffer
DDR4/5
OpenCAPI
SCM
Data
Data
Tier 1 Memory
Tier 2 Memory
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
15
Power9 Systems with OpenCAPI
16
• System Details
• 2 Socket 2U
• Up to 44 cores
• Up to 4 TB memory (32 DDR4 DIMMs)
• 4 Gen4 PCIe Slots, CAPI2.0 Enabled
• 6 Gen3 PCIe Slots
• Up to 24 SFF / 12 LFF Drives
• 4 x8 25 Gbps Ports
• Up to 4 cabled OpenCAPI
Adapters*
IBM Offered IC922
IBM AC922
Air Cooled
17
Power9 Systems with OpenCAPI
System Details
• 2 – Socket 2U
• Up to 40 cores
• Up to 2TB memory (16 DDR4 Dimms)
• 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled
• 2 2.5” SFF Drive Bays
• 4 OpenPOWER Mezzanine Sockets
• Up to 4 NVLink V100 GPUs
• Up to 4 socketed OpenCAPI Adapters*
• Up to 1 cabled OpenCAPI Cards w/ SlimSAS
adapter*
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
18
OpenCAPI Adapters
19
Mellanox Innova2
Network + FPGA
• Xilinx US+ KU15P FPGA
• Mellanox CX5 NIC
• 16 GB DDR4
• 2 25Gb SFP Cages
• X8 25Gb/s OpenCAPI Support
Network Acceleration (NFV, Packet
Classification), Security Acceleration
Available in HDK
Nallatech 250-SoC
Multipurpose Converged
Network / Storage
• Xilinx Zynq US+ ZU19EG FPGA
• 8/16 GB DDR4, 4/8 GB DDR4 ARM
• PCIe Gen3 x16, CAPI2
• 4 x8 Oculink Ports support NVMe,
Network, or OpenCAPI
• 2 100Gb QSFP28 Cages
NVMEoF Target, High BW Storage
Server
Orderable thru IBM with an
RPQ
Available in OC-Accel and HDK
AlphaData ADM-9H7
Large FPGA with 8GB HBM
• Xilinx US+ VU37P FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 4 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, HPC
To be deployed in Nimbix Cloud
Available in OC-Accel and HDK
AlphaData ADM-9H3
Medium FPGA with 8GB HBM
• Xilinx Virtex US+ VU33P-3 FPGA +
HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 1 x8 25 Gb/s OpenCAPI Ports
• 1 2x100Gb QSFP28-DD Cage
ML/DL, Inference, System Modeling,
HPC
Available in OC-Accel and HDK
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø OpenCAPI & CAPI Cloud Environments
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
20
OpenCAPI Hybrid Memory Subsystem (HMS)
IBM Confidential 21
• Hybrid Memory Subsystem using Low Latency NAND and
DRAM
– Exclusive partnerships with Samsung for Z-NAND media
components and with Molex/Bittware for card
design/builds
– Z-NAND Flash for capacity and persistence
– DRAM used for caching to lower average latency
• Goals and Capabilities
– SCM on OpenCAPI using Load/Store memory semantics
– Competitive latency and bandwidth at reduced cost for
systems with high capacity memory requirements
– 1.5TB and 3TB card options
• Target Applications
– Primary: cost reduction on SAP HANA OLAP workloads
Database On-Line Analytics Processing (OLAP) with
predominantly Sequential/Read-Only (Mostly) processing
– Additionally: genomics, minio, apache arrow
BonoM2 card
exploded view
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers and Microchip
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
22
• Less pins = more channels
• Media Controller moves out of SoC
• Memory innovation decoupled from SoC
• Asynchronous
• Model is extensible to persistent media
• 288 pins per channel = few channels
• SoC silicon changes for each media
• Synchronous
Media
Controller
DRAM
DRAM
DRAM
DRAM
Parallel
Memory
Bus
SerialMemory
DRAM
DRAM
DRAM
DRAM
Serial
Memory
Bus
Media
Controller
SerialMemory
DDIMMRDIMMCPU CPU
Single interface provides for multiple media types
Media Independence
24
What is the SMC 1000 8x25G
• OMI Interface
• 1x8, 1x4 support
• OIF-28G-MR
• Up to 25.6 Gbps link rate
• Dynamic low power modes
• DDR4 Memory Interface
• x72 bit DDR4-3200, 2933, or 2666 MT/s memory support
• Supports up to 4 ranks
• Supports up to 16 GBit memory devices
• 3D stacked memory support
• Persistent Memory Support
• Support for NVDIMM-N modules
• Intelligent Firmware
• Open Source Firmware
• On-board processor provides DDR/OMI initialization, and
in-band temperature and error monitoring
• ChipLink GUI
• Security and Data Protection
• Hardware root-of-trust, secure boot, and secure update
• Single symbol correction/double symbol detection ECC
• Memory scrub with auto correction on errors
• Small Package and Low-Power
• Power optimized
• 17 mm x 17 mm package
• Peripherals Support
• Support for SPI, I²C, GPIO, UART and JTAG/EJTAG
DDIMMs are available today!
• The SMC 1000 8x25G will be available in a standards-based DDIMMs in 1U and 2U
• DDIMMS are provided by SMART Modular, Samsung Electronics, and Micron
1U DDIMM Format
72b DDR4 3200
Traditional RDIMMs have a substantially larger footprint and routing requirements
85 mm 133 mm
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
28
Coding
RTL? C/C++?
SNAP? OC-Accel?
FPGA development: Choice1 (traditional way)
Develop your code:
• Software side using:
• libcxl APIs
• FPGA side using:
• PSL interface (CAPI1.0)
• PSL/BSP interface (CAPI2.0)
• TLx/DLx (OpenCAPI)
Process C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libcxl
cxl
HDK:
PSL
-
BSP
-
TLx/DLx
Big developing efforts
Extreme performance targeted, full control
Programming based on libcxl and PSL interface
Application on Host Acceleration on FPGA
Software
Program
Hardware Logic
CAPI or
OpenCAPI
FPGA development: Choice2 (Recommended for all users)
SNAP/OC-Accel is an environment that makes it easy for programmers to
create FPGA accelerators and integrate them into their applications:
• Security based on IBM POWER's technology.
• Portable
• from CAPI1.0, to CAPI2.0 and to OpenCAPI
• from any Xilinx FPGA card based to another
• Open-source (once a driver is available, everyone can make use of it)
CAPI è https://github.com/open-power/snap
OpenCAPI3.0 è https://github.com/OpenCAPI/oc-accel
The CAPI/OpenCAPI – SNAP/OC-Accel concept
Action X
Action Y
Action Z
CAPI
SNAP
OC-Accel
Vivado
HLS
CAPI FPGA becomes a peer of the CPU
è Action directly accesses host memory
SNAP
Manage server threads and actions
Manage access to IOs (memory, network)
è Action easily accesses resources
FPGA
Gives on-demand compute capabilities
Gives direct IOs access (storage, network)
è Action directly accesses external resources
Vivado
HLS
Compile Action written in C/C++ code
Optimize code to get performance
è Action code can be ported efficiently
+
+
+
= Offload/accelerate a C/ C++ code with :
- Quick porting
- Minimum change in code
- Better performance than CPU
FPGA
or any memory managed by the host
SNAP/OC-Accel framework
Process C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Application on Host Acceleration on FPGA
Software Program
PSL/AXI bridge
DRAM
on-card
Network
(TBD)
NVMe
AXI
AXI
Host
DMA
Control
MMIO
Job
Manager
Job
Queue
AXI
AXI lite
Quick and easy developing
Use High Level Synthesis tool to convert C/C++ to RTL, or directly use RTL
Programming based on SNAP/OC-Accel library and AXI interface
AXI is an industry standard for on-chip interconnection (https://www.arm.com/products/system-ip/amba-specifications)
C/C++
or RTL
Hardware Action
HDK:
PSL
-
BSP
-
TLx/DLx
CAPI or
OpenCAPI
2 different working modes
The Fixed-Action Mode
PARALLEL MODE
The Job-Queue Mode
SERIAL MODE
FPGA-action executes a job
and returns after completion
FPGA-action is designed to permanently run
Data-streaming approach with data-in and
data-out queue
Software Program C/C++ function
Hardware Action
Software Program C/C++ function
Hardware Action
FPGA CARD
Why CAPI is simpler and faster ? Because of the coherency of memory
Place computing closer to data
No data multiple copy
DRAM
on-card
Network
(TBD)
NVMe
AXI
AXI
Action1
Verilog
Action3
…
Action2
C/C++
AXI
AXI lite
Host
memory
CAPI
Config/Status
Core1 Core2 Core3
« Core4 »
like
From CPU-centric architecture …. to a …… Server memory centric architecture
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
36
Nimbix Cloud
37
Test the complete path for only $0.36 per hour
($3/h of deployment)
Benefits of Nimbix cloud vs other clouds:
• Much cheaper development price (development on a standard x86 with no FPGA)
• You can bring back your design compiled and tested to your premise for no additional cost
Nimbix Cloud
38
Alpha-Data KU3Bittware N250SPower8 servers
Available today:
SNAP for CAPI1.0
Available soon:
Power9 servers
Alpha-Data 9H7 with OpenCAPI link
Xilinx U200 with CAPI2.0 link
SNAP for CAPI2.0
OC-Accel for OpenCAPI3.0
3
good
reasons
No experience needed
Just an 1 hour or 2
For everyone
Just know C/C++
No investment to
evaluate
Just 36¢ per hour
why to try
3
steps
Isolate your function
Simulate your function
Execute your function
to implement
how
3
steps
Log to jarvice
Experience the flow
Boost YOUR function
to discover
~1.5 hours
and adopt
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
42
OpenCAPI Open Source Contributions on GitHub
▪ 3 reference designs:
▪ An OMI Host Side FPGA Reference RTL (non-encrypted Verilog and
VHDL)
▪ Named Fire
▪ An OMI Device Side FPGA Reference RTL (non-encrypted Verilog and
VHDL)
▪ Named Ice
▪ OpenCAPI 3.0 Device Side FPGA Reference RTL (non-encrypted
Verilog)
▪ OpenCAPI Simulation Engine
▪ OC-Accel
▪ OpenCAPI FPGA Developer Framework
▪ LibOCXL
▪ User Level API Library
§ https://github.com/OpenCAPI
§ Apache 2.0 licensing model
Rev 7
20180613
Open Source on GitHub - TLx and DLx Reference Designs in an FPGA
§ Open Verilog – free to enhance, improve or leverage pieces of reference design for
your own accelerator development
§ Designed for 64B packet flow running at 400MHz
§ Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device
44
VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile
DLx 9392/788160
(1.19%)
19026/394080
(4.82%)
0/197280
(0%)
7.5/720
(1.0%)
TLx 13806/788160
(1.75%)
8463/394080
(2.14%)
2156/197280
(1.09%)
0/720
(0%)
Total 23108/788160
(2.94%)
27849/394080
(6.98%)
2156/197280
(1.09%)
7.5/720
(1.0%)
§Because power efficiency and size do matter
OpenCAPI Device
• Customer application and accelerator
• Operation system enablement
• Little Endian Linux
• Reference Kernel Driver (ocxl)
• Reference User Library (libocxl)
• Hardware and reference designs to
enable coherent acceleration
Core
Processor
OS
App
(software)
Memory (Coherent)
Accelerated
Function(s)
TLx
DLx
25G
ocxl
libocxl
ØOCSE (OpenCAPI Simulation Environment)
models the red outlined area
ØOCSE enables AFU and Application co-
simulation when the reference libocxl
and reference TLx/DLx are used
ØWill be put out in the public GitHub
environment as well
Cable
Memory (Coherent)
Open Source on GitHub - OpenCAPI Simulation Environment
25G
DL
TL
45
Open Source on GitHub - Exerciser Examples
§ MemCopy
• The MemCopy example is a data mover from source address -> destination address
using Virtual Addressing and includes these features
• Configuration and MMIO Register Space
• acTag Table used for Bus/Device/Function and Process ID identification
• 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B,
128B, or 256B operations
§ Memory Home Agent
• The Memory Home Agent example implements memory off the endpoint OpenCAPI
accelerator to act as a coherent extension to the host processor memory
• The Memory Home Agent example includes these features
• Configuration and MMIO Register Space
• Individual and pipelined operation for memory loads and stores
• Interrupts, with error details reported to software through MMIO registers
• Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address
§ Open Examples – free to enhance, improve or leverage pieces of exerciser
examples for your own accelerator development 46
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
47
OpenCAPI Workgroups and Specification Status
48
Item Availability
TL Architecture Workgroup
TL 3.0 Architecture Specification Approved by Board of Directors. Released publicly on opencapi.org
TL 3.1 Architecture Specification Approved by Board of Directors. Released publicly on opencapi.org
TL 4.0 Architecture Specification 30 day review period in Board of Directors just completed
PHY Signaling Workgroup
25 Gbps PHY Signaling Specification Approved by Board of Directors. Available to OpenCAPI members
32 Gbps PHY Signaling Specification Voted out of Workgroup. Next step is approval from TSC
PHY Electro-Mechanical Workgroup
25 Gbps PHY Electro-Mechanical Specification Approved by Board of Directors. Available to OpenCAPI members
32 Gbps PHY Electro-Mechanical Specification Draft form in Workgroup
OpenCAPI Workgroups and Specification Status
49
Item Availability
DL Architecture Workgroup
DL 3.1/3.1/4.0 Architecture Specification Draft Review. Finished 30-day review in WG
Enablement Workgroup
Device Discovery and Configuration Specification Approved by WG. Available to OpenCAPI members
TLX 3.0 Reference Design Supplement Posted to GitHub
Memory Home Agent Supplement Approved by WG. Available to OpenCAPI members
Compliance Workgroup
OpenCAPI Ready Approved by Board of Directors. Available to OpenCAPI members
OpenCAPI Compliant Approved by Board of Directors. Available to OpenCAPI members
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
50
CAPI and OpenCAPI Performance
51
CAPI 1.0
PCIE Gen3 x8
Measured BW
@8Gb/s
CAPI 2.0
PCIE Gen4 x8
Measured BW
@16Gb/s
OpenCAPI 3.0
25 Gb/s x8
Measured BW
@25Gb/s
128B DMA
Read
3.81 GB/s 12.57 GB/s 22.1 GB/s
128B DMA
Write
4.16 GB/s 11.85 GB/s 21.6 GB/s
256B DMA
Read
N/A 13.94 GB/s 22.1 GB/s
256B DMA
Write
N/A 14.04 GB/s 22.0 GB/s
POWER8
Introduced
in 2013
POWER9
Second
Generation
POWER9
Open Architecture with a
Clean Slate Focused on
Bandwidth and Latency
POWER8 CAPI 1.0
POWER9 CAPI 2.0
and OpenCAPI 3.0
Xilinx
FPGAs
Latency Test Results
OpenCAPI Link
P9 OpenCAPI
3.9GHz Core, 2.4GHz Nest
Xilinx FPGA VU3P
298ns‡
2ns Jitter
TL, DL, PHY
TLx, DLx, PHYx (80nsǁ)
378ns† Total Latency
PCIe G4 Link
P9 PCIe Gen4
Xilinx FPGA VU3P
est. <337ns
PCIe Stack
Xilinx PCIe HIP (218ns¶)
est. <555ns§ Total Latency
PCIe G3 Link
P9 PCIe Gen3
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
337ns
7ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
737ns§ Total Latency
PCIe G3 Link
Kaby Lake PCIe Gen3*
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
376ns
31ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
776ns§ Total Latency
* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost)
† Derived from round-trip time minus simulated FPGA app time
‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time
§ Derived from measured CPU turnaround time plus vendor provided HIP latency
ǁ Derived from simulation
¶ Vendor provided latency statistic
RACE TO ZERO LATENCY
BECAUSE JITTER MATTERS
52
OpenCAPI Topics
Ø Industry Background
Ø Technology Overview
Ø Possible Areas of Interest & Use Cases
Ø OpenCAPI Based Servers
Ø OpenCAPI & CAPI2 Adapters
Ø IBM Offered Solutions (HMS)
Ø OMI Enabled Buffers
Ø OpenCAPI & CAPI Programming Frameworks
Ø OpenCAPI & CAPI Cloud Environments
Ø Open Source Contributions
Ø Specification Status
Ø Performance Metrics
Ø OpenCAPI Consortium
53
Membership Status
h
OpenCAPI Protocol
Welcoming new members in all areas of the ecosystem
Systems and Software
SW
Researh &
Acadecmic
Products and Services
Deployment
SOC
Accelerator Solutions
State Key Lab of
High end Server &
Storage Technology
Membership Entitlement Details
Strategic level - $25K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Vote on new Board Members
• Nominate and/or run for officer election
• Prominent listing in appropriate materials
Observing level - $5K
• Final Specifications and enablement
• License for Product development
Contributor level - $15K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Submit proposals
Academic and Non-Profit level - Free
• Final Specifications and enablement
• Workgroup participation and voting
55
OpenCAPI Consortium Next Steps
JOIN TODAY!
www.opencapi.org
56

OpenCAPI Technology Ecosystem

  • 1.
  • 2.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 2
  • 3.
    Industry Background § Datarich but insight poor § ½ of the world’s data has been generated in the last 2 years alone! § Only 2% of the world’s data has actually been analyzed into actionable intelligence § Data is growing at an ever alarming rate with social media, sensor data, camera data and the like § Diverse data types has been become a real challenge § Tough problems need to be addressed beyond CPUs and GPUs § FPGAs and other specialized HW should be considered for those class of problems where CPUs and GPUs fall short § Data prep feeding a GPU is a prime example 3
  • 4.
    Processor Architecture Bottleneck 4 •CPU Core count have increased but memory and IO subsystems have not kept up • Direct attach DRAM buses provide limited quantity and performance scaling • Processor devices experiencing increasing latency per core • Modern AI workloads and exponential data growth require lower latency and greater BW Memory BW / Core I/O BW / Core Mem Latency / Core
  • 5.
    3 typical cases when youshould consider Boost a function Offload your CPU using external accelerators Free network resources 100101010100011001100 110010010010010101010 001100110011001001001 001010101000110011001 100100100100101010100 011001100110010010010 010101010001100110011 001001001001010101000 110011001100100100100 101010100011001100110 010010010010101010001 100110011001001001001 0101010001100 CPU CPU CPU CPU CPU DATA
  • 6.
    GPU Thousands of tinyCPU using high parallelization è compute intensive application Logic + IOs are customized exactly for the application's needs. è Very low and predictable latency applications 2 options FPG A
  • 7.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 7
  • 8.
    Why take alook at OpenCAPI? § IBM has been leading the charge in acceleration since CAPI 1.0 was introduced in Power8 in 2013 § OpenCAPI is the 3rd generation of acceleration that is now architecture agnostic and is not tied to the Power ISA § It takes multiple generations to get a new acceleration offering that meets all the technical requirements that were initially defined § OpenCAPI has at least a 4 year time to market advantage over the rest of the industry § Any accelerator development using the existing OpenCAPI acceleration technologies today will have direct applicability as the industry and markets mature § Our open source framework supporting OpenCAPI 3.0 (OC-Accel) is available in a public github environment today! 8
  • 9.
    Accelerated OpenCAPI Device OpenCAPI KeyAttributes 9 TL/DL 25Gb I/O Any OpenCAPI Enabled Processor U Accelerated Function TLx/DLx 1. Architecture agnostic bus – Applicable with any system/microprocessor architecture 2. Optimized for High Bandwidth and Low Latency 3. High performance 25Gbps PHY design 4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor 5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit 6. Wide range of Use Cases and access semantics 7. CPU coherent device memory (Home Agent Memory) 8. Architected for both Classic Memory and emerging Advanced Storage Class Memory 9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA) Caches Application § Storage/Compute/Network etc § ASIC/FPGA/FFSA § FPGA, SOC, GPU Accelerator § Load/Store or Block Access Standard System Memory Advanced SCM Solutions BufferedSystemMemory OpenCAPIMemoryBuffers Device Memory
  • 10.
    OpenCAPI in Power9 OpenCAPI attach capabilities are broken into two subclasses • Compute (AFUc) for function acceleration using a more traditional IO model with DMAs mastered by the device • Memory (AFUm) for attaching various memory technologies using Loads / Stores mastered by the host OpenCAPI 3.0 – 25 Gbps (P9) • AFUC1, AFUM1 Available today with our Power9 offerings and at least 4 years before Intel’s first implementation of CXL 10 OpenCAPI Hostinterface Memory Controller System Memory CPU Device Memory OpenCAPI Device OpenCAPI Deviceinterface AFUC1 Cnfg AFUM1
  • 11.
    OpenCAPI in Power10 OpenCAPI attach capabilities are broken into two subclasses • Compute (AFUc) for function acceleration using a more traditional IO model with DMAs mastered by the device • Memory (AFUm) for attaching various memory technologies using Loads / Stores mastered by the host Power 9: OpenCAPI 3.0 @ 25 Gbps • AFUC1, AFUM1 Power 10: OpenCAPI 3.1 @ 25.6 Gbps • OpenCAPI Memory Interface (OMI) Power 10: OpenCAPI 4.0 @ 25 Gbps • Posted Writes • AFUC2 - EA Cache IBM Confidential 11 OpenCAPI Hostinterface OMI System Memory CPU Device Memory OpenCAPI Device OpenCAPI Deviceinterface AFUC2Cnfg EA Cache AFU OCMBOCMBOCMBOCMB
  • 12.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 12
  • 13.
    Acceleration Paradigms withGreat Performance 13 Examples: Encryption, Compression, Erasure prior to delivering data to the network or storage Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor Examples: Video Analytics, Network Security, Deep Packet Inspection, Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading, etc OpenCAPI WINS due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture Main Transform Processor Chip DLx/TLx Acc Example: Basic work offload Data Egress Transform Processor Chip DLx/TLx Acc Data Needle-In-A-Haystack Processor Chip DLx/TLx Engine Acc Needles Large Haystack Of Data Ingress Transform Processor Chip DLx/TLx Acc Data Bi-Directional Transform Processor Chip DLx/TLx Acc Acc Data
  • 14.
    Comparison of MemoryParadigms Emerging Storage Class Memory Processor Chip OpenCAPI SCMData OpenCAPI 3.1 Architecture (OMI) Ultra Low Latency ASIC Memory buffer chip adding ~5 ns on top of native DDR direct connect!! Microchip is our first partner to offer an OMI based memory buffer Storage Class Memory tiered with traditional DDR Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics Storage Class Memories have the potential to be the next disruptive technology….. Examples include ReRAM, MRAM, Z-NAND…… All are racing to become the defacto Main Memory Processor Chip Memory Buffer DDR4/5 Example: Basic DDR attach Data Tiered Memory Processor Chip Memory Buffer DDR4/5 OpenCAPI SCM Data Data Tier 1 Memory Tier 2 Memory
  • 15.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 15
  • 16.
    Power9 Systems withOpenCAPI 16 • System Details • 2 Socket 2U • Up to 44 cores • Up to 4 TB memory (32 DDR4 DIMMs) • 4 Gen4 PCIe Slots, CAPI2.0 Enabled • 6 Gen3 PCIe Slots • Up to 24 SFF / 12 LFF Drives • 4 x8 25 Gbps Ports • Up to 4 cabled OpenCAPI Adapters* IBM Offered IC922
  • 17.
    IBM AC922 Air Cooled 17 Power9Systems with OpenCAPI System Details • 2 – Socket 2U • Up to 40 cores • Up to 2TB memory (16 DDR4 Dimms) • 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled • 2 2.5” SFF Drive Bays • 4 OpenPOWER Mezzanine Sockets • Up to 4 NVLink V100 GPUs • Up to 4 socketed OpenCAPI Adapters* • Up to 1 cabled OpenCAPI Cards w/ SlimSAS adapter*
  • 18.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 18
  • 19.
    OpenCAPI Adapters 19 Mellanox Innova2 Network+ FPGA • Xilinx US+ KU15P FPGA • Mellanox CX5 NIC • 16 GB DDR4 • 2 25Gb SFP Cages • X8 25Gb/s OpenCAPI Support Network Acceleration (NFV, Packet Classification), Security Acceleration Available in HDK Nallatech 250-SoC Multipurpose Converged Network / Storage • Xilinx Zynq US+ ZU19EG FPGA • 8/16 GB DDR4, 4/8 GB DDR4 ARM • PCIe Gen3 x16, CAPI2 • 4 x8 Oculink Ports support NVMe, Network, or OpenCAPI • 2 100Gb QSFP28 Cages NVMEoF Target, High BW Storage Server Orderable thru IBM with an RPQ Available in OC-Accel and HDK AlphaData ADM-9H7 Large FPGA with 8GB HBM • Xilinx US+ VU37P FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 2 x8 25 Gb/s OpenCAPI Ports (support up to 50 GB/s) • 4 100Gb QSFP28 Cages ML/DL, Inference, System Modeling, HPC To be deployed in Nimbix Cloud Available in OC-Accel and HDK AlphaData ADM-9H3 Medium FPGA with 8GB HBM • Xilinx Virtex US+ VU33P-3 FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 1 x8 25 Gb/s OpenCAPI Ports • 1 2x100Gb QSFP28-DD Cage ML/DL, Inference, System Modeling, HPC Available in OC-Accel and HDK
  • 20.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø OpenCAPI & CAPI Cloud Environments Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 20
  • 21.
    OpenCAPI Hybrid MemorySubsystem (HMS) IBM Confidential 21 • Hybrid Memory Subsystem using Low Latency NAND and DRAM – Exclusive partnerships with Samsung for Z-NAND media components and with Molex/Bittware for card design/builds – Z-NAND Flash for capacity and persistence – DRAM used for caching to lower average latency • Goals and Capabilities – SCM on OpenCAPI using Load/Store memory semantics – Competitive latency and bandwidth at reduced cost for systems with high capacity memory requirements – 1.5TB and 3TB card options • Target Applications – Primary: cost reduction on SAP HANA OLAP workloads Database On-Line Analytics Processing (OLAP) with predominantly Sequential/Read-Only (Mostly) processing – Additionally: genomics, minio, apache arrow BonoM2 card exploded view
  • 22.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers and Microchip Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 22
  • 23.
    • Less pins= more channels • Media Controller moves out of SoC • Memory innovation decoupled from SoC • Asynchronous • Model is extensible to persistent media • 288 pins per channel = few channels • SoC silicon changes for each media • Synchronous Media Controller DRAM DRAM DRAM DRAM Parallel Memory Bus SerialMemory DRAM DRAM DRAM DRAM Serial Memory Bus Media Controller SerialMemory DDIMMRDIMMCPU CPU Single interface provides for multiple media types Media Independence
  • 24.
  • 26.
    What is theSMC 1000 8x25G • OMI Interface • 1x8, 1x4 support • OIF-28G-MR • Up to 25.6 Gbps link rate • Dynamic low power modes • DDR4 Memory Interface • x72 bit DDR4-3200, 2933, or 2666 MT/s memory support • Supports up to 4 ranks • Supports up to 16 GBit memory devices • 3D stacked memory support • Persistent Memory Support • Support for NVDIMM-N modules • Intelligent Firmware • Open Source Firmware • On-board processor provides DDR/OMI initialization, and in-band temperature and error monitoring • ChipLink GUI • Security and Data Protection • Hardware root-of-trust, secure boot, and secure update • Single symbol correction/double symbol detection ECC • Memory scrub with auto correction on errors • Small Package and Low-Power • Power optimized • 17 mm x 17 mm package • Peripherals Support • Support for SPI, I²C, GPIO, UART and JTAG/EJTAG
  • 27.
    DDIMMs are availabletoday! • The SMC 1000 8x25G will be available in a standards-based DDIMMs in 1U and 2U • DDIMMS are provided by SMART Modular, Samsung Electronics, and Micron 1U DDIMM Format 72b DDR4 3200 Traditional RDIMMs have a substantially larger footprint and routing requirements 85 mm 133 mm
  • 28.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 28
  • 29.
  • 30.
    FPGA development: Choice1(traditional way) Develop your code: • Software side using: • libcxl APIs • FPGA side using: • PSL interface (CAPI1.0) • PSL/BSP interface (CAPI2.0) • TLx/DLx (OpenCAPI) Process C Slave Context libcxl cxl SNAP library Job Queue Process B Slave Context libcxl cxl SNAP library Job Queue Process A Slave Context libcxl cxl HDK: PSL - BSP - TLx/DLx Big developing efforts Extreme performance targeted, full control Programming based on libcxl and PSL interface Application on Host Acceleration on FPGA Software Program Hardware Logic CAPI or OpenCAPI
  • 31.
    FPGA development: Choice2(Recommended for all users) SNAP/OC-Accel is an environment that makes it easy for programmers to create FPGA accelerators and integrate them into their applications: • Security based on IBM POWER's technology. • Portable • from CAPI1.0, to CAPI2.0 and to OpenCAPI • from any Xilinx FPGA card based to another • Open-source (once a driver is available, everyone can make use of it) CAPI è https://github.com/open-power/snap OpenCAPI3.0 è https://github.com/OpenCAPI/oc-accel
  • 32.
    The CAPI/OpenCAPI –SNAP/OC-Accel concept Action X Action Y Action Z CAPI SNAP OC-Accel Vivado HLS CAPI FPGA becomes a peer of the CPU è Action directly accesses host memory SNAP Manage server threads and actions Manage access to IOs (memory, network) è Action easily accesses resources FPGA Gives on-demand compute capabilities Gives direct IOs access (storage, network) è Action directly accesses external resources Vivado HLS Compile Action written in C/C++ code Optimize code to get performance è Action code can be ported efficiently + + + = Offload/accelerate a C/ C++ code with : - Quick porting - Minimum change in code - Better performance than CPU FPGA or any memory managed by the host
  • 33.
    SNAP/OC-Accel framework Process C SlaveContext libcxl cxl SNAP library Job Queue Process B Slave Context libcxl cxl SNAP library Job Queue Process A Slave Context libcxl cxl SNAP library Job Queue Application on Host Acceleration on FPGA Software Program PSL/AXI bridge DRAM on-card Network (TBD) NVMe AXI AXI Host DMA Control MMIO Job Manager Job Queue AXI AXI lite Quick and easy developing Use High Level Synthesis tool to convert C/C++ to RTL, or directly use RTL Programming based on SNAP/OC-Accel library and AXI interface AXI is an industry standard for on-chip interconnection (https://www.arm.com/products/system-ip/amba-specifications) C/C++ or RTL Hardware Action HDK: PSL - BSP - TLx/DLx CAPI or OpenCAPI
  • 34.
    2 different workingmodes The Fixed-Action Mode PARALLEL MODE The Job-Queue Mode SERIAL MODE FPGA-action executes a job and returns after completion FPGA-action is designed to permanently run Data-streaming approach with data-in and data-out queue Software Program C/C++ function Hardware Action Software Program C/C++ function Hardware Action
  • 35.
    FPGA CARD Why CAPIis simpler and faster ? Because of the coherency of memory Place computing closer to data No data multiple copy DRAM on-card Network (TBD) NVMe AXI AXI Action1 Verilog Action3 … Action2 C/C++ AXI AXI lite Host memory CAPI Config/Status Core1 Core2 Core3 « Core4 » like From CPU-centric architecture …. to a …… Server memory centric architecture
  • 36.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 36
  • 37.
    Nimbix Cloud 37 Test thecomplete path for only $0.36 per hour ($3/h of deployment) Benefits of Nimbix cloud vs other clouds: • Much cheaper development price (development on a standard x86 with no FPGA) • You can bring back your design compiled and tested to your premise for no additional cost
  • 38.
    Nimbix Cloud 38 Alpha-Data KU3BittwareN250SPower8 servers Available today: SNAP for CAPI1.0 Available soon: Power9 servers Alpha-Data 9H7 with OpenCAPI link Xilinx U200 with CAPI2.0 link SNAP for CAPI2.0 OC-Accel for OpenCAPI3.0
  • 39.
    3 good reasons No experience needed Justan 1 hour or 2 For everyone Just know C/C++ No investment to evaluate Just 36¢ per hour why to try
  • 40.
    3 steps Isolate your function Simulateyour function Execute your function to implement how
  • 41.
    3 steps Log to jarvice Experiencethe flow Boost YOUR function to discover ~1.5 hours and adopt
  • 42.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 42
  • 43.
    OpenCAPI Open SourceContributions on GitHub ▪ 3 reference designs: ▪ An OMI Host Side FPGA Reference RTL (non-encrypted Verilog and VHDL) ▪ Named Fire ▪ An OMI Device Side FPGA Reference RTL (non-encrypted Verilog and VHDL) ▪ Named Ice ▪ OpenCAPI 3.0 Device Side FPGA Reference RTL (non-encrypted Verilog) ▪ OpenCAPI Simulation Engine ▪ OC-Accel ▪ OpenCAPI FPGA Developer Framework ▪ LibOCXL ▪ User Level API Library § https://github.com/OpenCAPI § Apache 2.0 licensing model Rev 7 20180613
  • 44.
    Open Source onGitHub - TLx and DLx Reference Designs in an FPGA § Open Verilog – free to enhance, improve or leverage pieces of reference design for your own accelerator development § Designed for 64B packet flow running at 400MHz § Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device 44 VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile DLx 9392/788160 (1.19%) 19026/394080 (4.82%) 0/197280 (0%) 7.5/720 (1.0%) TLx 13806/788160 (1.75%) 8463/394080 (2.14%) 2156/197280 (1.09%) 0/720 (0%) Total 23108/788160 (2.94%) 27849/394080 (6.98%) 2156/197280 (1.09%) 7.5/720 (1.0%) §Because power efficiency and size do matter
  • 45.
    OpenCAPI Device • Customerapplication and accelerator • Operation system enablement • Little Endian Linux • Reference Kernel Driver (ocxl) • Reference User Library (libocxl) • Hardware and reference designs to enable coherent acceleration Core Processor OS App (software) Memory (Coherent) Accelerated Function(s) TLx DLx 25G ocxl libocxl ØOCSE (OpenCAPI Simulation Environment) models the red outlined area ØOCSE enables AFU and Application co- simulation when the reference libocxl and reference TLx/DLx are used ØWill be put out in the public GitHub environment as well Cable Memory (Coherent) Open Source on GitHub - OpenCAPI Simulation Environment 25G DL TL 45
  • 46.
    Open Source onGitHub - Exerciser Examples § MemCopy • The MemCopy example is a data mover from source address -> destination address using Virtual Addressing and includes these features • Configuration and MMIO Register Space • acTag Table used for Bus/Device/Function and Process ID identification • 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B, 128B, or 256B operations § Memory Home Agent • The Memory Home Agent example implements memory off the endpoint OpenCAPI accelerator to act as a coherent extension to the host processor memory • The Memory Home Agent example includes these features • Configuration and MMIO Register Space • Individual and pipelined operation for memory loads and stores • Interrupts, with error details reported to software through MMIO registers • Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address § Open Examples – free to enhance, improve or leverage pieces of exerciser examples for your own accelerator development 46
  • 47.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 47
  • 48.
    OpenCAPI Workgroups andSpecification Status 48 Item Availability TL Architecture Workgroup TL 3.0 Architecture Specification Approved by Board of Directors. Released publicly on opencapi.org TL 3.1 Architecture Specification Approved by Board of Directors. Released publicly on opencapi.org TL 4.0 Architecture Specification 30 day review period in Board of Directors just completed PHY Signaling Workgroup 25 Gbps PHY Signaling Specification Approved by Board of Directors. Available to OpenCAPI members 32 Gbps PHY Signaling Specification Voted out of Workgroup. Next step is approval from TSC PHY Electro-Mechanical Workgroup 25 Gbps PHY Electro-Mechanical Specification Approved by Board of Directors. Available to OpenCAPI members 32 Gbps PHY Electro-Mechanical Specification Draft form in Workgroup
  • 49.
    OpenCAPI Workgroups andSpecification Status 49 Item Availability DL Architecture Workgroup DL 3.1/3.1/4.0 Architecture Specification Draft Review. Finished 30-day review in WG Enablement Workgroup Device Discovery and Configuration Specification Approved by WG. Available to OpenCAPI members TLX 3.0 Reference Design Supplement Posted to GitHub Memory Home Agent Supplement Approved by WG. Available to OpenCAPI members Compliance Workgroup OpenCAPI Ready Approved by Board of Directors. Available to OpenCAPI members OpenCAPI Compliant Approved by Board of Directors. Available to OpenCAPI members
  • 50.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 50
  • 51.
    CAPI and OpenCAPIPerformance 51 CAPI 1.0 PCIE Gen3 x8 Measured BW @8Gb/s CAPI 2.0 PCIE Gen4 x8 Measured BW @16Gb/s OpenCAPI 3.0 25 Gb/s x8 Measured BW @25Gb/s 128B DMA Read 3.81 GB/s 12.57 GB/s 22.1 GB/s 128B DMA Write 4.16 GB/s 11.85 GB/s 21.6 GB/s 256B DMA Read N/A 13.94 GB/s 22.1 GB/s 256B DMA Write N/A 14.04 GB/s 22.0 GB/s POWER8 Introduced in 2013 POWER9 Second Generation POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency POWER8 CAPI 1.0 POWER9 CAPI 2.0 and OpenCAPI 3.0 Xilinx FPGAs
  • 52.
    Latency Test Results OpenCAPILink P9 OpenCAPI 3.9GHz Core, 2.4GHz Nest Xilinx FPGA VU3P 298ns‡ 2ns Jitter TL, DL, PHY TLx, DLx, PHYx (80nsǁ) 378ns† Total Latency PCIe G4 Link P9 PCIe Gen4 Xilinx FPGA VU3P est. <337ns PCIe Stack Xilinx PCIe HIP (218ns¶) est. <555ns§ Total Latency PCIe G3 Link P9 PCIe Gen3 3.9GHz Core, 2.4GHz Nest Altera FPGA Stratix V 337ns 7ns Jitter PCIe Stack Altera PCIe HIP (400ns¶) 737ns§ Total Latency PCIe G3 Link Kaby Lake PCIe Gen3* 3.9GHz Core, 2.4GHz Nest Altera FPGA Stratix V 376ns 31ns Jitter PCIe Stack Altera PCIe HIP (400ns¶) 776ns§ Total Latency * Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost) † Derived from round-trip time minus simulated FPGA app time ‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time § Derived from measured CPU turnaround time plus vendor provided HIP latency ǁ Derived from simulation ¶ Vendor provided latency statistic RACE TO ZERO LATENCY BECAUSE JITTER MATTERS 52
  • 53.
    OpenCAPI Topics Ø IndustryBackground Ø Technology Overview Ø Possible Areas of Interest & Use Cases Ø OpenCAPI Based Servers Ø OpenCAPI & CAPI2 Adapters Ø IBM Offered Solutions (HMS) Ø OMI Enabled Buffers Ø OpenCAPI & CAPI Programming Frameworks Ø OpenCAPI & CAPI Cloud Environments Ø Open Source Contributions Ø Specification Status Ø Performance Metrics Ø OpenCAPI Consortium 53
  • 54.
    Membership Status h OpenCAPI Protocol Welcomingnew members in all areas of the ecosystem Systems and Software SW Researh & Acadecmic Products and Services Deployment SOC Accelerator Solutions State Key Lab of High end Server & Storage Technology
  • 55.
    Membership Entitlement Details Strategiclevel - $25K • Draft and Final Specifications and enablement • License for Product development • Workgroup participation and voting • TSC participation • Vote on new Board Members • Nominate and/or run for officer election • Prominent listing in appropriate materials Observing level - $5K • Final Specifications and enablement • License for Product development Contributor level - $15K • Draft and Final Specifications and enablement • License for Product development • Workgroup participation and voting • TSC participation • Submit proposals Academic and Non-Profit level - Free • Final Specifications and enablement • Workgroup participation and voting 55
  • 56.
    OpenCAPI Consortium NextSteps JOIN TODAY! www.opencapi.org 56