IntroductionTo Acceleration with OpenCAPI
SCFE 2020 - March 24th 2020 - A.CASTELLANE
What Do You Need?
2
Out In
Current Computing Landscape
3
CPU technology advances have slowed the historical cost/performance improvements seen
over the last several decades => New CPU chips alone can not handle current challenges!
Over Burdened
CPUs
Slow/Complex
Algorithms &
Functions
+
100101010100011001
100110010010010010
101010001100110011
001001001001010101
000110011001100100
100100101010100011
001100110010010010
010101010001100110
011001001001001010
101000110011001100
100100100101010100
011001100110010010
01001010101
CPU
CPU
CPU
CPU
CPU
DATA
Network & Data Access
Rates
+
Computation Data Access
=
Current Technology and
Processing Overload!
Bad News, it’s only going to get worse!
Next Set Of Challenges Is Here!
4
Exponential Data Growth Compute Intensive Algorithms
Diverse Data Structures & Types Decreasing Time To Results
Hours .. Minutes .. Seconds .. Real Time
Compute
 AI, Machine / Deep Learning
 Video Processing
 Database / Big Data Analytics
Storage
 Scale-out Storage
 Petabytes of new data
 Intelligent / Compute SSDs
Networking
 Network Security
 Low-latency Networking
 Open vSwitch offload
 Software Defined Networking Acceleration
Next Challenges Affect All Computing Fields
5
Bank / Finance
• Risk analysis / Faster trading: Monte Carlo libraries
• Credit card fraud detection
• Block chain acceleration
Video / Analytics
• Smart Video surveillance from multiple videos feed
• 3D video stream from multi-angles videos streams
• Image search / Object tracking / Scene recreation
• Multi-jpeg compression
Machine Learning / Deep learning
• Machine learning inference
• Accelerate frequently used ML / DL algorithm
Algorithm acceleration
• Compression on network path or storage
• Encryption on the fly to various memory types
• String match
But, what if you
could have the best
of both worlds!
Options: Software or Hardware?
6
• Software:
• Advantages:
• More rapid development leading to faster time to market
• Lower non-recurring engineering costs. Software can be reused easily.
• Heightened portability
• Ease of updating features or patching bugs
• Disadvantages:
• Slower run time
• Hardware
• Advantages:
• Much faster execution of functions
• Reduced power consumption
• Lower latency
• Increased parallelism and bandwidth
• Better utilization of area and functional components available on an integrated circuit (IC)
• Disadvantages:
• Lower ability to update designs once etched onto silicon
• Difficult to share Verilog/VHDL source code between different hardware platforms
• Higher costs of functional verification
• Longer develop process and time to market
So, what’s the solution?
7
The use of computer hardware specially designed to perform functions more
efficiently than is possible in software alone running on a general-purpose CPU.
Hardware Acceleration
Thousands of tiny CPU using high
parallelization
 compute intensive application
Field Programmable Gate Array
Logic + IOs are customized exactly for the
application's needs.
 Very low and predictable latency applications
Two Options
GPU FPGA
The Better Choice?
8
Due to the inherent logic and IO flexibility, speed, and
predictably low latency, FPGAs have a clear advantage.
FPGA Acceleration
FPGA = Field Programmable Gate Array
Historically programmed
using Verilog/VHDL
Compiled
Mapped to FPGA HW Logic
What is a FPGA?
9
• A re-programmable computer chip with lots of configurable logic
elements based on Lookup-Tables (LUT)
• Programmable switch matrix routing
• Configurable I/O and high-speed serial links
• Advantages in flexibility, speed, and low latency due to:
• Limited instruction set
• High parallelism
• Deep pipelines
Programmable switchLogical View
Programmable logic element
• Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM,...)
Field Programmable Gate Array
FPGA Example (Bittware 250-SOC)
10
Bittware 250-SoC
Multipurpose Converged Network / Storage
• Xilinx Zync UltraScale+ FPGA ZU19EG (64 bits Cortex-A53 ARM core)
• Two 4GB DDR4 (for FPGA and ARM)
• PCIe Gen3 x16 / Gen4 x8  CAPI2
• Up to 4 x8 Oculink ports suporting NVMe, 100GbE and OpenCAPI
• 2x 100GbE QSFP28 cages
• Half Height - Half Length format
Basics of HW Acceleration
11
Standard CPU Setup (No Acceleration)
Host Memory
Over burdened CPU
Slow functions
Congested
memory and
output card
access
CPU manages all data,
memory access,
functions, and flows
With increased data,
computing, storage, and
network challenges
Function
Application
Basics of HW Acceleration
12
Standard CPU Setup (No Acceleration)
Host Memory
CPU manages all data,
memory access,
functions, and flows
 CPU manages all data, memory access, functions, and flows
Over burdened CPU
Slow functions
Congested memory and output card access
Application
Function
HW Acceleration with FPGA
13
Classic Acceleration with FPGA
Host Memory
Faster functions
on FPGA
Relieved function only
from CPU burden
CPU still handles
FPGA memory
access and data
copying.
No Data Coherency
Standard CPU Setup (No Acceleration)
Host Memory
Historically
programmed using
Verilog/VHDL
Function
 CPU manages all data, memory access, functions, and flows
Over burdened CPU
Congested memory and output card access
Slow functions
ApplicationApplication
Function
HW Acceleration with FPGA
14
Standard CPU Setup (No Acceleration)
Host Memory
Classic Acceleration with FPGA
Host Memory
Function
 CPU is used to manage FPGA memory access
No Data Coherency (Host memory copied to FPGA)
FPGA historically programmed using Verilog/VHDL
CPU still handles all memory and data access
 CPU manages all data, memory access, functions, and flows
Over burdened CPU
Congested memory and output card access
Slow functions
ApplicationApplication
Function
Addressing Classic FPGA Acceleration Issues
15
• What is OpenCAPI?
• Open Coherent Accelerator Processor
Interface
• OpenCAPI is an open interface
architecture that allows any
microprocessor to attach to:
• Coherent user-level accelerators and
I/O devices
• Advanced memories accessible via
read/write or user-level direct
memory access (DMA) semantics
• Agnostic to processor architecture
• What is OC-Accel?
• OpenCAPI Acceleration Framework to
program FPGAs using C/C++ instead of
Verilog or VHDL
OpenCAPI 3.0
OC 3.1
OpenCAPI specifications are downloadable from www.opencapi.org
HW Acceleration with FPGA + OpenCAPI
16
Classic Acceleration with FPGA
Host Memory
Function
Acceleration with FPGA + OpenCAPI
Host Memory
OpenCAPI
 OpenCAPI IO interface on FPGA accesses host memory directly
 Function accesses only needed host memory data
 Data Coherency (Data does not need to be copied to FPGA)
 Address translation (@function=@application)
 FPGA programmed with C/C++ using OC-Accel Framework
Function
 CPU is used to manage FPGA memory access
No Data Coherency (Host memory copied to FPGA)
FPGA historically programmed using Verilog/VHDL
CPU still handles all memory and data access
ApplicationApplication
• Hardware
• Advantages:
• Using FPGA instead of CPU
• FPGA is function specific only
• FPGA is fast + OpenCAPI direct memory access
• FPGA can have parallel logic
• FPGA uses function logic only
• Disadvantages:
• FPGA easily reconfigurable with C/C++ updates
• C/C++ easily recompiled for different FPGAs
• C/C++ code simulated and debugged
• C/C++ code can be easier to write and upload
• Software
• Advantages:
• App. Eng. Writing C/C++ functions (OC-Accel)
• C/C++ code is reusable
• C/C++ code is portable
• FPGA reconfigurable with C/C++ updates
• Disadvantages:
• Function executed faster on FPGA + CPU relief
• Software
• Advantages:
• More rapid development
• Lower non-recurring engineering costs
• Heightened portability
• Ease of updating features or patching bugs
• Disadvantages:
• Slower run time
FPGAs + OpenCAPI + OC-Accel Address All Issues
17
• Hardware
• Advantages:
• Much faster execution of functions
• Reduced power consumption
• Lower latency
• Increased parallelism and bandwidth
• Better IC area and function utilization
• Disadvantages:
• Lower ability to update design hardware
• Difficult to share source code btw FPGAs
• Higher costs of functional verification
• Longer develop process and time to market
Ex: Monte-Carlo (FPGA Accelerated)
18
Monte Carlo Analysis is a risk
management technique used in
the financial and insurance
industries and is used for
conducting a quantitative analysis
of risks.
By using CAPI with a FPGA, the C/C++ code was reduce by 40x on the application
side and freed up 33% of memory and CPU (versus a non-CAPI FPGA ).
Running 1 million iterations
Results: At least
50x Faster
with CAPI and FPGA technology on
POWER server
Ex: PostgreSQL regex Matching Accelerated
19
PostgreSQL + OpenCAPI shows compelling “regex” performance increase by leveraging the bandwidth and virtual
addressing of OpenCAPI technology. In fact, accelerating the SQL with OpenCAPI-regex can be 4x to 10x faster than the
best PostgreSQL built-in functions (CPU multi-threads enabled).
PostgreSQL is a powerful, open source object-relational database system. SQL (Structured Query Language)
is used to communicate with a database.
Actual Example Single Search Run Times:
• CPU parallel Seq Scan: ~698ms
• Custom Scan (PFCAPI): ~161ms
SELECT * FROM table WHERE pkt ~ pattern;
Basically: search the db for all pkt that match pattern
Command example
Ex: Ultra Fast Data Acquisition (X-Ray Crystallography)
20
9GBps
1 4
4 MPixels @ 1.1kHz
Digital Camera Sensors
Raw Data
Goal: Real-time mapping of biological structure by examining molecule scatter plots of protein crystal struck by x-rays
2 3
GPU
PCIe
GPU + PCIe Configuration
(Today)
Protein
Molecule
Mapped
Real Image
Raw data to real image conversion
Decimate / sort images
Data compression
1 Data acquisition
2
3
4
Compressed
Data
Ex: Ultra Fast Data Acquisition (X-Ray Crystallography)
21
22GBps
1 2 4
10 MPixels @ 2.2 kHz
Digital Camera Sensors
22GBps
Compressed
Data
FPGA w/ OpenCAPI
(Goal)
OpenCAPI3.0
22GBps
Dual FPGAs
In Parallel
UnfilteredImage
FilteredImage
GPU or FPGA of both
Host with NX-gzip
Embedded
HW Accelerator
Raw Data
22GBps
Image Data
OpenCAPI breaks the 9GBps PCIe
bottleneck!
Protein
Molecule
Mapped
Real Image
Raw data to real image conversion
Decimate / sort images
Data compression
Data acquisition
3
4
3
Goal: Real-time mapping of biological structure by examining molecule scatter plots of protein crystal struck by x-rays
1
2
Ex: Pull Quote
22
The benefit of using POWER interfaces, i.e., NVLink and OpenCAPI, is
not only bandwidth, but these interfaces allow also for coherent
memory access. FPGA board connected via OpenCAPI or GPGPU
connected via NVLink sees host (CPU) virtual memory space exactly like
the process running on the CPU, reducing the burden of writing
reliable and secure applications. Memory coherency can be also
available for PCIe FPGA accelerators installed in POWER9 servers via
OpenCAPI predecessor, the Coherent Accelerator Processor Interface
(CAPI). IBM also provides optimized software to benefit from the
architecture, including the CAPI Storage, Network, and Analytics
Programming (SNAP) framework51,52 that simplifies the integration of
FPGA designs with POWER9, as well as optimized ML and data analysis
routines for GPGPUs or FPGAs.53
Structural Dynamics 7, 014305 (2020); https://doi.org/10.1063/1.5143480
Ex: Memory Coherency
23
Scenario: 2MB data scattered in host memory are processed in a FPGA.
« Classic » PCIe FPGA card
Server
Function
Server
« CAPI-enabled » FPGA card
Function
blk blk blk blk
Gathering data (SW memcopy)
1
1 transaction of big amount of
data to FPGA (2MB)
2
1
2
1 transaction of 8kB for AddrSet
from host memory to FPGA
1
1024 transactions of 2kB from
Host memory to FPGA.
Directly reads required data at
random address.
2 1
2
ApplicationApplication CAPI
Results: CAPI-enabled was 2-3x faster than Classic method
Ease of FPGA Programming (OC-Accel)
24
Benefits:
• Faster Time To Market: Port a function to a FPGA in days not months
• No Obsolescence: Simply recompile unchanged C/C++ code for different FPGA
• No Link Constraint: Moving from a CAPI (over PCIe) link to OpenCAPI is just a matter of recompiling
- no code change
• No Specific Hardware Skills Needed: C/C++ coder can focus on functionality as all the resources are
managed by the framework.
• Open-Source Framework: The code can be modified, improved by any user.
Example:
• Note: SNAP is the predecessor to OC-Accel and overall flow and performance is equivalent.
• Customer ported and optimized SHA3 C code within 10 days using SNAP* framework versus
4 months in VHDL without SNAP
Development Plans:
• OC-Accel with OpenCAPI today, OC-Accel with other emerging standards like CXL tomorrow!
FPGAs + OpenCAPI + OC-Accel Has It All
25
Very high bandwidth
Faster development and time
to market with OC-Accel
Developers Aren’t Where We Need Them
Scripting
Interpreted App (Python / Rails / Java)
Non-Interpreted App (C++ / Java JRE)
Procedural App (C / C++)
High Level OS (C / C++)
Firmware
HW API (C, ASM)
Kernel (C, AS)
HDL
Chart content courtesy of Aaron Sullivan @Rackspace
Spreading the CAPI Love (OC-Accel)
26
Interpreted App (Python / Rails / Java)
Non-Interpreted App (C++ / Java JRE)
Procedural App (C / C++)
High Level OS (C / C++)
Kernel (C, AS)
HW API (C, ASM)
Firmware
Scripting
HDL
Application
Application
New Abstraction
New Abstraction
New Abstraction
New Abstraction
Soft-Hardware
Soft-Hardware
Soft-Hardware
Spreading the CAPI Love (OC-Accel)
Developers Where We Need Them
Chart content courtesy of Aaron Sullivan @Rackspace
27
- Know more about accelerators ?
- See a live demonstration?
- Do a benchmark ?
- Get answers to your questions?
Contact us
alexandre.castellane@fr.ibm.com
bruno.mesnet@fr.ibm.com
fabrice_moyen@fr.ibm.com
luyong@cn.ibm.com
shgoupf@cn.ibm.com
28
29
Thank You!

SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial

  • 1.
    IntroductionTo Acceleration withOpenCAPI SCFE 2020 - March 24th 2020 - A.CASTELLANE
  • 2.
    What Do YouNeed? 2 Out In
  • 3.
    Current Computing Landscape 3 CPUtechnology advances have slowed the historical cost/performance improvements seen over the last several decades => New CPU chips alone can not handle current challenges! Over Burdened CPUs Slow/Complex Algorithms & Functions + 100101010100011001 100110010010010010 101010001100110011 001001001001010101 000110011001100100 100100101010100011 001100110010010010 010101010001100110 011001001001001010 101000110011001100 100100100101010100 011001100110010010 01001010101 CPU CPU CPU CPU CPU DATA Network & Data Access Rates + Computation Data Access = Current Technology and Processing Overload! Bad News, it’s only going to get worse!
  • 4.
    Next Set OfChallenges Is Here! 4 Exponential Data Growth Compute Intensive Algorithms Diverse Data Structures & Types Decreasing Time To Results Hours .. Minutes .. Seconds .. Real Time Compute  AI, Machine / Deep Learning  Video Processing  Database / Big Data Analytics Storage  Scale-out Storage  Petabytes of new data  Intelligent / Compute SSDs Networking  Network Security  Low-latency Networking  Open vSwitch offload  Software Defined Networking Acceleration
  • 5.
    Next Challenges AffectAll Computing Fields 5 Bank / Finance • Risk analysis / Faster trading: Monte Carlo libraries • Credit card fraud detection • Block chain acceleration Video / Analytics • Smart Video surveillance from multiple videos feed • 3D video stream from multi-angles videos streams • Image search / Object tracking / Scene recreation • Multi-jpeg compression Machine Learning / Deep learning • Machine learning inference • Accelerate frequently used ML / DL algorithm Algorithm acceleration • Compression on network path or storage • Encryption on the fly to various memory types • String match
  • 6.
    But, what ifyou could have the best of both worlds! Options: Software or Hardware? 6 • Software: • Advantages: • More rapid development leading to faster time to market • Lower non-recurring engineering costs. Software can be reused easily. • Heightened portability • Ease of updating features or patching bugs • Disadvantages: • Slower run time • Hardware • Advantages: • Much faster execution of functions • Reduced power consumption • Lower latency • Increased parallelism and bandwidth • Better utilization of area and functional components available on an integrated circuit (IC) • Disadvantages: • Lower ability to update designs once etched onto silicon • Difficult to share Verilog/VHDL source code between different hardware platforms • Higher costs of functional verification • Longer develop process and time to market
  • 7.
    So, what’s thesolution? 7 The use of computer hardware specially designed to perform functions more efficiently than is possible in software alone running on a general-purpose CPU. Hardware Acceleration Thousands of tiny CPU using high parallelization  compute intensive application Field Programmable Gate Array Logic + IOs are customized exactly for the application's needs.  Very low and predictable latency applications Two Options GPU FPGA
  • 8.
    The Better Choice? 8 Dueto the inherent logic and IO flexibility, speed, and predictably low latency, FPGAs have a clear advantage. FPGA Acceleration FPGA = Field Programmable Gate Array Historically programmed using Verilog/VHDL Compiled Mapped to FPGA HW Logic
  • 9.
    What is aFPGA? 9 • A re-programmable computer chip with lots of configurable logic elements based on Lookup-Tables (LUT) • Programmable switch matrix routing • Configurable I/O and high-speed serial links • Advantages in flexibility, speed, and low latency due to: • Limited instruction set • High parallelism • Deep pipelines Programmable switchLogical View Programmable logic element • Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM,...) Field Programmable Gate Array
  • 10.
    FPGA Example (Bittware250-SOC) 10 Bittware 250-SoC Multipurpose Converged Network / Storage • Xilinx Zync UltraScale+ FPGA ZU19EG (64 bits Cortex-A53 ARM core) • Two 4GB DDR4 (for FPGA and ARM) • PCIe Gen3 x16 / Gen4 x8  CAPI2 • Up to 4 x8 Oculink ports suporting NVMe, 100GbE and OpenCAPI • 2x 100GbE QSFP28 cages • Half Height - Half Length format
  • 11.
    Basics of HWAcceleration 11 Standard CPU Setup (No Acceleration) Host Memory Over burdened CPU Slow functions Congested memory and output card access CPU manages all data, memory access, functions, and flows With increased data, computing, storage, and network challenges Function Application
  • 12.
    Basics of HWAcceleration 12 Standard CPU Setup (No Acceleration) Host Memory CPU manages all data, memory access, functions, and flows  CPU manages all data, memory access, functions, and flows Over burdened CPU Slow functions Congested memory and output card access Application Function
  • 13.
    HW Acceleration withFPGA 13 Classic Acceleration with FPGA Host Memory Faster functions on FPGA Relieved function only from CPU burden CPU still handles FPGA memory access and data copying. No Data Coherency Standard CPU Setup (No Acceleration) Host Memory Historically programmed using Verilog/VHDL Function  CPU manages all data, memory access, functions, and flows Over burdened CPU Congested memory and output card access Slow functions ApplicationApplication Function
  • 14.
    HW Acceleration withFPGA 14 Standard CPU Setup (No Acceleration) Host Memory Classic Acceleration with FPGA Host Memory Function  CPU is used to manage FPGA memory access No Data Coherency (Host memory copied to FPGA) FPGA historically programmed using Verilog/VHDL CPU still handles all memory and data access  CPU manages all data, memory access, functions, and flows Over burdened CPU Congested memory and output card access Slow functions ApplicationApplication Function
  • 15.
    Addressing Classic FPGAAcceleration Issues 15 • What is OpenCAPI? • Open Coherent Accelerator Processor Interface • OpenCAPI is an open interface architecture that allows any microprocessor to attach to: • Coherent user-level accelerators and I/O devices • Advanced memories accessible via read/write or user-level direct memory access (DMA) semantics • Agnostic to processor architecture • What is OC-Accel? • OpenCAPI Acceleration Framework to program FPGAs using C/C++ instead of Verilog or VHDL OpenCAPI 3.0 OC 3.1 OpenCAPI specifications are downloadable from www.opencapi.org
  • 16.
    HW Acceleration withFPGA + OpenCAPI 16 Classic Acceleration with FPGA Host Memory Function Acceleration with FPGA + OpenCAPI Host Memory OpenCAPI  OpenCAPI IO interface on FPGA accesses host memory directly  Function accesses only needed host memory data  Data Coherency (Data does not need to be copied to FPGA)  Address translation (@function=@application)  FPGA programmed with C/C++ using OC-Accel Framework Function  CPU is used to manage FPGA memory access No Data Coherency (Host memory copied to FPGA) FPGA historically programmed using Verilog/VHDL CPU still handles all memory and data access ApplicationApplication
  • 17.
    • Hardware • Advantages: •Using FPGA instead of CPU • FPGA is function specific only • FPGA is fast + OpenCAPI direct memory access • FPGA can have parallel logic • FPGA uses function logic only • Disadvantages: • FPGA easily reconfigurable with C/C++ updates • C/C++ easily recompiled for different FPGAs • C/C++ code simulated and debugged • C/C++ code can be easier to write and upload • Software • Advantages: • App. Eng. Writing C/C++ functions (OC-Accel) • C/C++ code is reusable • C/C++ code is portable • FPGA reconfigurable with C/C++ updates • Disadvantages: • Function executed faster on FPGA + CPU relief • Software • Advantages: • More rapid development • Lower non-recurring engineering costs • Heightened portability • Ease of updating features or patching bugs • Disadvantages: • Slower run time FPGAs + OpenCAPI + OC-Accel Address All Issues 17 • Hardware • Advantages: • Much faster execution of functions • Reduced power consumption • Lower latency • Increased parallelism and bandwidth • Better IC area and function utilization • Disadvantages: • Lower ability to update design hardware • Difficult to share source code btw FPGAs • Higher costs of functional verification • Longer develop process and time to market
  • 18.
    Ex: Monte-Carlo (FPGAAccelerated) 18 Monte Carlo Analysis is a risk management technique used in the financial and insurance industries and is used for conducting a quantitative analysis of risks. By using CAPI with a FPGA, the C/C++ code was reduce by 40x on the application side and freed up 33% of memory and CPU (versus a non-CAPI FPGA ). Running 1 million iterations Results: At least 50x Faster with CAPI and FPGA technology on POWER server
  • 19.
    Ex: PostgreSQL regexMatching Accelerated 19 PostgreSQL + OpenCAPI shows compelling “regex” performance increase by leveraging the bandwidth and virtual addressing of OpenCAPI technology. In fact, accelerating the SQL with OpenCAPI-regex can be 4x to 10x faster than the best PostgreSQL built-in functions (CPU multi-threads enabled). PostgreSQL is a powerful, open source object-relational database system. SQL (Structured Query Language) is used to communicate with a database. Actual Example Single Search Run Times: • CPU parallel Seq Scan: ~698ms • Custom Scan (PFCAPI): ~161ms SELECT * FROM table WHERE pkt ~ pattern; Basically: search the db for all pkt that match pattern Command example
  • 20.
    Ex: Ultra FastData Acquisition (X-Ray Crystallography) 20 9GBps 1 4 4 MPixels @ 1.1kHz Digital Camera Sensors Raw Data Goal: Real-time mapping of biological structure by examining molecule scatter plots of protein crystal struck by x-rays 2 3 GPU PCIe GPU + PCIe Configuration (Today) Protein Molecule Mapped Real Image Raw data to real image conversion Decimate / sort images Data compression 1 Data acquisition 2 3 4 Compressed Data
  • 21.
    Ex: Ultra FastData Acquisition (X-Ray Crystallography) 21 22GBps 1 2 4 10 MPixels @ 2.2 kHz Digital Camera Sensors 22GBps Compressed Data FPGA w/ OpenCAPI (Goal) OpenCAPI3.0 22GBps Dual FPGAs In Parallel UnfilteredImage FilteredImage GPU or FPGA of both Host with NX-gzip Embedded HW Accelerator Raw Data 22GBps Image Data OpenCAPI breaks the 9GBps PCIe bottleneck! Protein Molecule Mapped Real Image Raw data to real image conversion Decimate / sort images Data compression Data acquisition 3 4 3 Goal: Real-time mapping of biological structure by examining molecule scatter plots of protein crystal struck by x-rays 1 2
  • 22.
    Ex: Pull Quote 22 Thebenefit of using POWER interfaces, i.e., NVLink and OpenCAPI, is not only bandwidth, but these interfaces allow also for coherent memory access. FPGA board connected via OpenCAPI or GPGPU connected via NVLink sees host (CPU) virtual memory space exactly like the process running on the CPU, reducing the burden of writing reliable and secure applications. Memory coherency can be also available for PCIe FPGA accelerators installed in POWER9 servers via OpenCAPI predecessor, the Coherent Accelerator Processor Interface (CAPI). IBM also provides optimized software to benefit from the architecture, including the CAPI Storage, Network, and Analytics Programming (SNAP) framework51,52 that simplifies the integration of FPGA designs with POWER9, as well as optimized ML and data analysis routines for GPGPUs or FPGAs.53 Structural Dynamics 7, 014305 (2020); https://doi.org/10.1063/1.5143480
  • 23.
    Ex: Memory Coherency 23 Scenario:2MB data scattered in host memory are processed in a FPGA. « Classic » PCIe FPGA card Server Function Server « CAPI-enabled » FPGA card Function blk blk blk blk Gathering data (SW memcopy) 1 1 transaction of big amount of data to FPGA (2MB) 2 1 2 1 transaction of 8kB for AddrSet from host memory to FPGA 1 1024 transactions of 2kB from Host memory to FPGA. Directly reads required data at random address. 2 1 2 ApplicationApplication CAPI Results: CAPI-enabled was 2-3x faster than Classic method
  • 24.
    Ease of FPGAProgramming (OC-Accel) 24 Benefits: • Faster Time To Market: Port a function to a FPGA in days not months • No Obsolescence: Simply recompile unchanged C/C++ code for different FPGA • No Link Constraint: Moving from a CAPI (over PCIe) link to OpenCAPI is just a matter of recompiling - no code change • No Specific Hardware Skills Needed: C/C++ coder can focus on functionality as all the resources are managed by the framework. • Open-Source Framework: The code can be modified, improved by any user. Example: • Note: SNAP is the predecessor to OC-Accel and overall flow and performance is equivalent. • Customer ported and optimized SHA3 C code within 10 days using SNAP* framework versus 4 months in VHDL without SNAP Development Plans: • OC-Accel with OpenCAPI today, OC-Accel with other emerging standards like CXL tomorrow!
  • 25.
    FPGAs + OpenCAPI+ OC-Accel Has It All 25 Very high bandwidth Faster development and time to market with OC-Accel
  • 26.
    Developers Aren’t WhereWe Need Them Scripting Interpreted App (Python / Rails / Java) Non-Interpreted App (C++ / Java JRE) Procedural App (C / C++) High Level OS (C / C++) Firmware HW API (C, ASM) Kernel (C, AS) HDL Chart content courtesy of Aaron Sullivan @Rackspace Spreading the CAPI Love (OC-Accel) 26
  • 27.
    Interpreted App (Python/ Rails / Java) Non-Interpreted App (C++ / Java JRE) Procedural App (C / C++) High Level OS (C / C++) Kernel (C, AS) HW API (C, ASM) Firmware Scripting HDL Application Application New Abstraction New Abstraction New Abstraction New Abstraction Soft-Hardware Soft-Hardware Soft-Hardware Spreading the CAPI Love (OC-Accel) Developers Where We Need Them Chart content courtesy of Aaron Sullivan @Rackspace 27
  • 28.
    - Know moreabout accelerators ? - See a live demonstration? - Do a benchmark ? - Get answers to your questions? Contact us alexandre.castellane@fr.ibm.com bruno.mesnet@fr.ibm.com fabrice_moyen@fr.ibm.com luyong@cn.ibm.com shgoupf@cn.ibm.com 28
  • 29.