SlideShare a Scribd company logo
1 of 20
Download to read offline
AI Inference Acceleration with
components all in Open Hardware:
OpenCAPI and NVDLA
Deep Learning Inference Engine for CAPI/OpenCAPI
October 27, 2019
IBM 中国系统实验室
IBM China System Lab
Peng Fei GOU (shgoupf@cn.ibm.com)
2
Motivation
Path to hardware acceleration for AI
ü Deep learning inference acceleration is hot everywhere, from edge to cloud
ü POWER9 needs a solution on hardware acceleration for AI
OpenCAPI-NVDLA: demonstration on P9 heterogeneous computing platform
ü To align with Open Hardware strategy
ü Fast and simple acceleration deployment on server with FPGA and OpenCAPI
NVDLA: OPEN SOURCE inference engine from NVIDIA
ü NVIDIA Deep Learning Accelerator
ü High quality: production level open source RTL
ü Flexibility: configurable architecture to fulfill different business needs
3
Open Hardware Ecosystem
NVDLA
üOpen hardware design
üPart of NVIDIA’s Xavier SOC
üOpen source compiler
üSifive + NVDLA collaboration
üMore than tens of startups
starting leverage NVDLA
üActive community
OpenPOWER
üOpen ISA
üOpen reference design
üEncourage more open innovations in
hardware
üRich ecosystem and partners from
software, system hardware to chip
4
Hardware
Backend
AI Acceleration
Intel
X86
CPU
NV
GPU
PCIe
FPGA
Google
TPU
PCIe
ASIC
(NPU)
Applications, High-level APIs, Management Tools, etc.
World of X86
IBM
POWER
CPU
PCIe ASIC
(NPU)
NV
GPU
OpenCAPI
FPGA
PCIe
FPGA
World of OpenPOWER
TensorFlow/Keras/Pytorch/Etc.
NVDLA-hwOpenCAPI
TLX/DLX
OC-ACCEL
AXI Lite
AXI
interrupt
AXI
OnChip
RAM
POWER9
Inference
Application
User mode
Driver
Host
memory
FPGA
MemoryInterface
Control Interface
Conv Buffer Conv Core
Activation
Pooling
LRN
Reshape
Bridge DMA
NVDLA open source inference
engine, adapted to FPGA and
OpenCAPI
Open
CAPI
25G
POWER 9 server with
CAPI or OpenCAPI
Mihawk, Inspur
5280/5290, etc.
FPGA Card provided by vendors
CAPI user
mode drivers
Applications,
image
recognition, etc.
OC/AXI Bridge mode
TLx
DLx
CFG
snap
data
bridge
mmio
6
Multi Engine Structure
A
X
I
I
n
t
e
r
c
o
n
n
e
c
t
Job Manager
Job Queue
Job 0
…
Job M
Local Configuration Bus
M NVDLA 0
NVDLA 1
…
NVDLA N
M
AXI/PSL Bridge
or AXI/TLX
Bridge
M
M
M
Software Stack
DL training
Framework
Parser Compiler
Optimizer
User-mode
Driver
Kernel-
mode
Driver
CAPI NVDLA
HardwareModel Loadable ioctl() Reg
Writes
Publicly available,
Caffe, etc.
NVIDIA open
sourced in around
Sep, 2019
Applications: image
recognition, etc.
NVIDIA open sourced,
user/kernel mode drivers.
Changed to CAPI user mode
Hardware with
OpenCAPI
Transform a trained network to NVDLA
loadables Running offline, not necessarily
on POWER
NVIDIA’s Parser,
Compiler and Optimizer
are enough to support
early stage evaluation.
Conv 1
Conv 2 Conv 3 Conv 4 Conv 5
FC 1 FC 2 FC 3
MAXPooling
MAXPooling
MAXPooling
Applications/workloads running with OpenCAPI-
NVDLA user-mode driver, on POWER9 platforms
Kernel mode
driver changed
and eliminated
to adapt CAPI
mode.
8
Driver Changes for OpenCAPI
DRM_IOCTL_NVDLA_GEM_CREATE
ioctl()
DRM_IOCTL_NVDLA_GEM_MMAP
DRM_IOCTL_NVDLA_DESTROY
DRM_IOCTL_NVDLA_SUBMIT
NVDLA DRM Driver
nvdla_gem_create() drm_gem_handle_create()
nvdla_gem_map_offset() drm_gem_create_mmap_offset()
nvdla_gem_destroy() drm_gem_dumb_destroy()
nvdla_submit() nvdla_task_sumbit()
NVDLA Firmware
dla_submit_operation()
*_reg_read()
*_reg_write()
DLA_OP_BDMA
Engines
DLA_OP_CONV
DLA_OP_SDP
DLA_OP_PDP
DLA_OP_CDP
DLA_OP_RUBIK
Hardware
User
mode
driver
IOCTL
removed
DRM and GEM
dependencies removed
Firmware changed to
user mode calls
Easy Memory Management
All kernel mode DRM/GEM codes
removed
Use user mode malloc() to manage
memories
No IOCTL calls
IOCTL calls from UMD to KMD
changed to user level function calls
Firmware Works in User Mode
No Linux Kernel Dependency
No dependency to DRM/GEM drivers
No dependency to Linux kernel
versions
Changed to
direct function
calls
9
Functional Validation
Mihawk
Running in IBM Austin Lab
Mihawk (POWER9) + AlphaData AD9H7 FPGA Card
Large config (2048 MACs) running @ 200MHz
Functional tests PASSED
Alexnet running with real image inferencing
Results not 100% accurate due to model inaccuracy
AlphaData AD9H7
10
Performance Evaluation and Projection
Hardware
No.
MACs
Clock FPGA
I/O
bandwidt
h
FC Batch
Size
AlexNet Perf (frames/second)
Current
Performance
2048 200MHz VU37P 1GB / s 1 10.417
Projected 2048 250MHz VU37P 20 GB / s 16 741
Current performance
Alexnet: 10.42 frames/second
Performance under tuning …
Expect to have better
performance when issues in
compiler is resolved and tuned.
Projected Performance
Alexnet: 741 frames/second
ResNet50: 113 frames/second
Projected Performance Calculated based on the analytical model from NVDLA
https://github.com/nvdla/hw/tree/master/perf
11
FPGA Implementation Result
Resource Utilization Available Utilization %
nv_large (2048 MACs)
LUT 616202 1303680 47.72
CLB Reg 448178 2607360 17.19
CLB 122266 162960 75.03
BRAM 408 2016 20.24
DSP 251 9024 2.78
IO 1 676 0.15
BUFG 37 1800 2.06
CARD Type: AlphaData AD9H7
Xilinx Virtex UltraScale+ XCVU37P-2E - FSVH2892
12
Summary
Why NVDLA on OpenCAPI
ü Open hardware collaboration
ü Inference engine is a foreseeable hot topic in servers, data centers and clouds
ü NVIDIA is serious on open source DLA, the quality of DLA is production level
ü We don’t want to reinvent the wheel
What’s next?
ü Larger configurations (4096 MACs and/or FP16 support)
ü Parser and compiler adaption
ü Performance tuning and real workload adaption ( key to business )
Open Source
ü Important to cultivate open hardware ecosystem
13
Pointers to Materials
Modified CAPI/SNAP framework for NVDLA
ü https://github.com/shgoupf/snap/tree/nvdla
ü On public Github
Modified NVDLA software for CAPI
ü https://github.com/shgoupf/nvdla-sw/tree/capi
ü On public Github
Modified NVDLA IP, including RTL and Unit Testbench
ü https://github.ibm.com/shgoupf/nvdla-capi
ü On IBM enterprise Github
14
References
Hotchips 30
ü http://www.hotchips.org/
Xilinx xfDNN (CHaiDNN)
ü https://github.com/Xilinx/CHaiDNN
SNAP
ü https://github.com/open-power/snap
NVDLA
ü http://nvdla.org/
Original NVDLA Hardware
ü http://github.com/nvdla/hw
Original NVDLA Software
ü http://github.com/nvdla/sw
Original NVDLA Virtual Platform
ü http://github.com/nvdla/vp
Community Contributed NVDLA
Compiler Source
ü https://github.com/icubecorp/nvdla_compiler
Thanks
and More Details in Following Slides
16
Quick Facts
What is NVDLA
ü NVIDIA Deep Learning Accelerator
ü Open Source, production level RTL
ü Hardware configurable
ü Accelerate Convolution Neural Networks
What is OpenCAPI-NVDLA
ü Bring NVDLA to OpenCAPI on FPGA
ü Explore possibility of AI acceleration on
CAPI/OpenCAPI
ü Align with POWER’s heterogenous
computing strategy
Current Development Status
ü NVDLA hardware ported to OpenCAPI
ü NVDLA software (drivers) ported to CAPI
ü Hardware running @2048 MACs
@200MHz, with AlexNet
ü Running on Mihawk + AlphaData AD9H7
Potential Use Case
ü AI acceleration solution on POWER9
ü Cloud image recognition service
ü Face recognition for large scale video
surveillance server
ü FPGA based AI acceleration on cloud
Performance
ü ~1TOPs @INT8
ü Current perf: 10.42 FPS for Alexnet
ü Projected perf: 813.49 FPS for Alexnet
Other Highlights
ü Production level unit verification
environment with full regression
enabled
ü Open source compilers
ü Larger hardware development in
progress
17
NVDLA Changes for FPGA Implementation
INIT
-2260
+use DSP
-2000
+disable clock
gating
25
+disable clock
gating
-1900
+add pipeline in
MAC
-400
+add pipeline in
SDP
-130
+set max fanout
-14
-2500
-2000
-1500
-1000
-500
0
500
INIT use DSP disable clock
gating
add pipeline
in MAC
add pipeline
in SDP
set max
fanout
WNS(ps)
FPGA Implementation Timing Closure
NV_SMALL NV_LARGE
Methods Used
INIT
The initial NVDLA RTL from Github
+Use DSP
Replace all NVDLA MAC operators with Xilinx DSP IP
+Disable clock gating
Disable the clock gating for ASIC design
+Add pipelines in MAC
Add pipelines in MAC for FPGA oriented design
RTL changes verified with unit testbench
+Add pipelines in SDP
Add pipelines in SDP for FPGA oriented design
RTL changes verified with unit testbench
+Set max fanout
Set max fanout for critical registers in NVDLA
18
NVDLA Changes for SNAP Integration
Address Width WR Description
0x400 32 RW [31:9] RO: Reserved
[8] RW: Selector of SNAP register and
NVDLA register, 0 is SNAP, 1 is NVDLA
[7:0] RW: Extension of NVDLA address,
use with paddr[9:2]
SNAP Action Registers
NVDLA Registers
Paddr[9:0]
0
1
Config Register
[8] [7:0]
{Config_reg[7:0], paddr[9:2]}
Config Register Definition
Indirect Register Accessing
NVDLA-hw
SNAP
AXI Lite
AXI4
interrupt
Control Path Adapter
AXI4-lite to
APB bridge
APB to CSB
bridge
Data Path Adapter
Data bus width converter
512bit to 256bit
SNAP
Action
Regs
Control Path Adaption
Add AXI-to-APB bridge and APB-to-CSB bridge
Indirect register accessing (NVDLA reg space larger than
SNAP action reg space)
Interrupt enablement
Data Path Adaption
AXI4 data bus width converter from NVDLA (256 bit dbus)
to SNAP (512 bit dbus)
Other changes to facilitate AXI4 bus signals
19
Unit Sim Test Plan and Testbench
Unit-sim
Testbench
Trace Generator: Generate trace
Trace Player: Drive trace to DUT, check DUT correct behavior
and collect coverages…
Test Plan
Test Level: Level 0, Level1,…Level 10, Level 20…
Associating Tests: A method named add_test to associate
tests with test plan
Testcase
Direct(Trace) Tests: pdp_8x8x32_1x1_int8_0…
Python Tests: nvdla_reg_accessing…
Random(UVM) Tests: cc_in_width_ctest…
Unit Level Simulation Environment
§ All RTL changes protected
§ Added AXI-lite adapters and scoreboards
§ Added checkers to SNAP action registers
§ Simulator changed from VCS to XCELIUM
§ Simulating with Xilinx FPGA Ips
§ Regression running on Jenkins server
§ Production level verification environment
Simulation Environment Components
20
Business Trends
NAME USAGE COMPANY DATE FEATURE
NVDLA Inferencing Nvidia 2017 Free Opensource Inference Engine
Zynq
UltraScale+
Training and
Inferencing
Xilinx 2015 HBM, CCIX, framework, int8
Xilinx DNN
Processor
Inferencing Xilinx 2018 On server inference solution
ARM ML
Processor
Inferencing ARM 2018
Flexible architecture, scaling from
edge to cloud
Brainwave
Training and
Inferencing
Microsoft 2017 Deployed on Microsoft cloud
Cambricon
1M
Inferencing Cambricon 2018 Specialized AI ISA
Ascend 910
Training and
Inferencing
Huawei 2018 Huawei’s new architecture for AI
Facebook, Ali(Ali-NPU), Baidu(XPU) are racing to hardware acceleration for AI
Aliyun, Tencent Cloud, Baidu Cloud, Huawei Cloud starts to provide FPGA cloud services
Highlighted Trends
Software + Hardware full stack solutions
Internet giants start on AI chips
Hardware providers optimize their AI
related libraries and try to get into
software’s domain
FPGA has been widely deployed on public
clouds
Focusing on energy-performance, IO
efficiency and system optimization

More Related Content

What's hot

Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
 
BKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopBKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopLinaro
 
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology InnovationsLenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovationsinside-BigData.com
 
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016Mandie Quartly
 
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSSPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSinside-BigData.com
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
Host Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsHost Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsNetronome
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 

What's hot (20)

Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
An Update on Arm HPC
An Update on Arm HPCAn Update on Arm HPC
An Update on Arm HPC
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
BKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopBKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing Hadoop
 
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology InnovationsLenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
 
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
 
Arm in HPC
Arm in HPCArm in HPC
Arm in HPC
 
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSSPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Host Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsHost Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment Models
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 

Similar to AI Inference Acceleration with Open Hardware Components

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, TrustedJeremy Eder
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Nelson Calero
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power Edge22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power EdgeSashikris
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a boxYutaka Kawai
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCGanesan Narayanasamy
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120Linaro
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...Jim St. Leger
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDKLagopus SDN/OpenFlow switch
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaJim St. Leger
 
PowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAPowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAAlexander Grudanov
 
An Update on the European Processor Initiative
An Update on the European Processor InitiativeAn Update on the European Processor Initiative
An Update on the European Processor Initiativeinside-BigData.com
 

Similar to AI Inference Acceleration with Open Hardware Components (20)

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
FPGA MeetUp
FPGA MeetUpFPGA MeetUp
FPGA MeetUp
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power Edge22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power Edge
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
InAccel FPGA resource manager
InAccel FPGA resource managerInAccel FPGA resource manager
InAccel FPGA resource manager
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
 
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
Make Accelerator Pluggable for Container Engine
Make Accelerator Pluggable for Container EngineMake Accelerator Pluggable for Container Engine
Make Accelerator Pluggable for Container Engine
 
PowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAPowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDA
 
An Update on the European Processor Initiative
An Update on the European Processor InitiativeAn Update on the European Processor Initiative
An Update on the European Processor Initiative
 

More from Yutaka Kawai

05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design exampleYutaka Kawai
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
03 desktop on an open powersystem
03 desktop on an open powersystem03 desktop on an open powersystem
03 desktop on an open powersystemYutaka Kawai
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms finalYutaka Kawai
 
0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy FurmanekYutaka Kawai
 
10th meetup20191209b
10th meetup20191209b10th meetup20191209b
10th meetup20191209bYutaka Kawai
 
Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Yutaka Kawai
 
Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Yutaka Kawai
 
Open power keynote- openisa
Open power  keynote- openisa Open power  keynote- openisa
Open power keynote- openisa Yutaka Kawai
 
Open power topics20191023
Open power topics20191023Open power topics20191023
Open power topics20191023Yutaka Kawai
 
9th meetup20191023
9th meetup201910239th meetup20191023
9th meetup20191023Yutaka Kawai
 
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Yutaka Kawai
 
Nec exp ether071719
Nec exp ether071719Nec exp ether071719
Nec exp ether071719Yutaka Kawai
 
July japan meetup latest
July japan meetup latestJuly japan meetup latest
July japan meetup latestYutaka Kawai
 
8th meetup20190717
8th meetup201907178th meetup20190717
8th meetup20190717Yutaka Kawai
 
2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2bYutaka Kawai
 
2019 may 20th japan summit
2019 may 20th japan summit2019 may 20th japan summit
2019 may 20th japan summitYutaka Kawai
 
7th meetup20190415
7th meetup201904157th meetup20190415
7th meetup20190415Yutaka Kawai
 

More from Yutaka Kawai (20)

05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
03 desktop on an open powersystem
03 desktop on an open powersystem03 desktop on an open powersystem
03 desktop on an open powersystem
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek
 
10th meetup20191209b
10th meetup20191209b10th meetup20191209b
10th meetup20191209b
 
Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Light talk kioxia_20191023r2
Light talk kioxia_20191023r2
 
Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1
 
Open power keynote- openisa
Open power  keynote- openisa Open power  keynote- openisa
Open power keynote- openisa
 
Open power topics20191023
Open power topics20191023Open power topics20191023
Open power topics20191023
 
9th meetup20191023
9th meetup201910239th meetup20191023
9th meetup20191023
 
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
 
Ai vision u200
Ai vision u200Ai vision u200
Ai vision u200
 
Nec exp ether071719
Nec exp ether071719Nec exp ether071719
Nec exp ether071719
 
July japan meetup latest
July japan meetup latestJuly japan meetup latest
July japan meetup latest
 
8th meetup20190717
8th meetup201907178th meetup20190717
8th meetup20190717
 
2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b
 
OCP48V Solution
OCP48V SolutionOCP48V Solution
OCP48V Solution
 
2019 may 20th japan summit
2019 may 20th japan summit2019 may 20th japan summit
2019 may 20th japan summit
 
7th meetup20190415
7th meetup201904157th meetup20190415
7th meetup20190415
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

AI Inference Acceleration with Open Hardware Components

  • 1. AI Inference Acceleration with components all in Open Hardware: OpenCAPI and NVDLA Deep Learning Inference Engine for CAPI/OpenCAPI October 27, 2019 IBM 中国系统实验室 IBM China System Lab Peng Fei GOU (shgoupf@cn.ibm.com)
  • 2. 2 Motivation Path to hardware acceleration for AI ü Deep learning inference acceleration is hot everywhere, from edge to cloud ü POWER9 needs a solution on hardware acceleration for AI OpenCAPI-NVDLA: demonstration on P9 heterogeneous computing platform ü To align with Open Hardware strategy ü Fast and simple acceleration deployment on server with FPGA and OpenCAPI NVDLA: OPEN SOURCE inference engine from NVIDIA ü NVIDIA Deep Learning Accelerator ü High quality: production level open source RTL ü Flexibility: configurable architecture to fulfill different business needs
  • 3. 3 Open Hardware Ecosystem NVDLA üOpen hardware design üPart of NVIDIA’s Xavier SOC üOpen source compiler üSifive + NVDLA collaboration üMore than tens of startups starting leverage NVDLA üActive community OpenPOWER üOpen ISA üOpen reference design üEncourage more open innovations in hardware üRich ecosystem and partners from software, system hardware to chip
  • 4. 4 Hardware Backend AI Acceleration Intel X86 CPU NV GPU PCIe FPGA Google TPU PCIe ASIC (NPU) Applications, High-level APIs, Management Tools, etc. World of X86 IBM POWER CPU PCIe ASIC (NPU) NV GPU OpenCAPI FPGA PCIe FPGA World of OpenPOWER TensorFlow/Keras/Pytorch/Etc.
  • 5. NVDLA-hwOpenCAPI TLX/DLX OC-ACCEL AXI Lite AXI interrupt AXI OnChip RAM POWER9 Inference Application User mode Driver Host memory FPGA MemoryInterface Control Interface Conv Buffer Conv Core Activation Pooling LRN Reshape Bridge DMA NVDLA open source inference engine, adapted to FPGA and OpenCAPI Open CAPI 25G POWER 9 server with CAPI or OpenCAPI Mihawk, Inspur 5280/5290, etc. FPGA Card provided by vendors CAPI user mode drivers Applications, image recognition, etc. OC/AXI Bridge mode TLx DLx CFG snap data bridge mmio
  • 6. 6 Multi Engine Structure A X I I n t e r c o n n e c t Job Manager Job Queue Job 0 … Job M Local Configuration Bus M NVDLA 0 NVDLA 1 … NVDLA N M AXI/PSL Bridge or AXI/TLX Bridge M M M
  • 7. Software Stack DL training Framework Parser Compiler Optimizer User-mode Driver Kernel- mode Driver CAPI NVDLA HardwareModel Loadable ioctl() Reg Writes Publicly available, Caffe, etc. NVIDIA open sourced in around Sep, 2019 Applications: image recognition, etc. NVIDIA open sourced, user/kernel mode drivers. Changed to CAPI user mode Hardware with OpenCAPI Transform a trained network to NVDLA loadables Running offline, not necessarily on POWER NVIDIA’s Parser, Compiler and Optimizer are enough to support early stage evaluation. Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 FC 1 FC 2 FC 3 MAXPooling MAXPooling MAXPooling Applications/workloads running with OpenCAPI- NVDLA user-mode driver, on POWER9 platforms Kernel mode driver changed and eliminated to adapt CAPI mode.
  • 8. 8 Driver Changes for OpenCAPI DRM_IOCTL_NVDLA_GEM_CREATE ioctl() DRM_IOCTL_NVDLA_GEM_MMAP DRM_IOCTL_NVDLA_DESTROY DRM_IOCTL_NVDLA_SUBMIT NVDLA DRM Driver nvdla_gem_create() drm_gem_handle_create() nvdla_gem_map_offset() drm_gem_create_mmap_offset() nvdla_gem_destroy() drm_gem_dumb_destroy() nvdla_submit() nvdla_task_sumbit() NVDLA Firmware dla_submit_operation() *_reg_read() *_reg_write() DLA_OP_BDMA Engines DLA_OP_CONV DLA_OP_SDP DLA_OP_PDP DLA_OP_CDP DLA_OP_RUBIK Hardware User mode driver IOCTL removed DRM and GEM dependencies removed Firmware changed to user mode calls Easy Memory Management All kernel mode DRM/GEM codes removed Use user mode malloc() to manage memories No IOCTL calls IOCTL calls from UMD to KMD changed to user level function calls Firmware Works in User Mode No Linux Kernel Dependency No dependency to DRM/GEM drivers No dependency to Linux kernel versions Changed to direct function calls
  • 9. 9 Functional Validation Mihawk Running in IBM Austin Lab Mihawk (POWER9) + AlphaData AD9H7 FPGA Card Large config (2048 MACs) running @ 200MHz Functional tests PASSED Alexnet running with real image inferencing Results not 100% accurate due to model inaccuracy AlphaData AD9H7
  • 10. 10 Performance Evaluation and Projection Hardware No. MACs Clock FPGA I/O bandwidt h FC Batch Size AlexNet Perf (frames/second) Current Performance 2048 200MHz VU37P 1GB / s 1 10.417 Projected 2048 250MHz VU37P 20 GB / s 16 741 Current performance Alexnet: 10.42 frames/second Performance under tuning … Expect to have better performance when issues in compiler is resolved and tuned. Projected Performance Alexnet: 741 frames/second ResNet50: 113 frames/second Projected Performance Calculated based on the analytical model from NVDLA https://github.com/nvdla/hw/tree/master/perf
  • 11. 11 FPGA Implementation Result Resource Utilization Available Utilization % nv_large (2048 MACs) LUT 616202 1303680 47.72 CLB Reg 448178 2607360 17.19 CLB 122266 162960 75.03 BRAM 408 2016 20.24 DSP 251 9024 2.78 IO 1 676 0.15 BUFG 37 1800 2.06 CARD Type: AlphaData AD9H7 Xilinx Virtex UltraScale+ XCVU37P-2E - FSVH2892
  • 12. 12 Summary Why NVDLA on OpenCAPI ü Open hardware collaboration ü Inference engine is a foreseeable hot topic in servers, data centers and clouds ü NVIDIA is serious on open source DLA, the quality of DLA is production level ü We don’t want to reinvent the wheel What’s next? ü Larger configurations (4096 MACs and/or FP16 support) ü Parser and compiler adaption ü Performance tuning and real workload adaption ( key to business ) Open Source ü Important to cultivate open hardware ecosystem
  • 13. 13 Pointers to Materials Modified CAPI/SNAP framework for NVDLA ü https://github.com/shgoupf/snap/tree/nvdla ü On public Github Modified NVDLA software for CAPI ü https://github.com/shgoupf/nvdla-sw/tree/capi ü On public Github Modified NVDLA IP, including RTL and Unit Testbench ü https://github.ibm.com/shgoupf/nvdla-capi ü On IBM enterprise Github
  • 14. 14 References Hotchips 30 ü http://www.hotchips.org/ Xilinx xfDNN (CHaiDNN) ü https://github.com/Xilinx/CHaiDNN SNAP ü https://github.com/open-power/snap NVDLA ü http://nvdla.org/ Original NVDLA Hardware ü http://github.com/nvdla/hw Original NVDLA Software ü http://github.com/nvdla/sw Original NVDLA Virtual Platform ü http://github.com/nvdla/vp Community Contributed NVDLA Compiler Source ü https://github.com/icubecorp/nvdla_compiler
  • 15. Thanks and More Details in Following Slides
  • 16. 16 Quick Facts What is NVDLA ü NVIDIA Deep Learning Accelerator ü Open Source, production level RTL ü Hardware configurable ü Accelerate Convolution Neural Networks What is OpenCAPI-NVDLA ü Bring NVDLA to OpenCAPI on FPGA ü Explore possibility of AI acceleration on CAPI/OpenCAPI ü Align with POWER’s heterogenous computing strategy Current Development Status ü NVDLA hardware ported to OpenCAPI ü NVDLA software (drivers) ported to CAPI ü Hardware running @2048 MACs @200MHz, with AlexNet ü Running on Mihawk + AlphaData AD9H7 Potential Use Case ü AI acceleration solution on POWER9 ü Cloud image recognition service ü Face recognition for large scale video surveillance server ü FPGA based AI acceleration on cloud Performance ü ~1TOPs @INT8 ü Current perf: 10.42 FPS for Alexnet ü Projected perf: 813.49 FPS for Alexnet Other Highlights ü Production level unit verification environment with full regression enabled ü Open source compilers ü Larger hardware development in progress
  • 17. 17 NVDLA Changes for FPGA Implementation INIT -2260 +use DSP -2000 +disable clock gating 25 +disable clock gating -1900 +add pipeline in MAC -400 +add pipeline in SDP -130 +set max fanout -14 -2500 -2000 -1500 -1000 -500 0 500 INIT use DSP disable clock gating add pipeline in MAC add pipeline in SDP set max fanout WNS(ps) FPGA Implementation Timing Closure NV_SMALL NV_LARGE Methods Used INIT The initial NVDLA RTL from Github +Use DSP Replace all NVDLA MAC operators with Xilinx DSP IP +Disable clock gating Disable the clock gating for ASIC design +Add pipelines in MAC Add pipelines in MAC for FPGA oriented design RTL changes verified with unit testbench +Add pipelines in SDP Add pipelines in SDP for FPGA oriented design RTL changes verified with unit testbench +Set max fanout Set max fanout for critical registers in NVDLA
  • 18. 18 NVDLA Changes for SNAP Integration Address Width WR Description 0x400 32 RW [31:9] RO: Reserved [8] RW: Selector of SNAP register and NVDLA register, 0 is SNAP, 1 is NVDLA [7:0] RW: Extension of NVDLA address, use with paddr[9:2] SNAP Action Registers NVDLA Registers Paddr[9:0] 0 1 Config Register [8] [7:0] {Config_reg[7:0], paddr[9:2]} Config Register Definition Indirect Register Accessing NVDLA-hw SNAP AXI Lite AXI4 interrupt Control Path Adapter AXI4-lite to APB bridge APB to CSB bridge Data Path Adapter Data bus width converter 512bit to 256bit SNAP Action Regs Control Path Adaption Add AXI-to-APB bridge and APB-to-CSB bridge Indirect register accessing (NVDLA reg space larger than SNAP action reg space) Interrupt enablement Data Path Adaption AXI4 data bus width converter from NVDLA (256 bit dbus) to SNAP (512 bit dbus) Other changes to facilitate AXI4 bus signals
  • 19. 19 Unit Sim Test Plan and Testbench Unit-sim Testbench Trace Generator: Generate trace Trace Player: Drive trace to DUT, check DUT correct behavior and collect coverages… Test Plan Test Level: Level 0, Level1,…Level 10, Level 20… Associating Tests: A method named add_test to associate tests with test plan Testcase Direct(Trace) Tests: pdp_8x8x32_1x1_int8_0… Python Tests: nvdla_reg_accessing… Random(UVM) Tests: cc_in_width_ctest… Unit Level Simulation Environment § All RTL changes protected § Added AXI-lite adapters and scoreboards § Added checkers to SNAP action registers § Simulator changed from VCS to XCELIUM § Simulating with Xilinx FPGA Ips § Regression running on Jenkins server § Production level verification environment Simulation Environment Components
  • 20. 20 Business Trends NAME USAGE COMPANY DATE FEATURE NVDLA Inferencing Nvidia 2017 Free Opensource Inference Engine Zynq UltraScale+ Training and Inferencing Xilinx 2015 HBM, CCIX, framework, int8 Xilinx DNN Processor Inferencing Xilinx 2018 On server inference solution ARM ML Processor Inferencing ARM 2018 Flexible architecture, scaling from edge to cloud Brainwave Training and Inferencing Microsoft 2017 Deployed on Microsoft cloud Cambricon 1M Inferencing Cambricon 2018 Specialized AI ISA Ascend 910 Training and Inferencing Huawei 2018 Huawei’s new architecture for AI Facebook, Ali(Ali-NPU), Baidu(XPU) are racing to hardware acceleration for AI Aliyun, Tencent Cloud, Baidu Cloud, Huawei Cloud starts to provide FPGA cloud services Highlighted Trends Software + Hardware full stack solutions Internet giants start on AI chips Hardware providers optimize their AI related libraries and try to get into software’s domain FPGA has been widely deployed on public clouds Focusing on energy-performance, IO efficiency and system optimization