Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

02 ai inference acceleration with components all in open hardware: opencapi and nvdla


Published on

This was presented by Peng Fei GOU (IBM China) at OpenPOWER summit EU 2019. The original one is uploaded at:

Published in: Technology
  • There is a useful site for you that will help you to write a perfect and valuable essay and so on. Check out, please ⇒ ⇐
    Are you sure you want to  Yes  No
    Your message goes here

02 ai inference acceleration with components all in open hardware: opencapi and nvdla

  1. 1. AI Inference Acceleration with components all in Open Hardware: OpenCAPI and NVDLA Deep Learning Inference Engine for CAPI/OpenCAPI October 27, 2019 IBM 中国系统实验室 IBM China System Lab Peng Fei GOU (
  2. 2. 2 Motivation Path to hardware acceleration for AI ü Deep learning inference acceleration is hot everywhere, from edge to cloud ü POWER9 needs a solution on hardware acceleration for AI OpenCAPI-NVDLA: demonstration on P9 heterogeneous computing platform ü To align with Open Hardware strategy ü Fast and simple acceleration deployment on server with FPGA and OpenCAPI NVDLA: OPEN SOURCE inference engine from NVIDIA ü NVIDIA Deep Learning Accelerator ü High quality: production level open source RTL ü Flexibility: configurable architecture to fulfill different business needs
  3. 3. 3 Open Hardware Ecosystem NVDLA üOpen hardware design üPart of NVIDIA’s Xavier SOC üOpen source compiler üSifive + NVDLA collaboration üMore than tens of startups starting leverage NVDLA üActive community OpenPOWER üOpen ISA üOpen reference design üEncourage more open innovations in hardware üRich ecosystem and partners from software, system hardware to chip
  4. 4. 4 Hardware Backend AI Acceleration Intel X86 CPU NV GPU PCIe FPGA Google TPU PCIe ASIC (NPU) Applications, High-level APIs, Management Tools, etc. World of X86 IBM POWER CPU PCIe ASIC (NPU) NV GPU OpenCAPI FPGA PCIe FPGA World of OpenPOWER TensorFlow/Keras/Pytorch/Etc.
  5. 5. NVDLA-hwOpenCAPI TLX/DLX OC-ACCEL AXI Lite AXI interrupt AXI OnChip RAM POWER9 Inference Application User mode Driver Host memory FPGA MemoryInterface Control Interface Conv Buffer Conv Core Activation Pooling LRN Reshape Bridge DMA NVDLA open source inference engine, adapted to FPGA and OpenCAPI Open CAPI 25G POWER 9 server with CAPI or OpenCAPI Mihawk, Inspur 5280/5290, etc. FPGA Card provided by vendors CAPI user mode drivers Applications, image recognition, etc. OC/AXI Bridge mode TLx DLx CFG snap data bridge mmio
  6. 6. 6 Multi Engine Structure A X I I n t e r c o n n e c t Job Manager Job Queue Job 0 … Job M Local Configuration Bus M NVDLA 0 NVDLA 1 … NVDLA N M AXI/PSL Bridge or AXI/TLX Bridge M M M
  7. 7. Software Stack DL training Framework Parser Compiler Optimizer User-mode Driver Kernel- mode Driver CAPI NVDLA HardwareModel Loadable ioctl() Reg Writes Publicly available, Caffe, etc. NVIDIA open sourced in around Sep, 2019 Applications: image recognition, etc. NVIDIA open sourced, user/kernel mode drivers. Changed to CAPI user mode Hardware with OpenCAPI Transform a trained network to NVDLA loadables Running offline, not necessarily on POWER NVIDIA’s Parser, Compiler and Optimizer are enough to support early stage evaluation. Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 FC 1 FC 2 FC 3 MAXPooling MAXPooling MAXPooling Applications/workloads running with OpenCAPI- NVDLA user-mode driver, on POWER9 platforms Kernel mode driver changed and eliminated to adapt CAPI mode.
  8. 8. 8 Driver Changes for OpenCAPI DRM_IOCTL_NVDLA_GEM_CREATE ioctl() DRM_IOCTL_NVDLA_GEM_MMAP DRM_IOCTL_NVDLA_DESTROY DRM_IOCTL_NVDLA_SUBMIT NVDLA DRM Driver nvdla_gem_create() drm_gem_handle_create() nvdla_gem_map_offset() drm_gem_create_mmap_offset() nvdla_gem_destroy() drm_gem_dumb_destroy() nvdla_submit() nvdla_task_sumbit() NVDLA Firmware dla_submit_operation() *_reg_read() *_reg_write() DLA_OP_BDMA Engines DLA_OP_CONV DLA_OP_SDP DLA_OP_PDP DLA_OP_CDP DLA_OP_RUBIK Hardware User mode driver IOCTL removed DRM and GEM dependencies removed Firmware changed to user mode calls Easy Memory Management All kernel mode DRM/GEM codes removed Use user mode malloc() to manage memories No IOCTL calls IOCTL calls from UMD to KMD changed to user level function calls Firmware Works in User Mode No Linux Kernel Dependency No dependency to DRM/GEM drivers No dependency to Linux kernel versions Changed to direct function calls
  9. 9. 9 Functional Validation Mihawk Running in IBM Austin Lab Mihawk (POWER9) + AlphaData AD9H7 FPGA Card Large config (2048 MACs) running @ 200MHz Functional tests PASSED Alexnet running with real image inferencing Results not 100% accurate due to model inaccuracy AlphaData AD9H7
  10. 10. 10 Performance Evaluation and Projection Hardware No. MACs Clock FPGA I/O bandwidt h FC Batch Size AlexNet Perf (frames/second) Current Performance 2048 200MHz VU37P 1GB / s 1 10.417 Projected 2048 250MHz VU37P 20 GB / s 16 741 Current performance Alexnet: 10.42 frames/second Performance under tuning … Expect to have better performance when issues in compiler is resolved and tuned. Projected Performance Alexnet: 741 frames/second ResNet50: 113 frames/second Projected Performance Calculated based on the analytical model from NVDLA
  11. 11. 11 FPGA Implementation Result Resource Utilization Available Utilization % nv_large (2048 MACs) LUT 616202 1303680 47.72 CLB Reg 448178 2607360 17.19 CLB 122266 162960 75.03 BRAM 408 2016 20.24 DSP 251 9024 2.78 IO 1 676 0.15 BUFG 37 1800 2.06 CARD Type: AlphaData AD9H7 Xilinx Virtex UltraScale+ XCVU37P-2E - FSVH2892
  12. 12. 12 Summary Why NVDLA on OpenCAPI ü Open hardware collaboration ü Inference engine is a foreseeable hot topic in servers, data centers and clouds ü NVIDIA is serious on open source DLA, the quality of DLA is production level ü We don’t want to reinvent the wheel What’s next? ü Larger configurations (4096 MACs and/or FP16 support) ü Parser and compiler adaption ü Performance tuning and real workload adaption ( key to business ) Open Source ü Important to cultivate open hardware ecosystem
  13. 13. 13 Pointers to Materials Modified CAPI/SNAP framework for NVDLA ü ü On public Github Modified NVDLA software for CAPI ü ü On public Github Modified NVDLA IP, including RTL and Unit Testbench ü ü On IBM enterprise Github
  14. 14. 14 References Hotchips 30 ü Xilinx xfDNN (CHaiDNN) ü SNAP ü NVDLA ü Original NVDLA Hardware ü Original NVDLA Software ü Original NVDLA Virtual Platform ü Community Contributed NVDLA Compiler Source ü
  15. 15. Thanks and More Details in Following Slides
  16. 16. 16 Quick Facts What is NVDLA ü NVIDIA Deep Learning Accelerator ü Open Source, production level RTL ü Hardware configurable ü Accelerate Convolution Neural Networks What is OpenCAPI-NVDLA ü Bring NVDLA to OpenCAPI on FPGA ü Explore possibility of AI acceleration on CAPI/OpenCAPI ü Align with POWER’s heterogenous computing strategy Current Development Status ü NVDLA hardware ported to OpenCAPI ü NVDLA software (drivers) ported to CAPI ü Hardware running @2048 MACs @200MHz, with AlexNet ü Running on Mihawk + AlphaData AD9H7 Potential Use Case ü AI acceleration solution on POWER9 ü Cloud image recognition service ü Face recognition for large scale video surveillance server ü FPGA based AI acceleration on cloud Performance ü ~1TOPs @INT8 ü Current perf: 10.42 FPS for Alexnet ü Projected perf: 813.49 FPS for Alexnet Other Highlights ü Production level unit verification environment with full regression enabled ü Open source compilers ü Larger hardware development in progress
  17. 17. 17 NVDLA Changes for FPGA Implementation INIT -2260 +use DSP -2000 +disable clock gating 25 +disable clock gating -1900 +add pipeline in MAC -400 +add pipeline in SDP -130 +set max fanout -14 -2500 -2000 -1500 -1000 -500 0 500 INIT use DSP disable clock gating add pipeline in MAC add pipeline in SDP set max fanout WNS(ps) FPGA Implementation Timing Closure NV_SMALL NV_LARGE Methods Used INIT The initial NVDLA RTL from Github +Use DSP Replace all NVDLA MAC operators with Xilinx DSP IP +Disable clock gating Disable the clock gating for ASIC design +Add pipelines in MAC Add pipelines in MAC for FPGA oriented design RTL changes verified with unit testbench +Add pipelines in SDP Add pipelines in SDP for FPGA oriented design RTL changes verified with unit testbench +Set max fanout Set max fanout for critical registers in NVDLA
  18. 18. 18 NVDLA Changes for SNAP Integration Address Width WR Description 0x400 32 RW [31:9] RO: Reserved [8] RW: Selector of SNAP register and NVDLA register, 0 is SNAP, 1 is NVDLA [7:0] RW: Extension of NVDLA address, use with paddr[9:2] SNAP Action Registers NVDLA Registers Paddr[9:0] 0 1 Config Register [8] [7:0] {Config_reg[7:0], paddr[9:2]} Config Register Definition Indirect Register Accessing NVDLA-hw SNAP AXI Lite AXI4 interrupt Control Path Adapter AXI4-lite to APB bridge APB to CSB bridge Data Path Adapter Data bus width converter 512bit to 256bit SNAP Action Regs Control Path Adaption Add AXI-to-APB bridge and APB-to-CSB bridge Indirect register accessing (NVDLA reg space larger than SNAP action reg space) Interrupt enablement Data Path Adaption AXI4 data bus width converter from NVDLA (256 bit dbus) to SNAP (512 bit dbus) Other changes to facilitate AXI4 bus signals
  19. 19. 19 Unit Sim Test Plan and Testbench Unit-sim Testbench Trace Generator: Generate trace Trace Player: Drive trace to DUT, check DUT correct behavior and collect coverages… Test Plan Test Level: Level 0, Level1,…Level 10, Level 20… Associating Tests: A method named add_test to associate tests with test plan Testcase Direct(Trace) Tests: pdp_8x8x32_1x1_int8_0… Python Tests: nvdla_reg_accessing… Random(UVM) Tests: cc_in_width_ctest… Unit Level Simulation Environment § All RTL changes protected § Added AXI-lite adapters and scoreboards § Added checkers to SNAP action registers § Simulator changed from VCS to XCELIUM § Simulating with Xilinx FPGA Ips § Regression running on Jenkins server § Production level verification environment Simulation Environment Components
  20. 20. 20 Business Trends NAME USAGE COMPANY DATE FEATURE NVDLA Inferencing Nvidia 2017 Free Opensource Inference Engine Zynq UltraScale+ Training and Inferencing Xilinx 2015 HBM, CCIX, framework, int8 Xilinx DNN Processor Inferencing Xilinx 2018 On server inference solution ARM ML Processor Inferencing ARM 2018 Flexible architecture, scaling from edge to cloud Brainwave Training and Inferencing Microsoft 2017 Deployed on Microsoft cloud Cambricon 1M Inferencing Cambricon 2018 Specialized AI ISA Ascend 910 Training and Inferencing Huawei 2018 Huawei’s new architecture for AI Facebook, Ali(Ali-NPU), Baidu(XPU) are racing to hardware acceleration for AI Aliyun, Tencent Cloud, Baidu Cloud, Huawei Cloud starts to provide FPGA cloud services Highlighted Trends Software + Hardware full stack solutions Internet giants start on AI chips Hardware providers optimize their AI related libraries and try to get into software’s domain FPGA has been widely deployed on public clouds Focusing on energy-performance, IO efficiency and system optimization