SlideShare a Scribd company logo
A Lightweight
DNN Inference Processor
design, system, tools, and applications
羅賢君 Shien-Chun Luo Oct. 2018
工業技術研究院 Industrial Technology Research Institute (ITRI)
資訊與通訊研究所 Information and Communication Research Lab (ICL)
Roofline Model
- Key to Design DNN Inference Engine
1. More parallel PEs with high utilization
▪ Efficient parallel PE structure, interconnect
▪ Proper memory hierarchy
2. Increase data supplement
▪ High bandwidth data access
▪ Reduce data movement or compress data
3. Improve energy efficiency
▪ Adaptive resource to models
▪ Low-power skills
Performance(Operations)
Operational Intensity (operations/byte)
Computation
↓ Bound 2
3
2
↑ Computation
Bound 1
2
Segment & Position
ARM’s Project Trillium
• Performance of > 4.6 TOP/s
• Efficiency of > 3 TOPs/W (7nm process)
• On-chip SRAM size up-to 1MB
Our targeting DNN accelerating solution
• Performance of 50 GOP/s ~ 200 GOP/s
• Efficiency about 1 TOPs/W (65nm process)
• On-chip SRAM size ≤ 256KB
Figure sourced from : ARM Project Trillium
We Started from nVIDIA Open-source
Deep Learning Accelerator (DLA)
What have ITRI done
1. A bug-fixed, fully-compatible to NVDLA HW version (can use of NV’s tools)
2. A model translation tool – compile DNN model to DLA configuration files
3. An adaptive quantization flow – convert FP weights to HW-specific 8-bit precision
4. End-to-end verifications – we show an object detection (YOLO) in this presentation
HW Overview
Features
1. Variable HW resource
2. Suit for 3D convolution
3. Buffer data reuse
4. Hetero-layer fusion
5. Ping-pong CFG registers
1. Variable HW resource-PE#, buffer size
• Search an efficient resource to models
• Adaptive performance & power consumption
2. Suit for 3D convolution
• Released data dependency, share input feature cube
• Output pixel first, share IN, avoiding partial sum storage
• Support any kernel size (n x m) ,the same data flow
• Close to 100% PE utilization
3. Buffer data reuse
• Reuse input or weight in the next layer
• Benefit large layer partition, or batch
4. Hetero-layer fusion
• Fuse popular layer stack [ CONV – BN – PReLU – Pooling ]
• Greatly reduce the DRAM access data
5. Ping-pong CFG registers
• Configure N and N+1 layer simultaneously
• Cover the configuration time during layer change
DLA Features - Overview width
height
IN IN
IN
OUT
kernels
Stride 1, no pad
Channel first Plane first
3D CONV example
DLA Features - Why Configurable Resource is important ?
Alexnet (~0.73 GOP, 61M weights)
• Huge fully connected weights
• DRAM speed dominates
• Computation power cannot help
GoogleNet (~3.2 GOP, 7M weights)
• Small filter size (1x1)
• Benefit parallelism in CNN operations
• Computation power dominates
• DRAM speed cannot help
ResNet50 (~7.8 GOP, 25M weights)
• Large CNN operations, large weights
• Residual  directly add two data cubes 
DRAM speed dominates
• Computation power and DRAM speed are
evenly important
Performance Gradient
Original NVDLA Framework, DEV Flow
Caffe Prototxt
Caffe Model
(weights)
Parser
HW SPEC
Layer ID
Compiler
(Optimization)
Wisdom DIR
• layer details
Loadable file
• HW CONFIGs
• Layers’ CONFIGs
Kernel Mode Driver
(KMD)
• Translate a layer to
HW binary CFGs
• Handle IRQ
User Mode Driver
(UMD)
• Allocate address
• Function call :
layer by layer
inference
Flow Controller
(MCU or CPU)
• Load HW binary
CONFIGs
• Handle IRQ
DLA
HW
Input Compiler (binary version)
HardwareAPI and Driver
Formatted Weights
 online | offline 
ITRI DLA-Lite Simplified Flow - Overview
MCU
NVM
(optional)
DRAM
DMA
GPIF
DNN
Accelerator
Host System
(ARM-based, x86, …)
Program GPIF
DNN Model
Translate /
Format
Tools
HW
resource
allocation
Quantized
Re-train
Weights
Performance
Estimation
DEV ToolsHW Architecture
1. Find an efficient setup of HW resources
2. Setup system address allocation
3. Generate “translated” inference commands
4. Generate “formatted” model parameters
 Inference command package ( to compile for MCU)
 Inference weight package
ITRI DLA-Lite Simplified Flow – DEV tools
DNN Model
Parameters
Caffe Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
Check Layer
Sequence
Check HW
buffer size
DLA CFG
Commands
MCU
Instructions
MCU
compiler
DLA CFG translator
Memory allocator
HW-aware
Quantize
Insertion (TF)
Accuracy
Retrain (TF)
Parameter
Partition
Formatted
Quantized
Weights
Weight format
writer
• Before inference, initialize 2
packages into memory
• After inference, load images
and activate MCU and DLA
• API Example : “YOLO”, “RESNET-50” as a function call,
no breakdown to sub-tasks
• Easy for predefined DNN, future updated by venders
which is like the input.txn
file in NVDLA v1 testbench
Two binary packages
1. compiled MCU instructions
2. formatted weights
Popular NN Computer Vision Tasks
“You Only Look Once“ (YOLO)
Object detection (OD)
application is verified and
demonstrated
Figure sourced from :
Arthur Ouaknine’s Medium log
Object Detection Inference (1/2)
-- Layer Fusion
ID type
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Layer
Number
Hybrid Layer 1
Hybrid Layer 2
Hybrid Layer 3
Hybrid Layer 4
Hybrid Layer 5
Hybrid Layer 6
Hybrid Layer 7
Hybrid Layer 8
FC9
Tiny YOLO v1
(39 DNN layers)
HW Inference
Queue (9 layers)
 Hybrid layer supports [CONV–BN–Scale–PReLU–
Pool] 5-layer combination
• Originally,8-bit data,Minimal feature maps
DRAM access = 27.7MB
• Use [CONV–BN–Scale–PReLU–Pool] fusion, total
feature map DRAM access = 6.2MB
 Why reduce DRAM access important
(Weight = 27MB)
• Originally, @30 FPS,DRAM BW = 1.64 GB/s
• After fusion, @30 FPS,DRAM BW = 996 MB/s
 HW : 64 Cores, 128KB SRAM
* Detection layer is done by CPU
Object Detection Inference (2/2)
–RTL Results
Conv.
layer
Input Data
Dimension
RTL
Cycle #
OPs
OPs /
cycle
UTIL
Hybrid1 448x448x3 5.80M 193M 33 26.0%
Hybrid2 224x224x16 4.25M 472M 111 86.8%
Hybrid3 112x112x32 3.94M 467M 119 92.7%
Hybrid4 56x56x64 3.82M 465M 122 95.1%
Hybrid5 28x28x128 3.71M 464M 125 97.6%
Hybrid6 14x14x256 3.69M 463M 126 98.1%
Hybrid7 7x7x512 3.66M 463M 126 98.7%
Hybrid8 7x7x1024 3.52M 231M 66 51.3%
FC9 12540 14.19M 37M 2.6 2.0%
Summary 46.57M 3.25G 70
Note: MAC (CONV+FC) total OPs = 3.18G
Total weights = 27M
 Use 64 cores, 128KB SRAM
 Peak performance = 128 OPs/ cycle
 Result analysis
• Utilization (86%~98%) in CNN layers
• DRAM BW and SRAM size affects
hybrid layer 1 and 8
• FC is highly DRAM BW dominated
 Have some detailed
partitions (by DEV tool)
Config
file
Weight
Generator
DLA
RTL
DRAM
Model
VPI
hex
Caffe
format
Trans
DLA Product Prototypes (1/2)
• FPGA–based standalone product
• CFG file is packed to a C function, compiled to ARM
• Running a defined DNN inference
• Update DNN CFG & models by venders
Example 1 --- as a standalone ID Camera
DRAM
DLA Input Data
Model Weights
OS Memory Space
DRAM
CTRL
HDMI
USB ARM CPU
(FPGA)
DLA
(Processing System) Activations
AXI
DLA
AXI
Private Memory
MCU
Main CPUUSB HDMI
APB DRAM CTRL
DMA
DRAM
Example 3
--- as a SoC IP
Video Capture ( )
DNN_CALL( )
Data Fusion ( )
Decision ( )
USB – DLA on FPGA
USB - DLA in ASIC,
dev board
• USB accelerating stick + SDK
• Help legacy facilities to equip DNN
acceleration
• DNN accelerator IP
• Conventional IP business + DEV
tool chains
DLA Product Prototypes (2/2)
Example 2
--- as a Plug and Play Stick
 similar to Movidius / Gyrfalcon stick
 execute whole model inference, instead of
convolution function only
USB acceleration system & ASIC Design
USB to
GPIF
GPIF
Data CTRL
DRAM
CTRL / IF
RISC-V
Cache
DLA
(64 MAC)
AXI
Parallel bus SDK + API
DRAM
A
P
BPeripherals
DLA-Lite System SPEC
• 400MHz core, 100MHz board
• 64CONV MAC, 128KB CONV SRAM
• 50 GOPs peak CNN performance
• Targeting power consumption 50mW
ASIC Preliminary Info (floorplan view)
• TSMC 65nm
• Die size: 3,200 x 3,200 μm2
• Core: 2,500 x 2,500 μm2
64KB
CONV
Buffer32
MAC
32
MAC
BN
PReLU
Pool
Processor
ACC
CONV
DMA
CONV
Sequencer
Data
IO
CTRL
RISC-V
PLL
AXI
DMA interface
64KB
CONV
Buffer
THANK YOU!
QUESTIONS AND COMMENTS?
technical contact : scluo@itri.org.tw , yhchu@itri.org.tw
business contact : victor.wang@itri.org.tw

More Related Content

What's hot

[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
Tomoya Hibi
 
CXL at OCP
CXL at OCPCXL at OCP
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Anne Nicolas
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Edge AI and Vision Alliance
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
jemin lee
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
Naoto MATSUMOTO
 
AMD: Where Gaming Begins
AMD: Where Gaming BeginsAMD: Where Gaming Begins
AMD: Where Gaming Begins
AMD
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
AMD
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
Grigory Sapunov
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)Takeshi Yamamuro
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
Denys Haryachyy
 
Memory management in vx works
Memory management in vx worksMemory management in vx works
Memory management in vx worksDhan V Sagar
 
x86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTx86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNT
takesako
 
CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5Yosuke Onoue
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMI
Allan Cantle
 
SDL2の紹介
SDL2の紹介SDL2の紹介
SDL2の紹介
nyaocat
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
Rebekah Rodriguez
 

What's hot (20)

[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
 
CXL at OCP
CXL at OCPCXL at OCP
CXL at OCP
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
 
AMD: Where Gaming Begins
AMD: Where Gaming BeginsAMD: Where Gaming Begins
AMD: Where Gaming Begins
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Memory management in vx works
Memory management in vx worksMemory management in vx works
Memory management in vx works
 
x86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTx86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNT
 
CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMI
 
SDL2の紹介
SDL2の紹介SDL2の紹介
SDL2の紹介
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
 

Similar to Lightweight DNN Processor Design (based on NVDLA)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
Stephen Hemminger
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
Ruhaim Izmeth
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
supratikmondal6
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Michelle Holley
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Memory Fabric Forum
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
OPNFV
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
Amazon Web Services
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 

Similar to Lightweight DNN Processor Design (based on NVDLA) (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 

Recently uploaded

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 

Recently uploaded (20)

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 

Lightweight DNN Processor Design (based on NVDLA)

  • 1. A Lightweight DNN Inference Processor design, system, tools, and applications 羅賢君 Shien-Chun Luo Oct. 2018 工業技術研究院 Industrial Technology Research Institute (ITRI) 資訊與通訊研究所 Information and Communication Research Lab (ICL)
  • 2. Roofline Model - Key to Design DNN Inference Engine 1. More parallel PEs with high utilization ▪ Efficient parallel PE structure, interconnect ▪ Proper memory hierarchy 2. Increase data supplement ▪ High bandwidth data access ▪ Reduce data movement or compress data 3. Improve energy efficiency ▪ Adaptive resource to models ▪ Low-power skills Performance(Operations) Operational Intensity (operations/byte) Computation ↓ Bound 2 3 2 ↑ Computation Bound 1 2
  • 3. Segment & Position ARM’s Project Trillium • Performance of > 4.6 TOP/s • Efficiency of > 3 TOPs/W (7nm process) • On-chip SRAM size up-to 1MB Our targeting DNN accelerating solution • Performance of 50 GOP/s ~ 200 GOP/s • Efficiency about 1 TOPs/W (65nm process) • On-chip SRAM size ≤ 256KB Figure sourced from : ARM Project Trillium
  • 4. We Started from nVIDIA Open-source Deep Learning Accelerator (DLA) What have ITRI done 1. A bug-fixed, fully-compatible to NVDLA HW version (can use of NV’s tools) 2. A model translation tool – compile DNN model to DLA configuration files 3. An adaptive quantization flow – convert FP weights to HW-specific 8-bit precision 4. End-to-end verifications – we show an object detection (YOLO) in this presentation HW Overview Features 1. Variable HW resource 2. Suit for 3D convolution 3. Buffer data reuse 4. Hetero-layer fusion 5. Ping-pong CFG registers
  • 5. 1. Variable HW resource-PE#, buffer size • Search an efficient resource to models • Adaptive performance & power consumption 2. Suit for 3D convolution • Released data dependency, share input feature cube • Output pixel first, share IN, avoiding partial sum storage • Support any kernel size (n x m) ,the same data flow • Close to 100% PE utilization 3. Buffer data reuse • Reuse input or weight in the next layer • Benefit large layer partition, or batch 4. Hetero-layer fusion • Fuse popular layer stack [ CONV – BN – PReLU – Pooling ] • Greatly reduce the DRAM access data 5. Ping-pong CFG registers • Configure N and N+1 layer simultaneously • Cover the configuration time during layer change DLA Features - Overview width height IN IN IN OUT kernels Stride 1, no pad Channel first Plane first 3D CONV example
  • 6. DLA Features - Why Configurable Resource is important ? Alexnet (~0.73 GOP, 61M weights) • Huge fully connected weights • DRAM speed dominates • Computation power cannot help GoogleNet (~3.2 GOP, 7M weights) • Small filter size (1x1) • Benefit parallelism in CNN operations • Computation power dominates • DRAM speed cannot help ResNet50 (~7.8 GOP, 25M weights) • Large CNN operations, large weights • Residual  directly add two data cubes  DRAM speed dominates • Computation power and DRAM speed are evenly important Performance Gradient
  • 7. Original NVDLA Framework, DEV Flow Caffe Prototxt Caffe Model (weights) Parser HW SPEC Layer ID Compiler (Optimization) Wisdom DIR • layer details Loadable file • HW CONFIGs • Layers’ CONFIGs Kernel Mode Driver (KMD) • Translate a layer to HW binary CFGs • Handle IRQ User Mode Driver (UMD) • Allocate address • Function call : layer by layer inference Flow Controller (MCU or CPU) • Load HW binary CONFIGs • Handle IRQ DLA HW Input Compiler (binary version) HardwareAPI and Driver Formatted Weights  online | offline 
  • 8. ITRI DLA-Lite Simplified Flow - Overview MCU NVM (optional) DRAM DMA GPIF DNN Accelerator Host System (ARM-based, x86, …) Program GPIF DNN Model Translate / Format Tools HW resource allocation Quantized Re-train Weights Performance Estimation DEV ToolsHW Architecture 1. Find an efficient setup of HW resources 2. Setup system address allocation 3. Generate “translated” inference commands 4. Generate “formatted” model parameters  Inference command package ( to compile for MCU)  Inference weight package
  • 9. ITRI DLA-Lite Simplified Flow – DEV tools DNN Model Parameters Caffe Model Prototxt Model Parser Layer Fusion Layer Partition Check Layer Sequence Check HW buffer size DLA CFG Commands MCU Instructions MCU compiler DLA CFG translator Memory allocator HW-aware Quantize Insertion (TF) Accuracy Retrain (TF) Parameter Partition Formatted Quantized Weights Weight format writer • Before inference, initialize 2 packages into memory • After inference, load images and activate MCU and DLA • API Example : “YOLO”, “RESNET-50” as a function call, no breakdown to sub-tasks • Easy for predefined DNN, future updated by venders which is like the input.txn file in NVDLA v1 testbench Two binary packages 1. compiled MCU instructions 2. formatted weights
  • 10. Popular NN Computer Vision Tasks “You Only Look Once“ (YOLO) Object detection (OD) application is verified and demonstrated Figure sourced from : Arthur Ouaknine’s Medium log
  • 11. Object Detection Inference (1/2) -- Layer Fusion ID type 1 CONV 2 BN 3 Scale 4 ReLU 5 Pool 6 CONV 7 BN 8 Scale 9 ReLU 10 Pool 11 CONV 12 BN 13 Scale 14 ReLU 15 Pool 16 CONV 17 BN 18 Scale 19 ReLU 20 Pool 21 CONV 22 BN 23 Scale 24 ReLU 25 Pool 26 CONV 27 BN 28 Scale 29 ReLU 30 Pool 31 CONV 32 BN 33 Scale 34 ReLU 35 CONV 36 BN 37 Scale 38 ReLU 39 FC Layer Number Hybrid Layer 1 Hybrid Layer 2 Hybrid Layer 3 Hybrid Layer 4 Hybrid Layer 5 Hybrid Layer 6 Hybrid Layer 7 Hybrid Layer 8 FC9 Tiny YOLO v1 (39 DNN layers) HW Inference Queue (9 layers)  Hybrid layer supports [CONV–BN–Scale–PReLU– Pool] 5-layer combination • Originally,8-bit data,Minimal feature maps DRAM access = 27.7MB • Use [CONV–BN–Scale–PReLU–Pool] fusion, total feature map DRAM access = 6.2MB  Why reduce DRAM access important (Weight = 27MB) • Originally, @30 FPS,DRAM BW = 1.64 GB/s • After fusion, @30 FPS,DRAM BW = 996 MB/s  HW : 64 Cores, 128KB SRAM * Detection layer is done by CPU
  • 12. Object Detection Inference (2/2) –RTL Results Conv. layer Input Data Dimension RTL Cycle # OPs OPs / cycle UTIL Hybrid1 448x448x3 5.80M 193M 33 26.0% Hybrid2 224x224x16 4.25M 472M 111 86.8% Hybrid3 112x112x32 3.94M 467M 119 92.7% Hybrid4 56x56x64 3.82M 465M 122 95.1% Hybrid5 28x28x128 3.71M 464M 125 97.6% Hybrid6 14x14x256 3.69M 463M 126 98.1% Hybrid7 7x7x512 3.66M 463M 126 98.7% Hybrid8 7x7x1024 3.52M 231M 66 51.3% FC9 12540 14.19M 37M 2.6 2.0% Summary 46.57M 3.25G 70 Note: MAC (CONV+FC) total OPs = 3.18G Total weights = 27M  Use 64 cores, 128KB SRAM  Peak performance = 128 OPs/ cycle  Result analysis • Utilization (86%~98%) in CNN layers • DRAM BW and SRAM size affects hybrid layer 1 and 8 • FC is highly DRAM BW dominated  Have some detailed partitions (by DEV tool) Config file Weight Generator DLA RTL DRAM Model VPI hex Caffe format Trans
  • 13. DLA Product Prototypes (1/2) • FPGA–based standalone product • CFG file is packed to a C function, compiled to ARM • Running a defined DNN inference • Update DNN CFG & models by venders Example 1 --- as a standalone ID Camera DRAM DLA Input Data Model Weights OS Memory Space DRAM CTRL HDMI USB ARM CPU (FPGA) DLA (Processing System) Activations
  • 14. AXI DLA AXI Private Memory MCU Main CPUUSB HDMI APB DRAM CTRL DMA DRAM Example 3 --- as a SoC IP Video Capture ( ) DNN_CALL( ) Data Fusion ( ) Decision ( ) USB – DLA on FPGA USB - DLA in ASIC, dev board • USB accelerating stick + SDK • Help legacy facilities to equip DNN acceleration • DNN accelerator IP • Conventional IP business + DEV tool chains DLA Product Prototypes (2/2) Example 2 --- as a Plug and Play Stick  similar to Movidius / Gyrfalcon stick  execute whole model inference, instead of convolution function only
  • 15. USB acceleration system & ASIC Design USB to GPIF GPIF Data CTRL DRAM CTRL / IF RISC-V Cache DLA (64 MAC) AXI Parallel bus SDK + API DRAM A P BPeripherals DLA-Lite System SPEC • 400MHz core, 100MHz board • 64CONV MAC, 128KB CONV SRAM • 50 GOPs peak CNN performance • Targeting power consumption 50mW ASIC Preliminary Info (floorplan view) • TSMC 65nm • Die size: 3,200 x 3,200 μm2 • Core: 2,500 x 2,500 μm2 64KB CONV Buffer32 MAC 32 MAC BN PReLU Pool Processor ACC CONV DMA CONV Sequencer Data IO CTRL RISC-V PLL AXI DMA interface 64KB CONV Buffer
  • 16. THANK YOU! QUESTIONS AND COMMENTS? technical contact : scluo@itri.org.tw , yhchu@itri.org.tw business contact : victor.wang@itri.org.tw