SlideShare a Scribd company logo
1 of 31
Download to read offline
High-Throughput Convolutional Neural Network
on an FPGA by Customized JPEG Compression
Hiroki Nakahara
Tokyo Institute of Technology, JP
Zhiqiang Que Wayne Luk
Imperial College London, UK
Outline
• Background
• JPEG compression for a high-speed inference
• CNN model for an FPGA implementation
• Channel shift and point-wise decomposition
• Quantization strategy
• Channel shuffle
• Fully-pipelined CNN architecture
• Experimental results
• Conclusion
2
Convolutional Neural Networks (CNNs)
• High accuracy and many applications
• Image recognitions, NLPs, data mining [1]
• FPGAs on cloud services
• Amazon AWS, Microsoft Azure, etc.
3
[1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng,
“UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery
and data mining (KDD), 2019, pp.3132–3142.
Problems
• Power consumption
• Performance bottleneck (Data-transfer)
• e.g., AWS F1 provides overall read/write at 6.5GB/s
from host CPU to FPGA [1]
4
Host
PC
Interconnect
PCIe
CNN
Kernel
.jpg
RAW(RGB) Img.
Accelerator Card
[1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for
Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
5
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Dog?
Cat?
6Image Source: https://www.kaggle.com/c/dogs-vs-cats/data
7
Labradoodle?
Fried Chicken?
Source: https://bit.ly/2zveHGT
8
Labradoodle?
Fried Chicken?
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
9
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Customized JPEG
for a High-speed Inference
10
JPEG Coding
11
Pre-
processing
DCT Quant.
Huffman
Encoding
Quant.
Table
Huffman
Coding
Table
Post-
processing
IDCT
Reverse
Quant.
Huffman
Decoding
Encoding
Decoding
CompressedImageData
RGB
Image
Picture
Matrix
DCT
Matrix
JPEG
Header
Proposed JPEG Coding
with a CNN Accelerator
12
Quant.
Huffman
Encoding
Fully
Pipelining
CNN
IDCT
Reverse
Quant.
Huffman
Decoding
Host PC
ImageStreamData
JPEG
Image
.jpg
Extreme
Quant. Value q
Quant.
Table
RAM
PCIe
Huffman
Decoding
& Reverse
Quant.
RAMRAM
Detection
Result
FPGA
Ping-pong
Buffer
Huffman
Coding Table
Huffman
Coding Table
Huffman Decoding
and Reverse Quantization Unit
13
0
1
2
3
4
2
2
2
3
4
Shift Register
Shift
Value
Quantized
Value
Quant.
Value q
Run-length
Decoder
00**
01**
10**
110*
1110
Image Data Stream
...
...
Priority
Encoder
Buffer RAM
Zig-zag writing
ADR
WDATA
Zig-zag
pattern ROM
• Decompose the 2D-IDCT with 16 1D-DCTs
14
2D-IDCT
AP-922
Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine
Transform Using the Streaming SIMD Extensions and MMX Instructions,”
https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
2D-IDCT Unit
15
..
Controller
Operation
Units
Reg. 1D-IDCT Unit
RAM RAM RAM
• Two 1D-IDCT units
• Use half precision (16 bits)
CNN model for an FPGA
Implementation
16
Overview
1. Decomposing k×k convolution by
channel shift [1] and point-wise (1×1) convolution
2. Binary (1-bit) weight quantization [2]
3. Channel split and shuffle [3]
17
[3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices,” CVPR, 2018.
[1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez,
and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,”
CVPR, 2018, pp. 9127-9135.
[2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural
Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
Building Blocks
18
#channel x2
(a) Plain block
(b) Down-sampling block
Channel
Split
Shift PWConv Shift PWConv
Concat
&
Shuffle
Channel
Split
Shift
PWConv
(s=2)
Shift PWConv
Concat
&
Shuffle
Shift
PWConv
(s=2)
#channel/2
Our CNN
Model
19
Layer Output
size
Kernel size Stride #Output
channel
Image 224 3
PWConv 224 1 2 24
Norm 224 1 1 24
Shift 224 1 1 24
Pool 112 3 1 24
PWConv 112 2 2 24
Norm 112 1 1 24
ReLU 112 1 1 24
Shift 112 3 1 24
Pool 56 2 2 24
Stage 2 28 116
(4 repeats)
Stage 3 14 232
(8 repeats)
Stage 4 7 464
(16 repeats)
GAP 1 7 1 464
PWConv 1 1 1 1000
• Training-aware
quantization
• w: binary, a: 8-bit
• 2.54 M params,
0.616 GMACs
Fully-pipelined
CNN Architecture
20
Dataflow for a Residual Stage
of a Plain Block
• Double buffers for branch-flow
• Xilinx #pragma HLS dataflow
21
Layer
Unit
F.map Buffer
...
...
...
...
Layer
Unit
...
...
Shuffle
...
2D Convolutional Unit
22
...
...
AdderTree
BN Act
W.mem
...
...
...
...
c
n
p
c
n×p
Convolution Unit
Pooling Units
23
x00 x01 x02 x03 x04
x10 x11 x12 x13 x14
x20 x21 x22 x23 x24
x30 x31 x32 x33 x34
x40 x41 x42 x43 x44
x11 x10 x04 x03 x02 x01 x00
Write
Ctrl.
Logic
F. Map Mem. (n=5, k=2)
Shift Register
Max
Selector
+F. Map Mem.
Register
Reset
Write
Ctrl.
Logic
1
𝑛!
Controller
Max. Pooling
Unit
Global Ave.
Pooling
Unit
Experimental Results
24
Compression Ratio vs. Accuracy
25
162.2
124.6
82.1
53.5
34.9
11.5
59.61
66.64
70.8 71.1 71.2 71.2
50
55
60
65
70
75
0.0
50.0
100.0
150.0
200.0
q=1 q=2 q=3 q=4 q=5 Standard
speed-up acc
ImageNetTop-1Accuracy[%]
DataTransferSpeed-UpRatio
(Baseline:RGBImageTransfer)
JPEG Quantization Bit
• ImageNet2012 (224x224 pixel image) classification task
• PyTorch 1.4.0 + modified libjpeg library
Only decreases 0.3 point of accuracy and
achieves 82.1 times speed-up
Implementation Results
Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs
JPEG Decoder 11,675 6,646 34 2 0
Huffman Decoder 6,794 2,378 0 0 0
2D-IDCT 4,881 4,278 34 2 0
Pipelined-CNN 263,120 266,784 2,336 2,744 0
Total 274,795 273,440 2,370 2,746 16
(Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%)
26
• Xilinx Inc. Virtex UltraScale+ FPGA
VCU1525 acceleration development kit
• Xilinx Inc. SDAccel 2018.2
• Operates 300MHz@75Watt
• System performance: 3321.25 FPS
• JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS)
• JPEG decoder part of the LUT was only 4.2% of total system resource
Comparison with
Other FPGA Implementations
27
Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours
FPGA Stratix V Zynq
ZU3EG
Zynq
ZU3EG
Zynq ZU9EG Virtex US+
XCVU9P
Virtex US+
XCVU9P
FPS 864.7 200.0 96.5 809.8 123.1 3321.2
Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8%
Top-5 Acc. 66.80% --- 88.12% --- --- 90.1%
Precision
(W/Act)
16/16 1/2 4/4 8/8 16/16 1/8
Freq.(MHz) 150 220 250 333 214 300
Power (W) 26.2 10.2 5.5 --- 49.25 75.0
1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018.
2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast
exploration of quantized neural networks,” 2018.
3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy:
Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019.
4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th
International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143.
5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019,
pp.73–82.
Comparison with CPU and GPU
Platform CPU GPU FPGA
Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P
Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz
Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM
Throughput (FPS) 24.0 350.0 3321.25
Power (W) 95 295 75
Efficiency (FPS/W) 0.25 1.18 44.28
28
• Ubuntu 18.04 LTS with PyTorch 1.4.0
• 128 Batch with INT8 quantization (for CPU and GPU)
Note: CPU and GPU did not use our JPEG compression scheme
Conclusion
29
Conclusion
• Customized JPEG compression for a high-speed inference
• 82.1x speed-up, 0.3-point accuracy drop
• CNN model for a fully-pipelined implementation
• Channel shift and point-wise decomposition
• Binary weight quantization
• Channel split-shuffle operation
• Fully-pipelined CNN architecture
• Achieved 3,321 FPS@75W
• Speed-up: 138.4x CPU, 9.5x GPU
• Energy efficiency: 177.1x CPU, 37.5x GPU
• Future works
• Custom compression & Other DL applications
30
Thank you
Hiroki Nakahara (Tokyo Tech, JP)
nakahara@ict.e.titech.ac.jp
31

More Related Content

What's hot

画像認識の初歩、SIFT,SURF特徴量
画像認識の初歩、SIFT,SURF特徴量画像認識の初歩、SIFT,SURF特徴量
画像認識の初歩、SIFT,SURF特徴量
takaya imai
 
レベルを上げて物理で殴れ、Fuzzing入門 #pyfes
レベルを上げて物理で殴れ、Fuzzing入門 #pyfesレベルを上げて物理で殴れ、Fuzzing入門 #pyfes
レベルを上げて物理で殴れ、Fuzzing入門 #pyfes
Tokoroten Nakayama
 

What's hot (20)

Fortranが拓く世界、VSCodeが架ける橋
Fortranが拓く世界、VSCodeが架ける橋Fortranが拓く世界、VSCodeが架ける橋
Fortranが拓く世界、VSCodeが架ける橋
 
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
 
SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attentio...
SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attentio...SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attentio...
SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attentio...
 
物体検出の歴史(R-CNNからSSD・YOLOまで)
物体検出の歴史(R-CNNからSSD・YOLOまで)物体検出の歴史(R-CNNからSSD・YOLOまで)
物体検出の歴史(R-CNNからSSD・YOLOまで)
 
日曜数学会_ガロア体上の符号とQRコード_Kuma
日曜数学会_ガロア体上の符号とQRコード_Kuma日曜数学会_ガロア体上の符号とQRコード_Kuma
日曜数学会_ガロア体上の符号とQRコード_Kuma
 
Rによるデータサイエンス:12章「時系列」
Rによるデータサイエンス:12章「時系列」Rによるデータサイエンス:12章「時系列」
Rによるデータサイエンス:12章「時系列」
 
画像認識の初歩、SIFT,SURF特徴量
画像認識の初歩、SIFT,SURF特徴量画像認識の初歩、SIFT,SURF特徴量
画像認識の初歩、SIFT,SURF特徴量
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
Python入門 : 4日間コース社内トレーニング
Python入門 : 4日間コース社内トレーニングPython入門 : 4日間コース社内トレーニング
Python入門 : 4日間コース社内トレーニング
 
Introduction to YOLO detection model
Introduction to YOLO detection modelIntroduction to YOLO detection model
Introduction to YOLO detection model
 
EEG analysis (nonlinear)
EEG analysis (nonlinear)EEG analysis (nonlinear)
EEG analysis (nonlinear)
 
Rの高速化
Rの高速化Rの高速化
Rの高速化
 
チュートリアル:細胞画像を使った初めてのディープラーニング
チュートリアル:細胞画像を使った初めてのディープラーニングチュートリアル:細胞画像を使った初めてのディープラーニング
チュートリアル:細胞画像を使った初めてのディープラーニング
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~
 
点群SegmentationのためのTransformerサーベイ
点群SegmentationのためのTransformerサーベイ点群SegmentationのためのTransformerサーベイ
点群SegmentationのためのTransformerサーベイ
 
Azure kinect DKハンズオン
Azure kinect DKハンズオンAzure kinect DKハンズオン
Azure kinect DKハンズオン
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 
レベルを上げて物理で殴れ、Fuzzing入門 #pyfes
レベルを上げて物理で殴れ、Fuzzing入門 #pyfesレベルを上げて物理で殴れ、Fuzzing入門 #pyfes
レベルを上げて物理で殴れ、Fuzzing入門 #pyfes
 
ros_whillとROS2対応(ROS勉強会第28回LT大会)
ros_whillとROS2対応(ROS勉強会第28回LT大会)ros_whillとROS2対応(ROS勉強会第28回LT大会)
ros_whillとROS2対応(ROS勉強会第28回LT大会)
 
完全自動運転実現のための信頼度付き自己位置推定の提案
完全自動運転実現のための信頼度付き自己位置推定の提案完全自動運転実現のための信頼度付き自己位置推定の提案
完全自動運転実現のための信頼度付き自己位置推定の提案
 

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
Mohsen Jafarzadeh
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
LEGATO project
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
zhao fu
 

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression (20)

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
 
EIS_REVIEW_1.pptx
EIS_REVIEW_1.pptxEIS_REVIEW_1.pptx
EIS_REVIEW_1.pptx
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobile
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learning
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
 
An35225228
An35225228An35225228
An35225228
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 

More from Hiroki Nakahara

More from Hiroki Nakahara (20)

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
 
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
 
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
 
Naist2015 dec ver1
Naist2015 dec ver1Naist2015 dec ver1
Naist2015 dec ver1
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
 
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
 

Recently uploaded

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 

FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

  • 1. High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression Hiroki Nakahara Tokyo Institute of Technology, JP Zhiqiang Que Wayne Luk Imperial College London, UK
  • 2. Outline • Background • JPEG compression for a high-speed inference • CNN model for an FPGA implementation • Channel shift and point-wise decomposition • Quantization strategy • Channel shuffle • Fully-pipelined CNN architecture • Experimental results • Conclusion 2
  • 3. Convolutional Neural Networks (CNNs) • High accuracy and many applications • Image recognitions, NLPs, data mining [1] • FPGAs on cloud services • Amazon AWS, Microsoft Azure, etc. 3 [1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng, “UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery and data mining (KDD), 2019, pp.3132–3142.
  • 4. Problems • Power consumption • Performance bottleneck (Data-transfer) • e.g., AWS F1 provides overall read/write at 6.5GB/s from host CPU to FPGA [1] 4 Host PC Interconnect PCIe CNN Kernel .jpg RAW(RGB) Img. Accelerator Card [1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 5. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 5 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 9. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 9 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 10. Customized JPEG for a High-speed Inference 10
  • 12. Proposed JPEG Coding with a CNN Accelerator 12 Quant. Huffman Encoding Fully Pipelining CNN IDCT Reverse Quant. Huffman Decoding Host PC ImageStreamData JPEG Image .jpg Extreme Quant. Value q Quant. Table RAM PCIe Huffman Decoding & Reverse Quant. RAMRAM Detection Result FPGA Ping-pong Buffer Huffman Coding Table Huffman Coding Table
  • 13. Huffman Decoding and Reverse Quantization Unit 13 0 1 2 3 4 2 2 2 3 4 Shift Register Shift Value Quantized Value Quant. Value q Run-length Decoder 00** 01** 10** 110* 1110 Image Data Stream ... ... Priority Encoder Buffer RAM Zig-zag writing ADR WDATA Zig-zag pattern ROM
  • 14. • Decompose the 2D-IDCT with 16 1D-DCTs 14 2D-IDCT AP-922 Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMX Instructions,” https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
  • 15. 2D-IDCT Unit 15 .. Controller Operation Units Reg. 1D-IDCT Unit RAM RAM RAM • Two 1D-IDCT units • Use half precision (16 bits)
  • 16. CNN model for an FPGA Implementation 16
  • 17. Overview 1. Decomposing k×k convolution by channel shift [1] and point-wise (1×1) convolution 2. Binary (1-bit) weight quantization [2] 3. Channel split and shuffle [3] 17 [3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” CVPR, 2018. [1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez, and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,” CVPR, 2018, pp. 9127-9135. [2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
  • 18. Building Blocks 18 #channel x2 (a) Plain block (b) Down-sampling block Channel Split Shift PWConv Shift PWConv Concat & Shuffle Channel Split Shift PWConv (s=2) Shift PWConv Concat & Shuffle Shift PWConv (s=2) #channel/2
  • 19. Our CNN Model 19 Layer Output size Kernel size Stride #Output channel Image 224 3 PWConv 224 1 2 24 Norm 224 1 1 24 Shift 224 1 1 24 Pool 112 3 1 24 PWConv 112 2 2 24 Norm 112 1 1 24 ReLU 112 1 1 24 Shift 112 3 1 24 Pool 56 2 2 24 Stage 2 28 116 (4 repeats) Stage 3 14 232 (8 repeats) Stage 4 7 464 (16 repeats) GAP 1 7 1 464 PWConv 1 1 1 1000 • Training-aware quantization • w: binary, a: 8-bit • 2.54 M params, 0.616 GMACs
  • 21. Dataflow for a Residual Stage of a Plain Block • Double buffers for branch-flow • Xilinx #pragma HLS dataflow 21 Layer Unit F.map Buffer ... ... ... ... Layer Unit ... ... Shuffle ...
  • 22. 2D Convolutional Unit 22 ... ... AdderTree BN Act W.mem ... ... ... ... c n p c n×p Convolution Unit
  • 23. Pooling Units 23 x00 x01 x02 x03 x04 x10 x11 x12 x13 x14 x20 x21 x22 x23 x24 x30 x31 x32 x33 x34 x40 x41 x42 x43 x44 x11 x10 x04 x03 x02 x01 x00 Write Ctrl. Logic F. Map Mem. (n=5, k=2) Shift Register Max Selector +F. Map Mem. Register Reset Write Ctrl. Logic 1 𝑛! Controller Max. Pooling Unit Global Ave. Pooling Unit
  • 25. Compression Ratio vs. Accuracy 25 162.2 124.6 82.1 53.5 34.9 11.5 59.61 66.64 70.8 71.1 71.2 71.2 50 55 60 65 70 75 0.0 50.0 100.0 150.0 200.0 q=1 q=2 q=3 q=4 q=5 Standard speed-up acc ImageNetTop-1Accuracy[%] DataTransferSpeed-UpRatio (Baseline:RGBImageTransfer) JPEG Quantization Bit • ImageNet2012 (224x224 pixel image) classification task • PyTorch 1.4.0 + modified libjpeg library Only decreases 0.3 point of accuracy and achieves 82.1 times speed-up
  • 26. Implementation Results Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs JPEG Decoder 11,675 6,646 34 2 0 Huffman Decoder 6,794 2,378 0 0 0 2D-IDCT 4,881 4,278 34 2 0 Pipelined-CNN 263,120 266,784 2,336 2,744 0 Total 274,795 273,440 2,370 2,746 16 (Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%) 26 • Xilinx Inc. Virtex UltraScale+ FPGA VCU1525 acceleration development kit • Xilinx Inc. SDAccel 2018.2 • Operates 300MHz@75Watt • System performance: 3321.25 FPS • JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS) • JPEG decoder part of the LUT was only 4.2% of total system resource
  • 27. Comparison with Other FPGA Implementations 27 Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours FPGA Stratix V Zynq ZU3EG Zynq ZU3EG Zynq ZU9EG Virtex US+ XCVU9P Virtex US+ XCVU9P FPS 864.7 200.0 96.5 809.8 123.1 3321.2 Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8% Top-5 Acc. 66.80% --- 88.12% --- --- 90.1% Precision (W/Act) 16/16 1/2 4/4 8/8 16/16 1/8 Freq.(MHz) 150 220 250 333 214 300 Power (W) 26.2 10.2 5.5 --- 49.25 75.0 1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018. 2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” 2018. 3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019. 4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143. 5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 28. Comparison with CPU and GPU Platform CPU GPU FPGA Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM Throughput (FPS) 24.0 350.0 3321.25 Power (W) 95 295 75 Efficiency (FPS/W) 0.25 1.18 44.28 28 • Ubuntu 18.04 LTS with PyTorch 1.4.0 • 128 Batch with INT8 quantization (for CPU and GPU) Note: CPU and GPU did not use our JPEG compression scheme
  • 30. Conclusion • Customized JPEG compression for a high-speed inference • 82.1x speed-up, 0.3-point accuracy drop • CNN model for a fully-pipelined implementation • Channel shift and point-wise decomposition • Binary weight quantization • Channel split-shuffle operation • Fully-pipelined CNN architecture • Achieved 3,321 FPS@75W • Speed-up: 138.4x CPU, 9.5x GPU • Energy efficiency: 177.1x CPU, 37.5x GPU • Future works • Custom compression & Other DL applications 30
  • 31. Thank you Hiroki Nakahara (Tokyo Tech, JP) nakahara@ict.e.titech.ac.jp 31