(公開版)Reconf研2017GUINNESS

FPGA向け
ディープラーニング開発環境
GUINNESSについて
中原啓貴,⽶川晴義,藤井智也,下⽥将之,佐藤真平
東京⼯業⼤学⼯学院情報通信系
リコンフ研2017 9⽉
@ドワンゴ

発表内容
• 研究背景
• Convolutional Neural Network (CNN)
• 2値化CNNの最適化⼿法
• FPGA専⽤ディープラーニング開発環境GUINNESS
について
• 実験結果
• まとめ
2

クルマで想定されるスペック
4
クラウドエッジ
Many classes (1000s) Few classes (<10)
Large workloads Frame rates (15‐30 FPS)
High efficiency
(Performance/W)
Low cost & low power
(1W‐5W)
Server form factor Custom form factor
J. Freeman (Intel), “FPGA Acceleration in the era of high level design”, 2017

サーマルスロットリング
• ⾼負荷による過度な発熱を抑える
• 安全性の向上や機器を熱による破損から守る
• TM2: 電源と周波数が低下, TM1: 間隔実⾏
• CPU, GPU, SSDなどに搭載
• 性能低下→機器停⽌も
6
💭

組込み(エッジ)でディープラーニング
• クラウドでの問題
• ネットワーク遅延
• プライバシー
• セキュリティ
• 学習はオンライン,
推論だけ⾏うことを想定
• 検討事項
• 計算能⼒
• バッテリ
• 冷却ファン
• バッテリ時間
7

発表内容
• 研究背景
について
• 実験結果
• まとめ
8

Artificial Neuron (AN)
+
x0=1
x1
x2
xN
... w0 (Bias)
w1
w2
wN
f(u)
u y
xi: Input signal
wi: Weight
u: Internal state
f(u): Activation function
(Sigmoid, ReLU, etc.)
y: Output signal
y  f (u)
u  wi xi
i0
N

9

Deep Neural Network
10
happy
sad
mad
curious
出典: imotionsglobal.com

畳込み演算
1 0 1 1
1 1 1 0
0 1 0 0
1 1 0 1
1 1 1 0
0 1 1 0
0 0 0 0
1 0 1 1
5
x1 x0 x1
x0 x1 x0
x0 x0 x1
x0 x0 x1
x1 x0 x1
x1 x1 x1
+
カーネル
(この例ではK=3) 11

畳込み演算
1 0 1 1
1 1 1 0
0 1 0 0
1 1 0 1
1 1 1 0
0 1 1 0
0 0 0 0
1 0 1 1
5 3
x1 x0 x1
x0 x1 x0
x0 x0 x1
x0 x0 x1
x1 x0 x1
x1 x1 x1
+
12

畳込み演算
1 0 1 1
1 1 1 0
0 1 0 0
1 1 0 1
1 1 1 0
0 1 1 0
0 0 0 0
1 0 1 1
5 3
6
x1 x0 x1
x0 x1 x0
x0 x0 x1
x0 x0 x1
x1 x0 x1
x1 x1 x1
+
13

畳込み演算
1 0 1 1
1 1 1 0
0 1 0 0
1 1 0 1
1 1 1 0
0 1 1 0
0 0 0 0
1 0 1 1
5 3
6 4
x1 x0 x1
x0 x1 x0
x0 x0 x1
x0 x0 x1
x1 x0 x1
x1 x1 x1
+
14

CNNで⾏われている畳込み演算
1 0 1 1
1 1 1 0
0 1 0 0
1 1 0 1
1 1 1 0
0 1 1 0
0 0 0 0
1 0 1 1
5 3
6 4
x1 x0 x1
x0 x1 x0
x0 x0 x1
x0 x0 x1
x1 x0 x1
x1 x1 x1
• ANを2次元に拡張
15

2値化ニューラルネットワーク
• ⼆値(-1/+1)の乗算
• 乗算器をXNORゲートで
16
x1 x2 Y
‐1 ‐1 1
‐1 +1 ‐1
+1 ‐1 ‐1
+1 +1 1
x1 x2 Y
0 0 1
0 1 0
1 0 0
1 1 1

2値化CNNの効果
17
x1
w0 (Bias)
fsgn(Y)
Y
z
w1
x2
w2
xn
wn
...
短精度(4〜8)ビットを2値に置き換え→メモリ帯域の圧縮
乗算器をXNORに置き換え→回路⾯積の削減

メモリ量削減→電⼒効率向上
• メモリと演算器の距離∝電⼒
→FPGAのオンチップメモリに格納できれば電⼒効率↑
E. Joel et al., “Tutorial on Hardware Architectures for Deep Neural Networks,” MICRO‐49, 2016.18

認識精度低下に対して
• バッチ正規化(BatchNormalization)を導⼊
0
20
40
60
80
100
# of epochs
Classification error (%)
(a) float32 bit precision CNN
1 80 160 200
0
20
40
60
80
100
# of epochs
Classification error (%)
(b) Binarized CNN
1 80 160 200
単に2値化した場合
提案⼿法
約6%の誤差(VGG‐16を使⽤)
H. Nakahara et al., “A memory‐based binarized convolutional deep neural network,”
FPT2016, pp285‐288, 2016.
19

• Normalizing the result
of MAC operations
• Batch normalization is
necessary for the
Binarized CNN to
improve its accracy
20
Normalization for Binarized DNN

Batch
Norm

0
20
40
60
80
100
1 80 160 200
Error rate[％]
epoch
Without BN
With BN

H. Nakahara, H. Yonekawa, T. Sasao, H. Iwamoto,
and M. Motomura, "A Memory‐Based Realization
of a Binarized Deep Convolutional Neural
Network," The International Conference on Field‐
Programmable Technology (FPT 2016), pp.273‐76,
2016.
mean
variance
Scaling Shift

• Batch Normalization is implemented by fixed
point adders and multipliers
21
バッチ正規化を導⼊した回路

Adder tree
Batch normalization
Sign bit
XNOR gate

• The output from batch
normalization( ) is the
input to sign function
Constant factor can
be ignored
• The input from batch
normalization( ) is the
integer value
To integer
22
バッチ正規化をバイアスで実現

23
バッチ正規化と等価な回路

Batch
Norm

2値化CNN専⽤回路
• カスタマイズ演算: 1ビット積和演算
• 専⽤パイプライン
24
x00 x01 x02 x03 x04
x10 x11 x12 x13 x14
x20 x21 x22 x23 x24
x30 x31 x32 x33 x34
x40 x41 x42 x43 x44
x22 x21 x20 x14 x13 x12 x11 x10 x04 x03 x02 x01 x00
+
Binarized
Weight
Mem.
Integer
Bias
Mem.
Write
Ctrl.
Logic
Counter
Binarized Feature Map
(L=5, K=3)
Shift Register (2L+K bits)
9
Binarized MACs
(EXNORs + Adder Tree)
Sign
bit

Bottleneck
• Convolutional layer
→ #MAC operations
• Fully connection layer
→ Weight memory
J. Qiu et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,”
ISFPGA2016.
25

Replacement Internal FC Layers
into a Binarized Average Pooling Layer
26
“car”
Input
Image
Feature maps
Repeats of CONV+Max Pooling
Fully connection
...
Max Pooling
“car”
Input
Image
Feature maps
Repeat of CONV+Max Pooling
Fully
connection
...
Binarized Ave. Pooling
Flatten

0
10000
20000
30000
40000
50000
60000
70000
‐1 ‐0.8 ‐0.6 ‐0.4 ‐0.2 0 0.2 0.4 0.6 0.8 1
# of weights Weight value
“‐1” (50%) “+1” (50%)
Distribution of trained weight
Binarized weights are
balanced
Binarized internal
values are
balanced
“car”
Input
Image
Feature maps
Repeats of CONV+Max Pooling
Fully connection
...
Max Pooling Flatten

0
10000
20000
30000
40000
50000
60000
70000
‐1 ‐0.8 ‐0.6 ‐0.4 ‐0.2 0 0.2 0.4 0.6 0.8 1
x w Y
0 0 1
0 1 0
1 0 0
1 1 1
# of weights
Weight value
“‐1” (50%) “+1” (50%)
Distribution of trained weight
Binarized weights are
balanced
Binarized internal
values are
balanced
The outputs are
also balanced
→ 1’s count operation
for a Binarized internal
values
0 0
0 1
1 0
0 1
1 1
0 1
→ 0
→ 1
→ 1
Training:
Binarized
average pooling
Inference(FPGA):
1’s counter
Σ
x1
w0 (Bias)
fsgn(Y)
Y
z
w1
x2
w2
xn
wn
...

モデルサイズの⽐較
29
Layer
Baseline Proposed
Dim.
In F
maps
Out F
maps
Weight
[bits]
Dim.
In F
maps
Out F
maps
Weight[bi
ts]
Iconv 32x32 3 64 1.7K 32x32 3 64 1.7K
Bconv 32x32 64 64 36.8K 32x32 64 64 36.8K
Max Pool 16x16 64 64 16x16 64 64
Bconv 16x16 64 128 73.7K 16x16 64 128 73.7K
Bconv 16x16 128 128 147.4K 16x16 128 128 147.4K
Max Pool 8x8 128 128 8x8 128 128
Bconv 8x8 128 256 294.9K 8x8 128 256 294.9K
Bconv 8x8 256 256 589.8K 8x8 256 256 589.8K
Max Pool 4x4 256 256 4x4 256 256
BFC 1x1 4096 4096 16.7M
(Binarized Average Pooling)
BFC 1x1 4096 4096 16.7M
BFC 1x1 4096 10 40.9K 1x1 256 10 2.5K
(fc total) (33.6M) (2.5K)
Total 34.7M 1.5M
Error Rate 18.6% 18.2%

発表内容
• 研究背景
について
• 実験結果
• まとめ
30

GUINNESSとは
• A GUI based neural network synthesizerの略
• ユーザの準備した画像を学習しFPGA向けの推論回路⽤
ビットストリームを⽣成
• GUIを操作するだけで学習・回路合成ができるので
ハードウェア・アルゴリズムどちらの技術者でも
ディープラーニングを簡単にFPGAに組込み可能
• コードを書く必要は⼀切なし
Tokyo Tech. Nakahara Lab. 31

GUINNESS (現バージョン)
CNNのパラメータ
（深さ・幅）
レイヤの種類
を指定できます
（奨励パラメータを
読み込み可能）
ユーザの学習データを
使えます
学習の再開・保存が可能
学習パラメータも予め
設定済み
ターゲットFPGAボードを指定するだけで
ビットストリームが⾃動⽣成されます

GUINNESS Tool Flow
.model
Training
by
Chainer
Binarized
CNN Weight
Chainer
to
C++
Model
to
Text
Binarized
Weight
.txt
PL code
.cpp
PS code
.cpp gcc
HLS
.elf
.bit
.pkl
Label Data
.txt
CNN Spec.
.py
Image Data
PS
PL
Exe. data
Bit stream
BRAM
Zynq
FPGA
SDSoC
Operated by
the GUI
Generated from
Images
Trained by GPU

GUINNESS奨励環境
• 奨励計算機環境
• GPU (GTX1070以上の性能)+CUDA8.0
• マルチGPU, クラウド環境への対応も可能(要相談)
• 学習済みCNN読み込み可能(要相談)
• メモリ16GB以上
• Ubuntu 14.04 LTSのみサポート
• Xilinx社 SDSoC 2016.3, 2016.4
• Chainer 1.17.0〜1.21.0
• 対応FPGA（今後追加予定、カスタム設計は要相談）
• Digilent社 Zybo, Zedboard
• Xilinx社 ZC702, ZCU102

発表内容
• 研究背景
について
• 実験結果
• まとめ
35

既存FPGA実現法との⽐較 36
Implementation
(Year)
Zhao et al. [1]
(2017)
FINN [2]
(2017)
Ours
FPGA Board
(FPGA)
Zedboard
(XC7Z020)
PYNQ board
(XC7Z020)
Zedboard
(XC7Z020)
Clock (MHz) 143 166 143
#LUTs
#18Kb BRAMs
#DSP Blocks
46900
94
3
42833
270
32
14509
32
1
Test Error 12.27% 19.90% 18.20%
Time [msec]
(FPS)
5.94
(168)
2.24
(445)
2.37
(420)
Power [W] 4.7 2.5 2.3
FPS/Watt
FPS/LUT
FPS/BRAM
35.7
35.8x10‐4
1.8
178.0
103.9x10‐4
1.6
182.6
289.4x10‐4
13.1
Y. Umuroglu, et al., “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” ISFPGA, 2017.
R. Zhao et al., “Accelerating Binarized Convolutional Neural Networks with Software‐Programmable FPGAs,” ISFPGA, 2017.

Comparison with Embedded
Platforms (VGG11 Forwarding)
Platform CPU GPU FPGA
Device
ARM Cortex‐A57 Maxwell GPU Zynq7020
Clock Freq. 1.9 GHz 998 MHz 143.78 MHz
Memory 16 GB eMMC Flash 4 GB LPDDR4 4.9 Mb BRAM
Time [msec]
(FPS)
4210.0
(0.23)
27.23
(36.7)
2.37
(421.9)
Power [W] 7 17 2.3
Efficiency 0.032 2.2 182.6
Design Time [Hours] 72 72 75

まとめ
• ディープラーニング統合開発環境を開発
• 2値化CNNに特化した環境
• 推論専⽤の最適化⼿法
• 学習⽅法
• 組込み向けCPU・GPUと⽐較
• ⾼速かつ電⼒効率に優れる
38

https://github.com/HirokiNakahara/GUINNESS
39

Docker イメージあります
• AzureとかでGUINNESSを実⾏可能︕
40

(公開版)Reconf研2017GUINNESS

More Related Content

What's hot

Viewers also liked

Similar to (公開版)Reconf研2017GUINNESS

More from Hiroki Nakahara

(公開版)Reconf研2017GUINNESS