LUT-Network
~ FOR REAL-TIME COMPUTING~
REVISION2
Ryuji Fuchikami
渕上 竜司
• This document is update from “fpgax February 2, 2019”
• https://www.slideshare.net/ryuz88/lut-network-
fpgx201902
• This is English translation version
2
History of LUT-Network publishing
• BinaryBrain Version 1 (August 1, 2018 ~)
• I named it “LUT-Network”
• Flat programing
• Binary-LUT model (SIMD AVX2)
• brute force learning model
• binary modulation model
• BinaryBrain Version 2 (September 2, 2018 ~)
• Layer model programing
• support CNN
• support export Verilog-RTL
• add back-propagation learning model
• Sparse Affine model
• micro-MLP model
• BinaryBrain Version 3 (March 19, 2019 ~)
• data object support with GPU (CUDA)
• add Stochastic-LUT model
• add Regression sample
3
https://github.com/ryuz/BinaryBrain
What is Real-Time Computing?
• Technology to match the computer to the real world dynamics.
• Computing in the living space
Human
thing
thing
Digital mirror
Video
conference
Remote
controller
Exploration
robot
Care robot
AR glasses
4
HPC
Autonomous
control
YouTube movie : https://www.youtube.com/watch?v=wGRhw9bbiik&t=2s
offline
Human
real-world
Real-Time Binary-DNN architecture for FPGA
memory
processor
input
device
output
device
best effort (variable fps)
Data enters memory first.
high-throughput, but long latency.
von Neumann architecture
dataflow programming for Real-time.
memory
processor
input
device
output
device
hard-real-time and Low-Latency
Memory is used to refer to past data
5I invented LUT-Network for Real-Time processing
LUT-Network Overview
• Conventional DNN
1. Construct with perceptron nodes.
2. Do training.
3. Get perceptron’s weight parameter.
• LUT-Network
1. Construct with LUT nodes.
2. Do training.
3. Get table parameters
θ
x1
x2
x3
xn
・・・
w1
w2
w3
wn
y
6
LUT-Network Performance
7
xc7z020clg400-1
very few resource Very Low-delay Real-Time recognition
MNIST MLP classification 318,877fps
1ms delay, 1000fps throughput
Network Design
Learning
(e.g. Tensor Flow)
Convert to C++
network
parameter
C++ source code
High Level Synthesis
(e.g. Vivado HLS)
RTL(behavior)
Synthesis
(e.g. Vivado)
Complete
(many LUTs, 100~200MHz)
Network Circuit
Design
network
(FPGA Circuit)
Learning
(BinaryBrain)
RTL(net-list)
Complete
(few LUTs, 300~400MHz)
Synthesis
(e.g. Vivado)
Design Flow for LUT-Network
【Conventional】 【LUT-Network】
8
Features of LUT-Network
• Binary Network for Prediction on Edge Device.
• Classification and Regression
• High-density and High-Speed(300~400MHz)
• Circuit size is determined prior to learning
• It is possible to keep Real-Time Warranty
Conventional DNN LUT-Network
Recognition rate Decided when learning best effort
System
performance
best effort Decided when learning
(Real-Time Warranty)
9
How do you learn the LUT?
1. Brute force learning
• Directly optimize LUT tables to minimize loss function for Train data.
• MLP(multi layer perceptron) only. (can’t apply to CNN)
• A large network's learning is difficult.
• Do not use gradients for learning.
(Possibility of being resistant to “Adversarial Examples”)
2. learning with micro-MLP model
• Apply the method of BDNN
• Forward : Binary, Backward : FP32
• low-speed learning on GPU, and high-speed prediction on FPGA.
3. learning with Stochastic-LUT model
• Forward : FP32, Backward : FP32
• high-speed and high-accuracy learning on GPU, and high-speed prediction on FPGA.
3 ideas
10
Brute force learning
1. Initialize LUT with random numbers
2. Fix the output to 0 and 1 respectively and pass all learning data
3. Keep the sum of loss function for each input value of LUT, and update the table
in the direction to reduce .
11
input frequency loss with 0 loss with 1
0 37932 47813.7 48233.9
1 39482 50001.3 49692.9
2 37028 44698.9 44845.7
3 40640 49257.1 49331.0
4 27156 33998.4 33891.0
5 23930 29538.6 29495.2
6 29002 35197.3 35451.4
7 27786 33390.9 33466.9
8 43532 52741.1 52993.5
9 41628 49985.9 50388.5
10 49176 56521.4 56026.1
11 46542 54215.4 54284.9
・・・・
・・・・
・・・・
・・・・
59 34268 41152.9 41215.8
60 22872 28852.4 29000.0
61 17930 22068.9 22112.9
62 24156 28213.2 28227.1
63 24194 28367.0 28450.4
new table value
0
1
0
0
1
1
0
0
0
0
1
0
・・・・
0
0
0
0
0
















yxwvu
tsrqp
onmlk
jihgf
edcba
















000
000
000
000
000
wv
ts
ok
gf
db
Dense-Affine (Fully Connection)
Sparse-Affine (my 1st idea)
・
・
・
synthesis
LUT
LUT
LUT
LUT
LUT
LUT
LUTmapping
BatchNormalization
Binary-Activation
BatchNormalization
Binary-Activation
Deep Logic
(Low-speed and Middle
Performance)
It can not learn XOR
high-speed(300MHz~400MHz)
It can not learn XOR
100MHz~200MHz
micro-MLP stack(my 2nd idea)
LUTmapping
BatchNormalization
Binary-Activation
It can learn XOR
high-speed(300MHz~400MHz)
・
・
・
Simple Logic
(High-speed and Low Performance)
Simple Logic
(High-speed and High Performance)
LUT includes BN
LUT includes hidden layer
This unit is “micro-MLP”
Micro-MLP learning
12
binary activation layer for Micro-MLP
• forward
• Sign()
• 𝑦 =
1 𝑖𝑓 𝑥 ≥ 0,
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,
• backward
• hard-tanh()
• 𝑔 𝑥 = 1 𝑥 ≤1
Same as the Binary Connect method
Batch Normalization uses the conventional one
13
BatchNormalization
Binary-Activation
・
・
・
mapping LUT
Stochastic-LUT model learning
14
-
*
-
x0
x1
*
W0
binarize
*
*
W1
*
*
W2
*
*
W3
1
1
+ y
e.g.) 2-input LUT model x0-x1 is input stochastic variables. W0-W3 is table value.
Probability that W0 is selected : (1 - x1) * (1 - x0)
Probability that W1 is selected : (1 - x1) * x0
Probability that W2 is selected : x1 * (1 - x0)
Probability that W3 is selected : x1 * x0
y = W0 * (1 - x1) * (1 - x0)
+ W1 * (1 - x1) * x0
+ W2 * x1 * (1 - x0)
+ W3 * x1 * x0
Because this calculation tree is differentiable, it can be calculate back-propagation.
The formula for the 6-input LUT is larger, but can be calculated in the same method.
By using the Stochastic-LUT model, it is possible to perform learning much faster and with
higher accuracy than the micro-MLP model.
No need for Batch-Normalization
No need for Activation
Stochastic-LUT model
15
input[n-1:0]
output
Table with probability values as n-dimensional continuum
Learning Prediction
Matrix Weight Activation Convolution Deep
Network
performance of
CPUs/GPUs
performance of
FPGA
Binary
Connect
Dense Binary Real
(FP32)
OK
Binarized
Neural
Network
Dense
Binary
Binary OK
XNOR-
Network
Dense Binary Binary OK
LUT-
Network
Sparse
Real
(FP32) none OK
good
excellent
1 node → many adders
1 node → many XNOR
1 node → many XNOR
1 node → 1 LUT
excellent
benchmark for other Binary Network
16
good
Demonstration 1
[MNIST MLP 1000fps])
DNN
(LUT-Net)
MIPI-CSI
RX
Raspberry Pi
Camera V2
(Sony IMX219)
SERDE
S
TX
PS
(Linux)
SERDES
RX
FIN1216
DMA
OLED
UG-9664HDDAG01
DDR3
SDRA
M
I2C
MIPI-CSI
Original Board
PCX-Window
1000fps
640x132
1000fps
control
PL (Jelly)
BinaryBrain
Ether
RTL
offline learning (PC)
ZYBO Z7-20
debug view
17
YouTube movie: https://www.youtube.com/watch?v=NJa77PZlQMI
MNIST MLP 1000fps
LUT: 1182
input:784
layer0: 256
layer1: 256
layer2: 128
layer3: 128
layer4: 128
layer5: 128
layer6: 128
layer7: 30
Total Utilization(include Camera/OLED control)Utilization of DNN part
250MHz / (28x28) = 318,877fps
DNN
(LUT-Net)
MIPI-CSI
RX
Raspberry Pi
Camera V2
(Sony IMX219)
SERDE
S
TX
PS
(Linux)
SERDES
RX
FIN1216
DMA
OLED
UG-9664HDDAG01
DDR3
SDRA
M
I2C
MIPI-CSI
Original Board
PCX-Window
1000fps
640x132
1000fps
control
PL (Jelly)
BinaryBrain
Ether
RTL
offline learning (PC)
ZYBO Z7-20
debug view
OSD
(frame-mem)
19
Demonstration 2
[MNIST CNN 1000fp]
YouTube movie : https://www.youtube.com/watch?v=aYuYrYxztBU
MNIST CNN (DNN part)
CNV3x3
CNV3x3
MaxPol
Affine
CNV3x3
CNV3x3
MaxPol
Affine
// sub-networks for convolution(3x3)
bb::NeuralNetSparseMicroMlp<6, 16>sub0_smm0(1 * 3 * 3, 192);
bb::NeuralNetSparseMicroMlp<6, 16>sub0_smm1(192, 32);
bb::NeuralNetGroup<>sub0_net;
sub0_net.AddLayer(&sub0_smm0);
sub0_net.AddLayer(&sub0_smm1);
// sub-networks for convolution(3x3)
bb::NeuralNetSparseMicroMlp<6, 16>sub1_smm0(32 * 3 * 3, 192);
bb::NeuralNetSparseMicroMlp<6, 16>sub1_smm1(192, 32);
bb::NeuralNetGroup<>sub1_net;
sub1_net.AddLayer(&sub1_smm0);
sub1_net.AddLayer(&sub1_smm1);
// sub-networks for convolution(3x3)
bb::NeuralNetSparseMicroMlp<6, 16>sub3_smm0(32 * 3 * 3, 192);
bb::NeuralNetSparseMicroMlp<6, 16>sub3_smm1(192, 32);
bb::NeuralNetGroup<>sub3_net;
sub3_net.AddLayer(&sub3_smm0);
sub3_net.AddLayer(&sub3_smm1);
// sub-networks for convolution(3x3)
bb::NeuralNetSparseMicroMlp<6, 16>sub4_smm0(32 * 3 * 3, 192);
bb::NeuralNetSparseMicroMlp<6, 16>sub4_smm1(192, 32);
bb::NeuralNetGroup<>sub4_net;
sub4_net.AddLayer(&sub4_smm0);
sub4_net.AddLayer(&sub4_smm1);
// main-networks
bb::NeuralNetRealToBinary<float>input_real2bin(28 * 28, 28 * 28);
bb::NeuralNetLoweringConvolution<>layer0_conv(&sub0_net, 1, 28, 28, 32, 3, 3);
bb::NeuralNetLoweringConvolution<>layer1_conv(&sub1_net, 32, 26, 26, 32, 3, 3);
bb::NeuralNetMaxPooling<>layer2_maxpol(32, 24, 24, 2, 2);
bb::NeuralNetLoweringConvolution<>layer3_conv(&sub3_net, 32, 12, 12, 32, 3, 3);
bb::NeuralNetLoweringConvolution<>layer4_conv(&sub4_net, 32, 10, 10, 32, 3, 3);
bb::NeuralNetMaxPooling<>layer5_maxpol(32, 8, 8, 2, 2);
bb::NeuralNetSparseMicroMlp<6, 16>layer6_smm(32 * 4 * 4, 480);
bb::NeuralNetSparseMicroMlp<6, 16>layer7_smm(480, 80);
bb::NeuralNetBinaryToReal<float>output_bin2real(80, 10);
xc7z020clg400-1
20
MNIST CNN (system total)
include Camera and OLED control
21
result of: RTL-simulation
MNIST CNN Learning log [micro-MLP]
fitting start : MnistCnnBin
initial test_accuracy : 0.1518
[save] MnistCnnBin_net_1.json
[load] MnistCnnBin_net.json
fitting start : MnistCnnBin
[initial] test_accuracy : 0.6778 train_accuracy : 0.6694
695.31s epoch[ 2] test_accuracy : 0.7661 train_accuracy : 0.7473
1464.13s epoch[ 3] test_accuracy : 0.8042 train_accuracy : 0.7914
2206.67s epoch[ 4] test_accuracy : 0.8445 train_accuracy : 0.8213
2913.12s epoch[ 5] test_accuracy : 0.8511 train_accuracy : 0.8460
3621.61s epoch[ 6] test_accuracy : 0.8755 train_accuracy : 0.8616
4325.83s epoch[ 7] test_accuracy : 0.8713 train_accuracy : 0.8730
5022.86s epoch[ 8] test_accuracy : 0.9086 train_accuracy : 0.8863
5724.22s epoch[ 9] test_accuracy : 0.9126 train_accuracy : 0.8930
6436.04s epoch[ 10] test_accuracy : 0.9213 train_accuracy : 0.8986
7128.01s epoch[ 11] test_accuracy : 0.9115 train_accuracy : 0.9034
7814.35s epoch[ 12] test_accuracy : 0.9078 train_accuracy : 0.9061
8531.97s epoch[ 13] test_accuracy : 0.9089 train_accuracy : 0.9082
9229.73s epoch[ 14] test_accuracy : 0.9276 train_accuracy : 0.9098
9950.20s epoch[ 15] test_accuracy : 0.9161 train_accuracy : 0.9105
10663.83s epoch[ 16] test_accuracy : 0.9243 train_accuracy : 0.9146
11337.86s epoch[ 17] test_accuracy : 0.9280 train_accuracy : 0.9121
fitting end
22micro MLP-model on BinaryBrain version2
MNIST CNN Learning log[Stochastic-LUT]
fitting start : MnistStochasticLut6Cnn
72.35s epoch[ 1] test accuracy : 0.9508 train accuracy : 0.9529
153.70s epoch[ 2] test accuracy : 0.9581 train accuracy : 0.9638
235.33s epoch[ 3] test accuracy : 0.9615 train accuracy : 0.9676
316.71s epoch[ 4] test accuracy : 0.9647 train accuracy : 0.9701
398.33s epoch[ 5] test accuracy : 0.9642 train accuracy : 0.9718
479.71s epoch[ 6] test accuracy : 0.9676 train accuracy : 0.9731
・
・
・
2111.04s epoch[ 26] test accuracy : 0.9699 train accuracy : 0.9786
2192.82s epoch[ 27] test accuracy : 0.9701 train accuracy : 0.9788
2274.26s epoch[ 28] test accuracy : 0.9699 train accuracy : 0.9789
2355.97s epoch[ 29] test accuracy : 0.9699 train accuracy : 0.9789
2437.39s epoch[ 30] test accuracy : 0.9696 train accuracy : 0.9791
2519.13s epoch[ 31] test accuracy : 0.9698 train accuracy : 0.9793
2600.71s epoch[ 32] test accuracy : 0.9695 train accuracy : 0.9792
fitting end
parameter copy to LUT-Network
lut_accuracy : 0.9641
export : verilog/MnistStochasticLut6Cnn.v
23
Stochastic-LUT model on BinaryBrain version3
Linear Regression [Stochastic-LUT]
(diabetes data from scikit-learn)
fitting start : DiabetesRegressionStochasticLut6
[initial] test MSE : 0.0571 train MSE : 0.0581
0.97s epoch[ 1] test MSE : 0.0307 train MSE : 0.0344
1.42s epoch[ 2] test MSE : 0.0209 train MSE : 0.0284
1.87s epoch[ 3] test MSE : 0.0162 train MSE : 0.0270
2.32s epoch[ 4] test MSE : 0.0160 train MSE : 0.0261
・
・
・
27.11s epoch[ 59] test MSE : 0.0146 train MSE : 0.0245
27.55s epoch[ 60] test MSE : 0.0195 train MSE : 0.0256
27.99s epoch[ 61] test MSE : 0.0145 train MSE : 0.0231
28.43s epoch[ 62] test MSE : 0.0133 train MSE : 0.0232
28.87s epoch[ 63] test MSE : 0.0940 train MSE : 0.0903
29.30s epoch[ 64] test MSE : 0.0146 train MSE : 0.0233
fitting end
parameter copy to LUT-Network
LUT-Network accuracy : 0.0340518
export : DiabetesRegressionBinaryLut.v
24Stochastic-LUT model on BinaryBrain version3
Learning prediction
operator
CPU
1Core
operator
CPU 1Core
(1 weight calculate instructions)
FPGA
(XILIN 7-Series)
ASIC
multi-cycle pipeline multi-cycle pipeline
Affine
(Float)
Multiplier
+ adder
0.25
cycle
Multiplier
+ adder
0.125 cycle
(8 parallel [FMA])
[MUL] DSP:2
LUT:133
[ADD] LUT:413
左×node数 gate : over 10k gate : over 10M
Affine
(INT16)
Multiplier
+ adder
0.125
cycle
Multiplier
+ adder
0.0625 cycle
(16 parallel)
[MAC] DSP:1 左×node数 gate : 0.5k~1k gate : over 1M
Binary
Connect
Multiplier
+ adder
0.25
cycle
adder
+adder
0.125 cycle
(8 parallel)
[MAC] DSP:1 左×node数 gate : 100~200 左×node数
BNN/
XNOR-Net
Multiplier
+ adder
0.25
cycle
XNOR
+popcnt
0.0039+0.0156 cycle
(256 parallel)
LUT:6~12
LUT:400~10000
(接続数次第)
gate : 20~60 左×node数
6-LUT-Net
Multiplier
+ adder
23.8
cycle
LUT
1.16 cycle
(6 input load
+ 1 table load) / 6
(256 parallel)
LUT : 1
(over spec)
LUT : 1
(fit)
gate : 10~30
(over spec)
gate : 10~30
2-LUT-Net
Multiplier
+ adder
1.37
cycle
logic-gate
1.5 cycle
(2 input load
+ 1 table load) / 2
LUT : 1
(over spec)
LUT : 1
(over spec)
gate : 1
(over spec)
gate : 1
(fit)
Resource estimate
25
oversampling and binary modulation
• oversampling and quantum modulation
• PWM(Pulse Width Modulation)
• delta-sigma modulation
• random dither, etc.
• For example, high-speed camera images originally contain noise.
• LPF (Low pass filter) removes noise and dequantizes it
• Regression analysis becomes possible
• e.g.) LPF will be constructed of IIR/FIR/Kalman filter
modulation
Quantization
DNN
random noise
or local oscillator
LPF
26
Human sense includes LPF
architecture proposal for Real-time
27
DNN
Video-In
ME MC
Video-Out
Frame Memory
Similar to IIR-filter
Next approach
• Improving Sparse Connected Connection Rules
• Currently connection rules is random select. But, any
data have locality, as CNN.
• There is a method to determine connection destination
by node distance probabilistically with Gaussian function
etc.
• I want to make stacked connection in pyramid structure.
28
reference
• BinaryConnect: Training Deep Neural Networks with binary weights during propagations
https://arxiv.org/pdf/1511.00363.pdf
• Binarized Neural Networks
https://arxiv.org/abs/1602.02505
• Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations
Constrained to +1 or -1
https://arxiv.org/abs/1602.02830
• XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
https://arxiv.org/abs/1603.05279
• Xilinx UltraScale Architecture Configurable Logic Block User Guide
https://japan.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf
29
My Profile
• Open source programmer (hobbyist)
• Born in 1976, I’m living in Fukuoka-city, Japan
• 1998~ publish HOS (Real-Time OS [uITRON])
• https://ja.osdn.net/projects/hos/
(ARM,H8,SH,MIPS,x86,Z80,AM,V850,MicroBlaze, etc.)
• 2008~ publish Jell (Soft-core CPU for FPGA)
• https://github.com/ryuz/jelly
• http://ryuz.my.coocan.jp/jelly/toppage.html
• 2018~ publish LUT-Network
• https://github.com/ryuz/BinaryBrain
• Real-Time AR-glasses project(current my hobby)
• Real-Time glasses (camera [IMX219] & OLED 1000fps)
https://www.youtube.com/watch?v=wGRhw9bbiik
• Real-Time GPU (no frame buffer architecture)
https://www.youtube.com/watch?v=vl-lhSOOlSk
• Real-Time DNN (LUT-Network)
https://www.youtube.com/watch?v=aYuYrYxztBU
30
Contact to me
• Ryuji Fuchikami (渕上 竜司)
• e-mail : ryuji.fuchikami@nifty.com
• Web-Site : http://ryuz.my.coocan.jp/
• Blog. : http://ryuz.txt-nifty.com/
• GitHub : https://github.com/ryuz/
• Twitter : https://twitter.com/Ryuz88
• Facebook : https://www.facebook.com/ryuji.fuchikami
• YouTube : https://www.youtube.com/user/nekoneko1024
31

LUT-Network Revision2 -English version-

  • 1.
    LUT-Network ~ FOR REAL-TIMECOMPUTING~ REVISION2 Ryuji Fuchikami 渕上 竜司
  • 2.
    • This documentis update from “fpgax February 2, 2019” • https://www.slideshare.net/ryuz88/lut-network- fpgx201902 • This is English translation version 2
  • 3.
    History of LUT-Networkpublishing • BinaryBrain Version 1 (August 1, 2018 ~) • I named it “LUT-Network” • Flat programing • Binary-LUT model (SIMD AVX2) • brute force learning model • binary modulation model • BinaryBrain Version 2 (September 2, 2018 ~) • Layer model programing • support CNN • support export Verilog-RTL • add back-propagation learning model • Sparse Affine model • micro-MLP model • BinaryBrain Version 3 (March 19, 2019 ~) • data object support with GPU (CUDA) • add Stochastic-LUT model • add Regression sample 3 https://github.com/ryuz/BinaryBrain
  • 4.
    What is Real-TimeComputing? • Technology to match the computer to the real world dynamics. • Computing in the living space Human thing thing Digital mirror Video conference Remote controller Exploration robot Care robot AR glasses 4 HPC Autonomous control YouTube movie : https://www.youtube.com/watch?v=wGRhw9bbiik&t=2s offline Human real-world
  • 5.
    Real-Time Binary-DNN architecturefor FPGA memory processor input device output device best effort (variable fps) Data enters memory first. high-throughput, but long latency. von Neumann architecture dataflow programming for Real-time. memory processor input device output device hard-real-time and Low-Latency Memory is used to refer to past data 5I invented LUT-Network for Real-Time processing
  • 6.
    LUT-Network Overview • ConventionalDNN 1. Construct with perceptron nodes. 2. Do training. 3. Get perceptron’s weight parameter. • LUT-Network 1. Construct with LUT nodes. 2. Do training. 3. Get table parameters θ x1 x2 x3 xn ・・・ w1 w2 w3 wn y 6
  • 7.
    LUT-Network Performance 7 xc7z020clg400-1 very fewresource Very Low-delay Real-Time recognition MNIST MLP classification 318,877fps 1ms delay, 1000fps throughput
  • 8.
    Network Design Learning (e.g. TensorFlow) Convert to C++ network parameter C++ source code High Level Synthesis (e.g. Vivado HLS) RTL(behavior) Synthesis (e.g. Vivado) Complete (many LUTs, 100~200MHz) Network Circuit Design network (FPGA Circuit) Learning (BinaryBrain) RTL(net-list) Complete (few LUTs, 300~400MHz) Synthesis (e.g. Vivado) Design Flow for LUT-Network 【Conventional】 【LUT-Network】 8
  • 9.
    Features of LUT-Network •Binary Network for Prediction on Edge Device. • Classification and Regression • High-density and High-Speed(300~400MHz) • Circuit size is determined prior to learning • It is possible to keep Real-Time Warranty Conventional DNN LUT-Network Recognition rate Decided when learning best effort System performance best effort Decided when learning (Real-Time Warranty) 9
  • 10.
    How do youlearn the LUT? 1. Brute force learning • Directly optimize LUT tables to minimize loss function for Train data. • MLP(multi layer perceptron) only. (can’t apply to CNN) • A large network's learning is difficult. • Do not use gradients for learning. (Possibility of being resistant to “Adversarial Examples”) 2. learning with micro-MLP model • Apply the method of BDNN • Forward : Binary, Backward : FP32 • low-speed learning on GPU, and high-speed prediction on FPGA. 3. learning with Stochastic-LUT model • Forward : FP32, Backward : FP32 • high-speed and high-accuracy learning on GPU, and high-speed prediction on FPGA. 3 ideas 10
  • 11.
    Brute force learning 1.Initialize LUT with random numbers 2. Fix the output to 0 and 1 respectively and pass all learning data 3. Keep the sum of loss function for each input value of LUT, and update the table in the direction to reduce . 11 input frequency loss with 0 loss with 1 0 37932 47813.7 48233.9 1 39482 50001.3 49692.9 2 37028 44698.9 44845.7 3 40640 49257.1 49331.0 4 27156 33998.4 33891.0 5 23930 29538.6 29495.2 6 29002 35197.3 35451.4 7 27786 33390.9 33466.9 8 43532 52741.1 52993.5 9 41628 49985.9 50388.5 10 49176 56521.4 56026.1 11 46542 54215.4 54284.9 ・・・・ ・・・・ ・・・・ ・・・・ 59 34268 41152.9 41215.8 60 22872 28852.4 29000.0 61 17930 22068.9 22112.9 62 24156 28213.2 28227.1 63 24194 28367.0 28450.4 new table value 0 1 0 0 1 1 0 0 0 0 1 0 ・・・・ 0 0 0 0 0
  • 12.
                    yxwvu tsrqp onmlk jihgf edcba                 000 000 000 000 000 wv ts ok gf db Dense-Affine (Fully Connection) Sparse-Affine(my 1st idea) ・ ・ ・ synthesis LUT LUT LUT LUT LUT LUT LUTmapping BatchNormalization Binary-Activation BatchNormalization Binary-Activation Deep Logic (Low-speed and Middle Performance) It can not learn XOR high-speed(300MHz~400MHz) It can not learn XOR 100MHz~200MHz micro-MLP stack(my 2nd idea) LUTmapping BatchNormalization Binary-Activation It can learn XOR high-speed(300MHz~400MHz) ・ ・ ・ Simple Logic (High-speed and Low Performance) Simple Logic (High-speed and High Performance) LUT includes BN LUT includes hidden layer This unit is “micro-MLP” Micro-MLP learning 12
  • 13.
    binary activation layerfor Micro-MLP • forward • Sign() • 𝑦 = 1 𝑖𝑓 𝑥 ≥ 0, 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, • backward • hard-tanh() • 𝑔 𝑥 = 1 𝑥 ≤1 Same as the Binary Connect method Batch Normalization uses the conventional one 13 BatchNormalization Binary-Activation ・ ・ ・ mapping LUT
  • 14.
    Stochastic-LUT model learning 14 - * - x0 x1 * W0 binarize * * W1 * * W2 * * W3 1 1 +y e.g.) 2-input LUT model x0-x1 is input stochastic variables. W0-W3 is table value. Probability that W0 is selected : (1 - x1) * (1 - x0) Probability that W1 is selected : (1 - x1) * x0 Probability that W2 is selected : x1 * (1 - x0) Probability that W3 is selected : x1 * x0 y = W0 * (1 - x1) * (1 - x0) + W1 * (1 - x1) * x0 + W2 * x1 * (1 - x0) + W3 * x1 * x0 Because this calculation tree is differentiable, it can be calculate back-propagation. The formula for the 6-input LUT is larger, but can be calculated in the same method. By using the Stochastic-LUT model, it is possible to perform learning much faster and with higher accuracy than the micro-MLP model. No need for Batch-Normalization No need for Activation
  • 15.
    Stochastic-LUT model 15 input[n-1:0] output Table withprobability values as n-dimensional continuum
  • 16.
    Learning Prediction Matrix WeightActivation Convolution Deep Network performance of CPUs/GPUs performance of FPGA Binary Connect Dense Binary Real (FP32) OK Binarized Neural Network Dense Binary Binary OK XNOR- Network Dense Binary Binary OK LUT- Network Sparse Real (FP32) none OK good excellent 1 node → many adders 1 node → many XNOR 1 node → many XNOR 1 node → 1 LUT excellent benchmark for other Binary Network 16 good
  • 17.
    Demonstration 1 [MNIST MLP1000fps]) DNN (LUT-Net) MIPI-CSI RX Raspberry Pi Camera V2 (Sony IMX219) SERDE S TX PS (Linux) SERDES RX FIN1216 DMA OLED UG-9664HDDAG01 DDR3 SDRA M I2C MIPI-CSI Original Board PCX-Window 1000fps 640x132 1000fps control PL (Jelly) BinaryBrain Ether RTL offline learning (PC) ZYBO Z7-20 debug view 17 YouTube movie: https://www.youtube.com/watch?v=NJa77PZlQMI
  • 18.
    MNIST MLP 1000fps LUT:1182 input:784 layer0: 256 layer1: 256 layer2: 128 layer3: 128 layer4: 128 layer5: 128 layer6: 128 layer7: 30 Total Utilization(include Camera/OLED control)Utilization of DNN part 250MHz / (28x28) = 318,877fps
  • 19.
    DNN (LUT-Net) MIPI-CSI RX Raspberry Pi Camera V2 (SonyIMX219) SERDE S TX PS (Linux) SERDES RX FIN1216 DMA OLED UG-9664HDDAG01 DDR3 SDRA M I2C MIPI-CSI Original Board PCX-Window 1000fps 640x132 1000fps control PL (Jelly) BinaryBrain Ether RTL offline learning (PC) ZYBO Z7-20 debug view OSD (frame-mem) 19 Demonstration 2 [MNIST CNN 1000fp] YouTube movie : https://www.youtube.com/watch?v=aYuYrYxztBU
  • 20.
    MNIST CNN (DNNpart) CNV3x3 CNV3x3 MaxPol Affine CNV3x3 CNV3x3 MaxPol Affine // sub-networks for convolution(3x3) bb::NeuralNetSparseMicroMlp<6, 16>sub0_smm0(1 * 3 * 3, 192); bb::NeuralNetSparseMicroMlp<6, 16>sub0_smm1(192, 32); bb::NeuralNetGroup<>sub0_net; sub0_net.AddLayer(&sub0_smm0); sub0_net.AddLayer(&sub0_smm1); // sub-networks for convolution(3x3) bb::NeuralNetSparseMicroMlp<6, 16>sub1_smm0(32 * 3 * 3, 192); bb::NeuralNetSparseMicroMlp<6, 16>sub1_smm1(192, 32); bb::NeuralNetGroup<>sub1_net; sub1_net.AddLayer(&sub1_smm0); sub1_net.AddLayer(&sub1_smm1); // sub-networks for convolution(3x3) bb::NeuralNetSparseMicroMlp<6, 16>sub3_smm0(32 * 3 * 3, 192); bb::NeuralNetSparseMicroMlp<6, 16>sub3_smm1(192, 32); bb::NeuralNetGroup<>sub3_net; sub3_net.AddLayer(&sub3_smm0); sub3_net.AddLayer(&sub3_smm1); // sub-networks for convolution(3x3) bb::NeuralNetSparseMicroMlp<6, 16>sub4_smm0(32 * 3 * 3, 192); bb::NeuralNetSparseMicroMlp<6, 16>sub4_smm1(192, 32); bb::NeuralNetGroup<>sub4_net; sub4_net.AddLayer(&sub4_smm0); sub4_net.AddLayer(&sub4_smm1); // main-networks bb::NeuralNetRealToBinary<float>input_real2bin(28 * 28, 28 * 28); bb::NeuralNetLoweringConvolution<>layer0_conv(&sub0_net, 1, 28, 28, 32, 3, 3); bb::NeuralNetLoweringConvolution<>layer1_conv(&sub1_net, 32, 26, 26, 32, 3, 3); bb::NeuralNetMaxPooling<>layer2_maxpol(32, 24, 24, 2, 2); bb::NeuralNetLoweringConvolution<>layer3_conv(&sub3_net, 32, 12, 12, 32, 3, 3); bb::NeuralNetLoweringConvolution<>layer4_conv(&sub4_net, 32, 10, 10, 32, 3, 3); bb::NeuralNetMaxPooling<>layer5_maxpol(32, 8, 8, 2, 2); bb::NeuralNetSparseMicroMlp<6, 16>layer6_smm(32 * 4 * 4, 480); bb::NeuralNetSparseMicroMlp<6, 16>layer7_smm(480, 80); bb::NeuralNetBinaryToReal<float>output_bin2real(80, 10); xc7z020clg400-1 20
  • 21.
    MNIST CNN (systemtotal) include Camera and OLED control 21 result of: RTL-simulation
  • 22.
    MNIST CNN Learninglog [micro-MLP] fitting start : MnistCnnBin initial test_accuracy : 0.1518 [save] MnistCnnBin_net_1.json [load] MnistCnnBin_net.json fitting start : MnistCnnBin [initial] test_accuracy : 0.6778 train_accuracy : 0.6694 695.31s epoch[ 2] test_accuracy : 0.7661 train_accuracy : 0.7473 1464.13s epoch[ 3] test_accuracy : 0.8042 train_accuracy : 0.7914 2206.67s epoch[ 4] test_accuracy : 0.8445 train_accuracy : 0.8213 2913.12s epoch[ 5] test_accuracy : 0.8511 train_accuracy : 0.8460 3621.61s epoch[ 6] test_accuracy : 0.8755 train_accuracy : 0.8616 4325.83s epoch[ 7] test_accuracy : 0.8713 train_accuracy : 0.8730 5022.86s epoch[ 8] test_accuracy : 0.9086 train_accuracy : 0.8863 5724.22s epoch[ 9] test_accuracy : 0.9126 train_accuracy : 0.8930 6436.04s epoch[ 10] test_accuracy : 0.9213 train_accuracy : 0.8986 7128.01s epoch[ 11] test_accuracy : 0.9115 train_accuracy : 0.9034 7814.35s epoch[ 12] test_accuracy : 0.9078 train_accuracy : 0.9061 8531.97s epoch[ 13] test_accuracy : 0.9089 train_accuracy : 0.9082 9229.73s epoch[ 14] test_accuracy : 0.9276 train_accuracy : 0.9098 9950.20s epoch[ 15] test_accuracy : 0.9161 train_accuracy : 0.9105 10663.83s epoch[ 16] test_accuracy : 0.9243 train_accuracy : 0.9146 11337.86s epoch[ 17] test_accuracy : 0.9280 train_accuracy : 0.9121 fitting end 22micro MLP-model on BinaryBrain version2
  • 23.
    MNIST CNN Learninglog[Stochastic-LUT] fitting start : MnistStochasticLut6Cnn 72.35s epoch[ 1] test accuracy : 0.9508 train accuracy : 0.9529 153.70s epoch[ 2] test accuracy : 0.9581 train accuracy : 0.9638 235.33s epoch[ 3] test accuracy : 0.9615 train accuracy : 0.9676 316.71s epoch[ 4] test accuracy : 0.9647 train accuracy : 0.9701 398.33s epoch[ 5] test accuracy : 0.9642 train accuracy : 0.9718 479.71s epoch[ 6] test accuracy : 0.9676 train accuracy : 0.9731 ・ ・ ・ 2111.04s epoch[ 26] test accuracy : 0.9699 train accuracy : 0.9786 2192.82s epoch[ 27] test accuracy : 0.9701 train accuracy : 0.9788 2274.26s epoch[ 28] test accuracy : 0.9699 train accuracy : 0.9789 2355.97s epoch[ 29] test accuracy : 0.9699 train accuracy : 0.9789 2437.39s epoch[ 30] test accuracy : 0.9696 train accuracy : 0.9791 2519.13s epoch[ 31] test accuracy : 0.9698 train accuracy : 0.9793 2600.71s epoch[ 32] test accuracy : 0.9695 train accuracy : 0.9792 fitting end parameter copy to LUT-Network lut_accuracy : 0.9641 export : verilog/MnistStochasticLut6Cnn.v 23 Stochastic-LUT model on BinaryBrain version3
  • 24.
    Linear Regression [Stochastic-LUT] (diabetesdata from scikit-learn) fitting start : DiabetesRegressionStochasticLut6 [initial] test MSE : 0.0571 train MSE : 0.0581 0.97s epoch[ 1] test MSE : 0.0307 train MSE : 0.0344 1.42s epoch[ 2] test MSE : 0.0209 train MSE : 0.0284 1.87s epoch[ 3] test MSE : 0.0162 train MSE : 0.0270 2.32s epoch[ 4] test MSE : 0.0160 train MSE : 0.0261 ・ ・ ・ 27.11s epoch[ 59] test MSE : 0.0146 train MSE : 0.0245 27.55s epoch[ 60] test MSE : 0.0195 train MSE : 0.0256 27.99s epoch[ 61] test MSE : 0.0145 train MSE : 0.0231 28.43s epoch[ 62] test MSE : 0.0133 train MSE : 0.0232 28.87s epoch[ 63] test MSE : 0.0940 train MSE : 0.0903 29.30s epoch[ 64] test MSE : 0.0146 train MSE : 0.0233 fitting end parameter copy to LUT-Network LUT-Network accuracy : 0.0340518 export : DiabetesRegressionBinaryLut.v 24Stochastic-LUT model on BinaryBrain version3
  • 25.
    Learning prediction operator CPU 1Core operator CPU 1Core (1weight calculate instructions) FPGA (XILIN 7-Series) ASIC multi-cycle pipeline multi-cycle pipeline Affine (Float) Multiplier + adder 0.25 cycle Multiplier + adder 0.125 cycle (8 parallel [FMA]) [MUL] DSP:2 LUT:133 [ADD] LUT:413 左×node数 gate : over 10k gate : over 10M Affine (INT16) Multiplier + adder 0.125 cycle Multiplier + adder 0.0625 cycle (16 parallel) [MAC] DSP:1 左×node数 gate : 0.5k~1k gate : over 1M Binary Connect Multiplier + adder 0.25 cycle adder +adder 0.125 cycle (8 parallel) [MAC] DSP:1 左×node数 gate : 100~200 左×node数 BNN/ XNOR-Net Multiplier + adder 0.25 cycle XNOR +popcnt 0.0039+0.0156 cycle (256 parallel) LUT:6~12 LUT:400~10000 (接続数次第) gate : 20~60 左×node数 6-LUT-Net Multiplier + adder 23.8 cycle LUT 1.16 cycle (6 input load + 1 table load) / 6 (256 parallel) LUT : 1 (over spec) LUT : 1 (fit) gate : 10~30 (over spec) gate : 10~30 2-LUT-Net Multiplier + adder 1.37 cycle logic-gate 1.5 cycle (2 input load + 1 table load) / 2 LUT : 1 (over spec) LUT : 1 (over spec) gate : 1 (over spec) gate : 1 (fit) Resource estimate 25
  • 26.
    oversampling and binarymodulation • oversampling and quantum modulation • PWM(Pulse Width Modulation) • delta-sigma modulation • random dither, etc. • For example, high-speed camera images originally contain noise. • LPF (Low pass filter) removes noise and dequantizes it • Regression analysis becomes possible • e.g.) LPF will be constructed of IIR/FIR/Kalman filter modulation Quantization DNN random noise or local oscillator LPF 26 Human sense includes LPF
  • 27.
    architecture proposal forReal-time 27 DNN Video-In ME MC Video-Out Frame Memory Similar to IIR-filter
  • 28.
    Next approach • ImprovingSparse Connected Connection Rules • Currently connection rules is random select. But, any data have locality, as CNN. • There is a method to determine connection destination by node distance probabilistically with Gaussian function etc. • I want to make stacked connection in pyramid structure. 28
  • 29.
    reference • BinaryConnect: TrainingDeep Neural Networks with binary weights during propagations https://arxiv.org/pdf/1511.00363.pdf • Binarized Neural Networks https://arxiv.org/abs/1602.02505 • Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 https://arxiv.org/abs/1602.02830 • XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks https://arxiv.org/abs/1603.05279 • Xilinx UltraScale Architecture Configurable Logic Block User Guide https://japan.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf 29
  • 30.
    My Profile • Opensource programmer (hobbyist) • Born in 1976, I’m living in Fukuoka-city, Japan • 1998~ publish HOS (Real-Time OS [uITRON]) • https://ja.osdn.net/projects/hos/ (ARM,H8,SH,MIPS,x86,Z80,AM,V850,MicroBlaze, etc.) • 2008~ publish Jell (Soft-core CPU for FPGA) • https://github.com/ryuz/jelly • http://ryuz.my.coocan.jp/jelly/toppage.html • 2018~ publish LUT-Network • https://github.com/ryuz/BinaryBrain • Real-Time AR-glasses project(current my hobby) • Real-Time glasses (camera [IMX219] & OLED 1000fps) https://www.youtube.com/watch?v=wGRhw9bbiik • Real-Time GPU (no frame buffer architecture) https://www.youtube.com/watch?v=vl-lhSOOlSk • Real-Time DNN (LUT-Network) https://www.youtube.com/watch?v=aYuYrYxztBU 30
  • 31.
    Contact to me •Ryuji Fuchikami (渕上 竜司) • e-mail : ryuji.fuchikami@nifty.com • Web-Site : http://ryuz.my.coocan.jp/ • Blog. : http://ryuz.txt-nifty.com/ • GitHub : https://github.com/ryuz/ • Twitter : https://twitter.com/Ryuz88 • Facebook : https://www.facebook.com/ryuji.fuchikami • YouTube : https://www.youtube.com/user/nekoneko1024 31