Improving Accuracy of Binary Neural Networks Using Unbalanced Activation Distribution

Interaction Lab., Seoul National University of Science and Technology
■Degree of gaze estimation
 Geometric based
• 0.5°~ 3°
 Appearance based
• 3°~ 12°
Feedback

■GazeTR-Hybrid and GazeTR-Conv
Feedback
3

Interaction Lab. Seoul National University of Science and Technology
Improving Accuracy of Binary Neural Networks
using Unbalanced Activation Distribution
Jae-Yeop Jeong
Computer Vision and Pattern Recognition 2021
Hyungjun Kim, Jihoon Park, Changhun Lee, Jae-Joon Kim
POSTECH, Korea

■Intro
■Method
■Conclusion
Agenda
5

■Lightweight Network
 Network quantization
 Network pruning
 Efficient architecture design
Intro(1/6)
7

■Network quantization
 Convert floating-point type to integer or fixed point
• Model Lightweight
• Reduced computation
• Efficient hardware use
 The process that reduces represent form and internal composition in Neural Network
Intro(2/6)
8

■Binary Neural Network
 32-bit floating-point weights can be represented by 1-bit weights
 XNOR-and-popcount logics
 Weight quantization
 Activation quantization
Intro(3/6)
9
-1.21 0.88 -0.11
0.17 1.72 -0.7
0.98 -0.57 0.63
-1 1 -1
1 1 -1
1 -1 1

■Accuracy of BNNs
 Due to the lightweight nature, BNNs are garnering interests as a promising solution for
DNN computing on edge devices
 BNNs still suffer from the accuracy degradation caused by the aggressive quantization
 Low accuracy than full-precision model
Intro(4/6)
10

■Problems of BNNs
 Gradient mismatch problem caused by the non-differentiable binary activation function
• Gradients cannot propagate through the quantization layer in the back-propagation process
• Straight-Through-Estimator(STE)
Intro(5/6)
11

■Technique to improve the accuracy of BNNs
 Straight-Through-Estimator(STE)
 Parameterized clipping activation function(PACT)
• Focused on multi-bit networks (2-bit, 4-bit …)
 Managing activation distribution
• There are only two values (1 or -1)
• Focused on binary activation more balanced
 Additional activation function
• Between the binary convolution layer and the following BN layer
• ReLU, PReLU
• Increasing the non-linearity
Intro(6/6)
12

■Overview
 The activation functions in conventional full precision models to describe the motivation for
using unbalanced activation distribution
 How to make the distribution of the binary activation unbalanced using threshold shifting in
BNNs
 Experimental results on ImageNet data set
 Ineffectiveness of the methodology to train the threshold of binary activation in BNNs
 Additional activation functions
Method(1/25)
14

■Motivation for using unbalanced activation distribution
 ReLU solved the gradient vanishing problem
• The gradient vanishing problem is largely diminished by using BN layers in recent models
 There might be other reasons for the success of the ReLU function
Method(2/25)
15

■Make unbalanced activation distribution using hardtanh
 Compare the performance hardtanh function and ReLU6
• Slight variant of the ReLU function where its positive outputs are clipped to 6
 Threshold shifting
• x-axis, y-axis, increased range
Method(3/25)
16

■Training model
 VGG-small model was trained on CIFAR-10 dataset with different activation functions
 VGG-small
• Consists of 4 Conv with 64, 64, 128, 128 output channel
• Three dense with 512
• BN layer is placed between the Conv layer and the activation layer
• Max pooling layer was used after each of the latter three convolution layers
 Trained the model in the same condition with 10 different seeds
Method(4/25)
17

■Accuracy gap between hardtanh and ReLU6(1/2)
 Red dotted line represents the test accuracy of the ReLU6
Method(5/25)
18

■Accuracy gap between hardtanh and ReLU6(2/2)
 The hardtanh function is shifted to the right along the x-axis
• The number of negative outputs increase and the output distribution becomes positively skewed
 The higher performance of ReLU activation function
• Come from the unbalanced distribution of activation outputs
• When sign function is used, the activation distribution becomes even more noticeable in BNNs
Method(6/25)
19

■BNNs with unbalanced activation distribution(1/4)
 Previous multi-bit quantization methods
• ReLU-based quantization which outputs unbalanced activation distribution
 Previous BNNs quantization methods
• Sign function for binary activation function thus distribution of binary activation is balanced
• The ratio of 1 to -1 is close to 1:1
Method(7/25)
20

 By shifting the activation function along the x-axis helps improve the accuracy of a model
 Poor performance was sparked by balanced activation distribution (sign function)
 Shifting the sign function along the x-axis
Method(8/25)
21

 VGG-small
• The Max pooling layer makes the distribution positively skewed
Method(9/25)
22

 Effect of Max Pooling
 Max pooling layer is the only layer that gives asymmetry in the model
• The accuracy difference between shifting the threshold to the positive direction and the negative direction
• Replace Max pooling to Average pooling
Method(10/25)
23

■Effect of other conditions
 Datasets
 Models
 Initialization methods
 Optimizers
Method(11/25)
24

■Dataset
 MNIST
• Binary VGG- small model on MNIST dataset
• 30 different random seeds and average results are presented
Method(12/25)
25

■Model architecture
 Two additional model architectures
 2-layer MLP and LeNet-5
• 2-layer MLP do not have Max pooling layer in it
Method(13/25)
26

■Initialization method(1/2)
 Trained the models from scratch using Xavier normal initialization
 Full-precision model is float point 32
 Pretrained full-precision models are often used to initialize BNN models
 Hardtanh + ReLU
 Full- precision VGG-small model on CIFAR-10 dataset
 Binary VGG- small model initialized with full-precision pretrained model
Method(14/25)
27

■Initialization method(2/2)
Method(15/25)
28
1
0.6

■Optimizer
 Mostly used Adam optimizer in BNN
 Replace Adam to SGD
Method(16/25)
29

■Experiments on ImageNet
 Finding shift amount
• Top-1 train accuracy at the first epoch and Top-1 validation accuracy
Method(17/25)
30

■Results on ImageNet
Method(18/25)
31

■Effect of training threshold
 Limited effect on the performance of BNNs
 Batch normalization bias
 Limited learning capability
Method(19/25)
32

■Batch normalization bias
 Trainable threshold is already covered by the bias values in the BN layer
 𝑋 and 𝑌 are inputs to the BN layer and output from the binary activation layer
 𝜇, 𝜎, 𝛾, 𝛽, and 𝑡ℎ
 mean, std, weight, and bias of the BN layer
 𝛽 and 𝑡ℎ are initialized to zero in the beginning
Method(20/25)
33

■Limited learning capability
 Trainable threshold method
• An initialization point
Method(21/25)
34

■Effect of additional activation function(1/4)
 Use additional activation function after convolution layer in BNNs
 BNNs usually lack non-linearity in the model
 Additional activation function is also related to the unbalanced activation distribution
Method(22/25)
35

 Effect of additional PReLU layers after binary convolution layers in the ResNet20
Method(23/25)
36

 There is accuracy degradation with PReLU
• Already play a role in distorting the activation distribution
• Further shifting the threshold of binary activation function is excessive
Method(24/25)
37

 Replace to PReLU to LeakyReLU
 When the slope 1
• Identity function
Method(25/25)
38

■Analyzed the impact of activation distribution on the accuracy of BNN models
■Demonstrated that the accuracy of previous BNN models could be further
improved
Conclusion
39

Improving Accuracy of Binary Neural Networks Using Unbalanced Activation Distribution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improving Accuracy of Binary Neural Networks Using Unbalanced Activation Distribution

Similar to Improving Accuracy of Binary Neural Networks Using Unbalanced Activation Distribution (20)

More from Jaey Jeong

More from Jaey Jeong (12)

Recently uploaded

Recently uploaded (20)

Improving Accuracy of Binary Neural Networks Using Unbalanced Activation Distribution