2. Interaction Lab., Seoul National University of Science and Technology
■Degree of gaze estimation
Geometric based
• 0.5°~ 3°
Appearance based
• 3°~ 12°
Feedback
3. Interaction Lab., Seoul National University of Science and Technology
■GazeTR-Hybrid and GazeTR-Conv
Feedback
3
4. Interaction Lab. Seoul National University of Science and Technology
Improving Accuracy of Binary Neural Networks
using Unbalanced Activation Distribution
Jae-Yeop Jeong
Computer Vision and Pattern Recognition 2021
Hyungjun Kim, Jihoon Park, Changhun Lee, Jae-Joon Kim
POSTECH, Korea
5. Interaction Lab., Seoul National University of Science and Technology
■Intro
■Method
■Conclusion
Agenda
5
7. Interaction Lab., Seoul National University of Science and Technology
■Lightweight Network
Network quantization
Network pruning
Efficient architecture design
Intro(1/6)
7
8. Interaction Lab., Seoul National University of Science and Technology
■Network quantization
Convert floating-point type to integer or fixed point
• Model Lightweight
• Reduced computation
• Efficient hardware use
The process that reduces represent form and internal composition in Neural Network
Intro(2/6)
8
9. Interaction Lab., Seoul National University of Science and Technology
■Binary Neural Network
32-bit floating-point weights can be represented by 1-bit weights
XNOR-and-popcount logics
Weight quantization
Activation quantization
Intro(3/6)
9
-1.21 0.88 -0.11
0.17 1.72 -0.7
0.98 -0.57 0.63
-1 1 -1
1 1 -1
1 -1 1
10. Interaction Lab., Seoul National University of Science and Technology
■Accuracy of BNNs
Due to the lightweight nature, BNNs are garnering interests as a promising solution for
DNN computing on edge devices
BNNs still suffer from the accuracy degradation caused by the aggressive quantization
Low accuracy than full-precision model
Intro(4/6)
10
11. Interaction Lab., Seoul National University of Science and Technology
■Problems of BNNs
Gradient mismatch problem caused by the non-differentiable binary activation function
• Gradients cannot propagate through the quantization layer in the back-propagation process
• Straight-Through-Estimator(STE)
Intro(5/6)
11
12. Interaction Lab., Seoul National University of Science and Technology
■Technique to improve the accuracy of BNNs
Straight-Through-Estimator(STE)
Parameterized clipping activation function(PACT)
• Focused on multi-bit networks (2-bit, 4-bit …)
Managing activation distribution
• There are only two values (1 or -1)
• Focused on binary activation more balanced
Additional activation function
• Between the binary convolution layer and the following BN layer
• ReLU, PReLU
• Increasing the non-linearity
Intro(6/6)
12
14. Interaction Lab., Seoul National University of Science and Technology
■Overview
The activation functions in conventional full precision models to describe the motivation for
using unbalanced activation distribution
How to make the distribution of the binary activation unbalanced using threshold shifting in
BNNs
Experimental results on ImageNet data set
Ineffectiveness of the methodology to train the threshold of binary activation in BNNs
Additional activation functions
Method(1/25)
14
15. Interaction Lab., Seoul National University of Science and Technology
■Motivation for using unbalanced activation distribution
ReLU solved the gradient vanishing problem
• The gradient vanishing problem is largely diminished by using BN layers in recent models
There might be other reasons for the success of the ReLU function
Method(2/25)
15
16. Interaction Lab., Seoul National University of Science and Technology
■Make unbalanced activation distribution using hardtanh
Compare the performance hardtanh function and ReLU6
• Slight variant of the ReLU function where its positive outputs are clipped to 6
Threshold shifting
• x-axis, y-axis, increased range
Method(3/25)
16
17. Interaction Lab., Seoul National University of Science and Technology
■Training model
VGG-small model was trained on CIFAR-10 dataset with different activation functions
VGG-small
• Consists of 4 Conv with 64, 64, 128, 128 output channel
• Three dense with 512
• BN layer is placed between the Conv layer and the activation layer
• Max pooling layer was used after each of the latter three convolution layers
Trained the model in the same condition with 10 different seeds
Method(4/25)
17
18. Interaction Lab., Seoul National University of Science and Technology
■Accuracy gap between hardtanh and ReLU6(1/2)
Red dotted line represents the test accuracy of the ReLU6
Method(5/25)
18
19. Interaction Lab., Seoul National University of Science and Technology
■Accuracy gap between hardtanh and ReLU6(2/2)
The hardtanh function is shifted to the right along the x-axis
• The number of negative outputs increase and the output distribution becomes positively skewed
The higher performance of ReLU activation function
• Come from the unbalanced distribution of activation outputs
• When sign function is used, the activation distribution becomes even more noticeable in BNNs
Method(6/25)
19
20. Interaction Lab., Seoul National University of Science and Technology
■BNNs with unbalanced activation distribution(1/4)
Previous multi-bit quantization methods
• ReLU-based quantization which outputs unbalanced activation distribution
Previous BNNs quantization methods
• Sign function for binary activation function thus distribution of binary activation is balanced
• The ratio of 1 to -1 is close to 1:1
Method(7/25)
20
21. Interaction Lab., Seoul National University of Science and Technology
■BNNs with unbalanced activation distribution(2/4)
By shifting the activation function along the x-axis helps improve the accuracy of a model
Poor performance was sparked by balanced activation distribution (sign function)
Shifting the sign function along the x-axis
Method(8/25)
21
22. Interaction Lab., Seoul National University of Science and Technology
■BNNs with unbalanced activation distribution(3/4)
VGG-small
• The Max pooling layer makes the distribution positively skewed
Method(9/25)
22
23. Interaction Lab., Seoul National University of Science and Technology
■BNNs with unbalanced activation distribution(4/4)
Effect of Max Pooling
Max pooling layer is the only layer that gives asymmetry in the model
• The accuracy difference between shifting the threshold to the positive direction and the negative direction
• Replace Max pooling to Average pooling
Method(10/25)
23
24. Interaction Lab., Seoul National University of Science and Technology
■Effect of other conditions
Datasets
Models
Initialization methods
Optimizers
Method(11/25)
24
25. Interaction Lab., Seoul National University of Science and Technology
■Dataset
MNIST
• Binary VGG- small model on MNIST dataset
• 30 different random seeds and average results are presented
Method(12/25)
25
26. Interaction Lab., Seoul National University of Science and Technology
■Model architecture
Two additional model architectures
2-layer MLP and LeNet-5
• 2-layer MLP do not have Max pooling layer in it
Method(13/25)
26
27. Interaction Lab., Seoul National University of Science and Technology
■Initialization method(1/2)
Trained the models from scratch using Xavier normal initialization
Full-precision model is float point 32
Pretrained full-precision models are often used to initialize BNN models
Hardtanh + ReLU
Full- precision VGG-small model on CIFAR-10 dataset
Binary VGG- small model initialized with full-precision pretrained model
Method(14/25)
27
28. Interaction Lab., Seoul National University of Science and Technology
■Initialization method(2/2)
Method(15/25)
28
1
0.6
29. Interaction Lab., Seoul National University of Science and Technology
■Optimizer
Mostly used Adam optimizer in BNN
Replace Adam to SGD
Method(16/25)
29
30. Interaction Lab., Seoul National University of Science and Technology
■Experiments on ImageNet
Finding shift amount
• Top-1 train accuracy at the first epoch and Top-1 validation accuracy
Method(17/25)
30
31. Interaction Lab., Seoul National University of Science and Technology
■Results on ImageNet
Method(18/25)
31
32. Interaction Lab., Seoul National University of Science and Technology
■Effect of training threshold
Limited effect on the performance of BNNs
Batch normalization bias
Limited learning capability
Method(19/25)
32
33. Interaction Lab., Seoul National University of Science and Technology
■Batch normalization bias
Trainable threshold is already covered by the bias values in the BN layer
𝑋 and 𝑌 are inputs to the BN layer and output from the binary activation layer
𝜇, 𝜎, 𝛾, 𝛽, and 𝑡ℎ
mean, std, weight, and bias of the BN layer
𝛽 and 𝑡ℎ are initialized to zero in the beginning
Method(20/25)
33
34. Interaction Lab., Seoul National University of Science and Technology
■Limited learning capability
Trainable threshold method
• An initialization point
Method(21/25)
34
35. Interaction Lab., Seoul National University of Science and Technology
■Effect of additional activation function(1/4)
Use additional activation function after convolution layer in BNNs
BNNs usually lack non-linearity in the model
Additional activation function is also related to the unbalanced activation distribution
Method(22/25)
35
36. Interaction Lab., Seoul National University of Science and Technology
■Effect of additional activation function(2/4)
Effect of additional PReLU layers after binary convolution layers in the ResNet20
Method(23/25)
36
37. Interaction Lab., Seoul National University of Science and Technology
■Effect of additional activation function(3/4)
There is accuracy degradation with PReLU
• Already play a role in distorting the activation distribution
• Further shifting the threshold of binary activation function is excessive
Method(24/25)
37
38. Interaction Lab., Seoul National University of Science and Technology
■Effect of additional activation function(4/4)
Replace to PReLU to LeakyReLU
When the slope 1
• Identity function
Method(25/25)
38
39. Interaction Lab., Seoul National University of Science and Technology
■Analyzed the impact of activation distribution on the accuracy of BNN models
■Demonstrated that the accuracy of previous BNN models could be further
improved
Conclusion
39