Network recasting

Network Recasting: A Universal Method
for Network Architecture Transformation
Speaker: Joonsang Yu

Speaker
2
Joonsang Yu
Education
• Dept. of EE at POSTECH (B.S)
• Dept. of ECE at SNE (Ph.D student)
• Advisor: Prof. Kiyoung Choi
Research Interests
• Hardware-friendly DL optimization
• Efficient hardware for DL
Publication
• Hardware: ICCD, DAC, ISOCC
• Machine learning: AAAI

• Related works
• Network Recasting
• Training Methods
• Experiments
• Conclusion
Outline
3

Related Works
5
Hardware architecture
Intel Skylake architecture [1]
NVIDIA Turing architecture [2]
• Traditional computer architectures are not efficient for DNNs.
• NVDIA introduced Tensor cores to accelerate DNNs.

Related Works
6
DL accelerator
DianNao architecture [3] ZeNa architecture [4]
• To accelerate neural network, several accelerators are also introduced.
• DNNs consists of simple operations (MAC), so it is easy to accelerate.
• In addition, conditional memory access is also possible thanks to pruning.

Related Works
7
Network architecture
Big-Little architecture [5]
ShuffleNet v2 architecture [6]
• Many network architectures are introduced to improve performance.
• In addition, many research also focus on light-weight
and light-computation CNN architecture.

Related Works
8
Compression (pruning)
Example of pruning method: ThiNet [7]
• Pruning-based network compression methods were introduced.
• After training, we can remove weak weights or filters.

Related Works
9
Compression (distillation)
Knowledge distillation [8]
• By distilling knowledge from cumbersome model, small network can
achieve higher accuracy compared with conventional training method.

Related Works
10
Deep mutual learning [9]
• By distilling knowledge from cumbersome model, small network can
achieve higher accuracy compared with conventional training method.

Related Works
11
Deep mutual learning [10]
• In addition, knowledge distillation enables reducing network depth and
architecture transformation.

Network Recasting
13
Network Recasting
• We transform pretrained blocks (source) into new blocks (target).
• The transformation is done by training the target block to generate output
activations (=feature map) similar to those of the source block.
• After training, the source block can be replaced with the target block.
Basic concept of network recasting.
Teacher network Student network

Network Recasting
14
Network Recasting
• Select the arbitrary block and recast.


Network Recasting
15
Network Recasting
• Train target block to generate output activation of source block.

Network Recasting
16
Network Recasting
• Replace the source block with trained target block.

Network Recasting
17
Network Recasting
• We can use the network after recasting.
“Dog”
Image via wikipedia

Network Recasting
18
Network Recasting
• The source block can be recast into any kinds of block.

Network Recasting
19
Network Recasting
Source Target

Network Recasting
20
Network Recasting
Source Target

Network Recasting
21
Network Recasting
Source Target

Network Recasting
22
Network Recasting
• We can recast the arbitrary source block into the arbitrary target block.
Source Target

Network Recasting
23
Network Recasting
Mixed-architecture
networkTeacher Student
• We can recast the entire blocks or some part of blocks.
DenseNet ConvNet

Training Methods
24
Mixed-architecture network
• When we recast partially, we can obtain a mixed-architecture network.
• The mixed-architecture network has both advantages of consisting blocks.
Mixed-architecture network
Image via wikipedia
Bottom Top

Training Methods
26
Block Training
• To avoid dimension mismatch problem, when training a target block, we
train the target block together with the next block by approximating the
output activations of the next block.
256-d 64-d

Training Methods
27
Block Training
Dimension mismatch!
256-d 64-d

Training Methods
28
Block Training
Dimension mismatch!
256-d 64-d
256-d 256-d

Training Methods
29
Sequential Recasting
• To recast the entire network, we recast the blocks sequentially.
Teacher Student

Training Methods
30
Teacher Student

Training Methods
31
Teacher Student

Training Methods
32
• Sequential recasting can alleviate the vanishing gradient problem.
Teacher Student

Training Methods
33
Teacher Student
Gradient path
is very long

Training Methods
34
Teacher Student
Very short
gradient path


Training Methods
35
Teacher Student
Very short
gradient path


Training Methods
36
Teacher Student
Very short
gradient path


Training Methods
37
Fine-tuning
• After finishing sequential recasting, we use the knowledge distillation
approach to fine-tune the student network.
• We train the student network with logits of the teacher network and
ground truth.
MSE loss for the logits Cross-entropy loss between
the given label and softmax output

Experiments
39
Experiments
• Filter reduction
• Vanishing gradient problem
• Actual speed-up on GPU

Experiments
40
Experiments
 Filter reduction

Experiments
41
Filter reduction (Compression)
• Recast a given source block into a smaller target block of the same type.
• Network recasting automatically remove redundant filters to reconstruct
the output activation of source block.
Source Target

Experiments
42
Visualization of Filter Reduction
• We recast the first block of AlexNet to visualize the filter reduction.
• Our method can remove redundant filters without any similarity or
effectiveness check criteria.
Visualization of filters in the first layer of AlexNet

Experiments
43
Experiments
 Vanishing gradient problem

Experiments
44
Vanishing gradient
• We compare the recasting with knowledge distillation and backpropagation.
KD & Backprop Network recasting
Gradient
path
Gradient
path

Experiments
45
Vanishing gradient
• Our method achieved much higher accuracy in spite of deep plain network.
Method Type # layers C10+ C100+
ResNet-56
Baseline 56 7.02 30.89
Recasting Conv 29 6.75 32.14
KD Conv 29 9.43 33.22
Backprop Conv 29 10.61 37.85
Recasting result on CIFAR dataset.

Experiments
46
Experiments
 Actual speed-up on GPU

Experiments
47
Activation load
• Generally, 1x1 convolution is used to reduce # of mults and params.
• However, 1x1 convolution actually increases activation loads from main
memory, and thus inference time.
Comparison of # multiplications. Comparison of inference time.

Experiments
48
Activation load
• For the activation reduction, we recast source block into the different type.
• By transforming network architecture, we can reduce the inference time.
Smaller
activation
Source Target

Experiments
49
Recasting Results on ILSVRC2012
Recasting results. (batch size is 64, NVIDIA Titan X (pascal))
Method Type Top1 Act/batch Time/batch
ResNet-56
Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x)
Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x)
Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x)
DenseNet-121
Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x)
Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x)
Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x)
Basic: Basic residual block Bottle: Bottleneck residual block.

Experiments
50
ResNet-56
DenseNet-121

Experiments
51
ResNet-56
DenseNet-121

Experiments
52
ResNet-56
DenseNet-121

Experiments
53
Previous works
• Many previous use weight/Filter pruning to reduce # of mults and params.
• The network architecture is not changed, so many 1x1 convolutions still exist.
• Thus, activation loads are still large.
Limitation of weight/filter pruning.

Experiments
54
Comparison with Previous work
• Compared with previous work, network recasting achived the lowest error
rate and the highest actual speed-up.
Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))

Experiments
55

Experiments
56

Conclusion
58
• The network recasting enables transformation of a network into a
different type.
• Sequential training of a student network gives a better result even
by alleviating vanishing gradient problem.
• The network recasting can remove redundant filters and also
accelerate inference effectively.
 We achieved up to 2.1x inference time reduction on ResNet-50
 We also achieved up to 3.2x reduction on VGG-16.

Question
59
Thank you!
If you have question, please contact to me
shorm21@dal.snu.ac.kr

Reference
60
• [1] https://wccftech.com/idf15-intel-skylake-analysis-cpu-gpu-microarchitecture-ddr4-memory-impact/3/
• [2] https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/
• [3] Chen, Tianshi, et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS,
2014.
• [4] Kim, Dongyoung, et al. Zena: Zero-aware neural network accelerator. IEEE Design & Test 35.1 (2018): 39-46.
• [5] Chen, Chun-Fu, et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition.
In ICLR, 2019.
• [6] Ma, Ningning, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV. 2018.
• [7] Luo, J.-H., et al. ThiNet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
• [8] Hinton, G. et al. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning
Workshop, 2014.
• [9] Zhang, Ying, et al. Deep mutual learning. IN CVPR. 2018.
• [10] Yim, Junho, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In
CVPR. 2017.

Appendix
Parameter & Activation load

Appendix
63
Block Training
Block training method.
Dimension mismatch!
256-d 64-d
256-d 256-d
𝐴 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝑊𝑇, 𝑊𝑆: 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑎𝑛𝑑 𝑠𝑡𝑢𝑑𝑒𝑛𝑡
𝑁: # 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛

Appendix
64
Experimental Setup
• The network recasting was implemented on the PyTorch framework.
• We adopted batch normalization for all networks.
• We used the Xavier initializer in all experiments.
• We used SGD with Nesterov momentum to train the teacher network and used
Adam optimizer for the network recasting.
• we used the pre-trained ResNet-50, DenseNet-121, and VGG-16 available from
torchvision.

Appendix
Mixed-architecture Network

Appendix
66
Recasting Results on CIFAR
6.90 31.04
31.56
22.39
4.71
25.60
6.75
8.31

Network recasting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Network recasting

Similar to Network recasting (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Network recasting