Network Recasting: A Universal Method
for Network Architecture Transformation
Speaker: Joonsang Yu
Speaker
2
Joonsang Yu
Education
• Dept. of EE at POSTECH (B.S)
• Dept. of ECE at SNE (Ph.D student)
• Advisor: Prof. Kiyoung Choi
Research Interests
• Hardware-friendly DL optimization
• Efficient hardware for DL
Publication
• Hardware: ICCD, DAC, ISOCC
• Machine learning: AAAI
• Related works
• Network Recasting
• Training Methods
• Experiments
• Conclusion
Outline
3
Related Works
Related Works
5
Hardware architecture
Intel Skylake architecture [1]
NVIDIA Turing architecture [2]
• Traditional computer architectures are not efficient for DNNs.
• NVDIA introduced Tensor cores to accelerate DNNs.
Related Works
6
DL accelerator
DianNao architecture [3] ZeNa architecture [4]
• To accelerate neural network, several accelerators are also introduced.
• DNNs consists of simple operations (MAC), so it is easy to accelerate.
• In addition, conditional memory access is also possible thanks to pruning.
Related Works
7
Network architecture
Big-Little architecture [5]
ShuffleNet v2 architecture [6]
• Many network architectures are introduced to improve performance.
• In addition, many research also focus on light-weight
and light-computation CNN architecture.
Related Works
8
Compression (pruning)
Example of pruning method: ThiNet [7]
• Pruning-based network compression methods were introduced.
• After training, we can remove weak weights or filters.
Related Works
9
Compression (distillation)
Knowledge distillation [8]
• By distilling knowledge from cumbersome model, small network can
achieve higher accuracy compared with conventional training method.
Related Works
10
Compression (distillation)
Deep mutual learning [9]
• By distilling knowledge from cumbersome model, small network can
achieve higher accuracy compared with conventional training method.
Related Works
11
Compression (distillation)
Deep mutual learning [10]
• In addition, knowledge distillation enables reducing network depth and
architecture transformation.
Network Recasting
Network Recasting
13
Network Recasting
• We transform pretrained blocks (source) into new blocks (target).
• The transformation is done by training the target block to generate output
activations (=feature map) similar to those of the source block.
• After training, the source block can be replaced with the target block.
Basic concept of network recasting.
Teacher network Student network
Network Recasting
14
Network Recasting
• Select the arbitrary block and recast.

Network Recasting
15
Network Recasting
• Train target block to generate output activation of source block.
Network Recasting
16
Network Recasting
• Replace the source block with trained target block.
Network Recasting
17
Network Recasting
• We can use the network after recasting.
“Dog”
Image via wikipedia
Network Recasting
18
Network Recasting
• The source block can be recast into any kinds of block.
Network Recasting
19
Network Recasting
Source Target
Network Recasting
20
Network Recasting
Source Target
Network Recasting
21
Network Recasting
Source Target
Network Recasting
22
Network Recasting
• We can recast the arbitrary source block into the arbitrary target block.
Source Target
Network Recasting
23
Network Recasting
Mixed-architecture
networkTeacher Student
• We can recast the entire blocks or some part of blocks.
DenseNet ConvNet
Training Methods
24
Mixed-architecture network
• When we recast partially, we can obtain a mixed-architecture network.
• The mixed-architecture network has both advantages of consisting blocks.
Mixed-architecture network
Image via wikipedia
Bottom Top
Training Methods
Training Methods
26
Block Training
• To avoid dimension mismatch problem, when training a target block, we
train the target block together with the next block by approximating the
output activations of the next block.
256-d 64-d
Training Methods
27
Block Training
• To avoid dimension mismatch problem, when training a target block, we
train the target block together with the next block by approximating the
output activations of the next block.
Dimension mismatch!
256-d 64-d
Training Methods
28
Block Training
• To avoid dimension mismatch problem, when training a target block, we
train the target block together with the next block by approximating the
output activations of the next block.
Dimension mismatch!
256-d 64-d
256-d 256-d
Training Methods
29
Sequential Recasting
• To recast the entire network, we recast the blocks sequentially.
Teacher Student
Training Methods
30
Sequential Recasting
• To recast the entire network, we recast the blocks sequentially.
Teacher Student
Training Methods
31
Sequential Recasting
• To recast the entire network, we recast the blocks sequentially.
Teacher Student
Training Methods
32
Sequential Recasting
• Sequential recasting can alleviate the vanishing gradient problem.
Teacher Student
Training Methods
33
Sequential Recasting
• Sequential recasting can alleviate the vanishing gradient problem.
Teacher Student
Gradient path
is very long
Training Methods
34
Sequential Recasting
• Sequential recasting can alleviate the vanishing gradient problem.
Teacher Student
Very short
gradient path

Training Methods
35
Sequential Recasting
• Sequential recasting can alleviate the vanishing gradient problem.
Teacher Student
Very short
gradient path

Training Methods
36
Sequential Recasting
• Sequential recasting can alleviate the vanishing gradient problem.
Teacher Student
Very short
gradient path

Training Methods
37
Fine-tuning
• After finishing sequential recasting, we use the knowledge distillation
approach to fine-tune the student network.
• We train the student network with logits of the teacher network and
ground truth.
MSE loss for the logits Cross-entropy loss between
the given label and softmax output
Experiments
Experiments
39
Experiments
• Filter reduction
• Vanishing gradient problem
• Actual speed-up on GPU
Experiments
40
Experiments
 Filter reduction
• Vanishing gradient problem
• Actual speed-up on GPU
Experiments
41
Filter reduction (Compression)
• Recast a given source block into a smaller target block of the same type.
• Network recasting automatically remove redundant filters to reconstruct
the output activation of source block.
Source Target
Experiments
42
Visualization of Filter Reduction
• We recast the first block of AlexNet to visualize the filter reduction.
• Our method can remove redundant filters without any similarity or
effectiveness check criteria.
Visualization of filters in the first layer of AlexNet
Experiments
43
Experiments
• Filter reduction
 Vanishing gradient problem
• Actual speed-up on GPU
Experiments
44
Vanishing gradient
• We compare the recasting with knowledge distillation and backpropagation.
KD & Backprop Network recasting
Gradient
path
Gradient
path
Experiments
45
Vanishing gradient
• Our method achieved much higher accuracy in spite of deep plain network.
Method Type # layers C10+ C100+
ResNet-56
Baseline 56 7.02 30.89
Recasting Conv 29 6.75 32.14
KD Conv 29 9.43 33.22
Backprop Conv 29 10.61 37.85
Recasting result on CIFAR dataset.
Experiments
46
Experiments
• Filter reduction
• Vanishing gradient problem
 Actual speed-up on GPU
Experiments
47
Activation load
• Generally, 1x1 convolution is used to reduce # of mults and params.
• However, 1x1 convolution actually increases activation loads from main
memory, and thus inference time.
Comparison of # multiplications. Comparison of inference time.
Experiments
48
Activation load
• For the activation reduction, we recast source block into the different type.
• By transforming network architecture, we can reduce the inference time.
Smaller
activation
Source Target
Experiments
49
Recasting Results on ILSVRC2012
Recasting results. (batch size is 64, NVIDIA Titan X (pascal))
Method Type Top1 Act/batch Time/batch
ResNet-56
Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x)
Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x)
Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x)
DenseNet-121
Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x)
Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x)
Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x)
Basic: Basic residual block Bottle: Bottleneck residual block.
Experiments
50
Recasting Results on ILSVRC2012
Recasting results. (batch size is 64, NVIDIA Titan X (pascal))
Method Type Top1 Act/batch Time/batch
ResNet-56
Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x)
Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x)
Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x)
DenseNet-121
Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x)
Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x)
Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x)
Basic: Basic residual block Bottle: Bottleneck residual block.
Experiments
51
Recasting Results on ILSVRC2012
Recasting results. (batch size is 64, NVIDIA Titan X (pascal))
Method Type Top1 Act/batch Time/batch
ResNet-56
Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x)
Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x)
Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x)
DenseNet-121
Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x)
Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x)
Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x)
Basic: Basic residual block Bottle: Bottleneck residual block.
Experiments
52
Recasting Results on ILSVRC2012
Recasting results. (batch size is 64, NVIDIA Titan X (pascal))
Method Type Top1 Act/batch Time/batch
ResNet-56
Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x)
Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x)
Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x)
DenseNet-121
Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x)
Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x)
Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x)
Basic: Basic residual block Bottle: Bottleneck residual block.
Experiments
53
Previous works
• Many previous use weight/Filter pruning to reduce # of mults and params.
• The network architecture is not changed, so many 1x1 convolutions still exist.
• Thus, activation loads are still large.
Limitation of weight/filter pruning.
Experiments
54
Comparison with Previous work
• Compared with previous work, network recasting achived the lowest error
rate and the highest actual speed-up.
Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))
Experiments
55
Comparison with Previous work
• Compared with previous work, network recasting achived the lowest error
rate and the highest actual speed-up.
Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))
Experiments
56
Comparison with Previous work
• Compared with previous work, network recasting achived the lowest error
rate and the highest actual speed-up.
Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))
Conclusion
Conclusion
58
• The network recasting enables transformation of a network into a
different type.
• Sequential training of a student network gives a better result even
by alleviating vanishing gradient problem.
• The network recasting can remove redundant filters and also
accelerate inference effectively.
 We achieved up to 2.1x inference time reduction on ResNet-50
 We also achieved up to 3.2x reduction on VGG-16.
Question
59
Thank you!
If you have question, please contact to me
shorm21@dal.snu.ac.kr
Reference
60
• [1] https://wccftech.com/idf15-intel-skylake-analysis-cpu-gpu-microarchitecture-ddr4-memory-impact/3/
• [2] https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/
• [3] Chen, Tianshi, et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS,
2014.
• [4] Kim, Dongyoung, et al. Zena: Zero-aware neural network accelerator. IEEE Design & Test 35.1 (2018): 39-46.
• [5] Chen, Chun-Fu, et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition.
In ICLR, 2019.
• [6] Ma, Ningning, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV. 2018.
• [7] Luo, J.-H., et al. ThiNet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
• [8] Hinton, G. et al. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning
Workshop, 2014.
• [9] Zhang, Ying, et al. Deep mutual learning. IN CVPR. 2018.
• [10] Yim, Junho, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In
CVPR. 2017.
Appendix
Parameter & Activation load
Appendix
Inference time
Appendix
63
Block Training
Block training method.
Dimension mismatch!
256-d 64-d
256-d 256-d
𝐴 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
𝑊𝑇, 𝑊𝑆: 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑎𝑛𝑑 𝑠𝑡𝑢𝑑𝑒𝑛𝑡
𝑁: # 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
Appendix
64
Experimental Setup
• The network recasting was implemented on the PyTorch framework.
• We adopted batch normalization for all networks.
• We used the Xavier initializer in all experiments.
• We used SGD with Nesterov momentum to train the teacher network and used
Adam optimizer for the network recasting.
• we used the pre-trained ResNet-50, DenseNet-121, and VGG-16 available from
torchvision.
Appendix
Mixed-architecture Network
Appendix
66
Recasting Results on CIFAR
6.90 31.04
31.56
22.39
4.71
25.60
6.75
8.31

Network recasting

  • 1.
    Network Recasting: AUniversal Method for Network Architecture Transformation Speaker: Joonsang Yu
  • 2.
    Speaker 2 Joonsang Yu Education • Dept.of EE at POSTECH (B.S) • Dept. of ECE at SNE (Ph.D student) • Advisor: Prof. Kiyoung Choi Research Interests • Hardware-friendly DL optimization • Efficient hardware for DL Publication • Hardware: ICCD, DAC, ISOCC • Machine learning: AAAI
  • 3.
    • Related works •Network Recasting • Training Methods • Experiments • Conclusion Outline 3
  • 4.
  • 5.
    Related Works 5 Hardware architecture IntelSkylake architecture [1] NVIDIA Turing architecture [2] • Traditional computer architectures are not efficient for DNNs. • NVDIA introduced Tensor cores to accelerate DNNs.
  • 6.
    Related Works 6 DL accelerator DianNaoarchitecture [3] ZeNa architecture [4] • To accelerate neural network, several accelerators are also introduced. • DNNs consists of simple operations (MAC), so it is easy to accelerate. • In addition, conditional memory access is also possible thanks to pruning.
  • 7.
    Related Works 7 Network architecture Big-Littlearchitecture [5] ShuffleNet v2 architecture [6] • Many network architectures are introduced to improve performance. • In addition, many research also focus on light-weight and light-computation CNN architecture.
  • 8.
    Related Works 8 Compression (pruning) Exampleof pruning method: ThiNet [7] • Pruning-based network compression methods were introduced. • After training, we can remove weak weights or filters.
  • 9.
    Related Works 9 Compression (distillation) Knowledgedistillation [8] • By distilling knowledge from cumbersome model, small network can achieve higher accuracy compared with conventional training method.
  • 10.
    Related Works 10 Compression (distillation) Deepmutual learning [9] • By distilling knowledge from cumbersome model, small network can achieve higher accuracy compared with conventional training method.
  • 11.
    Related Works 11 Compression (distillation) Deepmutual learning [10] • In addition, knowledge distillation enables reducing network depth and architecture transformation.
  • 12.
  • 13.
    Network Recasting 13 Network Recasting •We transform pretrained blocks (source) into new blocks (target). • The transformation is done by training the target block to generate output activations (=feature map) similar to those of the source block. • After training, the source block can be replaced with the target block. Basic concept of network recasting. Teacher network Student network
  • 14.
    Network Recasting 14 Network Recasting •Select the arbitrary block and recast. 
  • 15.
    Network Recasting 15 Network Recasting •Train target block to generate output activation of source block.
  • 16.
    Network Recasting 16 Network Recasting •Replace the source block with trained target block.
  • 17.
    Network Recasting 17 Network Recasting •We can use the network after recasting. “Dog” Image via wikipedia
  • 18.
    Network Recasting 18 Network Recasting •The source block can be recast into any kinds of block.
  • 19.
  • 20.
  • 21.
  • 22.
    Network Recasting 22 Network Recasting •We can recast the arbitrary source block into the arbitrary target block. Source Target
  • 23.
    Network Recasting 23 Network Recasting Mixed-architecture networkTeacherStudent • We can recast the entire blocks or some part of blocks. DenseNet ConvNet
  • 24.
    Training Methods 24 Mixed-architecture network •When we recast partially, we can obtain a mixed-architecture network. • The mixed-architecture network has both advantages of consisting blocks. Mixed-architecture network Image via wikipedia Bottom Top
  • 25.
  • 26.
    Training Methods 26 Block Training •To avoid dimension mismatch problem, when training a target block, we train the target block together with the next block by approximating the output activations of the next block. 256-d 64-d
  • 27.
    Training Methods 27 Block Training •To avoid dimension mismatch problem, when training a target block, we train the target block together with the next block by approximating the output activations of the next block. Dimension mismatch! 256-d 64-d
  • 28.
    Training Methods 28 Block Training •To avoid dimension mismatch problem, when training a target block, we train the target block together with the next block by approximating the output activations of the next block. Dimension mismatch! 256-d 64-d 256-d 256-d
  • 29.
    Training Methods 29 Sequential Recasting •To recast the entire network, we recast the blocks sequentially. Teacher Student
  • 30.
    Training Methods 30 Sequential Recasting •To recast the entire network, we recast the blocks sequentially. Teacher Student
  • 31.
    Training Methods 31 Sequential Recasting •To recast the entire network, we recast the blocks sequentially. Teacher Student
  • 32.
    Training Methods 32 Sequential Recasting •Sequential recasting can alleviate the vanishing gradient problem. Teacher Student
  • 33.
    Training Methods 33 Sequential Recasting •Sequential recasting can alleviate the vanishing gradient problem. Teacher Student Gradient path is very long
  • 34.
    Training Methods 34 Sequential Recasting •Sequential recasting can alleviate the vanishing gradient problem. Teacher Student Very short gradient path 
  • 35.
    Training Methods 35 Sequential Recasting •Sequential recasting can alleviate the vanishing gradient problem. Teacher Student Very short gradient path 
  • 36.
    Training Methods 36 Sequential Recasting •Sequential recasting can alleviate the vanishing gradient problem. Teacher Student Very short gradient path 
  • 37.
    Training Methods 37 Fine-tuning • Afterfinishing sequential recasting, we use the knowledge distillation approach to fine-tune the student network. • We train the student network with logits of the teacher network and ground truth. MSE loss for the logits Cross-entropy loss between the given label and softmax output
  • 38.
  • 39.
    Experiments 39 Experiments • Filter reduction •Vanishing gradient problem • Actual speed-up on GPU
  • 40.
    Experiments 40 Experiments  Filter reduction •Vanishing gradient problem • Actual speed-up on GPU
  • 41.
    Experiments 41 Filter reduction (Compression) •Recast a given source block into a smaller target block of the same type. • Network recasting automatically remove redundant filters to reconstruct the output activation of source block. Source Target
  • 42.
    Experiments 42 Visualization of FilterReduction • We recast the first block of AlexNet to visualize the filter reduction. • Our method can remove redundant filters without any similarity or effectiveness check criteria. Visualization of filters in the first layer of AlexNet
  • 43.
    Experiments 43 Experiments • Filter reduction Vanishing gradient problem • Actual speed-up on GPU
  • 44.
    Experiments 44 Vanishing gradient • Wecompare the recasting with knowledge distillation and backpropagation. KD & Backprop Network recasting Gradient path Gradient path
  • 45.
    Experiments 45 Vanishing gradient • Ourmethod achieved much higher accuracy in spite of deep plain network. Method Type # layers C10+ C100+ ResNet-56 Baseline 56 7.02 30.89 Recasting Conv 29 6.75 32.14 KD Conv 29 9.43 33.22 Backprop Conv 29 10.61 37.85 Recasting result on CIFAR dataset.
  • 46.
    Experiments 46 Experiments • Filter reduction •Vanishing gradient problem  Actual speed-up on GPU
  • 47.
    Experiments 47 Activation load • Generally,1x1 convolution is used to reduce # of mults and params. • However, 1x1 convolution actually increases activation loads from main memory, and thus inference time. Comparison of # multiplications. Comparison of inference time.
  • 48.
    Experiments 48 Activation load • Forthe activation reduction, we recast source block into the different type. • By transforming network architecture, we can reduce the inference time. Smaller activation Source Target
  • 49.
    Experiments 49 Recasting Results onILSVRC2012 Recasting results. (batch size is 64, NVIDIA Titan X (pascal)) Method Type Top1 Act/batch Time/batch ResNet-56 Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x) Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x) Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x) DenseNet-121 Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x) Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x) Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x) Basic: Basic residual block Bottle: Bottleneck residual block.
  • 50.
    Experiments 50 Recasting Results onILSVRC2012 Recasting results. (batch size is 64, NVIDIA Titan X (pascal)) Method Type Top1 Act/batch Time/batch ResNet-56 Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x) Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x) Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x) DenseNet-121 Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x) Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x) Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x) Basic: Basic residual block Bottle: Bottleneck residual block.
  • 51.
    Experiments 51 Recasting Results onILSVRC2012 Recasting results. (batch size is 64, NVIDIA Titan X (pascal)) Method Type Top1 Act/batch Time/batch ResNet-56 Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x) Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x) Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x) DenseNet-121 Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x) Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x) Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x) Basic: Basic residual block Bottle: Bottleneck residual block.
  • 52.
    Experiments 52 Recasting Results onILSVRC2012 Recasting results. (batch size is 64, NVIDIA Titan X (pascal)) Method Type Top1 Act/batch Time/batch ResNet-56 Baseline Bottle 23.85 740.48M(1.0x) 107.17ms(1.0x) Recasting(C) Conv 30.74 161.92M(4.6x) 37.21ms(2.9x) Recasting(C+R) Conv+Bottle 25.00 236.16M(3.1x) 49.97ms(2.1x) DenseNet-121 Baseline Dense 25.57 1,057.28M(1.0x) 111.31ms(1.0x) Recasting(R) Basic 26.42 340.48M(3.1x) 81.17ms(1.4x) Recasting(R+D) Basic+Bottle 24.87 585.60M(1.8x) 88.94ms(1.3x) Basic: Basic residual block Bottle: Bottleneck residual block.
  • 53.
    Experiments 53 Previous works • Manyprevious use weight/Filter pruning to reduce # of mults and params. • The network architecture is not changed, so many 1x1 convolutions still exist. • Thus, activation loads are still large. Limitation of weight/filter pruning.
  • 54.
    Experiments 54 Comparison with Previouswork • Compared with previous work, network recasting achived the lowest error rate and the highest actual speed-up. Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))
  • 55.
    Experiments 55 Comparison with Previouswork • Compared with previous work, network recasting achived the lowest error rate and the highest actual speed-up. Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))
  • 56.
    Experiments 56 Comparison with Previouswork • Compared with previous work, network recasting achived the lowest error rate and the highest actual speed-up. Comparison with previous works. (batch size is 64, NVIDIA Titan X (pascal))
  • 57.
  • 58.
    Conclusion 58 • The networkrecasting enables transformation of a network into a different type. • Sequential training of a student network gives a better result even by alleviating vanishing gradient problem. • The network recasting can remove redundant filters and also accelerate inference effectively.  We achieved up to 2.1x inference time reduction on ResNet-50  We also achieved up to 3.2x reduction on VGG-16.
  • 59.
    Question 59 Thank you! If youhave question, please contact to me shorm21@dal.snu.ac.kr
  • 60.
    Reference 60 • [1] https://wccftech.com/idf15-intel-skylake-analysis-cpu-gpu-microarchitecture-ddr4-memory-impact/3/ •[2] https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/ • [3] Chen, Tianshi, et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014. • [4] Kim, Dongyoung, et al. Zena: Zero-aware neural network accelerator. IEEE Design & Test 35.1 (2018): 39-46. • [5] Chen, Chun-Fu, et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition. In ICLR, 2019. • [6] Ma, Ningning, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV. 2018. • [7] Luo, J.-H., et al. ThiNet: A filter level pruning method for deep neural network compression. In ICCV, 2017. • [8] Hinton, G. et al. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2014. • [9] Zhang, Ying, et al. Deep mutual learning. IN CVPR. 2018. • [10] Yim, Junho, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR. 2017.
  • 61.
  • 62.
  • 63.
    Appendix 63 Block Training Block trainingmethod. Dimension mismatch! 256-d 64-d 256-d 256-d 𝐴 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑊𝑇, 𝑊𝑆: 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑎𝑛𝑑 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑁: # 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
  • 64.
    Appendix 64 Experimental Setup • Thenetwork recasting was implemented on the PyTorch framework. • We adopted batch normalization for all networks. • We used the Xavier initializer in all experiments. • We used SGD with Nesterov momentum to train the teacher network and used Adam optimizer for the network recasting. • we used the pre-trained ResNet-50, DenseNet-121, and VGG-16 available from torchvision.
  • 65.
  • 66.
    Appendix 66 Recasting Results onCIFAR 6.90 31.04 31.56 22.39 4.71 25.60 6.75 8.31