Neural network pruning with residual connections and limited-data review [cdm]
1. Neural Network Pruning
with Residual-Connections and Limited-Data
Dong Min Choi
Yonsei University Severance Hospital CCIDS
2. Introduction
• Filter level pruning
- an effective method to accelerate the inference speed
- Problems
1) how to prune residual connection?
2) pruning with limited data (pruning is worse than fine-tuning)
S Han et al. Learning both Weights and Connections for Efficient Neural Networks. arXiv:1506.02626
3. Introduction
• Pruning residual connection
- Most methods only prune filters inside the residual connection, leaving the
number of output channels unchanged
- The pruned block will become an hourglass (middle layer is handicapped)
- Therefore, pruning channels both inside and outside is more preferred
(still bottleneck or an opened wallet shape)
4. Introduction
• Pruning residual connection
- The advantages of wallet structure compared with hourglass
1) more accurate thanks to a larger pruning space
2) faster even with the same number of FLOPs
3) save more storage because more weights will be pruned
5. Introduction
• Pruning with limited data
- Method 1 : Fine-tuning
- Method 2 : directly prune the model without the large dataset
⇨ Method 2 usually has a significantly lower accuracy than Method 1
6. Introduction
CURL (Compression Using Residual-connections and Limited-data)
• Pruning Residual Connection
- prune not only channels inside the residual branch, but also channels of its output
activation maps (both the identity branch and the residual branch)
- The resulting wallet-shaped structure shows more advantages
• Pruning with limited data
- Combining data augmentation and knowledge distillation
- A label refinement strategy
7. Method
1. Pruning Residual-Connections
• Most previous studies only focus on
reducing channels inside the residual
block
• To prune the residual block,
a new criterion that can evaluate
multiple filters simultaneously
should be designed
8. Method
1. Pruning Residual-Connections
• Idea : to (1) remove the channels one by one
and (2) calculate the information loss
- (1) : set and of the BN layers to 0
- (2) : randomly select 256 images from training
dataset and compare the similarity of two prediction
probability by using KL-divergence
γ β
Let’s reduce the output channels !
9. Method
1. Pruning Residual-Connections
• Idea : to (1) remove the channels one by one
and (2) calculate the information loss
- (1) : set and of the BN layers to 0
- (2) : randomly select 256 images from training
dataset and compare the similarity of two prediction
probability by using KL-divergence
γ β
Let’s reduce the output channels !
10. Method
1. Pruning Residual-Connections
• Idea : to (1) remove the channels one by one
and (2) calculate the information loss
• Repeat this step 256 times, resulting in 256
importance scores, one for each channel
Let’s reduce the output channels !
11. Method
1. Pruning Residual-Connections
• Idea : to (1) remove the channels one by one
and (2) calculate the information loss
• Repeat this step 256 times, resulting in 256
importance scores, one for each channel
• For those channels inside the residual block,
only need to erase one filter at each step
• The top filter will be removed, leading to a
pruned small model
k
12. Method
2. Prune with Limited Data
• The pruned small model is then fine-tuned on the target dataset
• Fine-tuning
- Data augmentation
- Knowledge distillation
13. Method
2. Prune with Limited Data
• Data Augmentation
Motivation : Most discriminative information often lies in local image patches
rather than global information
14. Method
2. Prune with Limited Data
• Label Refinement
- Fine-tuning & Knowledge Distillation (KD)
- Problem : Because the teacher model has not seen the new data,
its output (logits) may be noisy
- Update the noisy logits during training via SGD
- Two Steps
1) Fine-tuning on original small dataset with KD plus mixup
2) Fine-tuning on expanded dataset with label refinement
15. Method
2. Prune with Limited Data
• Label Refinement
Step 1) Knowledge Distillation with Mixup
- A new input via mixup :
- Knowledge distillation w/ the new input :
- With these two techniques, the pruned small model can converge
into a good local minima
https://blog.airlab.re.kr/2019/11/mixup
16. Method
2. Prune with Limited Data
• Label Refinement
Step 2) Knowledge Distillation with Label Refinement
- Fine-tuning the small model on the expanded dataset and update
logits to remove label noises
- The soft-target of each image
will be extracted first and stored
- Update soft-target via SGD
17. Experiments
1. Pruning ResNet50 on ImageNet
* Actual Inference Speed Test (on NVIDIA Tesla M40 GPU) for 256 mini-batch
- Hourglass (AutoPruner) : 0.21s
- Wallet (CURL, MACS : 1.39G, #Param : 7.83M) : 0.19s