Neural network pruning with residual connections and limited-data review [cdm]

Neural Network Pruning
with Residual-Connections and Limited-Data
Dong Min Choi
Yonsei University Severance Hospital CCIDS

Introduction
• Filter level pruning 
- an effective method to accelerate the inference speed 
- Problems 
1) how to prune residual connection? 
2) pruning with limited data (pruning is worse than fine-tuning)
S Han et al. Learning both Weights and Connections for Efficient Neural Networks. arXiv:1506.02626

Introduction
• Pruning residual connection 
- Most methods only prune ﬁlters inside the residual connection, leaving the
number of output channels unchanged 
- The pruned block will become an hourglass (middle layer is handicapped) 
- Therefore, pruning channels both inside and outside is more preferred 
(still bottleneck or an opened wallet shape)

Introduction
• Pruning residual connection 
- The advantages of wallet structure compared with hourglass 
1) more accurate thanks to a larger pruning space 
2) faster even with the same number of FLOPs 
3) save more storage because more weights will be pruned

Introduction
• Pruning with limited data 
- Method 1 : Fine-tuning 
- Method 2 : directly prune the model without the large dataset 
 
⇨ Method 2 usually has a signiﬁcantly lower accuracy than Method 1

Introduction
CURL (Compression Using Residual-connections and Limited-data)
• Pruning Residual Connection 
- prune not only channels inside the residual branch, but also channels of its output
activation maps (both the identity branch and the residual branch) 
- The resulting wallet-shaped structure shows more advantages  
• Pruning with limited data 
- Combining data augmentation and knowledge distillation 
- A label reﬁnement strategy

Method
1. Pruning Residual-Connections
• Most previous studies only focus on 
reducing channels inside the residual 
block

• To prune the residual block, 
a new criterion that can evaluate 
multiple ﬁlters simultaneously 
should be designed

Method
• Idea : to (1) remove the channels one by one
and (2) calculate the information loss 
 
- (1) : set and of the BN layers to 0 
 
 
 
 
- (2) : randomly select 256 images from training
dataset and compare the similarity of two prediction
probability by using KL-divergence
γ β
Let’s reduce the output channels !

Method
and (2) calculate the information loss
• Repeat this step 256 times, resulting in 256 
importance scores, one for each channel
Let’s reduce the output channels !

Method
and (2) calculate the information loss
• Repeat this step 256 times, resulting in 256 
importance scores, one for each channel

• For those channels inside the residual block, 
only need to erase one ﬁlter at each step

• The top ﬁlter will be removed, leading to a
pruned small model
k

Method
2. Prune with Limited Data
• The pruned small model is then ﬁne-tuned on the target dataset

• Fine-tuning 
- Data augmentation 
- Knowledge distillation

Method
• Data Augmentation
Motivation : Most discriminative information often lies in local image patches  
rather than global information

Method
• Label Reﬁnement 
- Fine-tuning & Knowledge Distillation (KD) 
- Problem : Because the teacher model has not seen the new data, 
its output (logits) may be noisy 
- Update the noisy logits during training via SGD 
- Two Steps 
1) Fine-tuning on original small dataset with KD plus mixup 
2) Fine-tuning on expanded dataset with label reﬁnement

Method
Step 1) Knowledge Distillation with Mixup 
- A new input via mixup :  
- Knowledge distillation w/ the new input :  
 
- With these two techniques, the pruned small model can converge
into a good local minima
https://blog.airlab.re.kr/2019/11/mixup

Method
Step 2) Knowledge Distillation with Label Reﬁnement 
- Fine-tuning the small model on the expanded dataset and update
logits to remove label noises 
- The soft-target of each image 
will be extracted ﬁrst and stored 
- Update soft-target via SGD

Experiments
1. Pruning ResNet50 on ImageNet
* Actual Inference Speed Test (on NVIDIA Tesla M40 GPU) for 256 mini-batch 
- Hourglass (AutoPruner) : 0.21s 
- Wallet (CURL, MACS : 1.39G, #Param : 7.83M) : 0.19s

Experiments
2. Pruning on Small-scale Datasets

Ablation Studies
1. Impact of Fine-Tuning Strategy

Ablation Studies
2. Impact of Pruning Criterion

Ablation Studies
3. Impact of Label Reﬁnement

Neural network pruning with residual connections and limited-data review [cdm]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural network pruning with residual connections and limited-data review [cdm]

Similar to Neural network pruning with residual connections and limited-data review [cdm] (20)

More from Dongmin Choi

More from Dongmin Choi (20)

Recently uploaded

Recently uploaded (20)

Neural network pruning with residual connections and limited-data review [cdm]