This document discusses architecture-aware knowledge distillation (AKD), which uses reinforcement learning to find the optimal student network architecture for distilling knowledge from a given teacher model. It introduces AKD and how it guides the search for best architectures under latency constraints. Preliminary experiments show AKD networks achieve state-of-the-art results on ImageNet classification and generalize well to other tasks like face recognition and ensemble learning.
2. Contents
1. Introduction
2. Knowledge distillation
3. Architecture-aware knowledge distillation
4. Understanding the structural knowledge
5. Preliminary experiments on ImageNet
6. Towards million level face retrieval
7. Neural architecture ensemble by AKD
8. Conclusion & further thought
2
https://www.notion.so/Search-to-Distill-Pearls-are-Everywhere-but-not-the-Eyes-d8e57509df4244b68ef295273febc262
3. Introduction
• Neural Architecture Search (NAS)
• Automates the process of neural architecture design via reinforcement learning, differentiable
search, evolutionary search, and other algorithms
• Knowledge Distillation (KD)
• Trains a (usually small) student neural network using the supervision of a (relatively large)
teacher network
• Previous works on KD mostly focus on transferring the teacher’s knowledge to a student with a
predefined architecture
• The optimal student architectures for different teacher models trained on the same task and
dataset might be different!
3
4. Introduction
• Architecture-aware Knowledge Distillation (AKD)
• Finds best student architectures for distilling the given teacher model
• A Reinforcement Learning (RL) based NAS process with a KD-based reward function
• Achieve state-of-the-art results on the ImageNet classification task under several latency
settings
• The optimal architecture obtained by AKD for the ImageNet classification task generalizes well to
other tasks such as million-level face recognition and ensemble learning
4
5. Knowledge distillation
• Knowledge in a neural network
• input space : 𝐼, output space : 𝑂
• An ideal model is a connotative mapping function 𝑓 ∶ 𝑥 ↦ 𝑦, 𝑥 ∈ 𝐼, 𝑦 ∈ 𝑂
• The model’s conditional probability function : 𝑝(𝑦|𝑥)
• The knowledge of a neural network መ𝑓 ∶ 𝑥 ↦ ො𝑦, 𝑥 ∈ 𝐼, ො𝑦 ∈ 𝑂
• The network’s conditional probability function : Ƹ𝑝(ො𝑦|𝑥)
• The difference between Ƹ𝑝(ො𝑦|𝑥) and 𝑝(𝑦|𝑥) is the dark part of the neural network’s knowledge
• Margin between classes
• One-hot output 𝑦 constrains the angular distances between classes to be the same 90°
• Similar classes/samples should have smaller angular distances
5
6. Knowledge distillation
• Naïve distillation
• Dark knowledge distillation
• A student model trained with the objective of matching full softmax distribution of the teacher model
• [8] : the distribution of logits of the wrong responses, that carry information on the similarity between
output categories
• [3] : soft-target distribution acts as an importance sampling weight based on the teacher’s confidence
in its maximum value
• [42] : the posterior entropy view point claiming that soft-targets bring robustness by regularizing a
much more informed choice of alternatives that blind entropy regularization
6
7. • Teacher-Student relationship
• Are all student networks equally capable of receiving knowledge from different teacher?
• Distilling the same teacher model to different students leads to different performance results, and no
student architecture produces the best results across all teacher networks
7
Knowledge distillation
8. • Teacher-Student relationship
• Distribution
• 𝑇(𝐴) & 𝑇(𝐵) demonstrate lowest KL divergence among many others, which means their distributions
are closest
• 𝑆2 is the pearl (best student) in the eye of 𝑇(𝐴), whereas, 𝑆2 is the last choice to 𝑇(𝐵)
• It needs to disentangle distribution into finer singularity
8
Knowledge distillation
9. • Teacher-Student relationship
• Accuracy
• 𝑇(𝐴) is considered as the most accurate model, however students fail to achieve top performance
• [25] : the teacher complexity could hinder the learning process, as student does not have the sufficient
capacity to mimic teacher’s behavior, but its worth noting that 𝑆2 perform significantly better than 𝑆1
even though they have similar capacity
• [6] : the output of a high performance teacher network is not significantly different from ground truth,
hence KD become less useful
• HOWEVER, a lower-performance model 𝑻(𝑭) is closer to GT than a high-performance model 𝑻(𝑨)
• Parameters or performance + architecture
• KD using a pre-defined student architecture is to force the student to sacrifice its parameters to learn
teacher’s architecture, which end up with a non-optimal solution 9
Knowledge distillation
10. • KD-guided NAS
• RL approach to search for latency-constrained Pareto optimal solutions from a large factorized
hierarchical search space
• Add a teacher in the loop and use knowledge distillation to guide the search process
• RL agent
• A RNN-based actor-critic agent to search for the best architecture from the search space
• The RNN weights are updated using PPO algorithm by maximizing the expected KD-guided reward
• Search space
• Similar MNAS
• KD-guided reward
• KD-guided accuracy + latency on mobile devices = KD-guided reward
10
Architecture-aware knowledge distillation
11. • Implementation details
• Inception-ResNet-v2 as teacher model in ImageNet experiments
• The same setting in MNAS
11
Architecture-aware knowledge distillation
https://arxiv.org/abs/1807.11626
12. • Implementation details
• Training for 5 epochs (or 15 epochs), including the first epoch with warm-up learning rate
• Measuring its latency on the single-thread big CPU core of Pixel 1 phones
• The reward is calculated based on the combination of accuracy and latency
• The RL agent samples ~10K models
• Pick the top models that meet the latency constrain, and train them for further 400 epochs by
either distilling the same teacher model or using ground truth labels
• Temperature = 1 / Distilling weight 𝛼 = 0.9
12
Architecture-aware knowledge distillation
13. • Understanding the searching process
• AKDNet : searched by KD-guided reward
• NASNet : searched by classification-guided reward
• ↓ : statistical results of an AKD searching process
13
Architecture-aware knowledge distillation
14. • Understanding the searching process
• AKDNet : searched by KD-guided reward
• NASNet : searched by classification-guided reward
• ↓ : how different the AKDNet and NASNet
14
Architecture-aware knowledge distillation
15. • Existence of structural knowledge
• The knowledge behind the teacher network includes structural knowledge which sometimes can
hardly be transferred to parameters
• If two identical RL agents perform AKD on two different teacher architectures, will they converge to
different area in search space?
• If two different RL agents perform AKD on the same teacher, will they converge to similar area?
15
Understanding the structural knowledge
16. • Existence of structural knowledge
• Different teachers with same RL agent
• The teacher models are the off-the-shelf Inception-ResNet-v2 and EfficientNetB7
• All the other settings, random seed, latency target and mini-val data fixed to the same settings
• The final optimal architectures are clearly separable
16
Understanding the structural knowledge
17. • Existence of structural knowledge
• Same teachers with different RL agent
• Two AKD searching programs for the same teacher model
• Different random seeds for the RL agent and different splits of mini-train / mini-val
• Not surprisingly they finally converge to the close area
17
Understanding the structural knowledge
18. • Existence of structural knowledge
• Difference between AKDNet and NASNet
• The statistical divergence between the architecture family of AKDNet and NASNet
• Expand the search space to make it continuous (e.g. 35-dim space to 77-dim for skip_op)
• Calculate the probability of each operator among the top 100 optimal architectures of AKDNet and NASNet
• Compared with NASNet, favor larger kernel size but small expand ratio of depth-wise conv, and tend to
reduce the layer number at the beginning of the network
18
Understanding the structural knowledge
20. • Transfer AKDNet to advanced KD methods
• Investigate whether AKD overfits the original KD policy, such as algorithm and hyper-parameter
• It can be observed that even with a quite strong baseline when training without KD, AKDNet-M
(latency=~33ms) still can get more improvement under all KD methods
20
Preliminary experiments on ImageNet
21. • Compare with SOTA architectures
• MobileNet-v2, MNasNet, MobileNet-v3 : SOTA architectures with similar latency
• AKDNet can achieve great performance when trained without KD thanks to the good search
space
21
Preliminary experiments on ImageNet
22. • Latency vs. FLOPS
• The FLOPS and latency are empirical linearly correlated
3.4 × (latency−7) ≤ mFLOPS ≤ 10.47 × (latency−7)
• The variance becomes larger when the model goes up,
• To verify whether the conclusion still holds on searching by a slack FLOPS constraint, we replace
the latency term in our reward function by a FLOPS term, and set the target to 300 mFLOPS.
22
Preliminary experiments on ImageNet
23. • It is much harder to learn a complex data distribution for a tiny neural network
• Some previous works have shown that a huge improvement can be achieved by
introducing KD in complex metric learning
• MegaFace
• 3,530 probe faces / more than 1 million distractor faces
• Training with MS-Celeb-1M and testing with MegaFace
23
Towards million level face retrieval
24. • Does it still work well when the teacher model is an ensemble of multiple models whose
architecture are different?
24
Neural architecture ensemble by AKD
25. • The significance of structural knowledge of a model in KD, motivated by the inconsistent
distillation performance between different student and teacher models
• A novel RL based architecture aware knowledge distillation method to distill the structural
knowledge into students’ architecture, which leads to surprising results on multiple tasks
• Whether we can find a new metric space that can measure the similarity between two
arbitrary architectures
25
Conclusion & further thought