Search to Distill: Pearls are Everywhere but not the Eyes

Contents
1. Introduction
2. Knowledge distillation
3. Architecture-aware knowledge distillation
4. Understanding the structural knowledge
5. Preliminary experiments on ImageNet
6. Towards million level face retrieval
7. Neural architecture ensemble by AKD
8. Conclusion & further thought
2
https://www.notion.so/Search-to-Distill-Pearls-are-Everywhere-but-not-the-Eyes-d8e57509df4244b68ef295273febc262

Introduction
• Neural Architecture Search (NAS)
• Automates the process of neural architecture design via reinforcement learning, differentiable
search, evolutionary search, and other algorithms
• Knowledge Distillation (KD)
• Trains a (usually small) student neural network using the supervision of a (relatively large)
teacher network
• Previous works on KD mostly focus on transferring the teacher’s knowledge to a student with a
predefined architecture
• The optimal student architectures for different teacher models trained on the same task and
dataset might be different!
3

Introduction
• Architecture-aware Knowledge Distillation (AKD)
• Finds best student architectures for distilling the given teacher model
• A Reinforcement Learning (RL) based NAS process with a KD-based reward function
• Achieve state-of-the-art results on the ImageNet classification task under several latency
settings
• The optimal architecture obtained by AKD for the ImageNet classification task generalizes well to
other tasks such as million-level face recognition and ensemble learning
4

Knowledge distillation
• Knowledge in a neural network
• input space : 𝐼, output space : 𝑂
• An ideal model is a connotative mapping function 𝑓 ∶ 𝑥 ↦ 𝑦, 𝑥 ∈ 𝐼, 𝑦 ∈ 𝑂
• The model’s conditional probability function : 𝑝(𝑦|𝑥)
• The knowledge of a neural network መ𝑓 ∶ 𝑥 ↦ ො𝑦, 𝑥 ∈ 𝐼, ො𝑦 ∈ 𝑂
• The network’s conditional probability function : Ƹ𝑝(ො𝑦|𝑥)
• The difference between Ƹ𝑝(ො𝑦|𝑥) and 𝑝(𝑦|𝑥) is the dark part of the neural network’s knowledge
• Margin between classes
• One-hot output 𝑦 constrains the angular distances between classes to be the same 90°
• Similar classes/samples should have smaller angular distances
5

• Naïve distillation
• Dark knowledge distillation
• A student model trained with the objective of matching full softmax distribution of the teacher model
• [8] : the distribution of logits of the wrong responses, that carry information on the similarity between
output categories
• [3] : soft-target distribution acts as an importance sampling weight based on the teacher’s confidence
in its maximum value
• [42] : the posterior entropy view point claiming that soft-targets bring robustness by regularizing a
much more informed choice of alternatives that blind entropy regularization
6

• Teacher-Student relationship
• Are all student networks equally capable of receiving knowledge from different teacher?
• Distilling the same teacher model to different students leads to different performance results, and no
student architecture produces the best results across all teacher networks
7

• Distribution
• 𝑇(𝐴) & 𝑇(𝐵) demonstrate lowest KL divergence among many others, which means their distributions
are closest
• 𝑆2 is the pearl (best student) in the eye of 𝑇(𝐴), whereas, 𝑆2 is the last choice to 𝑇(𝐵)
• It needs to disentangle distribution into finer singularity
8

• Accuracy
• 𝑇(𝐴) is considered as the most accurate model, however students fail to achieve top performance
• [25] : the teacher complexity could hinder the learning process, as student does not have the sufficient
capacity to mimic teacher’s behavior, but its worth noting that 𝑆2 perform significantly better than 𝑆1
even though they have similar capacity
• [6] : the output of a high performance teacher network is not significantly different from ground truth,
hence KD become less useful
• HOWEVER, a lower-performance model 𝑻(𝑭) is closer to GT than a high-performance model 𝑻(𝑨)
• Parameters or performance + architecture
• KD using a pre-defined student architecture is to force the student to sacrifice its parameters to learn
teacher’s architecture, which end up with a non-optimal solution 9

• KD-guided NAS
• RL approach to search for latency-constrained Pareto optimal solutions from a large factorized
hierarchical search space
• Add a teacher in the loop and use knowledge distillation to guide the search process
• RL agent
• A RNN-based actor-critic agent to search for the best architecture from the search space
• The RNN weights are updated using PPO algorithm by maximizing the expected KD-guided reward
• Search space
• Similar MNAS
• KD-guided reward
• KD-guided accuracy + latency on mobile devices = KD-guided reward
10
Architecture-aware knowledge distillation

• Implementation details
• Inception-ResNet-v2 as teacher model in ImageNet experiments
• The same setting in MNAS
11
https://arxiv.org/abs/1807.11626

• Implementation details
• Training for 5 epochs (or 15 epochs), including the first epoch with warm-up learning rate
• Measuring its latency on the single-thread big CPU core of Pixel 1 phones
• The reward is calculated based on the combination of accuracy and latency
• The RL agent samples ~10K models
• Pick the top models that meet the latency constrain, and train them for further 400 epochs by
either distilling the same teacher model or using ground truth labels
• Temperature = 1 / Distilling weight 𝛼 = 0.9
12

• Understanding the searching process
• AKDNet : searched by KD-guided reward
• NASNet : searched by classification-guided reward
• ↓ : statistical results of an AKD searching process
13

• Understanding the searching process
• AKDNet : searched by KD-guided reward
• NASNet : searched by classification-guided reward
• ↓ : how different the AKDNet and NASNet
14

• Existence of structural knowledge
• The knowledge behind the teacher network includes structural knowledge which sometimes can
hardly be transferred to parameters
• If two identical RL agents perform AKD on two different teacher architectures, will they converge to
different area in search space?
• If two different RL agents perform AKD on the same teacher, will they converge to similar area?
15
Understanding the structural knowledge

• Different teachers with same RL agent
• The teacher models are the off-the-shelf Inception-ResNet-v2 and EfficientNetB7
• All the other settings, random seed, latency target and mini-val data fixed to the same settings
• The final optimal architectures are clearly separable
16

• Same teachers with different RL agent
• Two AKD searching programs for the same teacher model
• Different random seeds for the RL agent and different splits of mini-train / mini-val
• Not surprisingly they finally converge to the close area
17

• Difference between AKDNet and NASNet
• The statistical divergence between the architecture family of AKDNet and NASNet
• Expand the search space to make it continuous (e.g. 35-dim space to 77-dim for skip_op)
• Calculate the probability of each operator among the top 100 optimal architectures of AKDNet and NASNet
• Compared with NASNet, favor larger kernel size but small expand ratio of depth-wise conv, and tend to
reduce the layer number at the beginning of the network
18

19
Preliminary experiments on ImageNet

• Transfer AKDNet to advanced KD methods
• Investigate whether AKD overfits the original KD policy, such as algorithm and hyper-parameter
• It can be observed that even with a quite strong baseline when training without KD, AKDNet-M
(latency=~33ms) still can get more improvement under all KD methods
20

• Compare with SOTA architectures
• MobileNet-v2, MNasNet, MobileNet-v3 : SOTA architectures with similar latency
• AKDNet can achieve great performance when trained without KD thanks to the good search
space
21

• Latency vs. FLOPS
• The FLOPS and latency are empirical linearly correlated
3.4 × (latency−7) ≤ mFLOPS ≤ 10.47 × (latency−7)
• The variance becomes larger when the model goes up,
• To verify whether the conclusion still holds on searching by a slack FLOPS constraint, we replace
the latency term in our reward function by a FLOPS term, and set the target to 300 mFLOPS.
22

• It is much harder to learn a complex data distribution for a tiny neural network
• Some previous works have shown that a huge improvement can be achieved by
introducing KD in complex metric learning
• MegaFace
• 3,530 probe faces / more than 1 million distractor faces
• Training with MS-Celeb-1M and testing with MegaFace
23
Towards million level face retrieval

• Does it still work well when the teacher model is an ensemble of multiple models whose
architecture are different?
24
Neural architecture ensemble by AKD

• The significance of structural knowledge of a model in KD, motivated by the inconsistent
distillation performance between different student and teacher models
• A novel RL based architecture aware knowledge distillation method to distill the structural
knowledge into students’ architecture, which leads to surprising results on multiple tasks
• Whether we can find a new metric space that can measure the similarity between two
arbitrary architectures
25
Conclusion & further thought

Search to Distill: Pearls are Everywhere but not the Eyes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Search to Distill: Pearls are Everywhere but not the Eyes

Similar to Search to Distill: Pearls are Everywhere but not the Eyes (20)

More from Sungchul Kim

More from Sungchul Kim (20)

Recently uploaded

Recently uploaded (20)

Search to Distill: Pearls are Everywhere but not the Eyes