SlideShare a Scribd company logo
1 of 26
Download to read offline
김 성 철
Contents
1. Introduction
2. Knowledge distillation
3. Architecture-aware knowledge distillation
4. Understanding the structural knowledge
5. Preliminary experiments on ImageNet
6. Towards million level face retrieval
7. Neural architecture ensemble by AKD
8. Conclusion & further thought
2
https://www.notion.so/Search-to-Distill-Pearls-are-Everywhere-but-not-the-Eyes-d8e57509df4244b68ef295273febc262
Introduction
• Neural Architecture Search (NAS)
• Automates the process of neural architecture design via reinforcement learning, differentiable
search, evolutionary search, and other algorithms
• Knowledge Distillation (KD)
• Trains a (usually small) student neural network using the supervision of a (relatively large)
teacher network
• Previous works on KD mostly focus on transferring the teacher’s knowledge to a student with a
predefined architecture
• The optimal student architectures for different teacher models trained on the same task and
dataset might be different!
3
Introduction
• Architecture-aware Knowledge Distillation (AKD)
• Finds best student architectures for distilling the given teacher model
• A Reinforcement Learning (RL) based NAS process with a KD-based reward function
• Achieve state-of-the-art results on the ImageNet classification task under several latency
settings
• The optimal architecture obtained by AKD for the ImageNet classification task generalizes well to
other tasks such as million-level face recognition and ensemble learning
4
Knowledge distillation
• Knowledge in a neural network
• input space : 𝐼, output space : 𝑂
• An ideal model is a connotative mapping function 𝑓 ∶ 𝑥 ↦ 𝑦, 𝑥 ∈ 𝐼, 𝑦 ∈ 𝑂
• The model’s conditional probability function : 𝑝(𝑦|𝑥)
• The knowledge of a neural network መ𝑓 ∶ 𝑥 ↦ ො𝑦, 𝑥 ∈ 𝐼, ො𝑦 ∈ 𝑂
• The network’s conditional probability function : Ƹ𝑝(ො𝑦|𝑥)
• The difference between Ƹ𝑝(ො𝑦|𝑥) and 𝑝(𝑦|𝑥) is the dark part of the neural network’s knowledge
• Margin between classes
• One-hot output 𝑦 constrains the angular distances between classes to be the same 90°
• Similar classes/samples should have smaller angular distances
5
Knowledge distillation
• Naïve distillation
• Dark knowledge distillation
• A student model trained with the objective of matching full softmax distribution of the teacher model
• [8] : the distribution of logits of the wrong responses, that carry information on the similarity between
output categories
• [3] : soft-target distribution acts as an importance sampling weight based on the teacher’s confidence
in its maximum value
• [42] : the posterior entropy view point claiming that soft-targets bring robustness by regularizing a
much more informed choice of alternatives that blind entropy regularization
6
• Teacher-Student relationship
• Are all student networks equally capable of receiving knowledge from different teacher?
• Distilling the same teacher model to different students leads to different performance results, and no
student architecture produces the best results across all teacher networks
7
Knowledge distillation
• Teacher-Student relationship
• Distribution
• 𝑇(𝐴) & 𝑇(𝐵) demonstrate lowest KL divergence among many others, which means their distributions
are closest
• 𝑆2 is the pearl (best student) in the eye of 𝑇(𝐴), whereas, 𝑆2 is the last choice to 𝑇(𝐵)
• It needs to disentangle distribution into finer singularity
8
Knowledge distillation
• Teacher-Student relationship
• Accuracy
• 𝑇(𝐴) is considered as the most accurate model, however students fail to achieve top performance
• [25] : the teacher complexity could hinder the learning process, as student does not have the sufficient
capacity to mimic teacher’s behavior, but its worth noting that 𝑆2 perform significantly better than 𝑆1
even though they have similar capacity
• [6] : the output of a high performance teacher network is not significantly different from ground truth,
hence KD become less useful
• HOWEVER, a lower-performance model 𝑻(𝑭) is closer to GT than a high-performance model 𝑻(𝑨)
• Parameters or performance + architecture
• KD using a pre-defined student architecture is to force the student to sacrifice its parameters to learn
teacher’s architecture, which end up with a non-optimal solution 9
Knowledge distillation
• KD-guided NAS
• RL approach to search for latency-constrained Pareto optimal solutions from a large factorized
hierarchical search space
• Add a teacher in the loop and use knowledge distillation to guide the search process
• RL agent
• A RNN-based actor-critic agent to search for the best architecture from the search space
• The RNN weights are updated using PPO algorithm by maximizing the expected KD-guided reward
• Search space
• Similar MNAS
• KD-guided reward
• KD-guided accuracy + latency on mobile devices = KD-guided reward
10
Architecture-aware knowledge distillation
• Implementation details
• Inception-ResNet-v2 as teacher model in ImageNet experiments
• The same setting in MNAS
11
Architecture-aware knowledge distillation
https://arxiv.org/abs/1807.11626
• Implementation details
• Training for 5 epochs (or 15 epochs), including the first epoch with warm-up learning rate
• Measuring its latency on the single-thread big CPU core of Pixel 1 phones
• The reward is calculated based on the combination of accuracy and latency
• The RL agent samples ~10K models
• Pick the top models that meet the latency constrain, and train them for further 400 epochs by
either distilling the same teacher model or using ground truth labels
• Temperature = 1 / Distilling weight 𝛼 = 0.9
12
Architecture-aware knowledge distillation
• Understanding the searching process
• AKDNet : searched by KD-guided reward
• NASNet : searched by classification-guided reward
• ↓ : statistical results of an AKD searching process
13
Architecture-aware knowledge distillation
• Understanding the searching process
• AKDNet : searched by KD-guided reward
• NASNet : searched by classification-guided reward
• ↓ : how different the AKDNet and NASNet
14
Architecture-aware knowledge distillation
• Existence of structural knowledge
• The knowledge behind the teacher network includes structural knowledge which sometimes can
hardly be transferred to parameters
• If two identical RL agents perform AKD on two different teacher architectures, will they converge to
different area in search space?
• If two different RL agents perform AKD on the same teacher, will they converge to similar area?
15
Understanding the structural knowledge
• Existence of structural knowledge
• Different teachers with same RL agent
• The teacher models are the off-the-shelf Inception-ResNet-v2 and EfficientNetB7
• All the other settings, random seed, latency target and mini-val data fixed to the same settings
• The final optimal architectures are clearly separable
16
Understanding the structural knowledge
• Existence of structural knowledge
• Same teachers with different RL agent
• Two AKD searching programs for the same teacher model
• Different random seeds for the RL agent and different splits of mini-train / mini-val
• Not surprisingly they finally converge to the close area
17
Understanding the structural knowledge
• Existence of structural knowledge
• Difference between AKDNet and NASNet
• The statistical divergence between the architecture family of AKDNet and NASNet
• Expand the search space to make it continuous (e.g. 35-dim space to 77-dim for skip_op)
• Calculate the probability of each operator among the top 100 optimal architectures of AKDNet and NASNet
• Compared with NASNet, favor larger kernel size but small expand ratio of depth-wise conv, and tend to
reduce the layer number at the beginning of the network
18
Understanding the structural knowledge
19
Preliminary experiments on ImageNet
• Transfer AKDNet to advanced KD methods
• Investigate whether AKD overfits the original KD policy, such as algorithm and hyper-parameter
• It can be observed that even with a quite strong baseline when training without KD, AKDNet-M
(latency=~33ms) still can get more improvement under all KD methods
20
Preliminary experiments on ImageNet
• Compare with SOTA architectures
• MobileNet-v2, MNasNet, MobileNet-v3 : SOTA architectures with similar latency
• AKDNet can achieve great performance when trained without KD thanks to the good search
space
21
Preliminary experiments on ImageNet
• Latency vs. FLOPS
• The FLOPS and latency are empirical linearly correlated
3.4 × (latency−7) ≤ mFLOPS ≤ 10.47 × (latency−7)
• The variance becomes larger when the model goes up,
• To verify whether the conclusion still holds on searching by a slack FLOPS constraint, we replace
the latency term in our reward function by a FLOPS term, and set the target to 300 mFLOPS.
22
Preliminary experiments on ImageNet
• It is much harder to learn a complex data distribution for a tiny neural network
• Some previous works have shown that a huge improvement can be achieved by
introducing KD in complex metric learning
• MegaFace
• 3,530 probe faces / more than 1 million distractor faces
• Training with MS-Celeb-1M and testing with MegaFace
23
Towards million level face retrieval
• Does it still work well when the teacher model is an ensemble of multiple models whose
architecture are different?
24
Neural architecture ensemble by AKD
• The significance of structural knowledge of a model in KD, motivated by the inconsistent
distillation performance between different student and teacher models
• A novel RL based architecture aware knowledge distillation method to distill the structural
knowledge into students’ architecture, which leads to surprising results on multiple tasks
• Whether we can find a new metric space that can measure the similarity between two
arbitrary architectures
25
Conclusion & further thought
감 사 합 니 다
26

More Related Content

What's hot

2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...ali hassan
 
Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationYogendra Tamang
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with KerasQuantUniversity
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural NetworksSangwoo Mo
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceYoonho Lee
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...Dr.(Mrs).Gethsiyal Augasta
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaSangwoo Mo
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchSujit Pal
 
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksPR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksTaesu Kim
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
Clustering introduction
Clustering introductionClustering introduction
Clustering introductionYan Xu
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitBAINIDA
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 

What's hot (20)

2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
 
Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image Classfication
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
AlexNet
AlexNetAlexNet
AlexNet
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksPR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Neural Networks: Introducton
Neural Networks: IntroductonNeural Networks: Introducton
Neural Networks: Introducton
 
Clustering introduction
Clustering introductionClustering introduction
Clustering introduction
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Mnist soln
Mnist solnMnist soln
Mnist soln
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 

Similar to Search to Distill: Pearls are Everywhere but not the Eyes

Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningUMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningYun Huang
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clusteringNadeem Oozeer
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementMongoDB
 
Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsHPCC Systems
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learningKaty Lee
 
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...Reza Sadeghi
 
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptxthanhdowork
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureBogdan Bocse
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017VisageCloud
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksSeunghyun Hwang
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural NetworksJunho Cho
 
From deep learning to deep reasoning
From deep learning to deep reasoningFrom deep learning to deep reasoning
From deep learning to deep reasoningDeakin University
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
A seminar on neo4 j
A seminar on neo4 jA seminar on neo4 j
A seminar on neo4 jRishikese MR
 

Similar to Search to Distill: Pearls are Everywhere but not the Eyes (20)

Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningUMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
 
Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC Systems
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learning
 
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
 
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition Architecture
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
 
From deep learning to deep reasoning
From deep learning to deep reasoningFrom deep learning to deep reasoning
From deep learning to deep reasoning
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
A seminar on neo4 j
A seminar on neo4 jA seminar on neo4 j
A seminar on neo4 j
 

More from Sungchul Kim

PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionPR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionSungchul Kim
 
Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksSungchul Kim
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorSungchul Kim
 
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...Sungchul Kim
 
Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+Sungchul Kim
 
Going Deeper with Convolutions
Going Deeper with ConvolutionsGoing Deeper with Convolutions
Going Deeper with ConvolutionsSungchul Kim
 
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationSungchul Kim
 
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Sungchul Kim
 
Panoptic Segmentation
Panoptic SegmentationPanoptic Segmentation
Panoptic SegmentationSungchul Kim
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondSungchul Kim
 
A Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural NetworksA Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural NetworksSungchul Kim
 
KDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial NetworksKDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial NetworksSungchul Kim
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design SpacesSungchul Kim
 
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive LearningSungchul Kim
 
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningBootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningSungchul Kim
 
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...Sungchul Kim
 
Regularizing Class-wise Predictions via Self-knowledge Distillation
Regularizing Class-wise Predictions via Self-knowledge DistillationRegularizing Class-wise Predictions via Self-knowledge Distillation
Regularizing Class-wise Predictions via Self-knowledge DistillationSungchul Kim
 

More from Sungchul Kim (20)

PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionPR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
 
Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural Networks
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
 
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
 
Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+
 
Going Deeper with Convolutions
Going Deeper with ConvolutionsGoing Deeper with Convolutions
Going Deeper with Convolutions
 
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
 
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
 
Panoptic Segmentation
Panoptic SegmentationPanoptic Segmentation
Panoptic Segmentation
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
 
A Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural NetworksA Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural Networks
 
KDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial NetworksKDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial Networks
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design Spaces
 
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
 
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningBootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
 
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
 
Regularizing Class-wise Predictions via Self-knowledge Distillation
Regularizing Class-wise Predictions via Self-knowledge DistillationRegularizing Class-wise Predictions via Self-knowledge Distillation
Regularizing Class-wise Predictions via Self-knowledge Distillation
 

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

Search to Distill: Pearls are Everywhere but not the Eyes

  • 2. Contents 1. Introduction 2. Knowledge distillation 3. Architecture-aware knowledge distillation 4. Understanding the structural knowledge 5. Preliminary experiments on ImageNet 6. Towards million level face retrieval 7. Neural architecture ensemble by AKD 8. Conclusion & further thought 2 https://www.notion.so/Search-to-Distill-Pearls-are-Everywhere-but-not-the-Eyes-d8e57509df4244b68ef295273febc262
  • 3. Introduction • Neural Architecture Search (NAS) • Automates the process of neural architecture design via reinforcement learning, differentiable search, evolutionary search, and other algorithms • Knowledge Distillation (KD) • Trains a (usually small) student neural network using the supervision of a (relatively large) teacher network • Previous works on KD mostly focus on transferring the teacher’s knowledge to a student with a predefined architecture • The optimal student architectures for different teacher models trained on the same task and dataset might be different! 3
  • 4. Introduction • Architecture-aware Knowledge Distillation (AKD) • Finds best student architectures for distilling the given teacher model • A Reinforcement Learning (RL) based NAS process with a KD-based reward function • Achieve state-of-the-art results on the ImageNet classification task under several latency settings • The optimal architecture obtained by AKD for the ImageNet classification task generalizes well to other tasks such as million-level face recognition and ensemble learning 4
  • 5. Knowledge distillation • Knowledge in a neural network • input space : 𝐼, output space : 𝑂 • An ideal model is a connotative mapping function 𝑓 ∶ 𝑥 ↦ 𝑦, 𝑥 ∈ 𝐼, 𝑦 ∈ 𝑂 • The model’s conditional probability function : 𝑝(𝑦|𝑥) • The knowledge of a neural network መ𝑓 ∶ 𝑥 ↦ ො𝑦, 𝑥 ∈ 𝐼, ො𝑦 ∈ 𝑂 • The network’s conditional probability function : Ƹ𝑝(ො𝑦|𝑥) • The difference between Ƹ𝑝(ො𝑦|𝑥) and 𝑝(𝑦|𝑥) is the dark part of the neural network’s knowledge • Margin between classes • One-hot output 𝑦 constrains the angular distances between classes to be the same 90° • Similar classes/samples should have smaller angular distances 5
  • 6. Knowledge distillation • Naïve distillation • Dark knowledge distillation • A student model trained with the objective of matching full softmax distribution of the teacher model • [8] : the distribution of logits of the wrong responses, that carry information on the similarity between output categories • [3] : soft-target distribution acts as an importance sampling weight based on the teacher’s confidence in its maximum value • [42] : the posterior entropy view point claiming that soft-targets bring robustness by regularizing a much more informed choice of alternatives that blind entropy regularization 6
  • 7. • Teacher-Student relationship • Are all student networks equally capable of receiving knowledge from different teacher? • Distilling the same teacher model to different students leads to different performance results, and no student architecture produces the best results across all teacher networks 7 Knowledge distillation
  • 8. • Teacher-Student relationship • Distribution • 𝑇(𝐴) & 𝑇(𝐵) demonstrate lowest KL divergence among many others, which means their distributions are closest • 𝑆2 is the pearl (best student) in the eye of 𝑇(𝐴), whereas, 𝑆2 is the last choice to 𝑇(𝐵) • It needs to disentangle distribution into finer singularity 8 Knowledge distillation
  • 9. • Teacher-Student relationship • Accuracy • 𝑇(𝐴) is considered as the most accurate model, however students fail to achieve top performance • [25] : the teacher complexity could hinder the learning process, as student does not have the sufficient capacity to mimic teacher’s behavior, but its worth noting that 𝑆2 perform significantly better than 𝑆1 even though they have similar capacity • [6] : the output of a high performance teacher network is not significantly different from ground truth, hence KD become less useful • HOWEVER, a lower-performance model 𝑻(𝑭) is closer to GT than a high-performance model 𝑻(𝑨) • Parameters or performance + architecture • KD using a pre-defined student architecture is to force the student to sacrifice its parameters to learn teacher’s architecture, which end up with a non-optimal solution 9 Knowledge distillation
  • 10. • KD-guided NAS • RL approach to search for latency-constrained Pareto optimal solutions from a large factorized hierarchical search space • Add a teacher in the loop and use knowledge distillation to guide the search process • RL agent • A RNN-based actor-critic agent to search for the best architecture from the search space • The RNN weights are updated using PPO algorithm by maximizing the expected KD-guided reward • Search space • Similar MNAS • KD-guided reward • KD-guided accuracy + latency on mobile devices = KD-guided reward 10 Architecture-aware knowledge distillation
  • 11. • Implementation details • Inception-ResNet-v2 as teacher model in ImageNet experiments • The same setting in MNAS 11 Architecture-aware knowledge distillation https://arxiv.org/abs/1807.11626
  • 12. • Implementation details • Training for 5 epochs (or 15 epochs), including the first epoch with warm-up learning rate • Measuring its latency on the single-thread big CPU core of Pixel 1 phones • The reward is calculated based on the combination of accuracy and latency • The RL agent samples ~10K models • Pick the top models that meet the latency constrain, and train them for further 400 epochs by either distilling the same teacher model or using ground truth labels • Temperature = 1 / Distilling weight 𝛼 = 0.9 12 Architecture-aware knowledge distillation
  • 13. • Understanding the searching process • AKDNet : searched by KD-guided reward • NASNet : searched by classification-guided reward • ↓ : statistical results of an AKD searching process 13 Architecture-aware knowledge distillation
  • 14. • Understanding the searching process • AKDNet : searched by KD-guided reward • NASNet : searched by classification-guided reward • ↓ : how different the AKDNet and NASNet 14 Architecture-aware knowledge distillation
  • 15. • Existence of structural knowledge • The knowledge behind the teacher network includes structural knowledge which sometimes can hardly be transferred to parameters • If two identical RL agents perform AKD on two different teacher architectures, will they converge to different area in search space? • If two different RL agents perform AKD on the same teacher, will they converge to similar area? 15 Understanding the structural knowledge
  • 16. • Existence of structural knowledge • Different teachers with same RL agent • The teacher models are the off-the-shelf Inception-ResNet-v2 and EfficientNetB7 • All the other settings, random seed, latency target and mini-val data fixed to the same settings • The final optimal architectures are clearly separable 16 Understanding the structural knowledge
  • 17. • Existence of structural knowledge • Same teachers with different RL agent • Two AKD searching programs for the same teacher model • Different random seeds for the RL agent and different splits of mini-train / mini-val • Not surprisingly they finally converge to the close area 17 Understanding the structural knowledge
  • 18. • Existence of structural knowledge • Difference between AKDNet and NASNet • The statistical divergence between the architecture family of AKDNet and NASNet • Expand the search space to make it continuous (e.g. 35-dim space to 77-dim for skip_op) • Calculate the probability of each operator among the top 100 optimal architectures of AKDNet and NASNet • Compared with NASNet, favor larger kernel size but small expand ratio of depth-wise conv, and tend to reduce the layer number at the beginning of the network 18 Understanding the structural knowledge
  • 20. • Transfer AKDNet to advanced KD methods • Investigate whether AKD overfits the original KD policy, such as algorithm and hyper-parameter • It can be observed that even with a quite strong baseline when training without KD, AKDNet-M (latency=~33ms) still can get more improvement under all KD methods 20 Preliminary experiments on ImageNet
  • 21. • Compare with SOTA architectures • MobileNet-v2, MNasNet, MobileNet-v3 : SOTA architectures with similar latency • AKDNet can achieve great performance when trained without KD thanks to the good search space 21 Preliminary experiments on ImageNet
  • 22. • Latency vs. FLOPS • The FLOPS and latency are empirical linearly correlated 3.4 × (latency−7) ≤ mFLOPS ≤ 10.47 × (latency−7) • The variance becomes larger when the model goes up, • To verify whether the conclusion still holds on searching by a slack FLOPS constraint, we replace the latency term in our reward function by a FLOPS term, and set the target to 300 mFLOPS. 22 Preliminary experiments on ImageNet
  • 23. • It is much harder to learn a complex data distribution for a tiny neural network • Some previous works have shown that a huge improvement can be achieved by introducing KD in complex metric learning • MegaFace • 3,530 probe faces / more than 1 million distractor faces • Training with MS-Celeb-1M and testing with MegaFace 23 Towards million level face retrieval
  • 24. • Does it still work well when the teacher model is an ensemble of multiple models whose architecture are different? 24 Neural architecture ensemble by AKD
  • 25. • The significance of structural knowledge of a model in KD, motivated by the inconsistent distillation performance between different student and teacher models • A novel RL based architecture aware knowledge distillation method to distill the structural knowledge into students’ architecture, which leads to surprising results on multiple tasks • Whether we can find a new metric space that can measure the similarity between two arbitrary architectures 25 Conclusion & further thought
  • 26. 감 사 합 니 다 26