SlideShare a Scribd company logo
1 of 27
Download to read offline
Distilling the knowledge in a Neural Network
Geoffrey Hinton(2015, 1943회 인용)
DNN is easily overfitted to training data => Due to many parameters
Overfitting of DNN
To improve generalization performance Ensemble could be used
DNN Ensemble
Input
Output
Because ensemble is composed of many models Ensemble is computationally expensive
Test sets are large (e.g. Google) => 데이터 양이 많다 보니 너무 계산 비용 높아
Storage space at a premium (e.g. Mobile phone) => 앙상블 저장 공간 너무 많이 차지해!
** 물론 training에 사용되는 모델은 대규모 데이터를 가지고 batch 처리를 할 수 있고, 리소스를 비교적 자유롭게 사용할 수 있다.
하지만 실제 deployment 단계의 모델은 데이터의 실시간 처리가 필요하고 리소스에 제약을 받아 빠른 처리가 중요함
DNN Ensemble
Ensemble 정보를 distilling해서 간단한 single shallow model을 만들자, 단!
• Good performance
• Low computation
Distilling Ensemble: Single Model
Distilling Ensemble: Single Model
이런 맥락에서 Distillation(증류)
= 많은 parameter가 사용되는 ensemble model로부터
generalization, 성능을 그대로 유지할 수 있는 어떠한
knowledge를 분리하여 가볍게 모델을 만들자!
Distilling Ensemble: Single Model
1번의 training 결과를 2번에게 어떻게 하면 잘 가르칠까? (= Model Compression)
1번이 축적한 지식을 2번 모델에게 효율적으로 전달하자!
(여기서 2번 모델은 single shallow net [hidden layer가 하나인 신경망])
(크고 무거운)
Distilling Ensemble: Single Model -1-
그냥 observations이 많으면 Generalization도 잘되고 성능도 좋다.
데이터를 늘려보자
Distilling Ensemble: Single Model -1-
현재 데이터가 별로 없는 상황 : 일단 ensemble 학습
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
* Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 535-541). ACM.
Distilling Ensemble: Single Model -1-
?
?
?
? ?
? ?
?
?
?
?
?
?
?
? ?
?
?
? ?
?
?
? ?
?
?
20 gray 80 ?
32 yellow 205 ?
10 white 102 ?
8 gray 52 ?
9 white 42 ?
12 gray 45 ?
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
over
sampling
oversampling한다. 단 : label을 붙이지 않고
* Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 535-541). ACM.
Distilling Ensemble: Single Model -1-
Oversampling전에 데이터로 학습한 ensemble 모델로 (?)를 예측(준지도 학습)
?
?
?
? ?
? ?
?
?
?
?
?
?
?
? ?
?
?
? ?
?
?
? ?
?
?
20 gray 80
32 yellow 205
10 white 102
8 gray 52
9 white 42
12 gray 45
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
* Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 535-541). ACM.
Distilling Ensemble: Single Model -1-
이제 데이터가 많아지고, 생성된 데이터에는 ensemble의 정보가 담겨 있다.
=> 최종적인 데이터로 single shallow net을 만들면 distilling 된다.
20 gray 80
32 yellow 205
10 white 102
8 gray 52
9 white 42
12 gray 45
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
* Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 535-541). ACM.
Distilling Ensemble: Single Model -1-
빨간색 : ensemble
파란색 : single shallow net
Training data size가 커질수록 single shallow net이 따라 잡는군!
Best Ensemble
Model Compression
Distilling Ensemble: Single Model -2-
Class의 확률 분표를 알면 학습을 더 잘 하지 않을까?
(어떻게 보면 데이터 양을 늘리는 거랑 비슷한 맥락)
Distilling Ensemble: Single Model -2-
* Ba, J., & Caruana, R. (2014). Do deep nets really need to be deep?. In Advances in neural information processing systems (pp. 2654-2662).
20 gray 80 -2
32 yellow 205 3
10 white 102 1
8 gray 52 3
9 white 42 -2
12 gray 45 1
weight color length ... Y
10 red 80 2
30 yellow 201 1
15 white 100 3
6 gray 50 3
5 gray 40 2
Class말고 각 sample의 logit 값을 single shallow net의 y로 넣자.
logit을 Class의 점수 혹은 class의 확률 분포라고 생각할 수 있다.
Distilling Ensemble: Single Model -2-
0.8
21 5 16 43 0.2 0.9 1.1
0.2 1.2 ... 0.1 0.7
33.7
1.39O = activation( )Logit = W*H
0.5 0.1
최종 ouput을 계산하기 위해서
Activation function(softmax)에 넣어 계산하기 전의 값 = logit
2
Logit 값을 student 모델 학습에 사용한다.
** The deep models are trained in the usual way using softmax output and cross-entropy cost function.
The shallow mimic models, however, instead of being trained with cross-entropy on the 183 p values
where pk = e zk / P j e zj output by the softmax layer from the deep model, are trained directly on the
183 log probability values z, also called logit, before the softmax activation.
Distilling Ensemble: Single Model -2-
20 gray 80 -2
32 yellow 205 3
10 white 102 1
8 gray 52 3
9 white 42 -2
12 gray 45 1
weight color length ... Y
10 red 80 2
30 yellow 201 1
15 white 100 3
6 gray 50 3
5 gray 40 2
0.8
21 5 16 43 0.2 0.9 1.1
0.2 1.2 ... 0.1 0.7
33.7
1.39O = activation( )Logit = W*H
0.5 0.1
Because the logits capture the logarithm relationships between the
probability predictions, a student model trained on logits has to learn
all of the additional fine detailed relationships between labels
Distilling Ensemble: Single Model -2-
Result
The TIMIT speech corpus has 462 speakers in the training set
Distilling Ensemble: Single Model -3-
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
Hinton : prob, distribution을 얻기 위해 softmax function을 사용하자
* Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Distilling Ensemble: Single Model -3-
Hinton : prob, distribution을 얻기 위해 softmax function을 사용하자
앙상블 학습 => prob 계산 (with Softmax function)
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
0.9 0.1
Distilling Ensemble: Single Model -3-
구한 prob값을 single shallow net에 넣으면 ensemble의 정보가 전이 된다.
prob으로 학습하는 게 regularization 역할
weight color length ... Prob.
10 red 80 0.90
30 yellow 201 0.95
15 white 100 0.10
6 gray 50 0.70
5 gray 40 0.75
Distilling Ensemble: Single Model -3-
Parameter T(Temperature) 추가
원래 softmax function
100 ...
T가 높을수록 기존보다 더 soft한 probability distribution을 얻을 수 있다.
** Temperature라고 지은 것도 굉장히 은유적. 증류를 할 때 온도를 잘 조절해야 증류가 잘된다. 그래서 T라고 지음
** Probaility가 0인 것도 T를 키우면 쭉쭉쭉 soft하게!
** 기존에 softmax는 T가 1이고, 2~5일 때 효과가 좋다더라
Distilling Ensemble: Single Model -3-
온도를 적당히 높여서 class간의 적절한 분포를 알아내자
그 후 결과값을 다시 single shallow model로 학습
weight color length ... Prob.
10 red 80 0.90
30 yellow 201 0.95
15 white 100 0.10
6 gray 50 0.70
5 gray 40 0.75
100 ...
Train single model
Distilling Ensemble: Single Model -3-
the soft targets have high entropy, they provide much
more information per training case than hard targets
and much less variance in the gradient between
training cases, so the small model can often be trained
on much less data than the original cumbersome
model and using a much higher learning rate.
100 ...
Result : Speech Recognition
Single model can be built using artificial data or soft target* (prob.)
- Good performance
- Low computation
Distilling Ensemble: Single Model -4-
Distilling Ensemble: Single Model -4-
20 gray 80 ! +-2
32 yellow 205 ! +3
10 white 102 ! +1
8 gray 52 ! +3
9 white 42 ! +-2
12 gray 45 ! +1
weight color length ... Y
10 red 80 ! +2
30 yellow 201 ! +1
15 white 100 ! +3
6 gray 50 ! +3
5 gray 40 ! +2
+ "
Logit 값에 noise를 추가하는 게 regularizer의 역할을 해서 좀 더 성능이 좋다더라
* Sau, B. B., & Balasubramanian, V. N. (2016). Deep Model Compression: Distilling Knowledge from Noisy Teachers. arXiv preprint arXiv:1610.09650.
Hinton보다 좋다고 쉬익쉬익했는데…. 인용수는 34회
결론
모델이 아무리 크고 복잡하더라도 실제 서비스로 deploy 못할까 봐 걱정할 필요가 없다.
선생님 모델에서 knowledge를 추출해서 훨씬 작은 학생 모델로 옮길 수 있다. 

More Related Content

What's hot

인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기Byoung-Hee Kim
 
Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Plot Hong
 
Deep Learning Frameworks slides
Deep Learning Frameworks slides Deep Learning Frameworks slides
Deep Learning Frameworks slides Sheamus McGovern
 
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
TalkingData AdTracking  Fraud Detection Challenge (1st place solution)TalkingData AdTracking  Fraud Detection Challenge (1st place solution)
TalkingData AdTracking Fraud Detection Challenge (1st place solution)Takanori Hayashi
 
Stand alone self attention in vision models
Stand alone self attention in vision modelsStand alone self attention in vision models
Stand alone self attention in vision modelsharmonylab
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズムnishio
 
실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트
실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트
실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트김인규
 
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向Hiroshi Fukui
 
Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)Jeonghun Yoon
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054Jinwon Lee
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝Jinwon Lee
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...PyData
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
[DL輪読会]An Iterative Framework for Self-supervised Deep Speaker Representatio...
[DL輪読会]An Iterative Framework for Self-supervised Deep  Speaker Representatio...[DL輪読会]An Iterative Framework for Self-supervised Deep  Speaker Representatio...
[DL輪読会]An Iterative Framework for Self-supervised Deep Speaker Representatio...Deep Learning JP
 
[DL輪読会]Whole-Body Human Pose Estimation in the Wild
[DL輪読会]Whole-Body Human Pose Estimation in the Wild[DL輪読会]Whole-Body Human Pose Estimation in the Wild
[DL輪読会]Whole-Body Human Pose Estimation in the WildDeep Learning JP
 
Humpback whale identification challenge反省会
Humpback whale identification challenge反省会Humpback whale identification challenge反省会
Humpback whale identification challenge反省会Yusuke Uchida
 
国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告Shinnosuke Takamichi
 

What's hot (20)

인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기
 
Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習
 
Deep Learning Frameworks slides
Deep Learning Frameworks slides Deep Learning Frameworks slides
Deep Learning Frameworks slides
 
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
TalkingData AdTracking  Fraud Detection Challenge (1st place solution)TalkingData AdTracking  Fraud Detection Challenge (1st place solution)
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
 
Stand alone self attention in vision models
Stand alone self attention in vision modelsStand alone self attention in vision models
Stand alone self attention in vision models
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
 
실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트
실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트
실시간 따릉이 잔여대수 예측을 통한 사용자 불만제로 프로젝트
 
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
[サーベイ論文] Deep Learningを用いた歩行者検出の研究動向
 
Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
[DL輪読会]An Iterative Framework for Self-supervised Deep Speaker Representatio...
[DL輪読会]An Iterative Framework for Self-supervised Deep  Speaker Representatio...[DL輪読会]An Iterative Framework for Self-supervised Deep  Speaker Representatio...
[DL輪読会]An Iterative Framework for Self-supervised Deep Speaker Representatio...
 
[DL輪読会]Whole-Body Human Pose Estimation in the Wild
[DL輪読会]Whole-Body Human Pose Estimation in the Wild[DL輪読会]Whole-Body Human Pose Estimation in the Wild
[DL輪読会]Whole-Body Human Pose Estimation in the Wild
 
Humpback whale identification challenge反省会
Humpback whale identification challenge反省会Humpback whale identification challenge反省会
Humpback whale identification challenge反省会
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告
 

Similar to Distilling the knowledge in a neural network

Dense sparse-dense training for dnn and Other Models
Dense sparse-dense training for dnn and Other ModelsDense sparse-dense training for dnn and Other Models
Dense sparse-dense training for dnn and Other ModelsDong Heon Cho
 
Learning by association
Learning by associationLearning by association
Learning by association홍배 김
 
Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Jinwon Lee
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
(in Korean) about knowledge distillation
(in Korean) about knowledge distillation(in Korean) about knowledge distillation
(in Korean) about knowledge distillationssuser23ed0c
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot LearningJisung Kim
 
쫄지말자딥러닝2 - CNN RNN 포함버전
쫄지말자딥러닝2 - CNN RNN 포함버전쫄지말자딥러닝2 - CNN RNN 포함버전
쫄지말자딥러닝2 - CNN RNN 포함버전Modulabs
 
Ml for 정형데이터
Ml for 정형데이터Ml for 정형데이터
Ml for 정형데이터JEEHYUN PAIK
 
Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...
Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...
Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...Mad Scientists
 
집단지성 프로그래밍 06-의사결정트리-01
집단지성 프로그래밍 06-의사결정트리-01집단지성 프로그래밍 06-의사결정트리-01
집단지성 프로그래밍 06-의사결정트리-01Kwang Woo NAM
 
EveryBody Tensorflow module2 GIST Jan 2018 Korean
EveryBody Tensorflow module2 GIST Jan 2018 KoreanEveryBody Tensorflow module2 GIST Jan 2018 Korean
EveryBody Tensorflow module2 GIST Jan 2018 KoreanJaewook. Kang
 

Similar to Distilling the knowledge in a neural network (13)

Dense sparse-dense training for dnn and Other Models
Dense sparse-dense training for dnn and Other ModelsDense sparse-dense training for dnn and Other Models
Dense sparse-dense training for dnn and Other Models
 
Learning by association
Learning by associationLearning by association
Learning by association
 
Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Deep learning seminar_snu_161031
Deep learning seminar_snu_161031
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
(in Korean) about knowledge distillation
(in Korean) about knowledge distillation(in Korean) about knowledge distillation
(in Korean) about knowledge distillation
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
 
Gan
GanGan
Gan
 
쫄지말자딥러닝2 - CNN RNN 포함버전
쫄지말자딥러닝2 - CNN RNN 포함버전쫄지말자딥러닝2 - CNN RNN 포함버전
쫄지말자딥러닝2 - CNN RNN 포함버전
 
Ml for 정형데이터
Ml for 정형데이터Ml for 정형데이터
Ml for 정형데이터
 
Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...
Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...
Face Feature Recognition System with Deep Belief Networks, for Korean/KIISE T...
 
집단지성 프로그래밍 06-의사결정트리-01
집단지성 프로그래밍 06-의사결정트리-01집단지성 프로그래밍 06-의사결정트리-01
집단지성 프로그래밍 06-의사결정트리-01
 
EveryBody Tensorflow module2 GIST Jan 2018 Korean
EveryBody Tensorflow module2 GIST Jan 2018 KoreanEveryBody Tensorflow module2 GIST Jan 2018 Korean
EveryBody Tensorflow module2 GIST Jan 2018 Korean
 
TinyBERT
TinyBERTTinyBERT
TinyBERT
 

More from KyeongUkJang

Photo wake up - 3d character animation from a single photo
Photo wake up - 3d character animation from a single photoPhoto wake up - 3d character animation from a single photo
Photo wake up - 3d character animation from a single photoKyeongUkJang
 
GAN - Generative Adversarial Nets
GAN - Generative Adversarial NetsGAN - Generative Adversarial Nets
GAN - Generative Adversarial NetsKyeongUkJang
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationKyeongUkJang
 
Gaussian Mixture Model
Gaussian Mixture ModelGaussian Mixture Model
Gaussian Mixture ModelKyeongUkJang
 
CNN for sentence classification
CNN for sentence classificationCNN for sentence classification
CNN for sentence classificationKyeongUkJang
 
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNEKyeongUkJang
 
Playing atari with deep reinforcement learning
Playing atari with deep reinforcement learningPlaying atari with deep reinforcement learning
Playing atari with deep reinforcement learningKyeongUkJang
 
Chapter 20 Deep generative models
Chapter 20 Deep generative modelsChapter 20 Deep generative models
Chapter 20 Deep generative modelsKyeongUkJang
 
Chapter 19 Variational Inference
Chapter 19 Variational InferenceChapter 19 Variational Inference
Chapter 19 Variational InferenceKyeongUkJang
 
Natural Language Processing(NLP) - basic 2
Natural Language Processing(NLP) - basic 2Natural Language Processing(NLP) - basic 2
Natural Language Processing(NLP) - basic 2KyeongUkJang
 
Natural Language Processing(NLP) - Basic
Natural Language Processing(NLP) - BasicNatural Language Processing(NLP) - Basic
Natural Language Processing(NLP) - BasicKyeongUkJang
 
Chapter 17 monte carlo methods
Chapter 17 monte carlo methodsChapter 17 monte carlo methods
Chapter 17 monte carlo methodsKyeongUkJang
 
Chapter 16 structured probabilistic models for deep learning - 2
Chapter 16 structured probabilistic models for deep learning - 2Chapter 16 structured probabilistic models for deep learning - 2
Chapter 16 structured probabilistic models for deep learning - 2KyeongUkJang
 
Chapter 16 structured probabilistic models for deep learning - 1
Chapter 16 structured probabilistic models for deep learning - 1Chapter 16 structured probabilistic models for deep learning - 1
Chapter 16 structured probabilistic models for deep learning - 1KyeongUkJang
 
Chapter 15 Representation learning - 2
Chapter 15 Representation learning - 2Chapter 15 Representation learning - 2
Chapter 15 Representation learning - 2KyeongUkJang
 

More from KyeongUkJang (20)

Photo wake up - 3d character animation from a single photo
Photo wake up - 3d character animation from a single photoPhoto wake up - 3d character animation from a single photo
Photo wake up - 3d character animation from a single photo
 
YOLO
YOLOYOLO
YOLO
 
AlphagoZero
AlphagoZeroAlphagoZero
AlphagoZero
 
GoogLenet
GoogLenetGoogLenet
GoogLenet
 
GAN - Generative Adversarial Nets
GAN - Generative Adversarial NetsGAN - Generative Adversarial Nets
GAN - Generative Adversarial Nets
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Gaussian Mixture Model
Gaussian Mixture ModelGaussian Mixture Model
Gaussian Mixture Model
 
CNN for sentence classification
CNN for sentence classificationCNN for sentence classification
CNN for sentence classification
 
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
 
Playing atari with deep reinforcement learning
Playing atari with deep reinforcement learningPlaying atari with deep reinforcement learning
Playing atari with deep reinforcement learning
 
Chapter 20 - GAN
Chapter 20 - GANChapter 20 - GAN
Chapter 20 - GAN
 
Chapter 20 - VAE
Chapter 20 - VAEChapter 20 - VAE
Chapter 20 - VAE
 
Chapter 20 Deep generative models
Chapter 20 Deep generative modelsChapter 20 Deep generative models
Chapter 20 Deep generative models
 
Chapter 19 Variational Inference
Chapter 19 Variational InferenceChapter 19 Variational Inference
Chapter 19 Variational Inference
 
Natural Language Processing(NLP) - basic 2
Natural Language Processing(NLP) - basic 2Natural Language Processing(NLP) - basic 2
Natural Language Processing(NLP) - basic 2
 
Natural Language Processing(NLP) - Basic
Natural Language Processing(NLP) - BasicNatural Language Processing(NLP) - Basic
Natural Language Processing(NLP) - Basic
 
Chapter 17 monte carlo methods
Chapter 17 monte carlo methodsChapter 17 monte carlo methods
Chapter 17 monte carlo methods
 
Chapter 16 structured probabilistic models for deep learning - 2
Chapter 16 structured probabilistic models for deep learning - 2Chapter 16 structured probabilistic models for deep learning - 2
Chapter 16 structured probabilistic models for deep learning - 2
 
Chapter 16 structured probabilistic models for deep learning - 1
Chapter 16 structured probabilistic models for deep learning - 1Chapter 16 structured probabilistic models for deep learning - 1
Chapter 16 structured probabilistic models for deep learning - 1
 
Chapter 15 Representation learning - 2
Chapter 15 Representation learning - 2Chapter 15 Representation learning - 2
Chapter 15 Representation learning - 2
 

Distilling the knowledge in a neural network

  • 1. Distilling the knowledge in a Neural Network Geoffrey Hinton(2015, 1943회 인용)
  • 2. DNN is easily overfitted to training data => Due to many parameters Overfitting of DNN
  • 3. To improve generalization performance Ensemble could be used DNN Ensemble Input Output Because ensemble is composed of many models Ensemble is computationally expensive
  • 4. Test sets are large (e.g. Google) => 데이터 양이 많다 보니 너무 계산 비용 높아 Storage space at a premium (e.g. Mobile phone) => 앙상블 저장 공간 너무 많이 차지해! ** 물론 training에 사용되는 모델은 대규모 데이터를 가지고 batch 처리를 할 수 있고, 리소스를 비교적 자유롭게 사용할 수 있다. 하지만 실제 deployment 단계의 모델은 데이터의 실시간 처리가 필요하고 리소스에 제약을 받아 빠른 처리가 중요함 DNN Ensemble
  • 5. Ensemble 정보를 distilling해서 간단한 single shallow model을 만들자, 단! • Good performance • Low computation Distilling Ensemble: Single Model
  • 6. Distilling Ensemble: Single Model 이런 맥락에서 Distillation(증류) = 많은 parameter가 사용되는 ensemble model로부터 generalization, 성능을 그대로 유지할 수 있는 어떠한 knowledge를 분리하여 가볍게 모델을 만들자!
  • 7. Distilling Ensemble: Single Model 1번의 training 결과를 2번에게 어떻게 하면 잘 가르칠까? (= Model Compression) 1번이 축적한 지식을 2번 모델에게 효율적으로 전달하자! (여기서 2번 모델은 single shallow net [hidden layer가 하나인 신경망]) (크고 무거운)
  • 8. Distilling Ensemble: Single Model -1- 그냥 observations이 많으면 Generalization도 잘되고 성능도 좋다. 데이터를 늘려보자
  • 9. Distilling Ensemble: Single Model -1- 현재 데이터가 별로 없는 상황 : 일단 ensemble 학습 weight color length ... Y 10 red 80 30 yellow 201 15 white 100 6 gray 50 5 gray 40 * Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535-541). ACM.
  • 10. Distilling Ensemble: Single Model -1- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 20 gray 80 ? 32 yellow 205 ? 10 white 102 ? 8 gray 52 ? 9 white 42 ? 12 gray 45 ? weight color length ... Y 10 red 80 30 yellow 201 15 white 100 6 gray 50 5 gray 40 over sampling oversampling한다. 단 : label을 붙이지 않고 * Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535-541). ACM.
  • 11. Distilling Ensemble: Single Model -1- Oversampling전에 데이터로 학습한 ensemble 모델로 (?)를 예측(준지도 학습) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 20 gray 80 32 yellow 205 10 white 102 8 gray 52 9 white 42 12 gray 45 weight color length ... Y 10 red 80 30 yellow 201 15 white 100 6 gray 50 5 gray 40 * Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535-541). ACM.
  • 12. Distilling Ensemble: Single Model -1- 이제 데이터가 많아지고, 생성된 데이터에는 ensemble의 정보가 담겨 있다. => 최종적인 데이터로 single shallow net을 만들면 distilling 된다. 20 gray 80 32 yellow 205 10 white 102 8 gray 52 9 white 42 12 gray 45 weight color length ... Y 10 red 80 30 yellow 201 15 white 100 6 gray 50 5 gray 40 * Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535-541). ACM.
  • 13. Distilling Ensemble: Single Model -1- 빨간색 : ensemble 파란색 : single shallow net Training data size가 커질수록 single shallow net이 따라 잡는군! Best Ensemble Model Compression
  • 14. Distilling Ensemble: Single Model -2- Class의 확률 분표를 알면 학습을 더 잘 하지 않을까? (어떻게 보면 데이터 양을 늘리는 거랑 비슷한 맥락)
  • 15. Distilling Ensemble: Single Model -2- * Ba, J., & Caruana, R. (2014). Do deep nets really need to be deep?. In Advances in neural information processing systems (pp. 2654-2662). 20 gray 80 -2 32 yellow 205 3 10 white 102 1 8 gray 52 3 9 white 42 -2 12 gray 45 1 weight color length ... Y 10 red 80 2 30 yellow 201 1 15 white 100 3 6 gray 50 3 5 gray 40 2 Class말고 각 sample의 logit 값을 single shallow net의 y로 넣자. logit을 Class의 점수 혹은 class의 확률 분포라고 생각할 수 있다.
  • 16. Distilling Ensemble: Single Model -2- 0.8 21 5 16 43 0.2 0.9 1.1 0.2 1.2 ... 0.1 0.7 33.7 1.39O = activation( )Logit = W*H 0.5 0.1 최종 ouput을 계산하기 위해서 Activation function(softmax)에 넣어 계산하기 전의 값 = logit 2 Logit 값을 student 모델 학습에 사용한다. ** The deep models are trained in the usual way using softmax output and cross-entropy cost function. The shallow mimic models, however, instead of being trained with cross-entropy on the 183 p values where pk = e zk / P j e zj output by the softmax layer from the deep model, are trained directly on the 183 log probability values z, also called logit, before the softmax activation.
  • 17. Distilling Ensemble: Single Model -2- 20 gray 80 -2 32 yellow 205 3 10 white 102 1 8 gray 52 3 9 white 42 -2 12 gray 45 1 weight color length ... Y 10 red 80 2 30 yellow 201 1 15 white 100 3 6 gray 50 3 5 gray 40 2 0.8 21 5 16 43 0.2 0.9 1.1 0.2 1.2 ... 0.1 0.7 33.7 1.39O = activation( )Logit = W*H 0.5 0.1 Because the logits capture the logarithm relationships between the probability predictions, a student model trained on logits has to learn all of the additional fine detailed relationships between labels
  • 18. Distilling Ensemble: Single Model -2- Result The TIMIT speech corpus has 462 speakers in the training set
  • 19. Distilling Ensemble: Single Model -3- weight color length ... Y 10 red 80 30 yellow 201 15 white 100 6 gray 50 5 gray 40 Hinton : prob, distribution을 얻기 위해 softmax function을 사용하자 * Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • 20. Distilling Ensemble: Single Model -3- Hinton : prob, distribution을 얻기 위해 softmax function을 사용하자 앙상블 학습 => prob 계산 (with Softmax function) weight color length ... Y 10 red 80 30 yellow 201 15 white 100 6 gray 50 5 gray 40 0.9 0.1
  • 21. Distilling Ensemble: Single Model -3- 구한 prob값을 single shallow net에 넣으면 ensemble의 정보가 전이 된다. prob으로 학습하는 게 regularization 역할 weight color length ... Prob. 10 red 80 0.90 30 yellow 201 0.95 15 white 100 0.10 6 gray 50 0.70 5 gray 40 0.75
  • 22. Distilling Ensemble: Single Model -3- Parameter T(Temperature) 추가 원래 softmax function 100 ... T가 높을수록 기존보다 더 soft한 probability distribution을 얻을 수 있다. ** Temperature라고 지은 것도 굉장히 은유적. 증류를 할 때 온도를 잘 조절해야 증류가 잘된다. 그래서 T라고 지음 ** Probaility가 0인 것도 T를 키우면 쭉쭉쭉 soft하게! ** 기존에 softmax는 T가 1이고, 2~5일 때 효과가 좋다더라
  • 23. Distilling Ensemble: Single Model -3- 온도를 적당히 높여서 class간의 적절한 분포를 알아내자 그 후 결과값을 다시 single shallow model로 학습 weight color length ... Prob. 10 red 80 0.90 30 yellow 201 0.95 15 white 100 0.10 6 gray 50 0.70 5 gray 40 0.75 100 ... Train single model
  • 24. Distilling Ensemble: Single Model -3- the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate. 100 ...
  • 25. Result : Speech Recognition Single model can be built using artificial data or soft target* (prob.) - Good performance - Low computation Distilling Ensemble: Single Model -4-
  • 26. Distilling Ensemble: Single Model -4- 20 gray 80 ! +-2 32 yellow 205 ! +3 10 white 102 ! +1 8 gray 52 ! +3 9 white 42 ! +-2 12 gray 45 ! +1 weight color length ... Y 10 red 80 ! +2 30 yellow 201 ! +1 15 white 100 ! +3 6 gray 50 ! +3 5 gray 40 ! +2 + " Logit 값에 noise를 추가하는 게 regularizer의 역할을 해서 좀 더 성능이 좋다더라 * Sau, B. B., & Balasubramanian, V. N. (2016). Deep Model Compression: Distilling Knowledge from Noisy Teachers. arXiv preprint arXiv:1610.09650. Hinton보다 좋다고 쉬익쉬익했는데…. 인용수는 34회
  • 27. 결론 모델이 아무리 크고 복잡하더라도 실제 서비스로 deploy 못할까 봐 걱정할 필요가 없다. 선생님 모델에서 knowledge를 추출해서 훨씬 작은 학생 모델로 옮길 수 있다.