Distilling the knowledge in a neural network

Distilling the knowledge in a Neural Network
Geoffrey Hinton(2015, 1943회 인용)

DNN is easily overﬁtted to training data => Due to many parameters
Overﬁtting of DNN

To improve generalization performance Ensemble could be used
DNN Ensemble
Input
Output
Because ensemble is composed of many models Ensemble is computationally expensive

Test sets are large (e.g. Google) => 데이터 양이 많다 보니 너무 계산 비용 높아
Storage space at a premium (e.g. Mobile phone) => 앙상블 저장 공간 너무 많이 차지해!
** 물론 training에 사용되는 모델은 대규모 데이터를 가지고 batch 처리를 할 수 있고, 리소스를 비교적 자유롭게 사용할 수 있다.
하지만 실제 deployment 단계의 모델은 데이터의 실시간 처리가 필요하고 리소스에 제약을 받아 빠른 처리가 중요함
DNN Ensemble

Ensemble 정보를 distilling해서 간단한 single shallow model을 만들자, 단!
• Good performance
• Low computation
Distilling Ensemble: Single Model

이런 맥락에서 Distillation(증류)
= 많은 parameter가 사용되는 ensemble model로부터
generalization, 성능을 그대로 유지할 수 있는 어떠한
knowledge를 분리하여 가볍게 모델을 만들자!

1번의 training 결과를 2번에게 어떻게 하면 잘 가르칠까? (= Model Compression)
1번이 축적한 지식을 2번 모델에게 효율적으로 전달하자!
(여기서 2번 모델은 single shallow net [hidden layer가 하나인 신경망])
(크고 무거운)

Distilling Ensemble: Single Model -1-
그냥 observations이 많으면 Generalization도 잘되고 성능도 좋다.
데이터를 늘려보자

현재 데이터가 별로 없는 상황 : 일단 ensemble 학습
weight color length ... Y
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
* Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 535-541). ACM.

?
?
?
? ?
? ?
?
?
?
?
?
?
?
? ?
?
?
? ?
?
?
? ?
?
?
20 gray 80 ?
32 yellow 205 ?
10 white 102 ?
8 gray 52 ?
9 white 42 ?
12 gray 45 ?
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
over
sampling
oversampling한다. 단 : label을 붙이지 않고

Oversampling전에 데이터로 학습한 ensemble 모델로 (?)를 예측(준지도 학습)
?
?
?
? ?
? ?
?
?
?
?
?
?
?
? ?
?
?
? ?
?
?
? ?
?
?
20 gray 80
32 yellow 205
10 white 102
8 gray 52
9 white 42
12 gray 45
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40

이제 데이터가 많아지고, 생성된 데이터에는 ensemble의 정보가 담겨 있다.
=> 최종적인 데이터로 single shallow net을 만들면 distilling 된다.
20 gray 80
32 yellow 205
10 white 102
8 gray 52
9 white 42
12 gray 45
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40

빨간색 : ensemble
파란색 : single shallow net
Training data size가 커질수록 single shallow net이 따라 잡는군!
Best Ensemble
Model Compression

Class의 확률 분표를 알면 학습을 더 잘 하지 않을까?
(어떻게 보면 데이터 양을 늘리는 거랑 비슷한 맥락)

* Ba, J., & Caruana, R. (2014). Do deep nets really need to be deep?. In Advances in neural information processing systems (pp. 2654-2662).
20 gray 80 -2
32 yellow 205 3
10 white 102 1
8 gray 52 3
9 white 42 -2
12 gray 45 1
10 red 80 2
30 yellow 201 1
15 white 100 3
6 gray 50 3
5 gray 40 2
Class말고 각 sample의 logit 값을 single shallow net의 y로 넣자.
logit을 Class의 점수 혹은 class의 확률 분포라고 생각할 수 있다.

0.8
21 5 16 43 0.2 0.9 1.1
0.2 1.2 ... 0.1 0.7
33.7
1.39O = activation( )Logit = W*H
0.5 0.1
최종 ouput을 계산하기 위해서
Activation function(softmax)에 넣어 계산하기 전의 값 = logit
2
Logit 값을 student 모델 학습에 사용한다.
** The deep models are trained in the usual way using softmax output and cross-entropy cost function.
The shallow mimic models, however, instead of being trained with cross-entropy on the 183 p values
where pk = e zk / P j e zj output by the softmax layer from the deep model, are trained directly on the
183 log probability values z, also called logit, before the softmax activation.

20 gray 80 -2
32 yellow 205 3
10 white 102 1
8 gray 52 3
9 white 42 -2
12 gray 45 1
10 red 80 2
30 yellow 201 1
15 white 100 3
6 gray 50 3
5 gray 40 2
0.8
21 5 16 43 0.2 0.9 1.1
0.2 1.2 ... 0.1 0.7
33.7
1.39O = activation( )Logit = W*H
0.5 0.1
Because the logits capture the logarithm relationships between the
probability predictions, a student model trained on logits has to learn
all of the additional ﬁne detailed relationships between labels

Result
The TIMIT speech corpus has 462 speakers in the training set

10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
Hinton : prob, distribution을 얻기 위해 softmax function을 사용하자
* Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Hinton : prob, distribution을 얻기 위해 softmax function을 사용하자
앙상블 학습 => prob 계산 (with Softmax function)
10 red 80
30 yellow 201
15 white 100
6 gray 50
5 gray 40
0.9 0.1

구한 prob값을 single shallow net에 넣으면 ensemble의 정보가 전이 된다.
prob으로 학습하는 게 regularization 역할
weight color length ... Prob.
10 red 80 0.90
30 yellow 201 0.95
15 white 100 0.10
6 gray 50 0.70
5 gray 40 0.75

Parameter T(Temperature) 추가
원래 softmax function
100 ...
T가 높을수록 기존보다 더 soft한 probability distribution을 얻을 수 있다.
** Temperature라고 지은 것도 굉장히 은유적. 증류를 할 때 온도를 잘 조절해야 증류가 잘된다. 그래서 T라고 지음
** Probaility가 0인 것도 T를 키우면 쭉쭉쭉 soft하게!
** 기존에 softmax는 T가 1이고, 2~5일 때 효과가 좋다더라

온도를 적당히 높여서 class간의 적절한 분포를 알아내자
그 후 결과값을 다시 single shallow model로 학습
weight color length ... Prob.
10 red 80 0.90
30 yellow 201 0.95
15 white 100 0.10
6 gray 50 0.70
5 gray 40 0.75
100 ...
Train single model

the soft targets have high entropy, they provide much
more information per training case than hard targets
and much less variance in the gradient between
training cases, so the small model can often be trained
on much less data than the original cumbersome
model and using a much higher learning rate.
100 ...

Result : Speech Recognition
Single model can be built using artiﬁcial data or soft target* (prob.)
- Good performance
- Low computation

20 gray 80 ! +-2
32 yellow 205 ! +3
10 white 102 ! +1
8 gray 52 ! +3
9 white 42 ! +-2
12 gray 45 ! +1
10 red 80 ! +2
30 yellow 201 ! +1
15 white 100 ! +3
6 gray 50 ! +3
5 gray 40 ! +2
+ "
Logit 값에 noise를 추가하는 게 regularizer의 역할을 해서 좀 더 성능이 좋다더라
* Sau, B. B., & Balasubramanian, V. N. (2016). Deep Model Compression: Distilling Knowledge from Noisy Teachers. arXiv preprint arXiv:1610.09650.
Hinton보다 좋다고 쉬익쉬익했는데…. 인용수는 34회

결론
모델이 아무리 크고 복잡하더라도 실제 서비스로 deploy 못할까 봐 걱정할 필요가 없다.
선생님 모델에서 knowledge를 추출해서 훨씬 작은 학생 모델로 옮길 수 있다.

Distilling the knowledge in a neural network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distilling the knowledge in a neural network

Similar to Distilling the knowledge in a neural network (13)

More from KyeongUkJang

More from KyeongUkJang (20)

Distilling the knowledge in a neural network