Knowledge Distillation 1
🧪
Knowledge Distillation
발표자 유용상
발표일자
논문링크
논문게재일
도메인 기타
발표자료
파일과미디어
Knowledge Distillation이란?
@2023년2월23일
Knowledge Distillation이란?
왜필요할까?
Distilling the Knowledge in a Neural Network (NIPS 2014)
진행과정
Soft Label
Distillation Loss
다양한KD 모델들
DistillBERT (NIPS 2019)
TinyBERT (EMNLP 2020)
SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION (ICLR
2021)
참고자료
Knowledge Distillation 2
지식(Knowledge) + 증류(Distillation)
→ Teacher Network로부터증류한지식을Student Network로transfer하는일련의과
정
왜필요할까?
처음등장했을때→ Model Deploy(모델배포) 측면에서필요하다고주장
Knowledge Distillation 3
현재→경량화된모델을만들기위해서, 학습단계에드는리소스를줄이기위해서등등다
양한이유로연구되고있는분야!
Distilling the Knowledge in a Neural
Network (NIPS 2014)
KD에대한개념을처음으로정의한논문
복잡한모델(ex.앙상블모델)을유저에게배포하는것은어렵기때문에KD를통해작은
모델로학습한결과를전달하고전달받은모델의성능을평가
사용데이터셋: MNIST (Multi Class Classification)
진행과정
Teacher Network 학습
▼
Teacher Network에서Soft Label(Soft output, Dark Knowledge) 추출
▼
추출한지식과 Student 모델이예측한결과와정답사이의CE Loss 를합쳐Distillation
Loss 구성
Knowledge Distillation 4
Soft Label
일반적인분류모델이🐮, 🐶, 😺, 🚗를구분한다면?
정답(Hard Label, Original Target) :
분류모델이추론한결과:
논문에서는정답확률이아닌나머지값에주목했고이것들을Dark Knowledge라고표현
But!! 분류에서주로사용되는소프트맥스함수는큰값은더크게만들고작은값은더작게
만드는특징이있음
따라서Teacher Model의Dark Knowledge를잘추출하기위해서는출력값의분포를조금
더Soft하게만들필요가있다!!
일반소프트맥스식에T(Temperature)가추가됨: 높아지면Soft, 낮아지면Hard
Knowledge Distillation 5
Distillation Loss
🧐: 추출한Teacher Model 의지식을Student Model한테어떻게학습시킴??
→ Student Model이Teacher Model의Soft Label을출력하도록함! (KD Loss)
🧐:하지만Soft Label만학습시키는것은정답Label을예측하는게아니라그저
Teacher Model이뱉는결과의‘분포’만답습하는모델을만드는거아닌가요??
→ Student 모델이예측하는결과와정답(Hard Label)이가까워지도록하는CE Loss
구성
위의두Loss의합을최종손실함수로정의함
실험결과
Knowledge Distillation 6
MNIST dataset에서숫자3 데이터를제거하여student model을knowledge
distillation 방법으로학습→ 숫자3에대한정보를학습하지않았지만, soft label
이갖고있는정보로만학습하여test 3 이미지에대해98.6%의정확도를달성
student model이10개의모델을ensemble한model과비슷한정확도를보여줌,
10개의모델을ensemble하는비용을생각하면, knowledge distillation은정말효
과적!!
파이토치구현블로그
https://deep-learning-study.tistory.com/700
다양한KD 모델들
DistillBERT (NIPS 2019)
Teacher : 사전학습된BERT 모델(RoBERTa처럼dynamic masking 사용)
Student : token-type embedding, pooler 제거+ 레이어개수2배감소
3가지Loss 사용
1. Distillation Loss
: 소프트타깃과소프트예측사이의CE Loss
2. Masked Language Modeling Loss
: 하드타깃과하드예측사이의CE Loss
3. Cosine Embedding Loss
Knowledge Distillation 7
: Teacher와Student의hidden state vector 사이의거리로두모델의state가같은방향을
바라보게함
→ 기본BERT 대비2배적은레이어(모델용량207MB) + 유사한(97%) 성능+ 빠른(60%)
추론속도
TinyBERT (EMNLP 2020)
세가지Loss
1. Transformer Distillation
: Teacher Model의Transformer Layer의어텐션행렬(정규화전) 학습
+Transformer Layer의아웃풋(=Hidden States) 학습
2. Embedding-layer Distillation
Knowledge Distillation 8
: Teacher Model 의임베딩결과학습
3. Prediction-layer Distillation
: 최종레이어의결과값에대한Soft CE loss
두가지단계의distillation
1. General Distillation
: Teacher Model에서[Transformer Distillation, Embedding layer Distillation]
수행
2. Task-Specific Distillation (Over-parameterization 해결)
a. Data Augmentation
b. Task-Specific Distillation (Fine Tuning)
→ 4개Layer 버전: BERT_base보다7.5배작고9.4배빠름+ 96.8% 성능
→ 6개Layer 버전: 파라미터40% 감소+ 2배빠름+ 성능유지
SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL
REPRESENTATION (ICLR 2021)
→ Constrastive Learning에서KD 도입
사전학습된Teacher Model을freeze해서더작은모델에게Distill
Knowledge Distillation 9
이미지를randomly augment한뒤계산하는feature에관해두모델의probability score
유사도를CE로구함
참고자료
https://baeseongsu.github.io/posts/knowledge-distillation/
https://deep-learning-study.tistory.com/699
https://deep-learning-study.tistory.com/700
https://velog.io/@dldydldy75/지식-증류-Knowledge-Distillation
https://syj9700.tistory.com/38
https://3months.tistory.com/436
https://facerain.club/distilbert-paper/
https://littlefoxdiary.tistory.com/64
Knowledge Distillation 10
KD, SSL 등을포괄하는개념인representation learning에대한좋은글:
https://89douner.tistory.com/339

230223_Knowledge_Distillation

  • 1.
    Knowledge Distillation 1 🧪 KnowledgeDistillation 발표자 유용상 발표일자 논문링크 논문게재일 도메인 기타 발표자료 파일과미디어 Knowledge Distillation이란? @2023년2월23일 Knowledge Distillation이란? 왜필요할까? Distilling the Knowledge in a Neural Network (NIPS 2014) 진행과정 Soft Label Distillation Loss 다양한KD 모델들 DistillBERT (NIPS 2019) TinyBERT (EMNLP 2020) SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION (ICLR 2021) 참고자료
  • 2.
    Knowledge Distillation 2 지식(Knowledge)+ 증류(Distillation) → Teacher Network로부터증류한지식을Student Network로transfer하는일련의과 정 왜필요할까? 처음등장했을때→ Model Deploy(모델배포) 측면에서필요하다고주장
  • 3.
    Knowledge Distillation 3 현재→경량화된모델을만들기위해서,학습단계에드는리소스를줄이기위해서등등다 양한이유로연구되고있는분야! Distilling the Knowledge in a Neural Network (NIPS 2014) KD에대한개념을처음으로정의한논문 복잡한모델(ex.앙상블모델)을유저에게배포하는것은어렵기때문에KD를통해작은 모델로학습한결과를전달하고전달받은모델의성능을평가 사용데이터셋: MNIST (Multi Class Classification) 진행과정 Teacher Network 학습 ▼ Teacher Network에서Soft Label(Soft output, Dark Knowledge) 추출 ▼ 추출한지식과 Student 모델이예측한결과와정답사이의CE Loss 를합쳐Distillation Loss 구성
  • 4.
    Knowledge Distillation 4 SoftLabel 일반적인분류모델이🐮, 🐶, 😺, 🚗를구분한다면? 정답(Hard Label, Original Target) : 분류모델이추론한결과: 논문에서는정답확률이아닌나머지값에주목했고이것들을Dark Knowledge라고표현 But!! 분류에서주로사용되는소프트맥스함수는큰값은더크게만들고작은값은더작게 만드는특징이있음 따라서Teacher Model의Dark Knowledge를잘추출하기위해서는출력값의분포를조금 더Soft하게만들필요가있다!! 일반소프트맥스식에T(Temperature)가추가됨: 높아지면Soft, 낮아지면Hard
  • 5.
    Knowledge Distillation 5 DistillationLoss 🧐: 추출한Teacher Model 의지식을Student Model한테어떻게학습시킴?? → Student Model이Teacher Model의Soft Label을출력하도록함! (KD Loss) 🧐:하지만Soft Label만학습시키는것은정답Label을예측하는게아니라그저 Teacher Model이뱉는결과의‘분포’만답습하는모델을만드는거아닌가요?? → Student 모델이예측하는결과와정답(Hard Label)이가까워지도록하는CE Loss 구성 위의두Loss의합을최종손실함수로정의함 실험결과
  • 6.
    Knowledge Distillation 6 MNISTdataset에서숫자3 데이터를제거하여student model을knowledge distillation 방법으로학습→ 숫자3에대한정보를학습하지않았지만, soft label 이갖고있는정보로만학습하여test 3 이미지에대해98.6%의정확도를달성 student model이10개의모델을ensemble한model과비슷한정확도를보여줌, 10개의모델을ensemble하는비용을생각하면, knowledge distillation은정말효 과적!! 파이토치구현블로그 https://deep-learning-study.tistory.com/700 다양한KD 모델들 DistillBERT (NIPS 2019) Teacher : 사전학습된BERT 모델(RoBERTa처럼dynamic masking 사용) Student : token-type embedding, pooler 제거+ 레이어개수2배감소 3가지Loss 사용 1. Distillation Loss : 소프트타깃과소프트예측사이의CE Loss 2. Masked Language Modeling Loss : 하드타깃과하드예측사이의CE Loss 3. Cosine Embedding Loss
  • 7.
    Knowledge Distillation 7 :Teacher와Student의hidden state vector 사이의거리로두모델의state가같은방향을 바라보게함 → 기본BERT 대비2배적은레이어(모델용량207MB) + 유사한(97%) 성능+ 빠른(60%) 추론속도 TinyBERT (EMNLP 2020) 세가지Loss 1. Transformer Distillation : Teacher Model의Transformer Layer의어텐션행렬(정규화전) 학습 +Transformer Layer의아웃풋(=Hidden States) 학습 2. Embedding-layer Distillation
  • 8.
    Knowledge Distillation 8 :Teacher Model 의임베딩결과학습 3. Prediction-layer Distillation : 최종레이어의결과값에대한Soft CE loss 두가지단계의distillation 1. General Distillation : Teacher Model에서[Transformer Distillation, Embedding layer Distillation] 수행 2. Task-Specific Distillation (Over-parameterization 해결) a. Data Augmentation b. Task-Specific Distillation (Fine Tuning) → 4개Layer 버전: BERT_base보다7.5배작고9.4배빠름+ 96.8% 성능 → 6개Layer 버전: 파라미터40% 감소+ 2배빠름+ 성능유지 SEED : SELF-SUPERVISED DISTILLATION FOR VISUAL REPRESENTATION (ICLR 2021) → Constrastive Learning에서KD 도입 사전학습된Teacher Model을freeze해서더작은모델에게Distill
  • 9.
    Knowledge Distillation 9 이미지를randomlyaugment한뒤계산하는feature에관해두모델의probability score 유사도를CE로구함 참고자료 https://baeseongsu.github.io/posts/knowledge-distillation/ https://deep-learning-study.tistory.com/699 https://deep-learning-study.tistory.com/700 https://velog.io/@dldydldy75/지식-증류-Knowledge-Distillation https://syj9700.tistory.com/38 https://3months.tistory.com/436 https://facerain.club/distilbert-paper/ https://littlefoxdiary.tistory.com/64
  • 10.
    Knowledge Distillation 10 KD,SSL 등을포괄하는개념인representation learning에대한좋은글: https://89douner.tistory.com/339