Improving bert fine tuning via self-ensemble and self-distilation (1)

ⓒSaebyeol Yu. Saebyeol’s PowerPoint
자연어처리팀
김지연, 백지윤, 최상우, 주정헌, 민지원
IMPROVING BERT FINE-TUNING
VIA SELF-ENSEMBLE AND SELF-DISTILATION

1.
Introduction

• BERT : NLP 분야의 도약
• Pre-trained model 의 구조에 편향된 연구
• Pre-trained + feature extraction + fine-tuning
• Feature extraction 의 효과 < Fine-Tuning 의 효과
< 이 paper에서는 외부 데이터나 knowledge를 활용하지 않고,
BERT의 활용도를 극대화 하는 방법을 연구하였다 >

Part 1
STEP 1 STEP 2 STEP 3 STEP 4
SGD
단점 :
random seed /
training 데이터 순
서에 민감
Ensemble
여러개의 Base
Bert model 의 결
과를 averaging
단점:
training cost
Self- Ensemble
여러개의 Base
model 의
parameter 를
averaging
Self – Distillation
Teacher model1 :
Fine-tuning 하는 각
step 마다 parameter
average
Teacher model: Gold
Labels

2.
Related Work

Pretrained
language
Model
Fine-tuning
Knowledg
e
Distillatio
n
1. Bert : 대량의 cross-domain
unlabeled corpus를 활용한 pre-
train 모델
2. 12 layer transformer encoder로
만들어진 BERT base,
24 layer transformer encoder로
만들어진 BERT Large
3. Input : 512 tokens 이하로 구성
된 sequence
4. text classification : 1 세그먼트
text matching task : 2 segment
5. [CLS] : 2개의 segment 경우 구
분해주는 token
6. [SEP] 다음 segment를 나누는
token
1. 다양한 task에 대한 추가 학습
2. Text classification이나 text
matching에서, BERT는 첫번째
토큰 [CLS] 의 final hidden
state h를 input sentence
혹은 sentence par 의
representation으로 갖는다.
3. 간단한 softmax classifier는
label y의 확률을 예측하기 위해
BERT의 top에 더해진다.
4. p(y|h) = softmax(W h),
5. W는 task specific 파라미터 매
트릭스이다.
6. Fine-tune BERT를 위하여 cross
entropy loss와 W가 함께사용된
다.
1. pre-train language model : 대
량의 파라미터들  리소스가
제한된 환경에서 Apply가 어려
움
2. Knowledge distillation은 large
teacher model의 knowledge를
small student 모델에 전이시키
는데 목적을 준다.
3. 이러한 teacher model은 대개
트레인이 processing
knowledge distillation에서 잘 트
레이닝 되어있다.
4. 일반적으로 사용되는
knowledge distillation과 달리,
우리가 사용한 teacher 모델은
fine-tuning 단계 안에서,
이전 타임 스텝에서의 여러개의
student 모델들의 앙상블 모델

Methodology

3.1
Ensemble BERT
• θ1 ~ θK : 앙상블 모델 #1 에서 #k 각각의 파라미터(모수)
• BERT(x;θK) : K번째 모델이 θK 의 모수를 가질 때
해당 output x 가 나올 확률; likelihood
=> 각 output(;x1,x2,…) 에 대하여 k번째 앙상블 모델까지
의 likelihod 합 중 가장 높은 값이 나오는 x 가 최종 output이 됨

3.1 Averaged
BERT
• θ1 ~ θK : 앙상블 모델 #1 에서 #k 각각의 파라미터(모수)
• BERT(x;θK) : K번째 모델이 θK 의 모수를 가질 때
해당 output x 가 나올 확률; likelihood
=> 각 output(;x1,x2,…) 에 대하여 k번째 앙상블 모델까지
의 likelihod 평균 중 가장 높은 값이 나오는 x 가 최종 output이 됨

3.1 Averaged
BERT • Averaged Bert 의 경우 결국 여러 fine-tuned
bert 의 평균 값을 토대로 구성된 단일 bert 로 볼
수 있음
• 따라서, 시간,공간복잡도 : Averaged Bert >
Ensemble Bert

3.2 Self-Ensemble
BERT
• 하지만, 여전히 Bert 모델을 여러 개 구성해야 한다는 문제 발생
=> K개의 모델 대신, time step 이용하면 Self- Ensemble 가능 !
• θ1 ~ θt : 단일 모델의 timestep #1 에서 #t
• BERT(x;θt) : 단일 모델의 t 번째 timestep 에서 해당 output x 가
나올 확률; likelihood
=> output(;x1,x2,…) 에 대하여 단일 모델의 모든 timestep 에서의 결과
값 평균 중 가장 높은 값이 나오는 x 가 최종 output이 됨

Self-Distillation Bert
「
」
• Self Ensemble model 의 학습 과정은 기본 Bert Base 모델과 동일
• 여기어 Knowledge distillation 을 통해 Base model 을 개선
• Knowledge distillation : Teacher model 의 knowledge 를
• Student model 에 전달해주는 방식
• Teacher model : 이전 training step 에서의 평균
• 이 방식으로 student model 이 robust 하고, accurate 해질수 있다.

SDA
Self-Distillation Average
01 Teacher model = Self-Ensemble Model
with parameter averaging
02
03
Cross Entropy : Bert 모델의 output과 결과비교
MSE : 각 step 의 Bert 모델과 parameter 들의 평균의 결
과에 대한 MSE
Parameter 업데이트 : K 개의 step 에 대한 파라미
터들의 평균

SDV:SelfDistillationVoted
Part 3
• Teacher model 을 만들때, Averaging 대신
voted 하는 방식으로 대체한 방법
• 효 율 은 SDA 보 다 낮 다 그 이 유 는
parameter 만 update 평균으로 하는 SDA
와 다르게 SDV 모델은 각 train process
마다 계산을 해야하기 때문이다.

4.
Experiment

Datasets
Part 1
Text
Classificatio
n
>
>
• IMDB : 긍정/부정 영화 감성 리뷰 텍스트
• AG’s News : 세계/스포츠/비지니스/과학의 뉴스 제목과 짧은 요약(설명)
• DBPedia : 14개의 클래스를 가진 각 클래스에서 겹치지 않는 위키피디아 제목과 요약
• Yelp Polarity : 좋음/나쁨의 식당 리뷰 텍스트
• Yelp Full: 5개 별로 평가한 식당 리뷰 텍스트
NLI >> • SNLI : 스탠포드에서 구축한 문장 페어 관계 데이터( 반대 내용 , 같은 내용, 관계없음)
• MNLI : 10개의 장르에서 가져온 문장 페어 관계 데이터( 반대 내용 , 같은 내용, 관계없음)

「
」
Hyperparameters
대부분의 Hyperparameter 들은 Base Bert 모델을 학습하는데 사용된 동일한 Hyperparameter 사용
• AdamW optimizer
• Warm-up proportion 0.1
• 2e-5 learning rate
• Dropout 0.1
• 512 길이의 토큰 , 이상의 길이는 잘라내기
• BERT base : 4 batch + 4 gradient accumulate steps
• BERT Large : 1 batch + 16 gradient accumulate steps
• Ensemble BERT : 4 different random seed
++ 그 외로 사용한2 MAIN Hyperparameter FOR BERT SDA, BERT SDV
• Self distillation weight ƛ
• Teacher size K

IMDb 데이터로 λ 비교  λ=1
Teacher size K 비교결과
 각 데이터별로 다른 K 사용
Self Distillation Weight Teacher Size

Model Analysis
「
」
• 모든 비교하는 모델, 동일한 Seed Size 사용
• 각 모델을 비교한 결과, BERT sda 가 vanilla 모델 보다
훨씬 나은 성능을 가진 것을 확인 할수 있다.
• 다만, Seed 가 K의 값보다 더 영향을 가지고있기때문에,
해당 방식으로 K 값의 변화에 대한 비교를 할수 가 없다

Figure5.Self-distillation의효과비교
• Base모델의경우,epoch3이후로testerror의변화가없다
• Self-distillation경우,3epoch이후로도계속error가줄어듬
Figure6.Self-distillation의효과확인
• Crossentropy=studentmodel(self-dillation되지않은기본모델의loss)
• MSE=studentmodel의average(teacher)와student
• 의logitoutput에대한MSE
• Trainingstep동안CE값은꾸준히낮지만,MSE는드라마틱하게
낮아짐을확인
• SVD 를 사용함으로서 기존의 모델보다 다 발전할 가능성이
있다는것을확인
Convergence Curves

kkk
Classification Tasks
• BERT SDA 가 가장높은 성능 을 가진다
NLI Tasks
• Bert SDV 가 가장 높은 성능을 가진다.
>> 전체적으로 Self-Distillation 모델과 Self
Ensemble 모델이 기존의 모델보다 낮은
error 값을 얻었다.
Bert Large Model
• Classification 은 K 값이 클수록
• NLI 는 K=1 일때 가장 높은 성능을 보였다.
• 두개의 Task 모두 SD 한 결과가 나은 성능을 가진다.

「
」
Purpose
추가적인 데이터와, 지식없이 Bert model 을 Fine-tuning 을 통해 개선
시켜보자
Method
Self Ensemble / Self Distillation
Result
Self Ensemble model 은 Bert 를 개선 시킬수 있지만, 효율성이 좋지
않다.
Self Distillation model 은 Fine-tuning 에서 발전시킬수 있었다.
추가로 Data augmentation 과 hyperparameter 개선을 통해 모델을 향
상시킬 가능성이 있다.

감사합니다

Q&A

Improving bert fine tuning via self-ensemble and self-distilation (1)

Recommended

Recommended

More Related Content

More from taeseon ryu

More from taeseon ryu (20)

Improving bert fine tuning via self-ensemble and self-distilation (1)