2. 기존 연구(Original BERT)의 문제
Ⓐ 요구되는 메모리 과다, ⓑ 훈련시간 과다, ⓒ 파라메터 수의 최적화 미비 등 3가지 문제
1. 필요한 파라미터와 메모리 증가에 따른 하드웨어적인 이슈 (※ 특히 MRC 문제를 예로 하고 있음)
An obstacle to answering this question is the memory limitations of available hardware. Given that current
state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these
limitations as we try to scale our models.
2. 훈련에 소요되는 시간 문제 (TPU 16장으로 8일, 일반적인 GPU로는 수 개월 소요)
Training speed can also be significantly hampered in distributed training, as the communication overhead is
directly proportional to the number of parameters in the model
3. 파라미터 수의 증가가 성능의 향상으로 이어지지 않음(Large 모델 기준 X2의 경우 오히려 성능 하락)
We also observe that simply growing the hidden size of a model such as BERT-large (Devlin et al., 2019)
can lead to worse performance
3. 제안 방법
Factorized embedding parameterization
1. Factorized embedding parameterization
As such, untying the WordPiece embedding size E from the hidden layer size H allows us to make a
more efficient usage of the total model parameters as informed by modeling needs, which dictate that H
> E. Therefore, for ALBERT we use a factorization of the embedding parameters, decomposing them
into two smaller matrices. Instead of projecting the one-hot vectors directly into the hidden space of size
H, we first project them into a lower dimensional embedding space of size E, and then project it to the
hidden space.
O(V × H) ➔ O(V × E + E × H)
A
B
C
D
E
F
…
H
V
A
B
C
D
E
F
…
E (H>E)
V
H
E
30,000 * 768 = 23백만 30,000 * 200 = 6백만 200 * 768 = 15만
= x
+>
※ Embedding 시 필요한 파라메터 수의 감소
4. 제안 방법
Factorized embedding parameterization
1. huggingface github에서 Factorized embedding parameterization을 찾을려다가 실패
- albert Embedding implementation
- albert git
https://github.com/huggingface/transformers
https://github.com/brightmart/albert_zh/
https://github.com/google-research/ALBERT
6. 제안방법
Factorized embedding parameterization
1. Factorized embedding parameterization
shows the effect of changing the vocabulary embedding size E using an ALBERT-base configuration
setting (see Table 2), using the same set of representative downstream tasks. Under the non-shared
condition (BERT-style), larger embedding sizes give better performance, but not by much. Under the
all-shared condition (ALBERT-style), an embedding of size 128 appears to be the best. Based on these
results, we use an embedding size E = 128 in all future settings, as a necessary step to do further
scaling.
7. 제안 방법
Cross-layer parameter sharing
2. Cross-layer parameter sharing
For ALBERT, we propose cross-layer parameter sharing as another way to improve parameter
efficiency. There are multiple ways to share parameters, e.g., only sharing feed-forward network
(FFN) parameters across layers, or only sharing attention parameters.
Multi Head Attention
Add & Norm
Scaled Dot Product Attention
Feed Forward
Block-2
Add & Norm
Multi Head Attention
Add & Norm
Feed
Forward
Add & Norm
Multi Head Attention
Block-1
Shared
12. 제안 방법
Inter-sentence coherence loss
3. Inter-sentence coherence loss
However, subsequent studies (Yang et al., 2019; Liu et al., 2019) found NSP’s impact unreliable and
decided to eliminate it, a decision supported by an improvement in downstream task performance
across several tasks.
We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task.
That is, for ALBERT, we use a sentence-order prediction (SOP) loss. The SOP loss uses as
positive examples the same technique as BERT (two consecutive segments from the same
document), and as negative examples the same two consecutive segments but with their order
swapped
NSP : Next Sentence Prediction SOP : Sentence Order Prediction
positive
Seq A
Sentence
Seq B
Sentence
True
Seq A
Sentence
Seq B
Sentence
True
negative
Seq A
Sentence
Random
Sentence
False
Seq B
Sentence
Seq A
Sentence
False
14. 추가실험
Inter-sentence coherence loss
- MLM task에서 추가 데이터로 훈련(xlnet, roberta에서 쓴 데이터 셋)
- Drop out을 뺐더니 성능이 더 좋아짐
- 추가데이터에서는 Squad에서는 성능이 나뻐짐(아마도 도메인 밖의 데이터가
섞여서인듯)
15. 전체적인 실험 결과 (※ 자체 구현 後 송험 결과)
Able to Check our test code here : https://github.com/jeehyun100/text_proj
Bert(파라미터 사이즈 : 414M)
{
"exact": 80.21759697256385,
"f1": 87.94263692549254,
"total": 10570,
}
https://github.com/jeehyun100/text_proj
Albert(파라미터 사이즈 :
48M)
{
"exact": 80.87038789025544,
"f1": 88.67964179873631,
"total": 10570,
}
더 적은 파라미터 사이즈로 더 높은 성능
달성