SlideShare a Scribd company logo
1 of 48
Download to read offline
DEBERTA
Decoding-Enhanced BERT with Disentangled Attention
Arxiv 2020.06.05 || ICLR 2021
🤗
딥러닝 논문 읽기 모임 4기 NLP팀
진명훈 김수빈 신문종 아이린 최상우
TABLE OF CONTENTS
2
Introduction
Background
3 Contributions (2 Q&A)
Experiment
Conclusion
Introduction
3
• He et al., 2020 에서 제안된 모델
• DeBERTa: Decoding-enhanced BERT with Disentangled Attention
• Google의 BERT(2018)과 Facebook(현재 Meta)의 RoBERTa(2019) 기반
• RoBERTa + disentangled attention + enhanced mask decoder
• With half of the data used in RoBERTa (80GB)
• Scale Invariant Fine-Tuning 도입
• #5929 PR로 🤗transformers에 Merge됨
• Outperform RoBERTa an a majority of NLU tasks
• e.g., SQuAD, MNLI and RACE
Background
4
• Positional Information
• Masked Language Model
• Adversarial Training
Background: Positional Information
5
• The standard self-attention mechanism lacks a natural way to encode
word position information
• Add Positional Bias (ref: 딥논읽 Rotary Embedding 발표)
• Absolute Position Embedding
• Relative Position Embedding
Background: Masked Language Model
6
• Large-scale Transformer-based PLMs are typically pre-trained on large
amounts of text to learn contextual word representations using a self-
supervision objective, known as Masked Language Model (MLM)
max
𝜃
log 𝑝𝜃(𝑋| ෨
𝑋) ≈ max
𝜃
෍
𝑖∈𝐶
log 𝑝𝜃 ෥
𝑥𝑖 = 𝑥𝑖
෨
𝑋
Background: Adversarial Training
7
• 정상 데이터를 모델에 학습
• 정상 데이터 + adversarial sample을 같이 학습
• 일반화 성능 향상
3 Contributions
8
• Disentangled attention
• Transformer-xl처럼 additive하게 attention을 분해
• Shaw, Transformer-xl과 다르게 position-to-content term을 살림
• query token의 위치가 달라지는 부분도 반영
• Position-to-position term은 RPE에서 불필요하기 때문에 제거
• Enhanced Mask Decoder
• A new store opened beside the new mall
• Absolute position information 또한 중요하다!
• Scale Invariant Fine-Tuning
• Adversarial Training은 모델의 일반화에 도움을 준다
• NLP에서 embedding vector norm의 분산은 모델바이모델, 단어바이단어
• Word embedding을 normalize해주고 Perturbation을 추가하자!
Disentangled Attention
9
• Disentangled Attention: A two vector approach to content and position
embedding
• 논문에서 아래와 같은 수식을 제안하며 token repr을 content와 position에 대한
두 벡터로 decomposition 수행
• 이렇게 쪼개는 것은 사실 Transformer-XL에서 제안되었어요!
𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖
𝑇
= 𝐻𝑖𝐻𝑗
𝑇
+ 𝐻𝑖𝑃𝑗|𝑖
𝑇
+ 𝑃𝑖|𝑗𝐻𝑗
𝑇
+ 𝑃𝑖|𝑗𝑃𝑗|𝑖
𝑇
History: Relative Position Embedding
10
17.06.12 18.03.06 19.10.23
Transformer
Shaw RPE T5
18.09.12
Music
Transformer
19.01.09
Transformer-XL
19.06.19
XLNet
20.06.05
DeBERTa
History: Relative Position Embedding
11
Transformer upgrade! Layer에 직접 위치 정보를 주입하자!
Self-Attention with Relative Position Representations
Music Transformer
History: Relative Position Embedding
12
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
XLNet: Generalized Autoregressive Pretraining for Language Understanding
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
(a) (b) (c) (d)
Content-to-Content
Content-to-Position
Position-to-Content
Position-to-Position
History: Relative Position Embedding
13
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
XLNet: Generalized Autoregressive Pretraining for Language Understanding
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
(a) (b) (c) (d)
Content-to-Content
Content-to-Position
Position-to-Content
Position-to-Position
𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖
𝑇
= 𝐻𝑖𝐻𝑗
𝑇
+ 𝐻𝑖𝑃𝑗|𝑖
𝑇
+ 𝑃𝑖|𝑗𝐻𝑗
𝑇
+ 𝑃𝑖|𝑗𝑃𝑗|𝑖
𝑇
History: Relative Position Embedding
14
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
𝐿𝑒𝑡 𝑄 = 𝑊
𝑞𝐸𝑥𝑖
𝑈𝑞 = 𝑊
𝑞𝑈𝑖 𝐾 = 𝑊𝑘𝐸𝑥𝑗
𝑈𝑘 = 𝑊𝑘𝑈𝑗
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑄𝑇𝐾 + 𝑄𝑇𝑈𝑘 + 𝑈𝑞
𝑇𝐾 + 𝑈𝑞
𝑇𝑈𝑘
이 수식을 분석해봅시다!
이렇게 미리 정해둘게요
그러면 수식이 이렇게 정리됩니다!
이걸 묶어서 정리하면?
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑄 + 𝑈𝑞
𝑇
(𝐾 + 𝑈𝑘)
즉, 이렇게 정리할 수 있겠군요!
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑊
𝑞 𝐸𝑥𝑖
+ 𝑈𝑖
𝑇
𝑊𝑘 𝐸𝑥𝑗
+ 𝑈𝑗
𝑇
0.09 0.05 0.86 0.
0.05 0.91 0.04 0.
0.50 0. 10 0.40 0.
0.45 0.40 0.15 0.
𝑄𝑊
𝑖
𝑄
∈ ℝ4×3
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
Embedding
+
Projection
𝑘1 𝑘2 𝑘3 𝑘4
𝐾𝑊𝑖
𝐾
∈ ℝ3×4
𝐴𝑡𝑡 ∈ ℝ4×4
𝑎11 𝑎12 𝑎13 𝑎14
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
𝐴𝑖,𝑗
𝑅𝑃𝐸
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘,𝐸𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 + 𝑢𝑇𝑊𝑘,𝐸𝐸𝑥𝑗
+ 𝑣𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗
Learned position
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑊
𝑞 𝐸𝑥𝑖
+ 𝑈𝑖
𝑇
𝑊𝑘 𝐸𝑥𝑗
+ 𝑈𝑗
𝑇
𝐴𝑖,𝑗
𝑆ℎ𝑎𝑤_𝑅𝑃𝐸
= 𝐸𝑥𝑖
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑅𝑖−𝑗
Dai(transformer-xl의 주저자)의 연구는?
Sinusoid position
Shaw의 연구는?
History: Relative Position Embedding
15
Shaw’s RPE
Transformer-XL
T5
q𝑚
𝑇
𝑘𝑛 = 𝑥𝑚
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇
𝑊
𝑞
𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛 + 𝑢𝑇
𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛
History: Relative Position Embedding
Shaw’s RPE
Transformer-XL
T5
q𝑚
𝑇
𝑘𝑛 = 𝑥𝑚
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇
𝑊
𝑞
𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛 + 𝑢𝑇
𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛
History: Relative Position Embedding
DeBERTa q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘 ෤
𝑝𝑚−𝑛 + ෤
𝑝𝑚−𝑛
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛
Disentangled Attention
18
• Shaw 연구진 등의 기존 RPE 접근 방법은 content-to-content (a) term과
content-to-position (b) term을 사용하여 attention weights를 계산
• Attention weight는 어느 한 쪽 방향으로만 모델링할 수 없다.
• Position-to-content term (c) term 또한 중요하다!
• Relative position embedding에서 (d) term은 이미 고려하고 있음
Disentangled Attention
19
• k: maximum relative distance
• 𝛿 𝑖, 𝑗 ∈ 0,2𝑘
• 𝛿 𝑖, 𝑗 = ቐ
0
2𝑘 − 1
𝑖 − 𝑗 + 1
𝑖 − 𝑗 ≤ −𝑘
𝑖 − 𝑗 ≥ 𝑘
for
for
o. w
Disentangled Attention
20
https://github.com/huggingface/transformers/blob/421
0579522f8b288c3ae6c646e8a7f2e3a941c76/src/trans
formers/models/deberta/modeling_deberta.py#L660
Enhanced Mask Decoder
21
• DeBERTa는 MLM으로 pre-trained
• MLM을 위해 context words의 content와 position information을 활용
• 하지만 absolute positions을 고려하지 않음
• e.g.,
• A new store opened beside the new mall
• BERT는 absolute positions을 input layer에 주입
• DeBERTa는 Transformer layer를 전부 거치고 Masked token prediction을 위
해 softmax layer를 통과시키기 전에 absolute positions을 주입
Enhanced Mask Decoder
22
Enhanced Mask Decoder
23
https://github.com/microsoft/DeBERTa/blob/master/experiments/language_model/mlm.sh
Enhanced Mask Decoder
24
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py
Seed 고정
Tokenizer, task object 반환
Eval, test data load
Load train data and get model
Enhanced Mask Decoder
25
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/tasks/mlm_task.py
Enhanced Mask Decoder
26
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/models/masked_language_model.py
Enhanced Mask Decoder
27
https://youtu.be/gcMyKUXbY8s?t=1198
BertForMaskedLM
28
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
…
Sub-word embedding
Token type embedding
+
Encoder output
Encoder output
Encoder output
Encoder output
Encoder output
Absolute position embedding
Token type embedding …
CLS 딥 ##러 MASK 논 MASK 모임 SEP
lm_head
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
lm_logits, lm_loss
BERT Module
DeBERTaForMaskedLM with EDM
29
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
…
Sub-word embedding
Token type embedding
+
DeBERTa Module
Encoder output
Encoder output
Encoder output
Encoder output (H)
Encoder output
Absolute position embedding
• (1)은 position_biased_input 옵션이 True인 경우에만 더해줌
• 분홍색 Transformer Layer는 shared
• lm_head는 word embedding matrix와 shared
• 저자에 의하면 EDM이 누락되도 PLM의 수렴에 영향을 끼치지 않는다고 함
• MLM training의 perplexity에 약간의 영향을 미치는 부분
Token type embedding …
CLS 딥 ##러 MASK 논 MASK 모임 SEP
Query state (I)
+
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Query state (I)
EDM Module (n=2)
(1)
Encoder output
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
lm_head
lm_logits, lm_loss
Scale Invariant Fine-Tuning
30
• Virtual adversarial training은 regularization method
• 모델의 일반화 성능을 강화
• Input에 small perturbation(noise)를 줘서 adversarial attack에도 동일한
output prediction을 만드는 것이 목적
• NLP task에서 perturbation은 word embedding에 주어짐
• 그러나 model by model, word by word로 emb vector의 norm은 상이함
• Bigger model일수록 분산은 커지고 adversarial training의 불안정성을 키움
• Layer norm에서 영감을 받아 normalized word embeddings에 perturbation을
추가하여 Adversarial Fine-Tuning
• 1.5B 모델에만 적용했고 comprehensive study는 향후에 진행할 예정
Scale Invariant Fine-Tuning
31
https://github.com/microsoft/DeBERTa/tree/master/DeBERTa/sift
Scale Invariant Fine-Tuning
32
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py
Scale Invariant Fine-Tuning
33
Embedding Module
DeBERTa Module
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
Task-specific Layer
(SuperGLUE)
Token type embedding …
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
Absolute position embedding
+
If position_biased_input
LayerNorm
Sift hook!
Scale Invariant Fine-Tuning
34
Embedding Module
DeBERTa Module
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
Task-specific Layer
(SuperGLUE)
Token type embedding …
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
Absolute position embedding
+
If position_biased_input
LayerNorm
LayerNorm(inputs)
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
Scale Invariant Fine-Tuning
35
Embedding Module
DeBERTa Module
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
Task-specific Layer
(SuperGLUE)
Token type embedding …
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
Absolute position embedding
+
If position_biased_input
LayerNorm
LayerNorm(inputs)
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
+ perturbation delta ~𝑁 0, 0.02
If 𝛿 ≥ 0.04 𝑜𝑟 𝛿 ≤ −0.04, clamp
𝑎𝑑𝑣
Experiment: Pre-training
36
Experiment: Pre-training
37
• RoBERTa처럼 dynamic data batching 적용
• SpanBERT처럼 span masking 적용
Experiment: Pre-training
38
Experiment: Pre-training
39
Experiment: Fine-tuning
40
Experiment: Fine-tuning
41
Experiment: Fine-tuning
42
Experiment: Ablation study
43
Experiment: SuperGLUE
44
Experiment: SIFT
45
Experiment: DeBERTa v2
46
https://huggingface.co/docs/transformers/model_doc/deberta_v2
Experiment: DeBERTa v2
47
Conclusion
48
• Disentangled attention과 enhanced mask decoder로 RoBERTa 개선
• Downstream task에서 모델 일반화를 개선하기 위해 SIFT 제안
• Macro score 측면에서 SuperGLUE 벤치마크에서 인간의 성능을 상회
• 아직 인간 수준의 지능까진 도달하지 못함
• 최근에 V3가 나왔습니다! → 다음 차례에 wrap-up하며 발표하도록 하겠습니다.

More Related Content

What's hot

RのffでGLMしてみたけど...
RのffでGLMしてみたけど...RのffでGLMしてみたけど...
RのffでGLMしてみたけど...Kazuya Wada
 
[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...
[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...
[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...Deep Learning JP
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習Preferred Networks
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기CONNECT FOUNDATION
 
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...Deep Learning JP
 
한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례
한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례
한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례NAVER Engineering
 
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会Preferred Networks
 
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명홍배 김
 
潜在ディリクレ配分法
潜在ディリクレ配分法潜在ディリクレ配分法
潜在ディリクレ配分法y-uti
 
Efficient and effective passage search via contextualized late interaction ov...
Efficient and effective passage search via contextualized late interaction ov...Efficient and effective passage search via contextualized late interaction ov...
Efficient and effective passage search via contextualized late interaction ov...taeseon ryu
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
 
Dl hacks輪読: "Unifying distillation and privileged information"
Dl hacks輪読: "Unifying distillation and privileged information"Dl hacks輪読: "Unifying distillation and privileged information"
Dl hacks輪読: "Unifying distillation and privileged information"Yusuke Iwasawa
 
PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介
PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介
PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介T. Suwa
 
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!
遺伝的アルゴリズム(Genetic Algorithm)を始めよう!遺伝的アルゴリズム(Genetic Algorithm)を始めよう!
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!Kazuhide Okamura
 
fastTextの実装を見てみた
fastTextの実装を見てみたfastTextの実装を見てみた
fastTextの実装を見てみたYoshihiko Shiraki
 
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image TranslationDeep Learning JP
 
NP完全問題の紹介
NP完全問題の紹介NP完全問題の紹介
NP完全問題の紹介yutaka1999
 

What's hot (20)

RのffでGLMしてみたけど...
RのffでGLMしてみたけど...RのffでGLMしてみたけど...
RのffでGLMしてみたけど...
 
[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...
[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...
[DL輪読会]Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-...
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
 
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
 
한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례
한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례
한국어 MRC 연구를 위한 표준 데이터셋(KorQuAD) 소개 및 B2B를 위한 MRC 연구 사례
 
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
 
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명
 
潜在ディリクレ配分法
潜在ディリクレ配分法潜在ディリクレ配分法
潜在ディリクレ配分法
 
Efficient and effective passage search via contextualized late interaction ov...
Efficient and effective passage search via contextualized late interaction ov...Efficient and effective passage search via contextualized late interaction ov...
Efficient and effective passage search via contextualized late interaction ov...
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
 
Dl hacks輪読: "Unifying distillation and privileged information"
Dl hacks輪読: "Unifying distillation and privileged information"Dl hacks輪読: "Unifying distillation and privileged information"
Dl hacks輪読: "Unifying distillation and privileged information"
 
PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介
PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介
PPL 2022 招待講演: 静的型つき函数型組版処理システムSATySFiの紹介
 
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!
遺伝的アルゴリズム(Genetic Algorithm)を始めよう!遺伝的アルゴリズム(Genetic Algorithm)を始めよう!
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!
 
fastTextの実装を見てみた
fastTextの実装を見てみたfastTextの実装を見てみた
fastTextの実装を見てみた
 
CVPR 2020 報告
CVPR 2020 報告CVPR 2020 報告
CVPR 2020 報告
 
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
 
NP完全問題の紹介
NP完全問題の紹介NP完全問題の紹介
NP完全問題の紹介
 

Similar to DeBERTA : Decoding-Enhanced BERT with Disentangled Attention

파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)XAIC
 
MRC recent trend_ppt
MRC recent trend_pptMRC recent trend_ppt
MRC recent trend_pptseungwoo kim
 
LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.
LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.
LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.Adonis Han
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you needHoon Heo
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지deepseaswjh
 
NLU Tech Talk with KorBERT
NLU Tech Talk with KorBERTNLU Tech Talk with KorBERT
NLU Tech Talk with KorBERTLGCNSairesearch
 
Masked Sequence to Sequence Pre-training for Language Generation
Masked Sequence to Sequence Pre-training for Language GenerationMasked Sequence to Sequence Pre-training for Language Generation
Masked Sequence to Sequence Pre-training for Language GenerationHoon Heo
 
Papago/N2MT 개발이야기
Papago/N2MT 개발이야기Papago/N2MT 개발이야기
Papago/N2MT 개발이야기NAVER D2
 
딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향LGCNSairesearch
 
Tensorflow for Deep Learning(SK Planet)
Tensorflow for Deep Learning(SK Planet)Tensorflow for Deep Learning(SK Planet)
Tensorflow for Deep Learning(SK Planet)Tae Young Lee
 
Efficient Training of Bert by Progressively Stacking
Efficient Training of Bert by Progressively StackingEfficient Training of Bert by Progressively Stacking
Efficient Training of Bert by Progressively StackingHoon Heo
 
딥러닝을 위한 Tensor flow(skt academy)
딥러닝을 위한 Tensor flow(skt academy)딥러닝을 위한 Tensor flow(skt academy)
딥러닝을 위한 Tensor flow(skt academy)Tae Young Lee
 
[Paper Review] What have we achieved on text summarization?
[Paper Review] What have we achieved on text summarization? [Paper Review] What have we achieved on text summarization?
[Paper Review] What have we achieved on text summarization? Hangil Kim
 
Audio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudioAudio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudioSeungHeon Doh
 
GPT-Series.pdf
GPT-Series.pdfGPT-Series.pdf
GPT-Series.pdfKyuri Kim
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Suhyun Cho
 
Character-Aware Neural Language Models
Character-Aware Neural Language ModelsCharacter-Aware Neural Language Models
Character-Aware Neural Language ModelsHoon Heo
 
개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow양 한빛
 
Deep Learning for Chatbot (2/4)
Deep Learning for Chatbot (2/4)Deep Learning for Chatbot (2/4)
Deep Learning for Chatbot (2/4)Jaemin Cho
 

Similar to DeBERTA : Decoding-Enhanced BERT with Disentangled Attention (20)

파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)
 
MRC recent trend_ppt
MRC recent trend_pptMRC recent trend_ppt
MRC recent trend_ppt
 
LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.
LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.
LDA : latent Dirichlet Allocation (Fairies NLP Series) - Korean Ver.
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you need
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지
 
NLU Tech Talk with KorBERT
NLU Tech Talk with KorBERTNLU Tech Talk with KorBERT
NLU Tech Talk with KorBERT
 
221011_BERT
221011_BERT221011_BERT
221011_BERT
 
Masked Sequence to Sequence Pre-training for Language Generation
Masked Sequence to Sequence Pre-training for Language GenerationMasked Sequence to Sequence Pre-training for Language Generation
Masked Sequence to Sequence Pre-training for Language Generation
 
Papago/N2MT 개발이야기
Papago/N2MT 개발이야기Papago/N2MT 개발이야기
Papago/N2MT 개발이야기
 
딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향
 
Tensorflow for Deep Learning(SK Planet)
Tensorflow for Deep Learning(SK Planet)Tensorflow for Deep Learning(SK Planet)
Tensorflow for Deep Learning(SK Planet)
 
Efficient Training of Bert by Progressively Stacking
Efficient Training of Bert by Progressively StackingEfficient Training of Bert by Progressively Stacking
Efficient Training of Bert by Progressively Stacking
 
딥러닝을 위한 Tensor flow(skt academy)
딥러닝을 위한 Tensor flow(skt academy)딥러닝을 위한 Tensor flow(skt academy)
딥러닝을 위한 Tensor flow(skt academy)
 
[Paper Review] What have we achieved on text summarization?
[Paper Review] What have we achieved on text summarization? [Paper Review] What have we achieved on text summarization?
[Paper Review] What have we achieved on text summarization?
 
Audio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudioAudio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudio
 
GPT-Series.pdf
GPT-Series.pdfGPT-Series.pdf
GPT-Series.pdf
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
Character-Aware Neural Language Models
Character-Aware Neural Language ModelsCharacter-Aware Neural Language Models
Character-Aware Neural Language Models
 
개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow개발자를 위한 공감세미나 tensor-flow
개발자를 위한 공감세미나 tensor-flow
 
Deep Learning for Chatbot (2/4)
Deep Learning for Chatbot (2/4)Deep Learning for Chatbot (2/4)
Deep Learning for Chatbot (2/4)
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

DeBERTA : Decoding-Enhanced BERT with Disentangled Attention

  • 1. DEBERTA Decoding-Enhanced BERT with Disentangled Attention Arxiv 2020.06.05 || ICLR 2021 🤗 딥러닝 논문 읽기 모임 4기 NLP팀 진명훈 김수빈 신문종 아이린 최상우
  • 2. TABLE OF CONTENTS 2 Introduction Background 3 Contributions (2 Q&A) Experiment Conclusion
  • 3. Introduction 3 • He et al., 2020 에서 제안된 모델 • DeBERTa: Decoding-enhanced BERT with Disentangled Attention • Google의 BERT(2018)과 Facebook(현재 Meta)의 RoBERTa(2019) 기반 • RoBERTa + disentangled attention + enhanced mask decoder • With half of the data used in RoBERTa (80GB) • Scale Invariant Fine-Tuning 도입 • #5929 PR로 🤗transformers에 Merge됨 • Outperform RoBERTa an a majority of NLU tasks • e.g., SQuAD, MNLI and RACE
  • 4. Background 4 • Positional Information • Masked Language Model • Adversarial Training
  • 5. Background: Positional Information 5 • The standard self-attention mechanism lacks a natural way to encode word position information • Add Positional Bias (ref: 딥논읽 Rotary Embedding 발표) • Absolute Position Embedding • Relative Position Embedding
  • 6. Background: Masked Language Model 6 • Large-scale Transformer-based PLMs are typically pre-trained on large amounts of text to learn contextual word representations using a self- supervision objective, known as Masked Language Model (MLM) max 𝜃 log 𝑝𝜃(𝑋| ෨ 𝑋) ≈ max 𝜃 ෍ 𝑖∈𝐶 log 𝑝𝜃 ෥ 𝑥𝑖 = 𝑥𝑖 ෨ 𝑋
  • 7. Background: Adversarial Training 7 • 정상 데이터를 모델에 학습 • 정상 데이터 + adversarial sample을 같이 학습 • 일반화 성능 향상
  • 8. 3 Contributions 8 • Disentangled attention • Transformer-xl처럼 additive하게 attention을 분해 • Shaw, Transformer-xl과 다르게 position-to-content term을 살림 • query token의 위치가 달라지는 부분도 반영 • Position-to-position term은 RPE에서 불필요하기 때문에 제거 • Enhanced Mask Decoder • A new store opened beside the new mall • Absolute position information 또한 중요하다! • Scale Invariant Fine-Tuning • Adversarial Training은 모델의 일반화에 도움을 준다 • NLP에서 embedding vector norm의 분산은 모델바이모델, 단어바이단어 • Word embedding을 normalize해주고 Perturbation을 추가하자!
  • 9. Disentangled Attention 9 • Disentangled Attention: A two vector approach to content and position embedding • 논문에서 아래와 같은 수식을 제안하며 token repr을 content와 position에 대한 두 벡터로 decomposition 수행 • 이렇게 쪼개는 것은 사실 Transformer-XL에서 제안되었어요! 𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖 𝑇 = 𝐻𝑖𝐻𝑗 𝑇 + 𝐻𝑖𝑃𝑗|𝑖 𝑇 + 𝑃𝑖|𝑗𝐻𝑗 𝑇 + 𝑃𝑖|𝑗𝑃𝑗|𝑖 𝑇
  • 10. History: Relative Position Embedding 10 17.06.12 18.03.06 19.10.23 Transformer Shaw RPE T5 18.09.12 Music Transformer 19.01.09 Transformer-XL 19.06.19 XLNet 20.06.05 DeBERTa
  • 11. History: Relative Position Embedding 11 Transformer upgrade! Layer에 직접 위치 정보를 주입하자! Self-Attention with Relative Position Representations Music Transformer
  • 12. History: Relative Position Embedding 12 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context XLNet: Generalized Autoregressive Pretraining for Language Understanding 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 (a) (b) (c) (d) Content-to-Content Content-to-Position Position-to-Content Position-to-Position
  • 13. History: Relative Position Embedding 13 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context XLNet: Generalized Autoregressive Pretraining for Language Understanding 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 (a) (b) (c) (d) Content-to-Content Content-to-Position Position-to-Content Position-to-Position 𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖 𝑇 = 𝐻𝑖𝐻𝑗 𝑇 + 𝐻𝑖𝑃𝑗|𝑖 𝑇 + 𝑃𝑖|𝑗𝐻𝑗 𝑇 + 𝑃𝑖|𝑗𝑃𝑗|𝑖 𝑇
  • 14. History: Relative Position Embedding 14 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 𝐿𝑒𝑡 𝑄 = 𝑊 𝑞𝐸𝑥𝑖 𝑈𝑞 = 𝑊 𝑞𝑈𝑖 𝐾 = 𝑊𝑘𝐸𝑥𝑗 𝑈𝑘 = 𝑊𝑘𝑈𝑗 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑄𝑇𝐾 + 𝑄𝑇𝑈𝑘 + 𝑈𝑞 𝑇𝐾 + 𝑈𝑞 𝑇𝑈𝑘 이 수식을 분석해봅시다! 이렇게 미리 정해둘게요 그러면 수식이 이렇게 정리됩니다! 이걸 묶어서 정리하면? 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑄 + 𝑈𝑞 𝑇 (𝐾 + 𝑈𝑘) 즉, 이렇게 정리할 수 있겠군요! 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑊 𝑞 𝐸𝑥𝑖 + 𝑈𝑖 𝑇 𝑊𝑘 𝐸𝑥𝑗 + 𝑈𝑗 𝑇 0.09 0.05 0.86 0. 0.05 0.91 0.04 0. 0.50 0. 10 0.40 0. 0.45 0.40 0.15 0. 𝑄𝑊 𝑖 𝑄 ∈ ℝ4×3 𝑞1 𝑞2 𝑞3 𝑞4 </s> 𝑡1 𝑡2 𝑡3 Embedding + Projection 𝑘1 𝑘2 𝑘3 𝑘4 𝐾𝑊𝑖 𝐾 ∈ ℝ3×4 𝐴𝑡𝑡 ∈ ℝ4×4 𝑎11 𝑎12 𝑎13 𝑎14
  • 15. 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 𝐴𝑖,𝑗 𝑅𝑃𝐸 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘,𝐸𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 + 𝑢𝑇𝑊𝑘,𝐸𝐸𝑥𝑗 + 𝑣𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 Learned position 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑊 𝑞 𝐸𝑥𝑖 + 𝑈𝑖 𝑇 𝑊𝑘 𝐸𝑥𝑗 + 𝑈𝑗 𝑇 𝐴𝑖,𝑗 𝑆ℎ𝑎𝑤_𝑅𝑃𝐸 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝑅𝑖−𝑗 Dai(transformer-xl의 주저자)의 연구는? Sinusoid position Shaw의 연구는? History: Relative Position Embedding 15
  • 16. Shaw’s RPE Transformer-XL T5 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 + 𝑢𝑇 𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛 History: Relative Position Embedding
  • 17. Shaw’s RPE Transformer-XL T5 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 + 𝑢𝑇 𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛 History: Relative Position Embedding DeBERTa q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘 ෤ 𝑝𝑚−𝑛 + ෤ 𝑝𝑚−𝑛 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛
  • 18. Disentangled Attention 18 • Shaw 연구진 등의 기존 RPE 접근 방법은 content-to-content (a) term과 content-to-position (b) term을 사용하여 attention weights를 계산 • Attention weight는 어느 한 쪽 방향으로만 모델링할 수 없다. • Position-to-content term (c) term 또한 중요하다! • Relative position embedding에서 (d) term은 이미 고려하고 있음
  • 19. Disentangled Attention 19 • k: maximum relative distance • 𝛿 𝑖, 𝑗 ∈ 0,2𝑘 • 𝛿 𝑖, 𝑗 = ቐ 0 2𝑘 − 1 𝑖 − 𝑗 + 1 𝑖 − 𝑗 ≤ −𝑘 𝑖 − 𝑗 ≥ 𝑘 for for o. w
  • 21. Enhanced Mask Decoder 21 • DeBERTa는 MLM으로 pre-trained • MLM을 위해 context words의 content와 position information을 활용 • 하지만 absolute positions을 고려하지 않음 • e.g., • A new store opened beside the new mall • BERT는 absolute positions을 input layer에 주입 • DeBERTa는 Transformer layer를 전부 거치고 Masked token prediction을 위 해 softmax layer를 통과시키기 전에 absolute positions을 주입
  • 24. Enhanced Mask Decoder 24 https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py Seed 고정 Tokenizer, task object 반환 Eval, test data load Load train data and get model
  • 28. BertForMaskedLM 28 Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention … Sub-word embedding Token type embedding + Encoder output Encoder output Encoder output Encoder output Encoder output Absolute position embedding Token type embedding … CLS 딥 ##러 MASK 논 MASK 모임 SEP lm_head Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP lm_logits, lm_loss BERT Module
  • 29. DeBERTaForMaskedLM with EDM 29 Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention … Sub-word embedding Token type embedding + DeBERTa Module Encoder output Encoder output Encoder output Encoder output (H) Encoder output Absolute position embedding • (1)은 position_biased_input 옵션이 True인 경우에만 더해줌 • 분홍색 Transformer Layer는 shared • lm_head는 word embedding matrix와 shared • 저자에 의하면 EDM이 누락되도 PLM의 수렴에 영향을 끼치지 않는다고 함 • MLM training의 perplexity에 약간의 영향을 미치는 부분 Token type embedding … CLS 딥 ##러 MASK 논 MASK 모임 SEP Query state (I) + Transformer Layer with disentangled attention Transformer Layer with disentangled attention Query state (I) EDM Module (n=2) (1) Encoder output Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP lm_head lm_logits, lm_loss
  • 30. Scale Invariant Fine-Tuning 30 • Virtual adversarial training은 regularization method • 모델의 일반화 성능을 강화 • Input에 small perturbation(noise)를 줘서 adversarial attack에도 동일한 output prediction을 만드는 것이 목적 • NLP task에서 perturbation은 word embedding에 주어짐 • 그러나 model by model, word by word로 emb vector의 norm은 상이함 • Bigger model일수록 분산은 커지고 adversarial training의 불안정성을 키움 • Layer norm에서 영감을 받아 normalized word embeddings에 perturbation을 추가하여 Adversarial Fine-Tuning • 1.5B 모델에만 적용했고 comprehensive study는 향후에 진행할 예정
  • 33. Scale Invariant Fine-Tuning 33 Embedding Module DeBERTa Module Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP Task-specific Layer (SuperGLUE) Token type embedding … O B-XX I-XX I-XX B-XX I-XX I-XX O ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ) Sub-word embedding Token type embedding Absolute position embedding + If position_biased_input LayerNorm Sift hook!
  • 34. Scale Invariant Fine-Tuning 34 Embedding Module DeBERTa Module Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP Task-specific Layer (SuperGLUE) Token type embedding … O B-XX I-XX I-XX B-XX I-XX I-XX O ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ) Sub-word embedding Token type embedding Absolute position embedding + If position_biased_input LayerNorm LayerNorm(inputs) https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
  • 35. Scale Invariant Fine-Tuning 35 Embedding Module DeBERTa Module Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP Task-specific Layer (SuperGLUE) Token type embedding … O B-XX I-XX I-XX B-XX I-XX I-XX O ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ) Sub-word embedding Token type embedding Absolute position embedding + If position_biased_input LayerNorm LayerNorm(inputs) https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29 + perturbation delta ~𝑁 0, 0.02 If 𝛿 ≥ 0.04 𝑜𝑟 𝛿 ≤ −0.04, clamp 𝑎𝑑𝑣
  • 37. Experiment: Pre-training 37 • RoBERTa처럼 dynamic data batching 적용 • SpanBERT처럼 span masking 적용
  • 48. Conclusion 48 • Disentangled attention과 enhanced mask decoder로 RoBERTa 개선 • Downstream task에서 모델 일반화를 개선하기 위해 SIFT 제안 • Macro score 측면에서 SuperGLUE 벤치마크에서 인간의 성능을 상회 • 아직 인간 수준의 지능까진 도달하지 못함 • 최근에 V3가 나왔습니다! → 다음 차례에 wrap-up하며 발표하도록 하겠습니다.