The presentation explains the ELECTRA model.
ELECTRA means 'Efficiently Learning an Encoder that Classifies Token Replacements Accurately'.
This paper proposes the replaced token detection and it is more compute-efficient than masked language models.
(11st March 2021)
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
1. 자연어처리 연구실
M2020064
조단비
Published in: The 8th International Conference on Learning Representations (ICLR 2020)
URL: https://arxiv.org/abs/2003.10555
2. Content
1. Idea
2. Introduce
3. Mothed
4. Experiments and results
5. Summary
#Kookmin_University #Natural_Language_Processing_lab. 1
3. Idea
#Kookmin_University #Natural_Language_Processing_lab. 2
[BERT]
> Replacing some token
with [MASK]
(masked language modeling)
[ELECTRA]
> Replacing some token
with plausible alternatives sampled
from a small generator network
Problem)
require large amounts of compute
Proposal)
a more sample-efficient pre training task
: replaced token detection
[BERT]
> Train a model
> Predicts the original identities
of the corrupted token
[ELECTRA]
> Train a discriminative model
> Predicts whether each token
in the corrupted input was replaced
by a generator sample or not
5. Introduce
> SOTA representation learning = learning the DAE(Denoising Autoencoder)
> Proposal method: replaced token detection
> Goal: improve the efficiency of pre-training
#Kookmin_University #Natural_Language_Processing_lab. 4
Restoring the original input token
Masking or attention (BERT, XLNet)
Input token
Substantial compute cost is incurred
The network only learns from 15% of the tokens per example
Predict the original token or replacement token
Replacement using samples
Input token
Samples are generated by a small masked language model
Model learns all input tokens as discriminator
8. Method
> Generator
(1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛]
(2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n)
(3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 )
(4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓
(5) Output the predicted token from generator with softmax ෝ
𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎
(6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
#Kookmin_University #Natural_Language_Processing_lab. 7
(1) (2,3) (4) (5,6)
9. Method
> Discriminator
(1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
(2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓
(3) Output the predicted type of input token with sigmoid
> Loss function
- Minimize the combined loss
𝑚𝑖𝑛𝜃𝐺,𝜃𝐷
𝒙∈𝜒
𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷)
(* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator)
#Kookmin_University #Natural_Language_Processing_lab. 8
(1) (2) (3)
10. Experiments and results
#Kookmin_University #Natural_Language_Processing_lab. 9
1) Experimental setup
2) Model Extensions
3) Small Models
4) Large Models
5) Efficiency Analysis
11. Experimental Setup
> Evaluation
- GLUE (General Language Understanding Evaluation): 9 tasks (average score)
- CoLA: Is the sentence grammatical or ungrammatical?
- SST: Is the movie review positive, negative or neutral?
- MRPC: Is the sentence B a paraphrase of sentence A?
- STS: How similar are sentences A and B?
- QQP: Are the two questions similar?
- MNLI: Does sentence A entail or contradict sentence B?
- QNLI: Does sentence B contain the answer to the question in sentence A?
- RTE: Does sentence A entail sentence B?
- WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun?
- SQuAD (Stanford Question Answering)
#Kookmin_University #Natural_Language_Processing_lab. 10
https://rajpurkar.github.io/SQuAD-explorer/
https://gluebenchmark.com/
12. Model Extensions
> Weight sharing
- Sharing weights between the generator and discriminator
- Model size(generator) == Model size(discriminator) (*model size = the number of hidden units)
#. Compare the weight tying strategies (GLUE score)
- for no weight tying: 83.6
- for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding)
- for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size)
- Model size(generator) < Model size(discriminator)
: using the token and positional embedding weights (it is effective)
#Kookmin_University #Natural_Language_Processing_lab. 11
13. Model Extensions
> Smaller generators
- If the generator and discriminator are the same size, it has expensive computing cost
- Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant
- When generators have ¼~½ the size of the discriminator, GLUE score is best
#Kookmin_University #Natural_Language_Processing_lab. 12
When the sizes of generator and discriminator are the same
When the sizes of generator and discriminator are different
14. Model Extensions
> Training algorithms (try)
1. Train only the generator with loss of MLM for 𝑛 steps
2. Initialize the weights of the discriminator with the weights of the generator
Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen
- Addition, explore training the generator adversarially as GAN (58%)
- Problem1: inefficiency of reinforcement learning when working in the large action space of generating text
- Problem2: low-entropy of output distribution in generator with adversarial learning
#Kookmin_University #Natural_Language_Processing_lab. 13
15. Small Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 14
16. Large Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 15
= ¼ of RoBERTa (400K steps)
= RoBERTa (1,750K steps)
Dev set
Test set
17. Small Models & Large Models - SQuAD
#Kookmin_University #Natural_Language_Processing_lab. 16
18. Efficiency Analysis
> Effect validation of ELECTRA
1. ELECTRA 15%: loss of discriminator uses the 15% masked token
- testing the effective of calculating the loss from all tokens
> ELECTRA (85%) > ELECTRA 15% (82.4%)
2. Replace MLM: replace the [MASK] token with sample token of generator model
- testing the effective of replacing the [MASK] token with the sample token of generator
> Replace MLM (82.4%) > BERT (82.2%)
3. All-Tokens MLM: discriminator predicts the all token as well as masked token
- testing the effective of sigmoid layer for deciding whether the original token copies
> All-Tokens MLM (84.3%) > Replace MLM (82.4%)
#Kookmin_University #Natural_Language_Processing_lab. 17
19. Summary
> Proposal:
replaced token detection (a new self-supervised task for language representation learning)
> Key idea:
Training a text encoder to distinguish input tokens from high-quality negative samples by generator
> Performance:
ELECTRA is more compute-efficient and better performance than masked language models
#Kookmin_University #Natural_Language_Processing_lab. 18