ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

자연어처리 연구실
M2020064
조단비
Published in: The 8th International Conference on Learning Representations (ICLR 2020)
URL: https://arxiv.org/abs/2003.10555

Content
1. Idea
2. Introduce
3. Mothed
4. Experiments and results
5. Summary
#Kookmin_University #Natural_Language_Processing_lab. 1

Idea
[BERT]
> Replacing some token
with [MASK]
(masked language modeling)
[ELECTRA]
> Replacing some token
with plausible alternatives sampled
from a small generator network
Problem)
require large amounts of compute
Proposal)
a more sample-efficient pre training task
: replaced token detection
[BERT]
> Train a model
> Predicts the original identities
of the corrupted token
[ELECTRA]
> Train a discriminative model
> Predicts whether each token
in the corrupted input was replaced
by a generator sample or not

Introduce
https://github.com/google-research/electra
“ELECRA”
Efficiently Learning an Encoder
that Classifies Token Replacements Accurately.

Introduce
> SOTA representation learning = learning the DAE(Denoising Autoencoder)
> Proposal method: replaced token detection
> Goal: improve the efficiency of pre-training
Restoring the original input token
Masking or attention (BERT, XLNet)
Input token
Substantial compute cost is incurred
The network only learns from 15% of the tokens per example
Predict the original token or replacement token
Replacement using samples
Input token
Samples are generated by a small masked language model
Model learns all input tokens as discriminator

Introduce
> ELECTRA
- ELECTRA-small
- pre-training BERT dataset
- comparison with BERT, GPT
- ELECTRA-Large
- pre-training XLNet dataset
- comparison with RoBERTa, XLNet

Method
> ELECTRA
: generator 𝐺 + discriminator 𝐷
Each network is transformer encoder
Mapping a sequence on input tokens into a sequence of contextualized vector representations

Method
> Generator
(1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛]
(2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n)
(3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 )
(4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓
(5) Output the predicted token from generator with softmax ෝ
𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎
(6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
(1) (2,3) (4) (5,6)

Method
> Discriminator
(1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
(2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓
(3) Output the predicted type of input token with sigmoid
> Loss function
- Minimize the combined loss
𝑚𝑖𝑛𝜃𝐺,𝜃𝐷
෍
𝒙∈𝜒
𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷)
(* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator)
(1) (2) (3)

Experiments and results
1) Experimental setup
2) Model Extensions
3) Small Models
4) Large Models
5) Efficiency Analysis

Experimental Setup
> Evaluation
- GLUE (General Language Understanding Evaluation): 9 tasks (average score)
- CoLA: Is the sentence grammatical or ungrammatical?
- SST: Is the movie review positive, negative or neutral?
- MRPC: Is the sentence B a paraphrase of sentence A?
- STS: How similar are sentences A and B?
- QQP: Are the two questions similar?
- MNLI: Does sentence A entail or contradict sentence B?
- QNLI: Does sentence B contain the answer to the question in sentence A?
- RTE: Does sentence A entail sentence B?
- WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun?
- SQuAD (Stanford Question Answering)
https://rajpurkar.github.io/SQuAD-explorer/
https://gluebenchmark.com/

Model Extensions
> Weight sharing
- Sharing weights between the generator and discriminator
- Model size(generator) == Model size(discriminator) (*model size = the number of hidden units)
#. Compare the weight tying strategies (GLUE score)
- for no weight tying: 83.6
- for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding)
- for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size)
- Model size(generator) < Model size(discriminator)
: using the token and positional embedding weights (it is effective)

Model Extensions
> Smaller generators
- If the generator and discriminator are the same size, it has expensive computing cost
- Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant
- When generators have ¼~½ the size of the discriminator, GLUE score is best
When the sizes of generator and discriminator are the same
When the sizes of generator and discriminator are different

Model Extensions
> Training algorithms (try)
1. Train only the generator with loss of MLM for 𝑛 steps
2. Initialize the weights of the discriminator with the weights of the generator
Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen
- Addition, explore training the generator adversarially as GAN (58%)
- Problem1: inefficiency of reinforcement learning when working in the large action space of generating text
- Problem2: low-entropy of output distribution in generator with adversarial learning

Small Models - GLUE

Large Models - GLUE
= ¼ of RoBERTa (400K steps)
= RoBERTa (1,750K steps)
Dev set
Test set

Small Models & Large Models - SQuAD

Efficiency Analysis
> Effect validation of ELECTRA
1. ELECTRA 15%: loss of discriminator uses the 15% masked token
- testing the effective of calculating the loss from all tokens
> ELECTRA (85%) > ELECTRA 15% (82.4%)
2. Replace MLM: replace the [MASK] token with sample token of generator model
- testing the effective of replacing the [MASK] token with the sample token of generator
> Replace MLM (82.4%) > BERT (82.2%)
3. All-Tokens MLM: discriminator predicts the all token as well as masked token
- testing the effective of sigmoid layer for deciding whether the original token copies
> All-Tokens MLM (84.3%) > Replace MLM (82.4%)

Summary
> Proposal:
replaced token detection (a new self-supervised task for language representation learning)
> Key idea:
Training a text encoder to distinguish input tokens from high-quality negative samples by generator
> Performance:
ELECTRA is more compute-efficient and better performance than masked language models

Thank You.
19
#Kookmin_University #Natural_Language_Processing_lab.

ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

Similar to ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators (20)

More from Danbi Cho

More from Danbi Cho (11)

Recently uploaded

Recently uploaded (20)

ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators