SlideShare a Scribd company logo
자연어처리 연구실
M2020064
조단비
Published in: The 8th International Conference on Learning Representations (ICLR 2020)
URL: https://arxiv.org/abs/2003.10555
Content
1. Idea
2. Introduce
3. Mothed
4. Experiments and results
5. Summary
#Kookmin_University #Natural_Language_Processing_lab. 1
Idea
#Kookmin_University #Natural_Language_Processing_lab. 2
[BERT]
> Replacing some token
with [MASK]
(masked language modeling)
[ELECTRA]
> Replacing some token
with plausible alternatives sampled
from a small generator network
Problem)
require large amounts of compute
Proposal)
a more sample-efficient pre training task
: replaced token detection
[BERT]
> Train a model
> Predicts the original identities
of the corrupted token
[ELECTRA]
> Train a discriminative model
> Predicts whether each token
in the corrupted input was replaced
by a generator sample or not
Introduce
#Kookmin_University #Natural_Language_Processing_lab. 3
https://github.com/google-research/electra
“ELECRA”
Efficiently Learning an Encoder
that Classifies Token Replacements Accurately.
Introduce
> SOTA representation learning = learning the DAE(Denoising Autoencoder)
> Proposal method: replaced token detection
> Goal: improve the efficiency of pre-training
#Kookmin_University #Natural_Language_Processing_lab. 4
Restoring the original input token
Masking or attention (BERT, XLNet)
Input token
Substantial compute cost is incurred
The network only learns from 15% of the tokens per example
Predict the original token or replacement token
Replacement using samples
Input token
Samples are generated by a small masked language model
Model learns all input tokens as discriminator
Introduce
> ELECTRA
- ELECTRA-small
- pre-training BERT dataset
- comparison with BERT, GPT
- ELECTRA-Large
- pre-training XLNet dataset
- comparison with RoBERTa, XLNet
#Kookmin_University #Natural_Language_Processing_lab. 5
Method
#Kookmin_University #Natural_Language_Processing_lab. 6
> ELECTRA
: generator 𝐺 + discriminator 𝐷
Each network is transformer encoder
Mapping a sequence on input tokens into a sequence of contextualized vector representations
Method
> Generator
(1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛]
(2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n)
(3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 )
(4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓
(5) Output the predicted token from generator with softmax ෝ
𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎
(6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
#Kookmin_University #Natural_Language_Processing_lab. 7
(1) (2,3) (4) (5,6)
Method
> Discriminator
(1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
(2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓
(3) Output the predicted type of input token with sigmoid
> Loss function
- Minimize the combined loss
𝑚𝑖𝑛𝜃𝐺,𝜃𝐷
෍
𝒙∈𝜒
𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷)
(* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator)
#Kookmin_University #Natural_Language_Processing_lab. 8
(1) (2) (3)
Experiments and results
#Kookmin_University #Natural_Language_Processing_lab. 9
1) Experimental setup
2) Model Extensions
3) Small Models
4) Large Models
5) Efficiency Analysis
Experimental Setup
> Evaluation
- GLUE (General Language Understanding Evaluation): 9 tasks (average score)
- CoLA: Is the sentence grammatical or ungrammatical?
- SST: Is the movie review positive, negative or neutral?
- MRPC: Is the sentence B a paraphrase of sentence A?
- STS: How similar are sentences A and B?
- QQP: Are the two questions similar?
- MNLI: Does sentence A entail or contradict sentence B?
- QNLI: Does sentence B contain the answer to the question in sentence A?
- RTE: Does sentence A entail sentence B?
- WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun?
- SQuAD (Stanford Question Answering)
#Kookmin_University #Natural_Language_Processing_lab. 10
https://rajpurkar.github.io/SQuAD-explorer/
https://gluebenchmark.com/
Model Extensions
> Weight sharing
- Sharing weights between the generator and discriminator
- Model size(generator) == Model size(discriminator) (*model size = the number of hidden units)
#. Compare the weight tying strategies (GLUE score)
- for no weight tying: 83.6
- for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding)
- for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size)
- Model size(generator) < Model size(discriminator)
: using the token and positional embedding weights (it is effective)
#Kookmin_University #Natural_Language_Processing_lab. 11
Model Extensions
> Smaller generators
- If the generator and discriminator are the same size, it has expensive computing cost
- Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant
- When generators have ¼~½ the size of the discriminator, GLUE score is best
#Kookmin_University #Natural_Language_Processing_lab. 12
When the sizes of generator and discriminator are the same
When the sizes of generator and discriminator are different
Model Extensions
> Training algorithms (try)
1. Train only the generator with loss of MLM for 𝑛 steps
2. Initialize the weights of the discriminator with the weights of the generator
Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen
- Addition, explore training the generator adversarially as GAN (58%)
- Problem1: inefficiency of reinforcement learning when working in the large action space of generating text
- Problem2: low-entropy of output distribution in generator with adversarial learning
#Kookmin_University #Natural_Language_Processing_lab. 13
Small Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 14
Large Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 15
= ¼ of RoBERTa (400K steps)
= RoBERTa (1,750K steps)
Dev set
Test set
Small Models & Large Models - SQuAD
#Kookmin_University #Natural_Language_Processing_lab. 16
Efficiency Analysis
> Effect validation of ELECTRA
1. ELECTRA 15%: loss of discriminator uses the 15% masked token
- testing the effective of calculating the loss from all tokens
> ELECTRA (85%) > ELECTRA 15% (82.4%)
2. Replace MLM: replace the [MASK] token with sample token of generator model
- testing the effective of replacing the [MASK] token with the sample token of generator
> Replace MLM (82.4%) > BERT (82.2%)
3. All-Tokens MLM: discriminator predicts the all token as well as masked token
- testing the effective of sigmoid layer for deciding whether the original token copies
> All-Tokens MLM (84.3%) > Replace MLM (82.4%)
#Kookmin_University #Natural_Language_Processing_lab. 17
Summary
> Proposal:
replaced token detection (a new self-supervised task for language representation learning)
> Key idea:
Training a text encoder to distinguish input tokens from high-quality negative samples by generator
> Performance:
ELECTRA is more compute-efficient and better performance than masked language models
#Kookmin_University #Natural_Language_Processing_lab. 18
Thank You.
19
#Kookmin_University #Natural_Language_Processing_lab.

More Related Content

What's hot

Pegasus
PegasusPegasus
Pegasus
Hangil Kim
 
Clean code
Clean codeClean code
Clean code
Knoldus Inc.
 
Scala categorytheory
Scala categorytheoryScala categorytheory
Scala categorytheory
Knoldus Inc.
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
ChenYiHuang5
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
Rama Irsheidat
 
Theory of Computation Unit 5
Theory of Computation Unit 5Theory of Computation Unit 5
Theory of Computation Unit 5
Jena Catherine Bel D
 

What's hot (6)

Pegasus
PegasusPegasus
Pegasus
 
Clean code
Clean codeClean code
Clean code
 
Scala categorytheory
Scala categorytheoryScala categorytheory
Scala categorytheory
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
 
Theory of Computation Unit 5
Theory of Computation Unit 5Theory of Computation Unit 5
Theory of Computation Unit 5
 

Similar to ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Fwdays
 
Test-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxTest-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptx
Victor Rentea
 
Spock Framework
Spock FrameworkSpock Framework
des mutants dans le code.pdf
des mutants dans le code.pdfdes mutants dans le code.pdf
des mutants dans le code.pdf
Guillaume Saint Etienne
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
Shauvik Roy Choudhary, Ph.D.
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
Markus Scheidgen
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Paulo Clavijo
 
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docxFaculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
mydrynan
 
Idempotency of commands in distributed systems
Idempotency of commands in distributed systemsIdempotency of commands in distributed systems
Idempotency of commands in distributed systems
Max Małecki
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templates
Coen De Roover
 
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Benoit Combemale
 
Easymock Tutorial
Easymock TutorialEasymock Tutorial
Easymock Tutorial
Sbin m
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentationoldmanpat
 
Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder
paramisoft
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
 
Working Effectively With Legacy Perl Code
Working Effectively With Legacy Perl CodeWorking Effectively With Legacy Perl Code
Working Effectively With Legacy Perl Codeerikmsp
 
Icpc2010 bettenburg
Icpc2010 bettenburgIcpc2010 bettenburg
Icpc2010 bettenburgSAIL_QU
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
SARADINDU SENGUPTA
 

Similar to ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators (20)

Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
 
Test-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxTest-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptx
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
des mutants dans le code.pdf
des mutants dans le code.pdfdes mutants dans le code.pdf
des mutants dans le code.pdf
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
 
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docxFaculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
 
Idempotency of commands in distributed systems
Idempotency of commands in distributed systemsIdempotency of commands in distributed systems
Idempotency of commands in distributed systems
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templates
 
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
 
Easymock Tutorial
Easymock TutorialEasymock Tutorial
Easymock Tutorial
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentation
 
Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder
 
Js tips & tricks
Js tips & tricksJs tips & tricks
Js tips & tricks
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Working Effectively With Legacy Perl Code
Working Effectively With Legacy Perl CodeWorking Effectively With Legacy Perl Code
Working Effectively With Legacy Perl Code
 
Icpc2010 bettenburg
Icpc2010 bettenburgIcpc2010 bettenburg
Icpc2010 bettenburg
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 

More from Danbi Cho

Crf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic networkCrf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic network
Danbi Cho
 
Gpt models
Gpt modelsGpt models
Gpt models
Danbi Cho
 
Attention boosted deep networks for video classification
Attention boosted deep networks for video classificationAttention boosted deep networks for video classification
Attention boosted deep networks for video classification
Danbi Cho
 
A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...
Danbi Cho
 
A survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textA survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in text
Danbi Cho
 
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Danbi Cho
 
Decision tree and ensemble
Decision tree and ensembleDecision tree and ensemble
Decision tree and ensemble
Danbi Cho
 
Can recurrent neural networks warp time
Can recurrent neural networks warp timeCan recurrent neural networks warp time
Can recurrent neural networks warp time
Danbi Cho
 
Man is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddingsMan is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddings
Danbi Cho
 
Situation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understandingSituation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understanding
Danbi Cho
 
Mitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learningMitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learning
Danbi Cho
 

More from Danbi Cho (11)

Crf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic networkCrf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic network
 
Gpt models
Gpt modelsGpt models
Gpt models
 
Attention boosted deep networks for video classification
Attention boosted deep networks for video classificationAttention boosted deep networks for video classification
Attention boosted deep networks for video classification
 
A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...
 
A survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textA survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in text
 
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
 
Decision tree and ensemble
Decision tree and ensembleDecision tree and ensemble
Decision tree and ensemble
 
Can recurrent neural networks warp time
Can recurrent neural networks warp timeCan recurrent neural networks warp time
Can recurrent neural networks warp time
 
Man is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddingsMan is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddings
 
Situation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understandingSituation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understanding
 
Mitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learningMitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learning
 

Recently uploaded

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 

Recently uploaded (20)

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 

ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

  • 1. 자연어처리 연구실 M2020064 조단비 Published in: The 8th International Conference on Learning Representations (ICLR 2020) URL: https://arxiv.org/abs/2003.10555
  • 2. Content 1. Idea 2. Introduce 3. Mothed 4. Experiments and results 5. Summary #Kookmin_University #Natural_Language_Processing_lab. 1
  • 3. Idea #Kookmin_University #Natural_Language_Processing_lab. 2 [BERT] > Replacing some token with [MASK] (masked language modeling) [ELECTRA] > Replacing some token with plausible alternatives sampled from a small generator network Problem) require large amounts of compute Proposal) a more sample-efficient pre training task : replaced token detection [BERT] > Train a model > Predicts the original identities of the corrupted token [ELECTRA] > Train a discriminative model > Predicts whether each token in the corrupted input was replaced by a generator sample or not
  • 5. Introduce > SOTA representation learning = learning the DAE(Denoising Autoencoder) > Proposal method: replaced token detection > Goal: improve the efficiency of pre-training #Kookmin_University #Natural_Language_Processing_lab. 4 Restoring the original input token Masking or attention (BERT, XLNet) Input token Substantial compute cost is incurred The network only learns from 15% of the tokens per example Predict the original token or replacement token Replacement using samples Input token Samples are generated by a small masked language model Model learns all input tokens as discriminator
  • 6. Introduce > ELECTRA - ELECTRA-small - pre-training BERT dataset - comparison with BERT, GPT - ELECTRA-Large - pre-training XLNet dataset - comparison with RoBERTa, XLNet #Kookmin_University #Natural_Language_Processing_lab. 5
  • 7. Method #Kookmin_University #Natural_Language_Processing_lab. 6 > ELECTRA : generator 𝐺 + discriminator 𝐷 Each network is transformer encoder Mapping a sequence on input tokens into a sequence of contextualized vector representations
  • 8. Method > Generator (1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛] (2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n) (3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 = 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 ) (4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓 (5) Output the predicted token from generator with softmax ෝ 𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎 (6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕 = 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ 𝒙) #Kookmin_University #Natural_Language_Processing_lab. 7 (1) (2,3) (4) (5,6)
  • 9. Method > Discriminator (1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕 = 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ 𝒙) (2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓 (3) Output the predicted type of input token with sigmoid > Loss function - Minimize the combined loss 𝑚𝑖𝑛𝜃𝐺,𝜃𝐷 ෍ 𝒙∈𝜒 𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷) (* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator) #Kookmin_University #Natural_Language_Processing_lab. 8 (1) (2) (3)
  • 10. Experiments and results #Kookmin_University #Natural_Language_Processing_lab. 9 1) Experimental setup 2) Model Extensions 3) Small Models 4) Large Models 5) Efficiency Analysis
  • 11. Experimental Setup > Evaluation - GLUE (General Language Understanding Evaluation): 9 tasks (average score) - CoLA: Is the sentence grammatical or ungrammatical? - SST: Is the movie review positive, negative or neutral? - MRPC: Is the sentence B a paraphrase of sentence A? - STS: How similar are sentences A and B? - QQP: Are the two questions similar? - MNLI: Does sentence A entail or contradict sentence B? - QNLI: Does sentence B contain the answer to the question in sentence A? - RTE: Does sentence A entail sentence B? - WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun? - SQuAD (Stanford Question Answering) #Kookmin_University #Natural_Language_Processing_lab. 10 https://rajpurkar.github.io/SQuAD-explorer/ https://gluebenchmark.com/
  • 12. Model Extensions > Weight sharing - Sharing weights between the generator and discriminator - Model size(generator) == Model size(discriminator) (*model size = the number of hidden units) #. Compare the weight tying strategies (GLUE score) - for no weight tying: 83.6 - for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding) - for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size) - Model size(generator) < Model size(discriminator) : using the token and positional embedding weights (it is effective) #Kookmin_University #Natural_Language_Processing_lab. 11
  • 13. Model Extensions > Smaller generators - If the generator and discriminator are the same size, it has expensive computing cost - Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant - When generators have ¼~½ the size of the discriminator, GLUE score is best #Kookmin_University #Natural_Language_Processing_lab. 12 When the sizes of generator and discriminator are the same When the sizes of generator and discriminator are different
  • 14. Model Extensions > Training algorithms (try) 1. Train only the generator with loss of MLM for 𝑛 steps 2. Initialize the weights of the discriminator with the weights of the generator Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen - Addition, explore training the generator adversarially as GAN (58%) - Problem1: inefficiency of reinforcement learning when working in the large action space of generating text - Problem2: low-entropy of output distribution in generator with adversarial learning #Kookmin_University #Natural_Language_Processing_lab. 13
  • 15. Small Models - GLUE #Kookmin_University #Natural_Language_Processing_lab. 14
  • 16. Large Models - GLUE #Kookmin_University #Natural_Language_Processing_lab. 15 = ¼ of RoBERTa (400K steps) = RoBERTa (1,750K steps) Dev set Test set
  • 17. Small Models & Large Models - SQuAD #Kookmin_University #Natural_Language_Processing_lab. 16
  • 18. Efficiency Analysis > Effect validation of ELECTRA 1. ELECTRA 15%: loss of discriminator uses the 15% masked token - testing the effective of calculating the loss from all tokens > ELECTRA (85%) > ELECTRA 15% (82.4%) 2. Replace MLM: replace the [MASK] token with sample token of generator model - testing the effective of replacing the [MASK] token with the sample token of generator > Replace MLM (82.4%) > BERT (82.2%) 3. All-Tokens MLM: discriminator predicts the all token as well as masked token - testing the effective of sigmoid layer for deciding whether the original token copies > All-Tokens MLM (84.3%) > Replace MLM (82.4%) #Kookmin_University #Natural_Language_Processing_lab. 17
  • 19. Summary > Proposal: replaced token detection (a new self-supervised task for language representation learning) > Key idea: Training a text encoder to distinguish input tokens from high-quality negative samples by generator > Performance: ELECTRA is more compute-efficient and better performance than masked language models #Kookmin_University #Natural_Language_Processing_lab. 18