SlideShare a Scribd company logo
1 of 20
Download to read offline
자연어처리 연구실
M2020064
조단비
Published in: The 8th International Conference on Learning Representations (ICLR 2020)
URL: https://arxiv.org/abs/2003.10555
Content
1. Idea
2. Introduce
3. Mothed
4. Experiments and results
5. Summary
#Kookmin_University #Natural_Language_Processing_lab. 1
Idea
#Kookmin_University #Natural_Language_Processing_lab. 2
[BERT]
> Replacing some token
with [MASK]
(masked language modeling)
[ELECTRA]
> Replacing some token
with plausible alternatives sampled
from a small generator network
Problem)
require large amounts of compute
Proposal)
a more sample-efficient pre training task
: replaced token detection
[BERT]
> Train a model
> Predicts the original identities
of the corrupted token
[ELECTRA]
> Train a discriminative model
> Predicts whether each token
in the corrupted input was replaced
by a generator sample or not
Introduce
#Kookmin_University #Natural_Language_Processing_lab. 3
https://github.com/google-research/electra
“ELECRA”
Efficiently Learning an Encoder
that Classifies Token Replacements Accurately.
Introduce
> SOTA representation learning = learning the DAE(Denoising Autoencoder)
> Proposal method: replaced token detection
> Goal: improve the efficiency of pre-training
#Kookmin_University #Natural_Language_Processing_lab. 4
Restoring the original input token
Masking or attention (BERT, XLNet)
Input token
Substantial compute cost is incurred
The network only learns from 15% of the tokens per example
Predict the original token or replacement token
Replacement using samples
Input token
Samples are generated by a small masked language model
Model learns all input tokens as discriminator
Introduce
> ELECTRA
- ELECTRA-small
- pre-training BERT dataset
- comparison with BERT, GPT
- ELECTRA-Large
- pre-training XLNet dataset
- comparison with RoBERTa, XLNet
#Kookmin_University #Natural_Language_Processing_lab. 5
Method
#Kookmin_University #Natural_Language_Processing_lab. 6
> ELECTRA
: generator 𝐺 + discriminator 𝐷
Each network is transformer encoder
Mapping a sequence on input tokens into a sequence of contextualized vector representations
Method
> Generator
(1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛]
(2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n)
(3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 )
(4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓
(5) Output the predicted token from generator with softmax ෝ
𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎
(6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
#Kookmin_University #Natural_Language_Processing_lab. 7
(1) (2,3) (4) (5,6)
Method
> Discriminator
(1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
(2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓
(3) Output the predicted type of input token with sigmoid
> Loss function
- Minimize the combined loss
𝑚𝑖𝑛𝜃𝐺,𝜃𝐷
෍
𝒙∈𝜒
𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷)
(* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator)
#Kookmin_University #Natural_Language_Processing_lab. 8
(1) (2) (3)
Experiments and results
#Kookmin_University #Natural_Language_Processing_lab. 9
1) Experimental setup
2) Model Extensions
3) Small Models
4) Large Models
5) Efficiency Analysis
Experimental Setup
> Evaluation
- GLUE (General Language Understanding Evaluation): 9 tasks (average score)
- CoLA: Is the sentence grammatical or ungrammatical?
- SST: Is the movie review positive, negative or neutral?
- MRPC: Is the sentence B a paraphrase of sentence A?
- STS: How similar are sentences A and B?
- QQP: Are the two questions similar?
- MNLI: Does sentence A entail or contradict sentence B?
- QNLI: Does sentence B contain the answer to the question in sentence A?
- RTE: Does sentence A entail sentence B?
- WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun?
- SQuAD (Stanford Question Answering)
#Kookmin_University #Natural_Language_Processing_lab. 10
https://rajpurkar.github.io/SQuAD-explorer/
https://gluebenchmark.com/
Model Extensions
> Weight sharing
- Sharing weights between the generator and discriminator
- Model size(generator) == Model size(discriminator) (*model size = the number of hidden units)
#. Compare the weight tying strategies (GLUE score)
- for no weight tying: 83.6
- for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding)
- for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size)
- Model size(generator) < Model size(discriminator)
: using the token and positional embedding weights (it is effective)
#Kookmin_University #Natural_Language_Processing_lab. 11
Model Extensions
> Smaller generators
- If the generator and discriminator are the same size, it has expensive computing cost
- Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant
- When generators have ¼~½ the size of the discriminator, GLUE score is best
#Kookmin_University #Natural_Language_Processing_lab. 12
When the sizes of generator and discriminator are the same
When the sizes of generator and discriminator are different
Model Extensions
> Training algorithms (try)
1. Train only the generator with loss of MLM for 𝑛 steps
2. Initialize the weights of the discriminator with the weights of the generator
Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen
- Addition, explore training the generator adversarially as GAN (58%)
- Problem1: inefficiency of reinforcement learning when working in the large action space of generating text
- Problem2: low-entropy of output distribution in generator with adversarial learning
#Kookmin_University #Natural_Language_Processing_lab. 13
Small Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 14
Large Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 15
= ¼ of RoBERTa (400K steps)
= RoBERTa (1,750K steps)
Dev set
Test set
Small Models & Large Models - SQuAD
#Kookmin_University #Natural_Language_Processing_lab. 16
Efficiency Analysis
> Effect validation of ELECTRA
1. ELECTRA 15%: loss of discriminator uses the 15% masked token
- testing the effective of calculating the loss from all tokens
> ELECTRA (85%) > ELECTRA 15% (82.4%)
2. Replace MLM: replace the [MASK] token with sample token of generator model
- testing the effective of replacing the [MASK] token with the sample token of generator
> Replace MLM (82.4%) > BERT (82.2%)
3. All-Tokens MLM: discriminator predicts the all token as well as masked token
- testing the effective of sigmoid layer for deciding whether the original token copies
> All-Tokens MLM (84.3%) > Replace MLM (82.4%)
#Kookmin_University #Natural_Language_Processing_lab. 17
Summary
> Proposal:
replaced token detection (a new self-supervised task for language representation learning)
> Key idea:
Training a text encoder to distinguish input tokens from high-quality negative samples by generator
> Performance:
ELECTRA is more compute-efficient and better performance than masked language models
#Kookmin_University #Natural_Language_Processing_lab. 18
Thank You.
19
#Kookmin_University #Natural_Language_Processing_lab.

More Related Content

What's hot

Generative Image Inpainting with Contextual Attention (by Jiahui Yu)
Generative Image Inpainting with Contextual Attention (by Jiahui Yu) Generative Image Inpainting with Contextual Attention (by Jiahui Yu)
Generative Image Inpainting with Contextual Attention (by Jiahui Yu) Tomoki Tanimura
 
Kaggle days tokyo jin zhan
Kaggle days tokyo   jin zhanKaggle days tokyo   jin zhan
Kaggle days tokyo jin zhanJin Zhan
 
Database transaction isolation and locking in Java
Database transaction isolation and locking in JavaDatabase transaction isolation and locking in Java
Database transaction isolation and locking in JavaConstantine Slisenka
 
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual ClassificationSeulki Park
 
Image to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANImage to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANS.Shayan Daneshvar
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
 
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...Sangwoo Mo
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferRoelof Pieters
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisBeerenSahu
 
[DL Hacks] code_representation
[DL Hacks] code_representation[DL Hacks] code_representation
[DL Hacks] code_representationDeep Learning JP
 
[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...
[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...
[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...Deep Learning JP
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)NamHyuk Ahn
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pixYasar Hayat
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 

What's hot (20)

Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
Generative Image Inpainting with Contextual Attention (by Jiahui Yu)
Generative Image Inpainting with Contextual Attention (by Jiahui Yu) Generative Image Inpainting with Contextual Attention (by Jiahui Yu)
Generative Image Inpainting with Contextual Attention (by Jiahui Yu)
 
Kaggle days tokyo jin zhan
Kaggle days tokyo   jin zhanKaggle days tokyo   jin zhan
Kaggle days tokyo jin zhan
 
Database transaction isolation and locking in Java
Database transaction isolation and locking in JavaDatabase transaction isolation and locking in Java
Database transaction isolation and locking in Java
 
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
 
Image to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANImage to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GAN
 
Yol ov2
Yol ov2Yol ov2
Yol ov2
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Actor Model Akka Framework
Actor Model Akka FrameworkActor Model Akka Framework
Actor Model Akka Framework
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
 
[DL Hacks] code_representation
[DL Hacks] code_representation[DL Hacks] code_representation
[DL Hacks] code_representation
 
Green scheduling
Green schedulingGreen scheduling
Green scheduling
 
[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...
[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...
[DL Hacks]Learning Cross-modal Embeddings for Cooking Recipes and Food Images...
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
 
PRML Chapter5.2
PRML Chapter5.2PRML Chapter5.2
PRML Chapter5.2
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 

Similar to ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Fwdays
 
Test-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxTest-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxVictor Rentea
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019Paulo Clavijo
 
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docxFaculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docxmydrynan
 
Idempotency of commands in distributed systems
Idempotency of commands in distributed systemsIdempotency of commands in distributed systems
Idempotency of commands in distributed systemsMax Małecki
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesCoen De Roover
 
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Benoit Combemale
 
Easymock Tutorial
Easymock TutorialEasymock Tutorial
Easymock TutorialSbin m
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentationoldmanpat
 
Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder paramisoft
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Working Effectively With Legacy Perl Code
Working Effectively With Legacy Perl CodeWorking Effectively With Legacy Perl Code
Working Effectively With Legacy Perl Codeerikmsp
 
Icpc2010 bettenburg
Icpc2010 bettenburgIcpc2010 bettenburg
Icpc2010 bettenburgSAIL_QU
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionSARADINDU SENGUPTA
 

Similar to ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators (20)

Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
 
Test-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxTest-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptx
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
des mutants dans le code.pdf
des mutants dans le code.pdfdes mutants dans le code.pdf
des mutants dans le code.pdf
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
 
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docxFaculty of ScienceDepartment of ComputingFinal Examinati.docx
Faculty of ScienceDepartment of ComputingFinal Examinati.docx
 
Idempotency of commands in distributed systems
Idempotency of commands in distributed systemsIdempotency of commands in distributed systems
Idempotency of commands in distributed systems
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templates
 
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
 
Easymock Tutorial
Easymock TutorialEasymock Tutorial
Easymock Tutorial
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentation
 
Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder Desing pattern prototype-Factory Method, Prototype and Builder
Desing pattern prototype-Factory Method, Prototype and Builder
 
Js tips & tricks
Js tips & tricksJs tips & tricks
Js tips & tricks
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Working Effectively With Legacy Perl Code
Working Effectively With Legacy Perl CodeWorking Effectively With Legacy Perl Code
Working Effectively With Legacy Perl Code
 
Icpc2010 bettenburg
Icpc2010 bettenburgIcpc2010 bettenburg
Icpc2010 bettenburg
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 

More from Danbi Cho

Crf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic networkCrf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic networkDanbi Cho
 
Attention boosted deep networks for video classification
Attention boosted deep networks for video classificationAttention boosted deep networks for video classification
Attention boosted deep networks for video classificationDanbi Cho
 
A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...Danbi Cho
 
A survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textA survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textDanbi Cho
 
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...Danbi Cho
 
Decision tree and ensemble
Decision tree and ensembleDecision tree and ensemble
Decision tree and ensembleDanbi Cho
 
Can recurrent neural networks warp time
Can recurrent neural networks warp timeCan recurrent neural networks warp time
Can recurrent neural networks warp timeDanbi Cho
 
Man is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddingsMan is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddingsDanbi Cho
 
Situation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understandingSituation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understandingDanbi Cho
 
Mitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learningMitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learningDanbi Cho
 

More from Danbi Cho (11)

Crf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic networkCrf based named entity recognition using a korean lexical semantic network
Crf based named entity recognition using a korean lexical semantic network
 
Gpt models
Gpt modelsGpt models
Gpt models
 
Attention boosted deep networks for video classification
Attention boosted deep networks for video classificationAttention boosted deep networks for video classification
Attention boosted deep networks for video classification
 
A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...A survey on deep learning based approaches for action and gesture recognition...
A survey on deep learning based approaches for action and gesture recognition...
 
A survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textA survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in text
 
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...
 
Decision tree and ensemble
Decision tree and ensembleDecision tree and ensemble
Decision tree and ensemble
 
Can recurrent neural networks warp time
Can recurrent neural networks warp timeCan recurrent neural networks warp time
Can recurrent neural networks warp time
 
Man is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddingsMan is to computer programmer as woman is to homemaker debiasing word embeddings
Man is to computer programmer as woman is to homemaker debiasing word embeddings
 
Situation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understandingSituation recognition visual semantic role labeling for image understanding
Situation recognition visual semantic role labeling for image understanding
 
Mitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learningMitigating unwanted biases with adversarial learning
Mitigating unwanted biases with adversarial learning
 

Recently uploaded

GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphNeo4j
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksJinanKordab
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNeo4j
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConNatan Silnitsky
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Maxim Salnikov
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...Neo4j
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Andreas Granig
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Chirag Panchal
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdfSelfMade bd
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...drm1699
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Lisi Hocke
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfryanfarris8
 
Your Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | EvmuxYour Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | Evmuxevmux96
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Flutter Agency
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaNeo4j
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024MulesoftMunichMeetup
 
Rapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and InsightsRapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and Insightsrapidoform
 

Recently uploaded (20)

GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMs
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
Your Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | EvmuxYour Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | Evmux
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
Abortion Clinic In Pongola ](+27832195400*)[ 🏥 Safe Abortion Pills In Pongola...
Abortion Clinic In Pongola ](+27832195400*)[ 🏥 Safe Abortion Pills In Pongola...Abortion Clinic In Pongola ](+27832195400*)[ 🏥 Safe Abortion Pills In Pongola...
Abortion Clinic In Pongola ](+27832195400*)[ 🏥 Safe Abortion Pills In Pongola...
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Rapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and InsightsRapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and Insights
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 

ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators

  • 1. 자연어처리 연구실 M2020064 조단비 Published in: The 8th International Conference on Learning Representations (ICLR 2020) URL: https://arxiv.org/abs/2003.10555
  • 2. Content 1. Idea 2. Introduce 3. Mothed 4. Experiments and results 5. Summary #Kookmin_University #Natural_Language_Processing_lab. 1
  • 3. Idea #Kookmin_University #Natural_Language_Processing_lab. 2 [BERT] > Replacing some token with [MASK] (masked language modeling) [ELECTRA] > Replacing some token with plausible alternatives sampled from a small generator network Problem) require large amounts of compute Proposal) a more sample-efficient pre training task : replaced token detection [BERT] > Train a model > Predicts the original identities of the corrupted token [ELECTRA] > Train a discriminative model > Predicts whether each token in the corrupted input was replaced by a generator sample or not
  • 5. Introduce > SOTA representation learning = learning the DAE(Denoising Autoencoder) > Proposal method: replaced token detection > Goal: improve the efficiency of pre-training #Kookmin_University #Natural_Language_Processing_lab. 4 Restoring the original input token Masking or attention (BERT, XLNet) Input token Substantial compute cost is incurred The network only learns from 15% of the tokens per example Predict the original token or replacement token Replacement using samples Input token Samples are generated by a small masked language model Model learns all input tokens as discriminator
  • 6. Introduce > ELECTRA - ELECTRA-small - pre-training BERT dataset - comparison with BERT, GPT - ELECTRA-Large - pre-training XLNet dataset - comparison with RoBERTa, XLNet #Kookmin_University #Natural_Language_Processing_lab. 5
  • 7. Method #Kookmin_University #Natural_Language_Processing_lab. 6 > ELECTRA : generator 𝐺 + discriminator 𝐷 Each network is transformer encoder Mapping a sequence on input tokens into a sequence of contextualized vector representations
  • 8. Method > Generator (1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛] (2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n) (3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 = 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 ) (4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓 (5) Output the predicted token from generator with softmax ෝ 𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎 (6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕 = 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ 𝒙) #Kookmin_University #Natural_Language_Processing_lab. 7 (1) (2,3) (4) (5,6)
  • 9. Method > Discriminator (1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕 = 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ 𝒙) (2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓 (3) Output the predicted type of input token with sigmoid > Loss function - Minimize the combined loss 𝑚𝑖𝑛𝜃𝐺,𝜃𝐷 ෍ 𝒙∈𝜒 𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷) (* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator) #Kookmin_University #Natural_Language_Processing_lab. 8 (1) (2) (3)
  • 10. Experiments and results #Kookmin_University #Natural_Language_Processing_lab. 9 1) Experimental setup 2) Model Extensions 3) Small Models 4) Large Models 5) Efficiency Analysis
  • 11. Experimental Setup > Evaluation - GLUE (General Language Understanding Evaluation): 9 tasks (average score) - CoLA: Is the sentence grammatical or ungrammatical? - SST: Is the movie review positive, negative or neutral? - MRPC: Is the sentence B a paraphrase of sentence A? - STS: How similar are sentences A and B? - QQP: Are the two questions similar? - MNLI: Does sentence A entail or contradict sentence B? - QNLI: Does sentence B contain the answer to the question in sentence A? - RTE: Does sentence A entail sentence B? - WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun? - SQuAD (Stanford Question Answering) #Kookmin_University #Natural_Language_Processing_lab. 10 https://rajpurkar.github.io/SQuAD-explorer/ https://gluebenchmark.com/
  • 12. Model Extensions > Weight sharing - Sharing weights between the generator and discriminator - Model size(generator) == Model size(discriminator) (*model size = the number of hidden units) #. Compare the weight tying strategies (GLUE score) - for no weight tying: 83.6 - for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding) - for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size) - Model size(generator) < Model size(discriminator) : using the token and positional embedding weights (it is effective) #Kookmin_University #Natural_Language_Processing_lab. 11
  • 13. Model Extensions > Smaller generators - If the generator and discriminator are the same size, it has expensive computing cost - Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant - When generators have ¼~½ the size of the discriminator, GLUE score is best #Kookmin_University #Natural_Language_Processing_lab. 12 When the sizes of generator and discriminator are the same When the sizes of generator and discriminator are different
  • 14. Model Extensions > Training algorithms (try) 1. Train only the generator with loss of MLM for 𝑛 steps 2. Initialize the weights of the discriminator with the weights of the generator Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen - Addition, explore training the generator adversarially as GAN (58%) - Problem1: inefficiency of reinforcement learning when working in the large action space of generating text - Problem2: low-entropy of output distribution in generator with adversarial learning #Kookmin_University #Natural_Language_Processing_lab. 13
  • 15. Small Models - GLUE #Kookmin_University #Natural_Language_Processing_lab. 14
  • 16. Large Models - GLUE #Kookmin_University #Natural_Language_Processing_lab. 15 = ¼ of RoBERTa (400K steps) = RoBERTa (1,750K steps) Dev set Test set
  • 17. Small Models & Large Models - SQuAD #Kookmin_University #Natural_Language_Processing_lab. 16
  • 18. Efficiency Analysis > Effect validation of ELECTRA 1. ELECTRA 15%: loss of discriminator uses the 15% masked token - testing the effective of calculating the loss from all tokens > ELECTRA (85%) > ELECTRA 15% (82.4%) 2. Replace MLM: replace the [MASK] token with sample token of generator model - testing the effective of replacing the [MASK] token with the sample token of generator > Replace MLM (82.4%) > BERT (82.2%) 3. All-Tokens MLM: discriminator predicts the all token as well as masked token - testing the effective of sigmoid layer for deciding whether the original token copies > All-Tokens MLM (84.3%) > Replace MLM (82.4%) #Kookmin_University #Natural_Language_Processing_lab. 17
  • 19. Summary > Proposal: replaced token detection (a new self-supervised task for language representation learning) > Key idea: Training a text encoder to distinguish input tokens from high-quality negative samples by generator > Performance: ELECTRA is more compute-efficient and better performance than masked language models #Kookmin_University #Natural_Language_Processing_lab. 18