SlideShare a Scribd company logo
BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova
-
Google AI Language
-
Slides by Park JeeHyun
28 FEB 19
Contents
1. Motivation
2. Language Representations
3. Basic Idea
4. Model Architecture
5. How to use BERT
6. Results
7. Findings
1. Motivation
• Goal: Build a general, pre-trained language representation
model.
• Why: This model can be adapted to various NLP tasks easily,
we do not have to re-train a model from scratch every time.
• How: ?
2. Language Representations
1) Word Representations (Word embeddings)
• word2vec, GloVe
2) Contextual Representations
• Semi-Supervised Sequence Learning
• ELMo: Deep Contextual Word Embedding
• Generative Pre-Training
3) Problem with Previous Methods
2.1) Word Representation
Ref. [2]
2.1) Word Representation
2.2) Contextual Representations
2.2) Contextual Representations
• ELMo (Embeddings from Language Models)
2.2) Contextual Representations
• ELMo
• Deep Contextualized Word Representations
↘ neural network
↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡)
↘ words as fundamental semantic unit
↘ embedding
Ref. [3]
2.2) Contextual Representations
• ELMo
Ref. [4]
2.2) Contextual Representations
• ELMo
2.2) Contextual Representations
• ELMo
2.2) Contextual Representations
• GPT (Generative Pre-Training)
Ref. [2]
2.2) Contextual Representations
• GPT
• Unsupervised pre-training
• Supervised fine-tuning
• Task-specific input transformations
2.2) Contextual Representations
• GPT
Ref. [5]
2.3) Problem with Previous Methods
Ref. [2]
2.3) Problem with Previous Methods
Ref. [6]
2.3) Problem with Previous Methods
Ref. [2]
3. Basic Idea
1) Masked Language Model
2) Next Sentence Prediction
3) Input Representation
3.1) Masked Language Model
Ref. [2]
3.1) Masked Language Model
• Two downsides to MLM approach
i. MLM creates a mismatch between pre-training and fine- tuning,
since the [MASK] token is never seen during fine-tuning.
ii. MLM predicts only 15% of tokens in each batch, which suggests
that more pre-training steps may be required for the model to
converge.
3.1) Masked Language Model
 15% & 10% = 1.5%
: It does not seem t
o harm the model’s
language understan
d-ing capability.
 to bias the representation towards the actual observed word.
Ref. [2]
3.1) Masked Language Model
3.1) Masked Language Model
Ref. [7]
3.2) Next Sentence Prediction
Ref. [2]
3.2) Next Sentence Prediction
Ref. [7]
3.3) Input Representation
Ref. [2]
4. Model Architecture
1) Transformer
2) GELUs
4.1) Transformer
Ref. [2]
4.1) Transformer
4.1) Transformer
4.2) GELUs
• Gaussian Error Linear Units
• An activation function by combining properties from
dropout, zoneout, and ReLUs.
• ReLU
• deterministically multiplying the input by zero or one.
• dropout
• stochastically multiplying the input by zero.
• zoneout
• stochastically multiplies inputs by one.
• To build a new activation function called GELU,
the authors merge these functionalities by multiplying the input by
zero or one, but the values of this zero-one mask are
stochastically determined while also dependent upon the input.
Ref. [8]
4.2) GELUs
• GELU’s zero-one mask
• multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 =
𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the
standard normal distribution.
Ref. [8]
5. How to use BERT
1) Fine Tuning
2) Task Specific-Models
5.1) Fine Tuning
• Requires only ONE additional output layer
Ref. [2]
5.2) Task Specific-Model
6. Results
6. Results
Ref. [9]
7. Findings
1) Is masked language modeling really more effective than sequential language modeling?
2) Is the next sentence prediction task necessary?
3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to
achieve high fine-tuning accuracy?
5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining
(since masked language modeling only predicts 15% of the input tokens whereas left-to-right language
modeling predicts all of the tokens)?
6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor?
Ref. [10]
7.1)
Q) Is masked language modeling really more effective than sequential language modeling?
Ans) yes.
The authors tried training the Transformer on a left-to-right (LTR) language modeling task
instead of the masked language modeling task. The results for this setup can be seen in
the third row of the table below (“LTR & No NSP”).
7.2)
Q) Is the next sentence prediction task necessary?
Ans) yes.
For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD
datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC),
the performance change is much smaller, and for sentiment analysis (SST-2) the results
are virtually the same.
7.3)
Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
Ans) yes.
7.4)
Q) Does BERT really need such a large amount of pre-training (128,000 words/batch *
1,000,000 steps) to achieve high fine-tuning accuracy?
Ans) yes.
BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps
compared to 500k steps.
7.5)
Q) Does masked language modeling converge more slowly than left-to-right language modeling
pretraining (since masked language modeling only predicts 15% of the input tokens whereas
left-to-right language modeling predicts all of the tokens)?
Ans) yes & no.
For MNLI task, Left-to-right language modeling does converge faster, but masked
language modeling achieves a much higher accuracy with the same number of steps.
7.6)
Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature
extractor?
Ans) yes.
The authors tested how a BiLSTM model that
used fixed embeddings extracted from BERT
would perform on the CoNLL-NER dataset. The
results are shown in the table aside.
It turns out that using a concatenation of
the hidden activations from the last four
layers provides very strong performance,
only 0.3 behind finetuning the entire model. For
those on a strict computational budget,
this feature extraction approach is a good
option.
References
[1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS
(https://youtu.be/BhlOGGzC0Q0)
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
(https://nlp.stanford.edu/seminar/details/jdevlin.pdf)
[3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids
(http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)
[4] Word Embedding—ELMo
(https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc)
[5] Improving Language Understanding by Generative Pre-Training
(https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
[6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
[7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
(http://jalammar.github.io/illustrated-bert/)
[8] Gaussian Error Linear Units (GELUs)
(https://arxiv.org/abs/1606.08415)
[9] GLUE Benchmark
(https://gluebenchmark.com)
[10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
(http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

More Related Content

What's hot

What's hot (20)

BERT
BERTBERT
BERT
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 

Similar to [Paper review] BERT

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
MLAI2
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
sudeshnakundu10
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
Jinho Choi
 

Similar to [Paper review] BERT (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
 
srinu.pptx
srinu.pptxsrinu.pptx
srinu.pptx
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdf
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
 
arttt.pdf
arttt.pdfarttt.pdf
arttt.pdf
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 

More from JEE HYUN PARK (8)

keti companion classifier
keti companion classifierketi companion classifier
keti companion classifier
 
Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarization
 
Historical Finance Data
Historical Finance DataHistorical Finance Data
Historical Finance Data
 
Understanding lstm and its diagrams
Understanding lstm and its diagramsUnderstanding lstm and its diagrams
Understanding lstm and its diagrams
 
KCC2017 28APR2017
KCC2017 28APR2017KCC2017 28APR2017
KCC2017 28APR2017
 
Short-Term Load Forecasting of Australian National Electricity Market by Hier...
Short-Term Load Forecasting of Australian National Electricity Market by Hier...Short-Term Load Forecasting of Australian National Electricity Market by Hier...
Short-Term Load Forecasting of Australian National Electricity Market by Hier...
 

Recently uploaded

power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 

Recently uploaded (20)

Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturing
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answer
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

[Paper review] BERT

  • 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova - Google AI Language - Slides by Park JeeHyun 28 FEB 19
  • 2. Contents 1. Motivation 2. Language Representations 3. Basic Idea 4. Model Architecture 5. How to use BERT 6. Results 7. Findings
  • 3. 1. Motivation • Goal: Build a general, pre-trained language representation model. • Why: This model can be adapted to various NLP tasks easily, we do not have to re-train a model from scratch every time. • How: ?
  • 4. 2. Language Representations 1) Word Representations (Word embeddings) • word2vec, GloVe 2) Contextual Representations • Semi-Supervised Sequence Learning • ELMo: Deep Contextual Word Embedding • Generative Pre-Training 3) Problem with Previous Methods
  • 8. 2.2) Contextual Representations • ELMo (Embeddings from Language Models)
  • 9. 2.2) Contextual Representations • ELMo • Deep Contextualized Word Representations ↘ neural network ↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡) ↘ words as fundamental semantic unit ↘ embedding Ref. [3]
  • 13. 2.2) Contextual Representations • GPT (Generative Pre-Training) Ref. [2]
  • 14. 2.2) Contextual Representations • GPT • Unsupervised pre-training • Supervised fine-tuning • Task-specific input transformations
  • 16. 2.3) Problem with Previous Methods Ref. [2]
  • 17. 2.3) Problem with Previous Methods Ref. [6]
  • 18. 2.3) Problem with Previous Methods Ref. [2]
  • 19. 3. Basic Idea 1) Masked Language Model 2) Next Sentence Prediction 3) Input Representation
  • 20. 3.1) Masked Language Model Ref. [2]
  • 21. 3.1) Masked Language Model • Two downsides to MLM approach i. MLM creates a mismatch between pre-training and fine- tuning, since the [MASK] token is never seen during fine-tuning. ii. MLM predicts only 15% of tokens in each batch, which suggests that more pre-training steps may be required for the model to converge.
  • 22. 3.1) Masked Language Model  15% & 10% = 1.5% : It does not seem t o harm the model’s language understan d-ing capability.  to bias the representation towards the actual observed word. Ref. [2]
  • 24. 3.1) Masked Language Model Ref. [7]
  • 25. 3.2) Next Sentence Prediction Ref. [2]
  • 26. 3.2) Next Sentence Prediction Ref. [7]
  • 28. 4. Model Architecture 1) Transformer 2) GELUs
  • 32. 4.2) GELUs • Gaussian Error Linear Units • An activation function by combining properties from dropout, zoneout, and ReLUs. • ReLU • deterministically multiplying the input by zero or one. • dropout • stochastically multiplying the input by zero. • zoneout • stochastically multiplies inputs by one. • To build a new activation function called GELU, the authors merge these functionalities by multiplying the input by zero or one, but the values of this zero-one mask are stochastically determined while also dependent upon the input. Ref. [8]
  • 33. 4.2) GELUs • GELU’s zero-one mask • multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 = 𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the standard normal distribution. Ref. [8]
  • 34. 5. How to use BERT 1) Fine Tuning 2) Task Specific-Models
  • 35. 5.1) Fine Tuning • Requires only ONE additional output layer Ref. [2]
  • 39. 7. Findings 1) Is masked language modeling really more effective than sequential language modeling? 2) Is the next sentence prediction task necessary? 3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible? 4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? 5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining (since masked language modeling only predicts 15% of the input tokens whereas left-to-right language modeling predicts all of the tokens)? 6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor? Ref. [10]
  • 40. 7.1) Q) Is masked language modeling really more effective than sequential language modeling? Ans) yes. The authors tried training the Transformer on a left-to-right (LTR) language modeling task instead of the masked language modeling task. The results for this setup can be seen in the third row of the table below (“LTR & No NSP”).
  • 41. 7.2) Q) Is the next sentence prediction task necessary? Ans) yes. For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC), the performance change is much smaller, and for sentiment analysis (SST-2) the results are virtually the same.
  • 42. 7.3) Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible? Ans) yes.
  • 43. 7.4) Q) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? Ans) yes. BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps.
  • 44. 7.5) Q) Does masked language modeling converge more slowly than left-to-right language modeling pretraining (since masked language modeling only predicts 15% of the input tokens whereas left-to-right language modeling predicts all of the tokens)? Ans) yes & no. For MNLI task, Left-to-right language modeling does converge faster, but masked language modeling achieves a much higher accuracy with the same number of steps.
  • 45. 7.6) Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor? Ans) yes. The authors tested how a BiLSTM model that used fixed embeddings extracted from BERT would perform on the CoNLL-NER dataset. The results are shown in the table aside. It turns out that using a concatenation of the hidden activations from the last four layers provides very strong performance, only 0.3 behind finetuning the entire model. For those on a strict computational budget, this feature extraction approach is a good option.
  • 46. References [1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS (https://youtu.be/BhlOGGzC0Q0) [2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://nlp.stanford.edu/seminar/details/jdevlin.pdf) [3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids (http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html) [4] Word Embedding—ELMo (https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc) [5] Improving Language Understanding by Generative Pre-Training (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) [6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) [7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) (http://jalammar.github.io/illustrated-bert/) [8] Gaussian Error Linear Units (GELUs) (https://arxiv.org/abs/1606.08415) [9] GLUE Benchmark (https://gluebenchmark.com) [10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained (http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

Editor's Notes

  1. Feature-based approaches
  2. s = softmax-normalized weights r = scalar parameter
  3. Fine-tuning approaches
  4. ELMo & GPT are unidirectional???
  5. Incrementally??? Deep bidirectionality vs. ELMo-style shallow bidirectionality
  6. Incrementally???
  7. Random word  The Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability. Keep same  The purpose of this is to bias the representation towards the actual observed word.
  8. Embedding  Elementwise adding
  9. Transformer learns features throughout all other words in the sequences.
  10. Linear decay = why?
  11. GLUE = The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. https://gluebenchmark.com/leaderboard
  12. Yes & no???
  13. CoNLL-NER (Named Entity Recogmnition) Entities are annotated with LOC (location), ORG (organisation), PER(person) and MISC (miscellaneous).