BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova
-
Google AI Language
-
Slides by Park JeeHyun
28 FEB 19
Contents
1. Motivation
2. Language Representations
3. Basic Idea
4. Model Architecture
5. How to use BERT
6. Results
7. Findings
1. Motivation
• Goal: Build a general, pre-trained language representation
model.
• Why: This model can be adapted to various NLP tasks easily,
we do not have to re-train a model from scratch every time.
• How: ?
2. Language Representations
1) Word Representations (Word embeddings)
• word2vec, GloVe
2) Contextual Representations
• Semi-Supervised Sequence Learning
• ELMo: Deep Contextual Word Embedding
• Generative Pre-Training
3) Problem with Previous Methods
2.1) Word Representation
Ref. [2]
2.1) Word Representation
2.2) Contextual Representations
2.2) Contextual Representations
• ELMo (Embeddings from Language Models)
2.2) Contextual Representations
• ELMo
• Deep Contextualized Word Representations
↘ neural network
↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡)
↘ words as fundamental semantic unit
↘ embedding
Ref. [3]
2.2) Contextual Representations
• ELMo
Ref. [4]
2.2) Contextual Representations
• ELMo
2.2) Contextual Representations
• ELMo
2.2) Contextual Representations
• GPT (Generative Pre-Training)
Ref. [2]
2.2) Contextual Representations
• GPT
• Unsupervised pre-training
• Supervised fine-tuning
• Task-specific input transformations
2.2) Contextual Representations
• GPT
Ref. [5]
2.3) Problem with Previous Methods
Ref. [2]
2.3) Problem with Previous Methods
Ref. [6]
2.3) Problem with Previous Methods
Ref. [2]
3. Basic Idea
1) Masked Language Model
2) Next Sentence Prediction
3) Input Representation
3.1) Masked Language Model
Ref. [2]
3.1) Masked Language Model
• Two downsides to MLM approach
i. MLM creates a mismatch between pre-training and fine- tuning,
since the [MASK] token is never seen during fine-tuning.
ii. MLM predicts only 15% of tokens in each batch, which suggests
that more pre-training steps may be required for the model to
converge.
3.1) Masked Language Model
 15% & 10% = 1.5%
: It does not seem t
o harm the model’s
language understan
d-ing capability.
 to bias the representation towards the actual observed word.
Ref. [2]
3.1) Masked Language Model
3.1) Masked Language Model
Ref. [7]
3.2) Next Sentence Prediction
Ref. [2]
3.2) Next Sentence Prediction
Ref. [7]
3.3) Input Representation
Ref. [2]
4. Model Architecture
1) Transformer
2) GELUs
4.1) Transformer
Ref. [2]
4.1) Transformer
4.1) Transformer
4.2) GELUs
• Gaussian Error Linear Units
• An activation function by combining properties from
dropout, zoneout, and ReLUs.
• ReLU
• deterministically multiplying the input by zero or one.
• dropout
• stochastically multiplying the input by zero.
• zoneout
• stochastically multiplies inputs by one.
• To build a new activation function called GELU,
the authors merge these functionalities by multiplying the input by
zero or one, but the values of this zero-one mask are
stochastically determined while also dependent upon the input.
Ref. [8]
4.2) GELUs
• GELU’s zero-one mask
• multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 =
𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the
standard normal distribution.
Ref. [8]
5. How to use BERT
1) Fine Tuning
2) Task Specific-Models
5.1) Fine Tuning
• Requires only ONE additional output layer
Ref. [2]
5.2) Task Specific-Model
6. Results
6. Results
Ref. [9]
7. Findings
1) Is masked language modeling really more effective than sequential language modeling?
2) Is the next sentence prediction task necessary?
3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to
achieve high fine-tuning accuracy?
5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining
(since masked language modeling only predicts 15% of the input tokens whereas left-to-right language
modeling predicts all of the tokens)?
6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor?
Ref. [10]
7.1)
Q) Is masked language modeling really more effective than sequential language modeling?
Ans) yes.
The authors tried training the Transformer on a left-to-right (LTR) language modeling task
instead of the masked language modeling task. The results for this setup can be seen in
the third row of the table below (“LTR & No NSP”).
7.2)
Q) Is the next sentence prediction task necessary?
Ans) yes.
For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD
datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC),
the performance change is much smaller, and for sentiment analysis (SST-2) the results
are virtually the same.
7.3)
Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
Ans) yes.
7.4)
Q) Does BERT really need such a large amount of pre-training (128,000 words/batch *
1,000,000 steps) to achieve high fine-tuning accuracy?
Ans) yes.
BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps
compared to 500k steps.
7.5)
Q) Does masked language modeling converge more slowly than left-to-right language modeling
pretraining (since masked language modeling only predicts 15% of the input tokens whereas
left-to-right language modeling predicts all of the tokens)?
Ans) yes & no.
For MNLI task, Left-to-right language modeling does converge faster, but masked
language modeling achieves a much higher accuracy with the same number of steps.
7.6)
Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature
extractor?
Ans) yes.
The authors tested how a BiLSTM model that
used fixed embeddings extracted from BERT
would perform on the CoNLL-NER dataset. The
results are shown in the table aside.
It turns out that using a concatenation of
the hidden activations from the last four
layers provides very strong performance,
only 0.3 behind finetuning the entire model. For
those on a strict computational budget,
this feature extraction approach is a good
option.
References
[1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS
(https://youtu.be/BhlOGGzC0Q0)
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
(https://nlp.stanford.edu/seminar/details/jdevlin.pdf)
[3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids
(http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)
[4] Word Embedding—ELMo
(https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc)
[5] Improving Language Understanding by Generative Pre-Training
(https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
[6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
[7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
(http://jalammar.github.io/illustrated-bert/)
[8] Gaussian Error Linear Units (GELUs)
(https://arxiv.org/abs/1606.08415)
[9] GLUE Benchmark
(https://gluebenchmark.com)
[10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
(http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

[Paper review] BERT

  • 1.
    BERT: Pre-training of DeepBidirectional Transformers for Language Understanding Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova - Google AI Language - Slides by Park JeeHyun 28 FEB 19
  • 2.
    Contents 1. Motivation 2. LanguageRepresentations 3. Basic Idea 4. Model Architecture 5. How to use BERT 6. Results 7. Findings
  • 3.
    1. Motivation • Goal:Build a general, pre-trained language representation model. • Why: This model can be adapted to various NLP tasks easily, we do not have to re-train a model from scratch every time. • How: ?
  • 4.
    2. Language Representations 1)Word Representations (Word embeddings) • word2vec, GloVe 2) Contextual Representations • Semi-Supervised Sequence Learning • ELMo: Deep Contextual Word Embedding • Generative Pre-Training 3) Problem with Previous Methods
  • 5.
  • 6.
  • 7.
  • 8.
    2.2) Contextual Representations •ELMo (Embeddings from Language Models)
  • 9.
    2.2) Contextual Representations •ELMo • Deep Contextualized Word Representations ↘ neural network ↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡) ↘ words as fundamental semantic unit ↘ embedding Ref. [3]
  • 10.
  • 11.
  • 12.
  • 13.
    2.2) Contextual Representations •GPT (Generative Pre-Training) Ref. [2]
  • 14.
    2.2) Contextual Representations •GPT • Unsupervised pre-training • Supervised fine-tuning • Task-specific input transformations
  • 15.
  • 16.
    2.3) Problem withPrevious Methods Ref. [2]
  • 17.
    2.3) Problem withPrevious Methods Ref. [6]
  • 18.
    2.3) Problem withPrevious Methods Ref. [2]
  • 19.
    3. Basic Idea 1)Masked Language Model 2) Next Sentence Prediction 3) Input Representation
  • 20.
    3.1) Masked LanguageModel Ref. [2]
  • 21.
    3.1) Masked LanguageModel • Two downsides to MLM approach i. MLM creates a mismatch between pre-training and fine- tuning, since the [MASK] token is never seen during fine-tuning. ii. MLM predicts only 15% of tokens in each batch, which suggests that more pre-training steps may be required for the model to converge.
  • 22.
    3.1) Masked LanguageModel  15% & 10% = 1.5% : It does not seem t o harm the model’s language understan d-ing capability.  to bias the representation towards the actual observed word. Ref. [2]
  • 23.
  • 24.
    3.1) Masked LanguageModel Ref. [7]
  • 25.
    3.2) Next SentencePrediction Ref. [2]
  • 26.
    3.2) Next SentencePrediction Ref. [7]
  • 27.
  • 28.
    4. Model Architecture 1)Transformer 2) GELUs
  • 29.
  • 30.
  • 31.
  • 32.
    4.2) GELUs • GaussianError Linear Units • An activation function by combining properties from dropout, zoneout, and ReLUs. • ReLU • deterministically multiplying the input by zero or one. • dropout • stochastically multiplying the input by zero. • zoneout • stochastically multiplies inputs by one. • To build a new activation function called GELU, the authors merge these functionalities by multiplying the input by zero or one, but the values of this zero-one mask are stochastically determined while also dependent upon the input. Ref. [8]
  • 33.
    4.2) GELUs • GELU’szero-one mask • multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 = 𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the standard normal distribution. Ref. [8]
  • 34.
    5. How touse BERT 1) Fine Tuning 2) Task Specific-Models
  • 35.
    5.1) Fine Tuning •Requires only ONE additional output layer Ref. [2]
  • 36.
  • 37.
  • 38.
  • 39.
    7. Findings 1) Ismasked language modeling really more effective than sequential language modeling? 2) Is the next sentence prediction task necessary? 3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible? 4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? 5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining (since masked language modeling only predicts 15% of the input tokens whereas left-to-right language modeling predicts all of the tokens)? 6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor? Ref. [10]
  • 40.
    7.1) Q) Is maskedlanguage modeling really more effective than sequential language modeling? Ans) yes. The authors tried training the Transformer on a left-to-right (LTR) language modeling task instead of the masked language modeling task. The results for this setup can be seen in the third row of the table below (“LTR & No NSP”).
  • 41.
    7.2) Q) Is thenext sentence prediction task necessary? Ans) yes. For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC), the performance change is much smaller, and for sentiment analysis (SST-2) the results are virtually the same.
  • 42.
    7.3) Q) Should Iuse a larger BERT model (a BERT model with more parameters) whenever possible? Ans) yes.
  • 43.
    7.4) Q) Does BERTreally need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? Ans) yes. BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps.
  • 44.
    7.5) Q) Does maskedlanguage modeling converge more slowly than left-to-right language modeling pretraining (since masked language modeling only predicts 15% of the input tokens whereas left-to-right language modeling predicts all of the tokens)? Ans) yes & no. For MNLI task, Left-to-right language modeling does converge faster, but masked language modeling achieves a much higher accuracy with the same number of steps.
  • 45.
    7.6) Q) Do Ihave to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor? Ans) yes. The authors tested how a BiLSTM model that used fixed embeddings extracted from BERT would perform on the CoNLL-NER dataset. The results are shown in the table aside. It turns out that using a concatenation of the hidden activations from the last four layers provides very strong performance, only 0.3 behind finetuning the entire model. For those on a strict computational budget, this feature extraction approach is a good option.
  • 46.
    References [1] Pretrained DeepBidirectional Transformers for Language Understanding (algorithm) | TDLS (https://youtu.be/BhlOGGzC0Q0) [2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://nlp.stanford.edu/seminar/details/jdevlin.pdf) [3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids (http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html) [4] Word Embedding—ELMo (https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc) [5] Improving Language Understanding by Generative Pre-Training (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) [6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) [7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) (http://jalammar.github.io/illustrated-bert/) [8] Gaussian Error Linear Units (GELUs) (https://arxiv.org/abs/1606.08415) [9] GLUE Benchmark (https://gluebenchmark.com) [10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained (http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

Editor's Notes

  • #9 Feature-based approaches
  • #12 s = softmax-normalized weights r = scalar parameter
  • #14 Fine-tuning approaches
  • #17 ELMo & GPT are unidirectional???
  • #18 Incrementally??? Deep bidirectionality vs. ELMo-style shallow bidirectionality
  • #19 Incrementally???
  • #23 Random word  The Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability. Keep same  The purpose of this is to bias the representation towards the actual observed word.
  • #28 Embedding  Elementwise adding
  • #31 Transformer learns features throughout all other words in the sequences.
  • #32 Linear decay = why?
  • #38 GLUE = The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. https://gluebenchmark.com/leaderboard
  • #44 Yes & no???
  • #46 CoNLL-NER (Named Entity Recogmnition) Entities are annotated with LOC (location), ORG (organisation), PER(person) and MISC (miscellaneous).