[Paper review] BERT

BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova
-
Google AI Language
-
Slides by Park JeeHyun
28 FEB 19

Contents
1. Motivation
2. Language Representations
3. Basic Idea
4. Model Architecture
5. How to use BERT
6. Results
7. Findings

1. Motivation
• Goal: Build a general, pre-trained language representation
model.
• Why: This model can be adapted to various NLP tasks easily,
we do not have to re-train a model from scratch every time.
• How: ?

2. Language Representations
1) Word Representations (Word embeddings)
• word2vec, GloVe
2) Contextual Representations
• Semi-Supervised Sequence Learning
• ELMo: Deep Contextual Word Embedding
• Generative Pre-Training
3) Problem with Previous Methods

2.1) Word Representation
Ref. [2]

2.2) Contextual Representations

• ELMo (Embeddings from Language Models)

• ELMo
• Deep Contextualized Word Representations
↘ neural network
↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡)
↘ words as fundamental semantic unit
↘ embedding
Ref. [3]

• ELMo
Ref. [4]

• ELMo

• GPT (Generative Pre-Training)
Ref. [2]

• GPT
• Unsupervised pre-training
• Supervised fine-tuning
• Task-specific input transformations

• GPT
Ref. [5]

2.3) Problem with Previous Methods
Ref. [2]

2.3) Problem with Previous Methods
Ref. [6]

3. Basic Idea
1) Masked Language Model
2) Next Sentence Prediction
3) Input Representation

3.1) Masked Language Model
Ref. [2]

• Two downsides to MLM approach
i. MLM creates a mismatch between pre-training and fine- tuning,
since the [MASK] token is never seen during fine-tuning.
ii. MLM predicts only 15% of tokens in each batch, which suggests
that more pre-training steps may be required for the model to
converge.

 15% & 10% = 1.5%
: It does not seem t
o harm the model’s
language understan
d-ing capability.
 to bias the representation towards the actual observed word.
Ref. [2]

Ref. [7]

3.2) Next Sentence Prediction
Ref. [2]

3.2) Next Sentence Prediction
Ref. [7]

3.3) Input Representation
Ref. [2]

4. Model Architecture
1) Transformer
2) GELUs

4.2) GELUs
• Gaussian Error Linear Units
• An activation function by combining properties from
dropout, zoneout, and ReLUs.
• ReLU
• deterministically multiplying the input by zero or one.
• dropout
• stochastically multiplying the input by zero.
• zoneout
• stochastically multiplies inputs by one.
• To build a new activation function called GELU,
the authors merge these functionalities by multiplying the input by
zero or one, but the values of this zero-one mask are
stochastically determined while also dependent upon the input.
Ref. [8]

4.2) GELUs
• GELU’s zero-one mask
• multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 =
𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the
standard normal distribution.
Ref. [8]

5. How to use BERT
1) Fine Tuning
2) Task Specific-Models

5.1) Fine Tuning
• Requires only ONE additional output layer
Ref. [2]

7. Findings
1) Is masked language modeling really more effective than sequential language modeling?
2) Is the next sentence prediction task necessary?
3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to
achieve high fine-tuning accuracy?
5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining
(since masked language modeling only predicts 15% of the input tokens whereas left-to-right language
modeling predicts all of the tokens)?
6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor?
Ref. [10]

7.1)
Q) Is masked language modeling really more effective than sequential language modeling?
Ans) yes.
The authors tried training the Transformer on a left-to-right (LTR) language modeling task
instead of the masked language modeling task. The results for this setup can be seen in
the third row of the table below (“LTR & No NSP”).

7.2)
Q) Is the next sentence prediction task necessary?
Ans) yes.
For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD
datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC),
the performance change is much smaller, and for sentiment analysis (SST-2) the results
are virtually the same.

7.3)
Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
Ans) yes.

7.4)
Q) Does BERT really need such a large amount of pre-training (128,000 words/batch *
1,000,000 steps) to achieve high fine-tuning accuracy?
Ans) yes.
BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps
compared to 500k steps.

7.5)
Q) Does masked language modeling converge more slowly than left-to-right language modeling
pretraining (since masked language modeling only predicts 15% of the input tokens whereas
left-to-right language modeling predicts all of the tokens)?
Ans) yes & no.
For MNLI task, Left-to-right language modeling does converge faster, but masked
language modeling achieves a much higher accuracy with the same number of steps.

7.6)
Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature
extractor?
Ans) yes.
The authors tested how a BiLSTM model that
used fixed embeddings extracted from BERT
would perform on the CoNLL-NER dataset. The
results are shown in the table aside.
It turns out that using a concatenation of
the hidden activations from the last four
layers provides very strong performance,
only 0.3 behind finetuning the entire model. For
those on a strict computational budget,
this feature extraction approach is a good
option.

References
[1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS
(https://youtu.be/BhlOGGzC0Q0)
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
(https://nlp.stanford.edu/seminar/details/jdevlin.pdf)
[3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids
(http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)
[4] Word Embedding—ELMo
(https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc)
[5] Improving Language Understanding by Generative Pre-Training
(https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
[6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
[7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
(http://jalammar.github.io/illustrated-bert/)
[8] Gaussian Error Linear Units (GELUs)
(https://arxiv.org/abs/1606.08415)
[9] GLUE Benchmark
(https://gluebenchmark.com)
[10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
(http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

[Paper review] BERT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [Paper review] BERT

Similar to [Paper review] BERT (20)

More from JEE HYUN PARK

More from JEE HYUN PARK (8)

Recently uploaded

Recently uploaded (20)

[Paper review] BERT

Editor's Notes