BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of  
Deep Bidirectional Transformers
for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
!1
Google AI Language
2018.11.25
Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805

Articles & Useful Links
• Oﬃcial

• ArXiv : https://arxiv.org/abs/1810.04805

• Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• GitHub : https://github.com/google-research/bert

• Unoﬃcial

• Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-
language-model-for-nlp/

• Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep-
bidirectional-transformers-for-language-understanding
!2

Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049 : https://youtu.be/6zGgVIlStXs

• Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website : https://blog.openai.com/language-unsupervised/

• Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf

• Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
Understanding.” (2018)

• Website : https://gluebenchmark.com/

• Paper : https://arxiv.org/abs/1804.07461
!3

Attention is All you need
• Introduced Transformer module

• Reduced computational complexity in respect to
the sequence length
!5

GLUE
• Benchmark introduced in  
Wang, Alex et al. “GLUE: A Multi-Task
Benchmark and Analysis Platform for
Natural Language
Understanding.” (2018)

• Contains 11 Tasks
!6

BERT 
Bidirectional Encoder
Representations from Transformers
!7

Motivation
!8
Traditional RNN / LSTM / GRU units

Motivation
!9
Commonly used Bidirectional units

Problem
• Unfortunately, standard conditional language models can only be trained left-to-right or
right-to-left, since bidirectional conditioning would allow each word to indirectly “see
itself” in a multi-layered context.
!11

Problem
E 1
T 1
E 2
… EN
Transformer Transformer Transformer…
T 2
TN
…
Single Transformer Layer

!13
E
1
T 1
E
2
… E
N
T2
TN
…
Multi-layer Transformer Layer

!14
E
1
T 1
E
2
… E
N
T2
TN
…
Multi-layer Transformer Layer

Training Method
Task #1 - Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)

Task #1 - Masked LM
!16
• Fill in the blank!

• Formally, Cloze Test 
(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?

Task #1 - Masked LM
!17
• Fill in the blank!

• Formally, Cloze Test 
(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?

My dog is hairy My dog is hairy
Choose 15% of tokens at random
80%
10%
10%
My dog is [Mask]
My dog is hairy
My dog is apple
Masked LM Procedure

Task #2 - Next Sentence Prediction (NSP)
• Classiﬁcation - [IsNext, NotNext]

• Final pre-trained model achieved 97-98%
accuracy.
!19

!22
• The first token of every sequence is
always the special classification
embedding [CLS]. The final hidden state
corresponding to this token is used as the
aggregate sequence representation for
classification tasks. For non-classifcation
tasks, this vector is ignored.
• Sentence pairs are packed together into a
single sequence. The authors separate
them in two ways.

1. Separate with special token [SEP].

2. Add learned sentence embedding to
every token of corresponding
sentence.

Corpus
• BookCorpus (800M words)

• English Wikipedia (2500M words)

• Training dataset

• 50% - Two adjacent sentences

• 50% - Random sentence after a sentence.
!23

Differences between OpenAI GPT
!24
Model Corpus [CLS] / [SEP] tokens Steps Learning rate
BERT
BooksCorpus

+

Wikipedia
Learns during 
pre-training
1M steps with batch
size of  
128,000 words
Task-specific 
fine-tuning  
learning rate
OpenAI GPT BooksCorpus
Only introduced at  
fine-tuning time

1M steps with batch
size of  
32,000 words
Same learning rate of
5e-5

Results  
GLUE benchmark
!26

GLUE Benchmark
• MNLI: Multi-Genre Natural Language Inference

• Given a pair of sentences, the goal is to predict whether the second sentence is an
entailment, contradiction, or neutral with respect to the ﬁrst sentence.

• Two versions - MNLI matched, MNLI mismatched

• Two sentence, classiﬁcation task
!27

GLUE Benchmark
• QQP: Quora Question Pairs

• Quora Question Pairs is a binary classiﬁcation task where the goal is to determine if
two questions asked on Quora are semantically equivalent

• Two sentence, binary classiﬁcation task
!28

GLUE Benchmark
• QNLI: Question Natural Language Inference

• The positive examples are (question, sentence) pairs which do contain the correct
answer, and the negative examples are (question, sentence) from the same paragraph
which do not contain the answer.

!29

GLUE Benchmark
• SST-2: Stanford Sentiment Treebank

• Binary single-sentence classiﬁcation task consisting of sentences extracted from
movie reviews with human annotations of their sentiment

• One sentence, binary classiﬁcation task
!30

GLUE Benchmark
• CoLA: Corpus of Linguistic Acceptability

• Binary single-sentence classiﬁcation task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not

!31

GLUE Benchmark
• STS-B: The Semantic Textual Similarity Bench- mark

• Binary single-sentence classiﬁcation task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not

!32

GLUE Benchmark
• MRPC: Microsoft Research Paraphrase Corpus

• Consists of sentence pairs automatically extracted from online news sources, with
human annotations for whether the sentences in the pair are semantically equivalent

!33

GLUE Benchmark
• RTE: Recognizing Textual Entailment

• A binary entailment task similar to MNLI, but with much less training data

!34

GLUE Benchmark
• WNLI: Winograd Natural Language Inference

• A binary entailment task similar to MNLI, but with much less training data

• The GLUE webpage notes that there are issues with the construction of this dataset

• Authors therefore exclude this set
!35

SQuAD v1.1
• Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced
question/answer pairs

• Given a question and a paragraph from Wikipedia containing the answer, the task is to
predict the answer text span in the paragraph
!38

SWAG
• Situations With Adversarial Generations Dataset

• Given a sentence from a video captioning dataset, the task is to decide among four
choices the most plausible continuation.
!40

Conclusion
• Unsupervised pre-training is now an integral part of many language understanding
systems.

• Now models can be truly trained with deep bidirectional architectures.

• State-of-the-art on almost every NLP tasks, in some cases surpassing human
performance.
!45

Personal thoughts
• Paper is well written and easy to follow

• SOTA in not just one task/dataset but in almost all tasks

• I think this method is going to be used universally as a baseline for future NLP research

• More objective comparison between BERT and OpenAI GPT was possible because the
baseline parameters are chosen such that it is almost identical to OpenAI GPT

• Model looks very simple but at the same time very flexible to adapt towards various tasks
with simple modifications on the top layer

• Unsupervised pre-training and supervised fine-tuning might prevail in many domain.
!46

References
• Images are either from

• original papers or

• https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-
rnn-lstm-gru-73927ec9df15

• https://colah.github.io/posts/2015-08-Understanding-LSTMs/

• https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for-
nlp/
!48

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

Recently uploaded

Recently uploaded (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding