SlideShare a Scribd company logo
BERT: Pre-training of 

Deep Bidirectional Transformers
for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
!1
Google AI Language
2018.11.25
Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805
Articles & Useful Links
• Official

• ArXiv : https://arxiv.org/abs/1810.04805

• Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• GitHub : https://github.com/google-research/bert

• Unofficial

• Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-
language-model-for-nlp/

• Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep-
bidirectional-transformers-for-language-understanding
!2
Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049 : https://youtu.be/6zGgVIlStXs

• Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html 

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website : https://blog.openai.com/language-unsupervised/

• Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf

• Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
Understanding.” (2018)

• Website : https://gluebenchmark.com/

• Paper : https://arxiv.org/abs/1804.07461
!3
Preliminaries
!4
Attention is All you need
• Introduced Transformer module

• Reduced computational complexity in respect to
the sequence length
!5
GLUE
• Benchmark introduced in 

Wang, Alex et al. “GLUE: A Multi-Task
Benchmark and Analysis Platform for
Natural Language
Understanding.” (2018)

• Contains 11 Tasks
!6
BERT

Bidirectional Encoder
Representations from Transformers
!7
Motivation
!8
Traditional RNN / LSTM / GRU units
Motivation
!9
Commonly used Bidirectional units
Motivation
Problem
• Unfortunately, standard conditional language models can only be trained left-to-right or
right-to-left, since bidirectional conditioning would allow each word to indirectly “see
itself” in a multi-layered context.
!11
Problem
E 1
T 1
E 2
… EN
Transformer Transformer Transformer…
T 2
TN
…
Single Transformer Layer
!13
E
1
T 1
E
2
… E
N
Transformer Transformer Transformer…
T2
TN
Transformer Transformer Transformer…
…
Multi-layer Transformer Layer
!14
E
1
T 1
E
2
… E
N
Transformer Transformer Transformer…
T2
TN
Transformer Transformer Transformer…
…
Multi-layer Transformer Layer
Training Method
Task #1 - Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)
Task #1 - Masked LM
!16
• Fill in the blank!

• Formally, Cloze Test

(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?
Task #1 - Masked LM
!17
• Fill in the blank!

• Formally, Cloze Test

(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?
My dog is hairy My dog is hairy
Choose 15% of tokens at random
80%
10%
10%
My dog is [Mask]
My dog is hairy
My dog is apple
Masked LM Procedure
Task #2 - Next Sentence Prediction (NSP)
• Classification - [IsNext, NotNext]

• Final pre-trained model achieved 97-98%
accuracy.
!19
Embedding
!20
!21
!22
• The first token of every sequence is
always the special classification
embedding [CLS]. The final hidden state
corresponding to this token is used as the
aggregate sequence representation for
classification tasks. For non-classifcation
tasks, this vector is ignored.
• Sentence pairs are packed together into a
single sequence. The authors separate
them in two ways.

1. Separate with special token [SEP].

2. Add learned sentence embedding to
every token of corresponding
sentence.
Corpus
• BookCorpus (800M words)

• English Wikipedia (2500M words)

• Training dataset

• 50% - Two adjacent sentences

• 50% - Random sentence after a sentence.
!23
Differences between OpenAI GPT
!24
Model Corpus [CLS] / [SEP] tokens Steps Learning rate
BERT
BooksCorpus 

+

Wikipedia
Learns during

pre-training
1M steps with batch
size of 

128,000 words
Task-specific

fine-tuning 

learning rate
OpenAI GPT BooksCorpus
Only introduced at 

fine-tuning time 

1M steps with batch
size of 

32,000 words
Same learning rate of
5e-5
Results
!25
Results 

GLUE benchmark
!26
GLUE Benchmark
• MNLI: Multi-Genre Natural Language Inference 

• Given a pair of sentences, the goal is to predict whether the second sentence is an
entailment, contradiction, or neutral with respect to the first sentence.

• Two versions - MNLI matched, MNLI mismatched

• Two sentence, classification task
!27
GLUE Benchmark
• QQP: Quora Question Pairs

• Quora Question Pairs is a binary classification task where the goal is to determine if
two questions asked on Quora are semantically equivalent 

• Two sentence, binary classification task
!28
GLUE Benchmark
• QNLI: Question Natural Language Inference 

• The positive examples are (question, sentence) pairs which do contain the correct
answer, and the negative examples are (question, sentence) from the same paragraph
which do not contain the answer. 

• Two sentence, binary classification task
!29
GLUE Benchmark
• SST-2: Stanford Sentiment Treebank 

• Binary single-sentence classification task consisting of sentences extracted from
movie reviews with human annotations of their sentiment 

• One sentence, binary classification task
!30
GLUE Benchmark
• CoLA: Corpus of Linguistic Acceptability 

• Binary single-sentence classification task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not 

• One sentence, binary classification task
!31
GLUE Benchmark
• STS-B: The Semantic Textual Similarity Bench- mark 

• Binary single-sentence classification task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not 

• One sentence, binary classification task
!32
GLUE Benchmark
• MRPC: Microsoft Research Paraphrase Corpus 

• Consists of sentence pairs automatically extracted from online news sources, with
human annotations for whether the sentences in the pair are semantically equivalent 

• Two sentence, binary classification task
!33
GLUE Benchmark
• RTE: Recognizing Textual Entailment 

• A binary entailment task similar to MNLI, but with much less training data 

• Two sentence, binary classification task
!34
GLUE Benchmark
• WNLI: Winograd Natural Language Inference

• A binary entailment task similar to MNLI, but with much less training data 

• The GLUE webpage notes that there are issues with the construction of this dataset 

• Authors therefore exclude this set
!35
!36
SQuAD v1.1
• Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced
question/answer pairs 

• Given a question and a paragraph from Wikipedia containing the answer, the task is to
predict the answer text span in the paragraph
!38
Results on SQuAD v1.1
!39
SWAG
• Situations With Adversarial Generations Dataset

• Given a sentence from a video captioning dataset, the task is to decide among four
choices the most plausible continuation.
!40
SWAG Results
GLUE Results
Ablation Study
!43
Model size
!44
Conclusion
• Unsupervised pre-training is now an integral part of many language understanding
systems.

• Now models can be truly trained with deep bidirectional architectures.

• State-of-the-art on almost every NLP tasks, in some cases surpassing human
performance.
!45
Personal thoughts
• Paper is well written and easy to follow

• SOTA in not just one task/dataset but in almost all tasks

• I think this method is going to be used universally as a baseline for future NLP research

• More objective comparison between BERT and OpenAI GPT was possible because the
baseline parameters are chosen such that it is almost identical to OpenAI GPT

• Model looks very simple but at the same time very flexible to adapt towards various tasks
with simple modifications on the top layer

• Unsupervised pre-training and supervised fine-tuning might prevail in many domain.
!46
Thank you!
!47
References
• Images are either from 

• original papers or

• https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-
rnn-lstm-gru-73927ec9df15

• https://colah.github.io/posts/2015-08-Understanding-LSTMs/

• https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for-
nlp/
!48

More Related Content

What's hot

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
JEE HYUN PARK
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
Suman Debnath
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
SynaptonIncorporated
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
gohyunwoong
 
Transformers
TransformersTransformers
Transformers
Anup Joseph
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
RahulKumar854607
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
NILESH VERMA
 
BERT
BERTBERT
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
Grigory Sapunov
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
WarNik Chow
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Shuntaro Yada
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
홍배 김
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 

What's hot (20)

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Transformers
TransformersTransformers
Transformers
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
BERT
BERTBERT
BERT
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
Satyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
Anuj Gupta
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Zachary S. Brown
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Jinho Choi
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
Anuj Gupta
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Toine Bogers
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
Jinho Choi
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
Jinho Choi
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
Alexandre Masselot
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
HLT
HLTHLT
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
Iván Montes
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
SwayattaDaw1
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
GowrySailaja
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
Fabio Petroni, PhD
 

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
HLT
HLTHLT
HLT
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. BERT: Pre-training of 
 Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova !1 Google AI Language 2018.11.25 Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805
  • 2. Articles & Useful Links • Official • ArXiv : https://arxiv.org/abs/1810.04805 • Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • GitHub : https://github.com/google-research/bert • Unofficial • Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art- language-model-for-nlp/ • Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep- bidirectional-transformers-for-language-understanding !2
  • 3. Related Papers • Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017) • PR-049 : https://youtu.be/6zGgVIlStXs • Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html • Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018) • Website : https://blog.openai.com/language-unsupervised/ • Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf • Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” (2018) • Website : https://gluebenchmark.com/ • Paper : https://arxiv.org/abs/1804.07461 !3
  • 5. Attention is All you need • Introduced Transformer module • Reduced computational complexity in respect to the sequence length !5
  • 6. GLUE • Benchmark introduced in 
 Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” (2018) • Contains 11 Tasks !6
  • 11. Problem • Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context. !11
  • 12. Problem E 1 T 1 E 2 … EN Transformer Transformer Transformer… T 2 TN … Single Transformer Layer
  • 13. !13 E 1 T 1 E 2 … E N Transformer Transformer Transformer… T2 TN Transformer Transformer Transformer… … Multi-layer Transformer Layer
  • 14. !14 E 1 T 1 E 2 … E N Transformer Transformer Transformer… T2 TN Transformer Transformer Transformer… … Multi-layer Transformer Layer
  • 15. Training Method Task #1 - Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)
  • 16. Task #1 - Masked LM !16 • Fill in the blank! • Formally, Cloze Test
 (https://en.wikipedia.org/wiki/Cloze_test) • Similar to CBOW in Word2Vec?
  • 17. Task #1 - Masked LM !17 • Fill in the blank! • Formally, Cloze Test
 (https://en.wikipedia.org/wiki/Cloze_test) • Similar to CBOW in Word2Vec?
  • 18. My dog is hairy My dog is hairy Choose 15% of tokens at random 80% 10% 10% My dog is [Mask] My dog is hairy My dog is apple Masked LM Procedure
  • 19. Task #2 - Next Sentence Prediction (NSP) • Classification - [IsNext, NotNext] • Final pre-trained model achieved 97-98% accuracy. !19
  • 21. !21
  • 22. !22 • The first token of every sequence is always the special classification embedding [CLS]. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. For non-classifcation tasks, this vector is ignored. • Sentence pairs are packed together into a single sequence. The authors separate them in two ways. 1. Separate with special token [SEP]. 2. Add learned sentence embedding to every token of corresponding sentence.
  • 23. Corpus • BookCorpus (800M words) • English Wikipedia (2500M words) • Training dataset • 50% - Two adjacent sentences • 50% - Random sentence after a sentence. !23
  • 24. Differences between OpenAI GPT !24 Model Corpus [CLS] / [SEP] tokens Steps Learning rate BERT BooksCorpus + Wikipedia Learns during
 pre-training 1M steps with batch size of 
 128,000 words Task-specific
 fine-tuning 
 learning rate OpenAI GPT BooksCorpus Only introduced at 
 fine-tuning time 1M steps with batch size of 
 32,000 words Same learning rate of 5e-5
  • 27. GLUE Benchmark • MNLI: Multi-Genre Natural Language Inference • Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first sentence. • Two versions - MNLI matched, MNLI mismatched • Two sentence, classification task !27
  • 28. GLUE Benchmark • QQP: Quora Question Pairs • Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent • Two sentence, binary classification task !28
  • 29. GLUE Benchmark • QNLI: Question Natural Language Inference • The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer. • Two sentence, binary classification task !29
  • 30. GLUE Benchmark • SST-2: Stanford Sentiment Treebank • Binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment • One sentence, binary classification task !30
  • 31. GLUE Benchmark • CoLA: Corpus of Linguistic Acceptability • Binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not • One sentence, binary classification task !31
  • 32. GLUE Benchmark • STS-B: The Semantic Textual Similarity Bench- mark • Binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not • One sentence, binary classification task !32
  • 33. GLUE Benchmark • MRPC: Microsoft Research Paraphrase Corpus • Consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent • Two sentence, binary classification task !33
  • 34. GLUE Benchmark • RTE: Recognizing Textual Entailment • A binary entailment task similar to MNLI, but with much less training data • Two sentence, binary classification task !34
  • 35. GLUE Benchmark • WNLI: Winograd Natural Language Inference • A binary entailment task similar to MNLI, but with much less training data • The GLUE webpage notes that there are issues with the construction of this dataset • Authors therefore exclude this set !35
  • 36. !36
  • 37.
  • 38. SQuAD v1.1 • Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced question/answer pairs • Given a question and a paragraph from Wikipedia containing the answer, the task is to predict the answer text span in the paragraph !38
  • 39. Results on SQuAD v1.1 !39
  • 40. SWAG • Situations With Adversarial Generations Dataset • Given a sentence from a video captioning dataset, the task is to decide among four choices the most plausible continuation. !40
  • 45. Conclusion • Unsupervised pre-training is now an integral part of many language understanding systems. • Now models can be truly trained with deep bidirectional architectures. • State-of-the-art on almost every NLP tasks, in some cases surpassing human performance. !45
  • 46. Personal thoughts • Paper is well written and easy to follow • SOTA in not just one task/dataset but in almost all tasks • I think this method is going to be used universally as a baseline for future NLP research • More objective comparison between BERT and OpenAI GPT was possible because the baseline parameters are chosen such that it is almost identical to OpenAI GPT • Model looks very simple but at the same time very flexible to adapt towards various tasks with simple modifications on the top layer • Unsupervised pre-training and supervised fine-tuning might prevail in many domain. !46
  • 48. References • Images are either from • original papers or • https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional- rnn-lstm-gru-73927ec9df15 • https://colah.github.io/posts/2015-08-Understanding-LSTMs/ • https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for- nlp/ !48