SlideShare a Scribd company logo
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 1/13
Published in Towards Data Science
Rani Horev Follow
Nov 10, 2018 · 7 min read · Listen
BERTExplained: State of the art language
modelfor NLP
BERT (Bidirectional Encoder Representations from Transformers) is a recent
paper published by researchers at Google AI Language. It has caused a stir
in the Machine Learning community by presenting state-of-the-art results in
a wide variety of NLP tasks, including Question Answering (SQuAD v1.1),
Natural Language Inference (MNLI), and others.
BERT’s key technical innovation is applying the bidirectional training of
Transformer, a popular attention model, to language modelling. This is in
contrast to previous efforts which looked at a text sequence either from left
to right or combined left-to-right and right-to-left training. The paper’s results
show that a language model which is bidirectionally trained can have a
9K 29
Search Medium Write
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 2/13
deeper sense of language context and flow than single-direction language
models. In the paper, the researchers detail a novel technique named
Masked LM (MLM) which allows bidirectional training in models in which it
was previously impossible.
Background
In the field of computer vision, researchers have repeatedly shown the value
of transfer learning — pre-training a neural network model on a known task,
for instance ImageNet, and then performing fine-tuning — using the trained
neural network as the basis of a new purpose-specific model. In recent
years, researchers have been showing that a similar technique can be useful
in many natural language tasks.
A different approach, which is also popular in NLP tasks and exemplified in
the recent ELMo paper, is feature-based training. In this approach, a pre-
trained neural network produces word embeddings which are then used as
features in NLP models.
How BERT works
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 3/13
BERT makes use of Transformer, an attention mechanism that learns
contextual relations between words (or sub-words) in a text. In its vanilla
form, Transformer includes two separate mechanisms — an encoder that
reads the text input and a decoder that produces a prediction for the task.
Since BERT’s goal is to generate a language model, only the encoder
mechanism is necessary. The detailed workings of Transformer are
described in a paper by Google.
As opposed to directional models, which read the text input sequentially
(left-to-right or right-to-left), the Transformer encoder reads the entire
sequence of words at once. Therefore it is considered bidirectional, though it
would be more accurate to say that it’s non-directional. This characteristic
allows the model to learn the context of a word based on all of its
surroundings (left and right of the word).
The chart below is a high-level description of the Transformer encoder. The
input is a sequence of tokens, which are first embedded into vectors and
then processed in the neural network. The output is a sequence of vectors of
size H, in which each vector corresponds to an input token with the same
index.
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 4/13
When training language models, there is a challenge of defining a prediction
goal. Many models predict the next word in a sequence (e.g. “The child came
home from ___”), a directional approach which inherently limits context
learning. To overcome this challenge, BERT uses two training strategies:
Masked LM (MLM)
Before feeding word sequences into BERT, 15% of the words in each
sequence are replaced with a [MASK] token. The model then attempts to
predict the original value of the masked words, based on the context
provided by the other, non-masked, words in the sequence. In technical
terms, the prediction of the output words requires:
1. Adding a classification layer on top of the encoder output.
2. Multiplying the output vectors by the embedding matrix, transforming
them into the vocabulary dimension.
3. Calculating the probability of each word in the vocabulary with softmax.
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 5/13
The BERT loss function takes into consideration only the prediction of the
masked values and ignores the prediction of the non-masked words. As a
consequence, the model converges slower than directional models, a
characteristic which is offset by its increased context awareness (see
Takeaways #3).
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 6/13
Note: In practice, the BERT implementation is slightly more elaborate and doesn’t
replace all of the 15% masked words. See Appendix A for additional information.
Next Sentence Prediction (NSP)
In the BERT training process, the model receives pairs of sentences as input
and learns to predict if the second sentence in the pair is the subsequent
sentence in the original document. During training, 50% of the inputs are a
pair in which the second sentence is the subsequent sentence in the original
document, while in the other 50% a random sentence from the corpus is
chosen as the second sentence. The assumption is that the random sentence
will be disconnected from the first sentence.
To help the model distinguish between the two sentences in training, the
input is processed in the following way before entering the model:
1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP]
token is inserted at the end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to
each token. Sentence embeddings are similar in concept to token
embeddings with a vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in
the sequence. The concept and implementation of positional embedding
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 7/13
are presented in the Transformer paper.
Source: BERT [Devlin et al., 2018], with modifications
To predict if the second sentence is indeed connected to the first, the
following steps are performed:
1. The entire input sequence goes through the Transformer model.
2. The output of the [CLS] token is transformed into a 2×1 shaped vector,
using a simple classification layer (learned matrices of weights and
biases).
3. Calculating the probability of IsNextSequence with softmax.
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 8/13
When training the BERT model, Masked LM and Next Sentence Prediction
are trained together, with the goal of minimizing the combined loss function
of the two strategies.
How to use BERT (Fine-tuning)
Using BERT for a specific task is relatively straightforward:
BERT can be used for a wide variety of language tasks, while only adding a
small layer to the core model:
1. Classification tasks such as sentiment analysis are done similarly to Next
Sentence classification, by adding a classification layer on top of the
Transformer output for the [CLS] token.
2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a
question regarding a text sequence and is required to mark the answer in
the sequence. Using BERT, a Q&A model can be trained by learning two
extra vectors that mark the beginning and the end of the answer.
3. In Named Entity Recognition (NER), the software receives a text
sequence and is required to mark the various types of entities (Person,
Organization, Date, etc) that appear in the text. Using BERT, a NER model
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 9/13
can be trained by feeding the output vector of each token into a
classification layer that predicts the NER label.
In the fine-tuning training, most hyper-parameters stay the same as in BERT
training, and the paper gives specific guidance (Section 3.5) on the hyper-
parameters that require tuning. The BERT team has used this technique to
achieve state-of-the-art results on a wide variety of challenging natural
language tasks, detailed in Section 4 of the paper.
Takeaways
1. Model size matters, even at huge scale. BERT_large, with 345 million
parameters, is the largest model of its kind. It is demonstrably superior
on small-scale tasks to BERT_base, which uses the same architecture
with “only” 110 million parameters.
2. With enough training data, more training steps == higher accuracy. For
instance, on the MNLI task, the BERT_base accuracy improves by 1.0%
when trained on 1M steps (128,000 words batch size) compared to 500K
steps with the same batch size.
3. BERT’s bidirectional approach (MLM) converges slower than left-to-right
approaches (because only 15% of words are predicted in each batch) but
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 10/13
bidirectional training still outperforms left-to-right training after a small
number of pre-training steps.
Source: BERT [Devlin et al., 2018]
Compute considerations (training and applying)
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 11/13
Conclusion
BERT is undoubtedly a breakthrough in the use of Machine Learning for
Natural Language Processing. The fact that it’s approachable and allows fast
fine-tuning will likely allow a wide range of practical applications in the
future. In this summary, we attempted to describe the main ideas of the
paper while not drowning in excessive technical details. For those wishing
for a deeper dive, we highly recommend reading the full article and ancillary
articles referenced in it. Another useful reference is the BERT source code
and models, which cover 103 languages and were generously released as
open source by the research team.
Appendix A — Word Masking
Training the language model in BERT is done by predicting 15% of the
tokens in the input, that were randomly picked. These tokens are pre-
processed as follows — 80% are replaced with a “[MASK]” token, 10% with a
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 12/13
random word, and 10% use the original word. The intuition that led the
authors to pick this approach is as follows (Thanks to Jacob Devlin from
Google for the insight):
If we used [MASK] 100% of the time the model wouldn’t necessarily
produce good token representations for non-masked words. The non-
masked tokens were still used for context, but the model was optimized
for predicting masked words.
If we used [MASK] 90% of the time and random words 10% of the time,
this would teach the model that the observed word is never correct.
If we used [MASK] 90% of the time and kept the same word 10% of the
time, then the model could just trivially copy the non-contextual
embedding.
No ablation was done on the ratios of this approach, and it may have worked
better with different ratios. In addition, the model performance wasn’t tested
with simply masking 100% of the selected tokens.
For more summaries on the recent Machine Learning research, check out Lyrn.AI.
4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 13/13
Machine Learning NLP Bert Deep Learning

More Related Content

Similar to BERT Explained_ State of the art language model for NLP.pdf

Exploring the Role of Transformers in NLP: From BERT to GPT-3
Exploring the Role of Transformers in NLP: From BERT to GPT-3Exploring the Role of Transformers in NLP: From BERT to GPT-3
Exploring the Role of Transformers in NLP: From BERT to GPT-3
IRJET Journal
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
KSHITIJCHAUDHARY20
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
kevig
 
arttt.pdf
arttt.pdfarttt.pdf
arttt.pdf
ferejadawud
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
SDL
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdf
StephenAmell4
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
IRJET Journal
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
Shreyas Suresh Rao
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
IJECEIAES
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptx
lwz614595250
 
Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...
IJECEIAES
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
kevig
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
kevig
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
IJECEIAES
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
IJET - International Journal of Engineering and Techniques
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
MedBelatrach
 
Contextual Emotion Recognition Using Transformer-Based Models
Contextual Emotion Recognition Using Transformer-Based ModelsContextual Emotion Recognition Using Transformer-Based Models
Contextual Emotion Recognition Using Transformer-Based Models
IRJET Journal
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 

Similar to BERT Explained_ State of the art language model for NLP.pdf (20)

Exploring the Role of Transformers in NLP: From BERT to GPT-3
Exploring the Role of Transformers in NLP: From BERT to GPT-3Exploring the Role of Transformers in NLP: From BERT to GPT-3
Exploring the Role of Transformers in NLP: From BERT to GPT-3
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
ijeter35852020.pdf
ijeter35852020.pdfijeter35852020.pdf
ijeter35852020.pdf
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
arttt.pdf
arttt.pdfarttt.pdf
arttt.pdf
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdf
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptx
 
Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
Contextual Emotion Recognition Using Transformer-Based Models
Contextual Emotion Recognition Using Transformer-Based ModelsContextual Emotion Recognition Using Transformer-Based Models
Contextual Emotion Recognition Using Transformer-Based Models
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 

Recently uploaded

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 

BERT Explained_ State of the art language model for NLP.pdf

  • 1. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 1/13 Published in Towards Data Science Rani Horev Follow Nov 10, 2018 · 7 min read · Listen BERTExplained: State of the art language modelfor NLP BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others. BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a 9K 29 Search Medium Write
  • 2. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 2/13 deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible. Background In the field of computer vision, researchers have repeatedly shown the value of transfer learning — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning — using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks. A different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is feature-based training. In this approach, a pre- trained neural network produces word embeddings which are then used as features in NLP models. How BERT works
  • 3. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 3/13 BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.
  • 4. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 4/13 When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies: Masked LM (MLM) Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires: 1. Adding a classification layer on top of the encoder output. 2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension. 3. Calculating the probability of each word in the vocabulary with softmax.
  • 5. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 5/13 The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness (see Takeaways #3).
  • 6. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 6/13 Note: In practice, the BERT implementation is slightly more elaborate and doesn’t replace all of the 15% masked words. See Appendix A for additional information. Next Sentence Prediction (NSP) In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence. To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model: 1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence. 2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2. 3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding
  • 7. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 7/13 are presented in the Transformer paper. Source: BERT [Devlin et al., 2018], with modifications To predict if the second sentence is indeed connected to the first, the following steps are performed: 1. The entire input sequence goes through the Transformer model. 2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases). 3. Calculating the probability of IsNextSequence with softmax.
  • 8. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 8/13 When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies. How to use BERT (Fine-tuning) Using BERT for a specific task is relatively straightforward: BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model: 1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token. 2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer. 3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model
  • 9. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 9/13 can be trained by feeding the output vector of each token into a classification layer that predicts the NER label. In the fine-tuning training, most hyper-parameters stay the same as in BERT training, and the paper gives specific guidance (Section 3.5) on the hyper- parameters that require tuning. The BERT team has used this technique to achieve state-of-the-art results on a wide variety of challenging natural language tasks, detailed in Section 4 of the paper. Takeaways 1. Model size matters, even at huge scale. BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters. 2. With enough training data, more training steps == higher accuracy. For instance, on the MNLI task, the BERT_base accuracy improves by 1.0% when trained on 1M steps (128,000 words batch size) compared to 500K steps with the same batch size. 3. BERT’s bidirectional approach (MLM) converges slower than left-to-right approaches (because only 15% of words are predicted in each batch) but
  • 10. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 10/13 bidirectional training still outperforms left-to-right training after a small number of pre-training steps. Source: BERT [Devlin et al., 2018] Compute considerations (training and applying)
  • 11. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 11/13 Conclusion BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. In this summary, we attempted to describe the main ideas of the paper while not drowning in excessive technical details. For those wishing for a deeper dive, we highly recommend reading the full article and ancillary articles referenced in it. Another useful reference is the BERT source code and models, which cover 103 languages and were generously released as open source by the research team. Appendix A — Word Masking Training the language model in BERT is done by predicting 15% of the tokens in the input, that were randomly picked. These tokens are pre- processed as follows — 80% are replaced with a “[MASK]” token, 10% with a
  • 12. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 12/13 random word, and 10% use the original word. The intuition that led the authors to pick this approach is as follows (Thanks to Jacob Devlin from Google for the insight): If we used [MASK] 100% of the time the model wouldn’t necessarily produce good token representations for non-masked words. The non- masked tokens were still used for context, but the model was optimized for predicting masked words. If we used [MASK] 90% of the time and random words 10% of the time, this would teach the model that the observed word is never correct. If we used [MASK] 90% of the time and kept the same word 10% of the time, then the model could just trivially copy the non-contextual embedding. No ablation was done on the ratios of this approach, and it may have worked better with different ratios. In addition, the model performance wasn’t tested with simply masking 100% of the selected tokens. For more summaries on the recent Machine Learning research, check out Lyrn.AI.
  • 13. 4/19/23, 1:13 AM BERT Explained: State of the art language model for NLP | byRani Horev| Towards Data Science https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 13/13 Machine Learning NLP Bert Deep Learning