Transformer and BERT model
Contents
• Attention model over traditional RNN
• Transformer Architecture
• How Transformer model works?
• Pre-trained Transformer models
• Introduction to BERT
• How BERT is different?
• BERT Architetcure
• BERT Embeddings
• Why BERT Embeddings?
• Pre-trained Bert models
Attention model over traditional RNN
Attention Score
Transformer
Architecture
The Transformer architecture based on :
• An encoder-decoder structure but does not rely on
recurrence and convolutions in order to generate an
output.
• An attention mechanism that learns contextual
relations between words (or sub-words) in a text.
• Positional encoding
• Take advantage of parallelization
• Process all tokens at once
• Process much more data than RNN at once
Attention Is All You Need
How Transformer model works?
Step1. Input natural language sentence and embed each word.
Step2. Perform multi-headed attention and multiple
the embedded words with the respective weight
matrices.
How Transformer model works? (contd..)
Step3. Calculate the attention using the resulting QKV matrices.
Step4. Concatenate the matrices to produce the
output matrix which is the same dimension
as the final matrix.
Introduction to BERT
• BERT stands for Bidirectional Encoder Representations from Transformers.
• It was introduced by researchers at Google AI Language in 2018
• Today, BERT powers the google search
• Historically, language models could only read text input sequentially -- either left-to-right
or right-to-left -- but couldn't do both at the same time.
• BERT’s key technical innovation is applying the bidirectional training of Transformer, a
popular attention model, to language modelling
• There are two steps in BERT framework: pre-training and fine-tuning.
• During pre-training, the model is trained on unlabelled data over different pre-
training tasks.
• For finetuning, the BERT model is first initialized with the pre trained parameters,
and all of the parameters are fine-tuned using labelled data from the downstream
tasks. Each downstream task has separate fine-tuned models, even though they
are initialized with the same pre-trained parameters.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
How BERT is
different?
BERT doesn’t use a decoder unlike transformer. BERT is pre trained using two
unsupervised tasks as below:
BERT
Architetcure
There are two types of pre-trained versions of BERT
depending on the scale of the model architecture:
• BERT-Base: 12-layer, 768-hidden-nodes, 12-
attention-heads, 110M parameters (Cased,
Uncased)
• BERT-Large: 24-layer, 1024-hidden-nodes, 16-
attention-heads, 340M parameters (Cased,
Uncased)
Fun fact: BERT-Base was trained on 4 cloud TPUs
for 4 days and BERT-Large was trained on 16 TPUs
for 4 days!
** The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished
books and English Wikipedia (excluding lists, tables and headers)
BERT Embeddings
Note –
• Word Piece Embeddings(2016) – 30,000 token vocabulary
• The first token of every sequence is always a special classification token ([CLS])
• The sentences are differentiated in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned
embedding to every token indicating whether it belongs to sentence A or sentence B.
• Bert is designed to process the input sequences up to length of 512 tokens
Why BERT Embeddings?
• BERT is used to extract features,
namely word and sentence
embedding vectors, from text data
Pre- trained
models
Some of the pre trained modified versions of BERT are as below:
• DistilBERT model is a distilled form of the BERT model. The size of a
BERT model was reduced by 40% via knowledge distillation during the
pre-training phase while retaining 97% of its language understanding
abilities and being 60% fast
• RoBERTa, or Robustly Optimized BERT Pretraining Approach, is an
improvement over BERT developed by Facebook AI. It is trained on a
larger corpus of data and has some modifications to the training
process to improve its performance
• ALBERT, A Lite BERT for Self-supervised Learning of Language
Representations, with much fewer parameters.
• DeBERTa, Decoding-enhanced BERT with disentangled attention that
improves the BERT and RoBERTa model.
Pre-trained Bert models
https://huggingface.co/models?other=bert
Thank You

Bert.pptx

  • 1.
  • 2.
    Contents • Attention modelover traditional RNN • Transformer Architecture • How Transformer model works? • Pre-trained Transformer models • Introduction to BERT • How BERT is different? • BERT Architetcure • BERT Embeddings • Why BERT Embeddings? • Pre-trained Bert models
  • 3.
    Attention model overtraditional RNN Attention Score
  • 4.
    Transformer Architecture The Transformer architecturebased on : • An encoder-decoder structure but does not rely on recurrence and convolutions in order to generate an output. • An attention mechanism that learns contextual relations between words (or sub-words) in a text. • Positional encoding • Take advantage of parallelization • Process all tokens at once • Process much more data than RNN at once Attention Is All You Need
  • 5.
    How Transformer modelworks? Step1. Input natural language sentence and embed each word. Step2. Perform multi-headed attention and multiple the embedded words with the respective weight matrices.
  • 6.
    How Transformer modelworks? (contd..) Step3. Calculate the attention using the resulting QKV matrices. Step4. Concatenate the matrices to produce the output matrix which is the same dimension as the final matrix.
  • 8.
    Introduction to BERT •BERT stands for Bidirectional Encoder Representations from Transformers. • It was introduced by researchers at Google AI Language in 2018 • Today, BERT powers the google search • Historically, language models could only read text input sequentially -- either left-to-right or right-to-left -- but couldn't do both at the same time. • BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling • There are two steps in BERT framework: pre-training and fine-tuning. • During pre-training, the model is trained on unlabelled data over different pre- training tasks. • For finetuning, the BERT model is first initialized with the pre trained parameters, and all of the parameters are fine-tuned using labelled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 9.
    How BERT is different? BERTdoesn’t use a decoder unlike transformer. BERT is pre trained using two unsupervised tasks as below:
  • 10.
    BERT Architetcure There are twotypes of pre-trained versions of BERT depending on the scale of the model architecture: • BERT-Base: 12-layer, 768-hidden-nodes, 12- attention-heads, 110M parameters (Cased, Uncased) • BERT-Large: 24-layer, 1024-hidden-nodes, 16- attention-heads, 340M parameters (Cased, Uncased) Fun fact: BERT-Base was trained on 4 cloud TPUs for 4 days and BERT-Large was trained on 16 TPUs for 4 days! ** The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers)
  • 11.
    BERT Embeddings Note – •Word Piece Embeddings(2016) – 30,000 token vocabulary • The first token of every sequence is always a special classification token ([CLS]) • The sentences are differentiated in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. • Bert is designed to process the input sequences up to length of 512 tokens
  • 12.
    Why BERT Embeddings? •BERT is used to extract features, namely word and sentence embedding vectors, from text data
  • 13.
    Pre- trained models Some ofthe pre trained modified versions of BERT are as below: • DistilBERT model is a distilled form of the BERT model. The size of a BERT model was reduced by 40% via knowledge distillation during the pre-training phase while retaining 97% of its language understanding abilities and being 60% fast • RoBERTa, or Robustly Optimized BERT Pretraining Approach, is an improvement over BERT developed by Facebook AI. It is trained on a larger corpus of data and has some modifications to the training process to improve its performance • ALBERT, A Lite BERT for Self-supervised Learning of Language Representations, with much fewer parameters. • DeBERTa, Decoding-enhanced BERT with disentangled attention that improves the BERT and RoBERTa model.
  • 14.
  • 15.