4. Transformer
Architecture
The Transformer architecture based on :
• An encoder-decoder structure but does not rely on
recurrence and convolutions in order to generate an
output.
• An attention mechanism that learns contextual
relations between words (or sub-words) in a text.
• Positional encoding
• Take advantage of parallelization
• Process all tokens at once
• Process much more data than RNN at once
Attention Is All You Need
5. How Transformer model works?
Step1. Input natural language sentence and embed each word.
Step2. Perform multi-headed attention and multiple
the embedded words with the respective weight
matrices.
6. How Transformer model works? (contd..)
Step3. Calculate the attention using the resulting QKV matrices.
Step4. Concatenate the matrices to produce the
output matrix which is the same dimension
as the final matrix.
7.
8. Introduction to BERT
• BERT stands for Bidirectional Encoder Representations from Transformers.
• It was introduced by researchers at Google AI Language in 2018
• Today, BERT powers the google search
• Historically, language models could only read text input sequentially -- either left-to-right
or right-to-left -- but couldn't do both at the same time.
• BERT’s key technical innovation is applying the bidirectional training of Transformer, a
popular attention model, to language modelling
• There are two steps in BERT framework: pre-training and fine-tuning.
• During pre-training, the model is trained on unlabelled data over different pre-
training tasks.
• For finetuning, the BERT model is first initialized with the pre trained parameters,
and all of the parameters are fine-tuned using labelled data from the downstream
tasks. Each downstream task has separate fine-tuned models, even though they
are initialized with the same pre-trained parameters.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
9. How BERT is
different?
BERT doesn’t use a decoder unlike transformer. BERT is pre trained using two
unsupervised tasks as below:
10. BERT
Architetcure
There are two types of pre-trained versions of BERT
depending on the scale of the model architecture:
• BERT-Base: 12-layer, 768-hidden-nodes, 12-
attention-heads, 110M parameters (Cased,
Uncased)
• BERT-Large: 24-layer, 1024-hidden-nodes, 16-
attention-heads, 340M parameters (Cased,
Uncased)
Fun fact: BERT-Base was trained on 4 cloud TPUs
for 4 days and BERT-Large was trained on 16 TPUs
for 4 days!
** The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished
books and English Wikipedia (excluding lists, tables and headers)
11. BERT Embeddings
Note –
• Word Piece Embeddings(2016) – 30,000 token vocabulary
• The first token of every sequence is always a special classification token ([CLS])
• The sentences are differentiated in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned
embedding to every token indicating whether it belongs to sentence A or sentence B.
• Bert is designed to process the input sequences up to length of 512 tokens
12. Why BERT Embeddings?
• BERT is used to extract features,
namely word and sentence
embedding vectors, from text data
13. Pre- trained
models
Some of the pre trained modified versions of BERT are as below:
• DistilBERT model is a distilled form of the BERT model. The size of a
BERT model was reduced by 40% via knowledge distillation during the
pre-training phase while retaining 97% of its language understanding
abilities and being 60% fast
• RoBERTa, or Robustly Optimized BERT Pretraining Approach, is an
improvement over BERT developed by Facebook AI. It is trained on a
larger corpus of data and has some modifications to the training
process to improve its performance
• ALBERT, A Lite BERT for Self-supervised Learning of Language
Representations, with much fewer parameters.
• DeBERTa, Decoding-enhanced BERT with disentangled attention that
improves the BERT and RoBERTa model.