SlideShare a Scribd company logo
1 of 57
Transformers
State Of The Art Natural Language Processing
Nilesh Verma
Full Stack Data Scientist
(Amlgo Labs India)
Agenda
• Recent Development on NLP
• Short History of NLP
• Why Transformers
• Transformer & Their Architecture
• Attention Mechanism
• Workings
• Types of Transformers
• Explain BERT
• Popular State of the Art language Model
Who I am
• I am Nilesh Verma
• Full Stack Data Scientist at Amlgo Labs India. Ex- Xceedance, Samsung AI
• Having 2+ Years of Industry Experience.
• AutoWave for Audio Classification, DeepImageSearch, and DeepTextSearch are some of the
interesting python libraries (open source contributions) that I developed and maintain.
• More then 30K-40K+ Downloads.
• Secured 1st rank in The Great Indian Hiring Hackathon (Nov-20) based on Foretelling the Retail
Price Host by MachineHack.
• Recognition of being placed 3rd in AppScript, A 48-Hours Hackathon Conducted by IEEE APSIT on
6-7th Feb 2021.
• Clear NTA-NET,GATE exam in first attempt.
• B.Sc. And M.Sc. Computer Science (Gold Medalist)
• Various state-level news cover for the development of real-time covid-19 detection through CT-
Scan software.
Recent Development on NLP
Short History of NLP
• 1954 - Bag of Words (BoW)
• 1972 - TF-IDF
• 2001 - Neural language models (RNN,B-RNN,LSTM)
• 2008 - Multi-Task learning
• 2013 - Word embeddings (Word2Vec)
• 2013 - Neural networks for NLP
• 2014 - Sequence to sequence models(Encoder-Decoder)
• 2015 - Attention (For images but found useful for Text too)
• 2017 - Transformer
• 2018 - Pretrained language models(BERT,GPT ,T5 etc)
Transformer ??
Why Transformer
• Improve Contextual Understanding
• Parallelization (Faster Processing/Utilization of GPU/TPU Power)
What is Transformer
What is Transformer
The Transformer in NLP is a novel architecture that aims to solve sequence-to-
sequence tasks while handling long-range dependencies with ease. It relies entirely
on self-attention to compute representations of its input and
output WITHOUT using sequence-aligned RNNs or convolution.
Transformer Architecture
Transformer Architecture Breakdown
• we see an encoding component, a decoding component, and
connections between them.
Transformer Architecture Breakdown
• The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s
nothing magical about the number six, one can definitely experiment with other arrangements). The
decoding component is a stack of decoders of the same number.
Transformer Architecture Breakdown
• The encoder’s inputs first flow through a self-attention layer
• The outputs of the self-attention layer are fed to a feed-forward neural network.
• The decoder has both those layers, but between them is an attention layer that helps the decoder focus on
relevant parts of the input sentence
Input Preprocessing
• Each word is embedded into a vector of size 512. We'll represent
those vectors with these simple boxes. The embedding only happens
in the bottom-most encoder.
Input Preprocessing
• To give the model a sense of the order of the words, we add positional encoding
vectors -- the values of which follow a specific pattern.
• Real example of positional encoding with a toy embedding size of 4
Input Preprocessing
Here “pos” refers to the
position of the “word” in
the sequence “d” means
the size of the
word/token embedding.
Finally, “i” refers to each
of the individual
dimensions of the
embedding (i.e. 0, 1,2,3,4)
Encoder
• The word at each position passes through a self-attention process. Then, they
each pass through a feed-forward neural network -- the exact same network with
each vector flowing through it separately.
Self-Attention
• Attention allowed us to focus on parts of our input sequence while we predicted
our output sequence
“Self attention, sometimes called intra-attention is an attention mechanism
relating different positions of a single sequence in order to compute a
representation of the sequence.”
Self-Attention in Detail
• The first step in calculating self-attention is to create a Query vector, a Key vector, and a Value vector. These vectors are
created by multiplying the embedding by three matrices that we trained during the training process.
• Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. This is an
architecture.
Multiplying x1 by
the WQ weight matrix
produces q1, the "query"
vector associated with that
word. We end up creating a
"query", a "key", and a
"value" projection of each
word in the input sentence.
What are the “query”, “key”,
and “value” vectors?
Self-Attention in Detail
• The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first
word in this example, “Thinking”. We need to score each word of the input sentence against this word.
The score is calculated by taking the dot
product of the query vector with the key
vector of the respective word we’re
scoring. So if we’re processing the self-
attention for the word in position #1,
the first score would be the dot product
of q1 and k1. The second score would be
the dot product of q1 and k2.
Self-Attention in Detail
• The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the
paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default),
then pass the result through a SoftMax operation. SoftMax normalizes the scores so they’re all positive and add up to 1.
This SoftMax score determines how
much each word will be expressed at
this position. Clearly the word at this
position will have the highest SoftMax
score, but sometimes it’s useful to
attend to another word that is relevant
to the current word.
Self-Attention in Detail
• The fifth step is to multiply each value vector by
the SoftMax score (in preparation to sum them up).
The intuition here is to keep intact the values of the
word(s) we want to focus on, and drown-out
irrelevant words (by multiplying them by tiny
numbers like 0.001, for example).
• The sixth step is to sum up the weighted value
vectors. This produces the output of the self-
attention layer at this position (for the first word).
Matrix Calculation of Self-Attention
The first step is to calculate the Query, Key, and Value
matrices. We do that by packing our embeddings into a
matrix X, and multiplying it by the weight matrices we’ve
trained (WQ, WK, WV).
Finally, since we’re dealing with matrices, we can
condense steps two through six in one formula to
calculate the outputs of the self-attention layer.
The Beast With Many Heads
• The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves
the performance of the attention layer in two ways:
• It expands the model’s ability to focus on different positions.
• It gives the attention layer multiple “representation subspaces”. so we end up with eight sets for each
encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input
embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
The Beast With Many Heads
The Beast With Many Heads
The Beast With Many Heads
As we encode the word "it", one attention
head is focusing most on "the animal", while
another is focusing on "tired" -- in a sense, the
model's representation of the word "it" bakes
in some of the representation of both "animal"
and "tired".
The Residuals
• One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-
layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a layer-
normalization step.
Layer Normalization
If we’re to visualize the vectors and the layer-norm
operation associated with self attention, it would look like
this:
Final Linear and SoftMax Layer
1. The decoder stack outputs a vector of floats.
How do we turn that into a word That’s the job
of the final Linear layer which is followed by a
SoftMax Layer.
2. The Linear layer is a simple fully connected
neural network that projects the vector
produced by the stack of decoders, into a
much, much larger vector called a logits vector.
3. Let’s assume that our model knows 10,000
unique English words (our model’s “output
vocabulary”) that it’s learned from its training
dataset. This would make the logits vector
10,000 cells wide – each cell corresponding to
the score of a unique word. That is how we
interpret the output of the model followed by
the Linear layer.
4. The SoftMax layer then turns those scores
into probabilities (all positive, all add up to
1.0). The cell with the highest probability is
chosen, and the word associated with it is
produced as the output for this time step.
Combined All
• This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked
encoders and decoders, it would look something like this:
Working
Working
Transformers are everywhere!
• Transformer models are used to solve all kinds of NLP tasks.
1. Feature Extraction (Get The Vector Representation Of A Text)
2. Fill-mask (Next Word Predication)
3. NER (Named Entity Recognition)
4. Question-Answering
5. Sentiment-Analysis
6. Summarization
7. Text-Generation
8. Translation
9. Zero-Shot-Classification
• The companies and organizations using Transformer models
A bit of Transformer history
Here are some reference points in the (short) history of Transformer models:
A bit of Transformer history
The Transformer architecture was introduced in June 2017. The focus of the original research was
on translation tasks. This was followed by the introduction of several influential models, including:
• June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks
and obtained state-of-the-art results
• October 2018: BERT, another large pretrained model, this one designed to produce better
summaries of sentences (more on this in the next chapter!)
• February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly
released due to ethical concerns
• October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory,
and still retains 97% of BERT’s performance
• October 2019: BART and T5, two large pretrained models using the same architecture as the
original Transformer model (the first to do so)
• May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of
tasks without the need for fine-tuning (called zero-shot learning)
Types of Transformers
This list is far from comprehensive, and is just meant to highlight a few
of the different kinds of Transformer models. Broadly, they can be
grouped into three categories:
Model Examples Tasks
Encoder ALBERT, BERT, DistilBERT, ELECTRA,
RoBERTa
Sentence classification, named
entity recognition, extractive
question answering
Decoder CTRL, GPT, GPT-2, Transformer XL Text generation
Encoder-decoder BART, T5, Marian, mBART Summarization, translation,
generative question answering
Transformers are language models
• All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as
language models. This means they have been trained on large amounts of raw text in a self-
supervised fashion. Self-supervised learning is a type of training in which the objective is
automatically computed from the inputs of the model. That means that humans are not needed
to label the data!
• This type of model develops a statistical understanding of the language it has been trained on,
but it’s not very useful for specific practical tasks. Because of this, the general pretrained model
then goes through a process called transfer learning. During this process, the model is fine-tuned
in a supervised way — that is, using human-annotated labels — on a given task.
BERT (Bidirectional Encoder Representations
from Transformers)
• BERT is a Natural Language Processing Model proposed by
researchers at Google Research in 2018.
• Individual NLP tasks have traditionally been solved by individual
models created for each specific task. That is, until— BERT!
• BERT revolutionized the NLP space by solving for 11+ of the most
common NLP tasks (and better than previous models) making it the
jack of all NLP trades.
Fun Fact 😁: You interact with NLP (and likely BERT) almost every single day!
Example of BERT
• BERT helps Google better surface (English) results for nearly all searches since
November of 2020.
• Here’s an example of how BERT helps Google better understand specific searches
like:
BERT’s Architecture
Transformer Layers Hidden Size Attention Heads Parameters Processing Length of Training
BERT-base 12 768 12 110M 4 TPUs 4 days
BERT-large 24 1024 16 340M 16 TPUs 4 days
How does BERT Work?
1. Large amounts of training data:
• A massive dataset of 3.3 Billion words has contributed to BERT’s continued success.
• BERT was specifically trained on Wikipedia (~2.5B words) and Google’s Books-Corpus (~800M words). These
large informational datasets contributed to BERT’s deep knowledge not only of the English language but also
of our world! 🚀
How does BERT Work?
2. Masked Language Model:
• MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing
BERT to bidirectionally use the words on either side of the covered word to predict the masked word. This
had never been done before!
How does BERT Work?
3. Next Sentence Prediction:
• NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by
predicting if a given sentence follows the previous sentence or not.
Training Inputs
1. We give inputs to BERT using the above structure. The input consists of a pair of sentences, called
sequences, and two special tokens: [CLS] and [SEP].
2. BERT first uses wordpiece tokenization to convert the sequence into tokens and adds the [CLS]
token in the start and the [SEP] token in the beginning and end of the second sentence.
Training Inputs
Token Embeddings: Token embeddings by indexing a Matrix of size 30000x768(H). Here, 30000 is
the Vocab length after wordpiece tokenization. The weights of this matrix would be learned while
training.
Training Inputs
Segment Embeddings: For tasks such as question answering, we should specify which segment this
sentence is from. These are either all 0 vectors of H length if the embedding is from sentence 1, or
a vector of 1’s if the embedding is from sentence 2.
Training Output
we define two vectors S and E (which will be learned
during fine-tuning) both having shapes(1x768). We then
take a dot product of these vectors with the second
sentence’s output vectors from BERT, giving us some
scores. We then apply SoftMax over these scores to get
probabilities. The training objective is the sum of the log-
likelihoods of the correct start and end positions.
BERT Training
Pre-Training
“What is language? What is context?”
Fine-Training
“How to use language for specific task?”
Fine-Training
GLUE Benchmark
• GLUE (General Language Understanding Evaluation) benchmark is a group of resources for
training, measuring, and analyzing language models comparatively to one another. These
resources consist of nine “difficult” tasks designed to test an NLP model’s understanding.
GPT (Generative Pre-trained Transformer)
• OpenAI GPT model was proposed in Improving Language Understanding by Generative Pre-
Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal
(unidirectional) transformer pre-trained using language modelling on a large corpus will long
range dependencies, the Toronto Book Corpus.
T5(Text-To-Text Transfer Transformer)
• T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a
text-to-text approach. Every task – including translation, question answering, and
classification – is cast as feeding the model text as input and training it to generate some
target text. T5 uses common crawl web extracted text.
References
• https://jalammar.github.io/illustrated-transformer/
• https://www.analyticsvidhya.com/blog/2019/06/understanding-
transformers-nlp-state-of-the-art-models/
• https://towardsdatascience.com/transformers-89034557de14
• https://www.youtube.com/watch?v=TQQlZhbC5ps&t=60s
• https://arxiv.org/abs/1706.03762

More Related Content

What's hot

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with TransformersDatabricks
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AIOzgurOscarOzkan
 
Introduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsHichem Felouat
 

What's hot (20)

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
Gan intro
Gan introGan intro
Gan intro
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
 
Question answering
Question answeringQuestion answering
Question answering
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
 
Introduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANs
 

Similar to Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognitionvatsal199567
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to TransformersSuman Debnath
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018Olaf de Leeuw
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMakerSuman Debnath
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learningcomifa7406
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...devismileyrockz
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
 
Build a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flowBuild a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flowDebasisMohanty37
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdfChaoYang81
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesYannis Flet-Berliac
 

Similar to Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing (20)

Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
Som paper1.doc
Som paper1.docSom paper1.doc
Som paper1.doc
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learning
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
 
Deep learning
Deep learningDeep learning
Deep learning
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Build a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flowBuild a simple image recognition system with tensor flow
Build a simple image recognition system with tensor flow
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdf
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 

Recently uploaded

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

  • 1. Transformers State Of The Art Natural Language Processing Nilesh Verma Full Stack Data Scientist (Amlgo Labs India)
  • 2. Agenda • Recent Development on NLP • Short History of NLP • Why Transformers • Transformer & Their Architecture • Attention Mechanism • Workings • Types of Transformers • Explain BERT • Popular State of the Art language Model
  • 3. Who I am • I am Nilesh Verma • Full Stack Data Scientist at Amlgo Labs India. Ex- Xceedance, Samsung AI • Having 2+ Years of Industry Experience. • AutoWave for Audio Classification, DeepImageSearch, and DeepTextSearch are some of the interesting python libraries (open source contributions) that I developed and maintain. • More then 30K-40K+ Downloads. • Secured 1st rank in The Great Indian Hiring Hackathon (Nov-20) based on Foretelling the Retail Price Host by MachineHack. • Recognition of being placed 3rd in AppScript, A 48-Hours Hackathon Conducted by IEEE APSIT on 6-7th Feb 2021. • Clear NTA-NET,GATE exam in first attempt. • B.Sc. And M.Sc. Computer Science (Gold Medalist) • Various state-level news cover for the development of real-time covid-19 detection through CT- Scan software.
  • 5. Short History of NLP • 1954 - Bag of Words (BoW) • 1972 - TF-IDF • 2001 - Neural language models (RNN,B-RNN,LSTM) • 2008 - Multi-Task learning • 2013 - Word embeddings (Word2Vec) • 2013 - Neural networks for NLP • 2014 - Sequence to sequence models(Encoder-Decoder) • 2015 - Attention (For images but found useful for Text too) • 2017 - Transformer • 2018 - Pretrained language models(BERT,GPT ,T5 etc)
  • 7. Why Transformer • Improve Contextual Understanding • Parallelization (Faster Processing/Utilization of GPU/TPU Power)
  • 9. What is Transformer The Transformer in NLP is a novel architecture that aims to solve sequence-to- sequence tasks while handling long-range dependencies with ease. It relies entirely on self-attention to compute representations of its input and output WITHOUT using sequence-aligned RNNs or convolution.
  • 11. Transformer Architecture Breakdown • we see an encoding component, a decoding component, and connections between them.
  • 12. Transformer Architecture Breakdown • The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
  • 13. Transformer Architecture Breakdown • The encoder’s inputs first flow through a self-attention layer • The outputs of the self-attention layer are fed to a feed-forward neural network. • The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence
  • 14. Input Preprocessing • Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes. The embedding only happens in the bottom-most encoder.
  • 15. Input Preprocessing • To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern. • Real example of positional encoding with a toy embedding size of 4
  • 16. Input Preprocessing Here “pos” refers to the position of the “word” in the sequence “d” means the size of the word/token embedding. Finally, “i” refers to each of the individual dimensions of the embedding (i.e. 0, 1,2,3,4)
  • 17. Encoder • The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.
  • 18. Self-Attention • Attention allowed us to focus on parts of our input sequence while we predicted our output sequence “Self attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.”
  • 19. Self-Attention in Detail • The first step in calculating self-attention is to create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process. • Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. This is an architecture. Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence. What are the “query”, “key”, and “value” vectors?
  • 20. Self-Attention in Detail • The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self- attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
  • 21. Self-Attention in Detail • The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a SoftMax operation. SoftMax normalizes the scores so they’re all positive and add up to 1. This SoftMax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest SoftMax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
  • 22. Self-Attention in Detail • The fifth step is to multiply each value vector by the SoftMax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example). • The sixth step is to sum up the weighted value vectors. This produces the output of the self- attention layer at this position (for the first word).
  • 23. Matrix Calculation of Self-Attention The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV). Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
  • 24. The Beast With Many Heads • The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways: • It expands the model’s ability to focus on different positions. • It gives the attention layer multiple “representation subspaces”. so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
  • 25. The Beast With Many Heads
  • 26. The Beast With Many Heads
  • 27. The Beast With Many Heads As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".
  • 28. The Residuals • One detail in the architecture of the encoder that we need to mention before moving on, is that each sub- layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a layer- normalization step.
  • 29. Layer Normalization If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
  • 30. Final Linear and SoftMax Layer 1. The decoder stack outputs a vector of floats. How do we turn that into a word That’s the job of the final Linear layer which is followed by a SoftMax Layer. 2. The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. 3. Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer. 4. The SoftMax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
  • 31. Combined All • This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
  • 34. Transformers are everywhere! • Transformer models are used to solve all kinds of NLP tasks. 1. Feature Extraction (Get The Vector Representation Of A Text) 2. Fill-mask (Next Word Predication) 3. NER (Named Entity Recognition) 4. Question-Answering 5. Sentiment-Analysis 6. Summarization 7. Text-Generation 8. Translation 9. Zero-Shot-Classification • The companies and organizations using Transformer models
  • 35. A bit of Transformer history Here are some reference points in the (short) history of Transformer models:
  • 36. A bit of Transformer history The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including: • June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results • October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!) • February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns • October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance • October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so) • May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
  • 37. Types of Transformers This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories: Model Examples Tasks Encoder ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa Sentence classification, named entity recognition, extractive question answering Decoder CTRL, GPT, GPT-2, Transformer XL Text generation Encoder-decoder BART, T5, Marian, mBART Summarization, translation, generative question answering
  • 38. Transformers are language models • All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self- supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data! • This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.
  • 39. BERT (Bidirectional Encoder Representations from Transformers) • BERT is a Natural Language Processing Model proposed by researchers at Google Research in 2018. • Individual NLP tasks have traditionally been solved by individual models created for each specific task. That is, until— BERT! • BERT revolutionized the NLP space by solving for 11+ of the most common NLP tasks (and better than previous models) making it the jack of all NLP trades. Fun Fact 😁: You interact with NLP (and likely BERT) almost every single day!
  • 40. Example of BERT • BERT helps Google better surface (English) results for nearly all searches since November of 2020. • Here’s an example of how BERT helps Google better understand specific searches like:
  • 41. BERT’s Architecture Transformer Layers Hidden Size Attention Heads Parameters Processing Length of Training BERT-base 12 768 12 110M 4 TPUs 4 days BERT-large 24 1024 16 340M 16 TPUs 4 days
  • 42. How does BERT Work? 1. Large amounts of training data: • A massive dataset of 3.3 Billion words has contributed to BERT’s continued success. • BERT was specifically trained on Wikipedia (~2.5B words) and Google’s Books-Corpus (~800M words). These large informational datasets contributed to BERT’s deep knowledge not only of the English language but also of our world! 🚀
  • 43. How does BERT Work? 2. Masked Language Model: • MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word. This had never been done before!
  • 44. How does BERT Work? 3. Next Sentence Prediction: • NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not.
  • 45. Training Inputs 1. We give inputs to BERT using the above structure. The input consists of a pair of sentences, called sequences, and two special tokens: [CLS] and [SEP]. 2. BERT first uses wordpiece tokenization to convert the sequence into tokens and adds the [CLS] token in the start and the [SEP] token in the beginning and end of the second sentence.
  • 46. Training Inputs Token Embeddings: Token embeddings by indexing a Matrix of size 30000x768(H). Here, 30000 is the Vocab length after wordpiece tokenization. The weights of this matrix would be learned while training.
  • 47. Training Inputs Segment Embeddings: For tasks such as question answering, we should specify which segment this sentence is from. These are either all 0 vectors of H length if the embedding is from sentence 1, or a vector of 1’s if the embedding is from sentence 2.
  • 48. Training Output we define two vectors S and E (which will be learned during fine-tuning) both having shapes(1x768). We then take a dot product of these vectors with the second sentence’s output vectors from BERT, giving us some scores. We then apply SoftMax over these scores to get probabilities. The training objective is the sum of the log- likelihoods of the correct start and end positions.
  • 50. Pre-Training “What is language? What is context?”
  • 51. Fine-Training “How to use language for specific task?”
  • 53. GLUE Benchmark • GLUE (General Language Understanding Evaluation) benchmark is a group of resources for training, measuring, and analyzing language models comparatively to one another. These resources consist of nine “difficult” tasks designed to test an NLP model’s understanding.
  • 54. GPT (Generative Pre-trained Transformer) • OpenAI GPT model was proposed in Improving Language Understanding by Generative Pre- Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal (unidirectional) transformer pre-trained using language modelling on a large corpus will long range dependencies, the Toronto Book Corpus.
  • 55. T5(Text-To-Text Transfer Transformer) • T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a text-to-text approach. Every task – including translation, question answering, and classification – is cast as feeding the model text as input and training it to generate some target text. T5 uses common crawl web extracted text.
  • 56.
  • 57. References • https://jalammar.github.io/illustrated-transformer/ • https://www.analyticsvidhya.com/blog/2019/06/understanding- transformers-nlp-state-of-the-art-models/ • https://towardsdatascience.com/transformers-89034557de14 • https://www.youtube.com/watch?v=TQQlZhbC5ps&t=60s • https://arxiv.org/abs/1706.03762