SlideShare a Scribd company logo
1 of 143
Download to read offline
Transformers 101
An Intuitive Approach for Learners
Venkata Reddy Konasani
Contents
•Background
•Word2vec Recap
•Weighted embedding's
•Attention mechanism
•Single Head Attention
•Multi-head Attention
•Positional encodings
•Seq to Seq Model
•Encoder and Decoder Model
•Transformers
Background
•Transformers are used for processing sequential data.
• Language Translation – Very frequently used.
• Text Generation: Generating human-like text based on prompts.- Chat GPT
• Chatbots and Conversational Agents
• Genome sequence analysis
• Sound wave analysis
• Time series data analysis
•This is an extension of Encoder and Decoder mechanism used in
sequence to sequence model building for language translation using
LSTMs
Word Embeddings – Revision
Word Embeddings – Revision
•None of the ML and DL models can work with direct text data.
•Model parameter calculations in ML and DL involve mathematical
computations
•While working on NLP problems we need to convert our text data into
numbers.
•One hot encoding is very rudimentary, doesn’t really work for solving
large and complex NLP problems.
The Problem Statement
•We want to perform some analysis on text data. How do we convert
text into numerical data?
• By keeping all the meaningful relations intact
• By loosing very less information in that process of conversion
•The idea is to convert text data into numerical data
• With very less loss of information
• Without introducing any additional error?
• By keeping all the information intact
• By preserving all the relationships in the text data
Word2vec introduction
•word2vec computes vector representation for words
•word2vec tries to convert words in to numerical vectors so that similar
words share a similar vector representation.
•word2vec is the name of the concept and it is not a single algorithm
•word2vec is not a deep learning technique like RNN or CNN. (no
unsupervised pre-training of layers)
Word2vec - Two major steps
•Word2vec comes from the idea of preserving local context
•It has two major steps
• Create training samples
• Use these samples to train the neural network model
•Word2Vec tries to create the training samples by parsing through the
data with a fixed window size
•After creating training samples, we will use a single layer neural
network to train the model
Word2vec – Step1: Create Training
Samples Input Output
king strong
king man
strong king
strong man
man king
man strong
queen wise
queen women
wise queen
wise women
women queen
women wise
• We are considering window size as 2
king strong man
queen wise women
Word2vec – Step2: Build a neural network
Model
Input
king
king
strong
strong
man
man
queen
queen
wise
wise
women
women
words
Output
strong
man
king
man
king
strong
wise
women
queen
women
queen
wise
Contexts
K embeddings
Result of the word2vec
Input
king
king
strong
strong
man
man
queen
queen
wise
wise
women
women
words
It will come
from here
king 3.248315 -0.292609 2.028029
man 1.032173 3.037509 -1.810387
strong 3.659783 0.865091 -1.710116
queen -3.151272 -2.009174 0.550185
woman -2.420959 0.980081 -0.391963
Word2Vec - Example
1) I can bear the pain. I am used to playing all day. That
cricket bat is very light. I opened for my team.
2) It was very dark, there was no light. When she opened
the can, she found a bat and a cricket. Suddenly, a bear
appeared behind her.
Look at the above two sentences. Imagine if we are working with
word2vec
Word2Vec - Example
1) I can bear the pain. I am used to playing all day. That
cricket bat is very light. I opened for my team.
2) It was very dark, there was no light. When she opened
the can, she found a bat and a cricket. Suddenly, a bear
appeared behind her.
I -2.0280 0.2926 1.1242 3.2483
can -1.8104 3.0375 -0.8583 1.0322
bear -1.7101 0.8651 -0.9374 3.6598
the 0.5502 -2.0092 0.0753 -3.1513
pain -0.3920 0.9801 -0.3494 -2.4210
Problem with word2Vec
2) It was very dark,
there was no light. When
she opened the can, she
found a bat and a
cricket. Suddenly, a bear
appeared behind her.
1) I can bear the pain. I
am used to playing all
day. That cricket bat is
very light. I opened for
my team.
•The word “can” in the first sentence, is it same as the can in the
second sentence?
•When we use word2Vec will we get the same vector or two different
vectors
Problem with word2Vec
2) It was very dark,
there was no light. When
she opened the can, she
found a bat and a
cricket. Suddenly, a bear
appeared behind her.
1) I can bear the pain. I
am used to playing all
day. That cricket bat is
very light. I opened for
my team.
•Similarly, The word “light” in the first sentence, is it same as the can
in the second sentence?
•When we use word2Vec will we get the same vector or two different
vectors
Problem with word2Vec
2) It was very dark,
there was no light. When
she opened the can, she
found a bat and a
cricket. Suddenly, a bear
appeared behind her.
1) I can bear the pain. I
am used to playing all
day. That cricket bat is
very light. I opened for
my team.
•Can – to be able to vs. Can or tin
•light weight vs. light vision
•bear- endurance vs. bear –animal
•cricket bat – game vs cricket, bat insects
Problem with word2Vec
•Can – to be able to vs. Can or tin
•light weight vs. light vision
•bear- endurance vs. bear –animal
•cricket bat – game vs cricket, bat insects
•As humans, we can understand the contextual relevance and the
difference between those words. Word2Vec can not distinguish.
•The Word2Vec theory doesn’t have any mechanism to give
differentiate these words based on their context.
•We need to add some more contextual relevance to the existing
word2vec formula.
Do you agree?
•For building really intelligent models, we need to weight in some more
context.
•The context preserved in the Word2Vec is not sufficient for solving the
advanced NLP applications like
• Intelligent chatbots
• Search Engines
• Language Translation
• Voice Assistants
• Email filtering
•We need to preserve some more context over and above the word2Vec
embeddings
An example
•Consider these word embeddings.
•For the (simplicity) discussion sake we
have considered rounded-off numbers.
a -0.2 -0.2 -1.1 3.2
bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1.0 -2.2 -1.5
I -2.0 0.3 1.1 3.2
can -1.8 3.0 -0.9 1.0
bear -1.7 0.9 -0.9 3.7
the 0.6 -2.0 0.1 -3.2
pain -0.4 1.0 -0.3 -2.4
Word2Vec + more Contextual weights
•How to get additional context based
weights?
•Let us look at the word embeddings
for the words in the two sentences.
•The same embeddings can be seen for
bear in the two different sentences.
•The goal is to include context based
weights to modify these embeddings.
•Send those adjusted embeddings to
the final NLP model, that will indeed
will give us better accuracy.
a -0.2 -0.2 -1.1 3.2
bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1.0 -2.2 -1.5
I -2.0 0.3 1.1 3.2
can -1.8 3.0 -0.9 1.0
bear -1.7 0.9 -0.9 3.7
the 0.6 -2.0 0.1 -3.2
pain -0.4 1.0 -0.3 -2.4
Lets focus on “bear”
•How to make “bear” in the first
sentence look different from “bear”
in the second sentence ?
•We are going to add extra weightage
to these embeddings.
a -0.2 -0.2 -1.1 3.2
bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1.0 -2.2 -1.5
I -2.0 0.3 1.1 3.2
can -1.8 3.0 -0.9 1.0
bear -1.7 0.9 -0.9 3.7
the 0.6 -2.0 0.1 -3.2
pain -0.4 1.0 -0.3 -2.4
Weighted embeddings.
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Original embedding for
“bear” in sentence -1
This is the new weighted embedding for
“bear” in sentence -1
Weighted embeddings.
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
extra weights for “bear” with respect to “I”,
“can”, “bear”, ”the” and “pain”
Weighted embeddings.
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
bring “I” weight into “bear”
(-1.7)*(-2)+ (0.9)*(0.3)+ (-0.9)*(1.1)+ (3.7)*(3.2)
Weighted embeddings.
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
weights for “bear” with respect to “I”, “can”,
“bear”, ”the” and “pain”
Weighted embeddings.
-29.0
4.4
16.0
46.5
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Multiply with
weight(14.52)
After multiplication
Weighted embeddings.
-29.0 -18.5 -30.9 -8.9 2.8
4.4 30.8 16.4 29.5 -7.0
16.0 -9.2 -16.4 -1.5 2.1
46.5 10.3 67.3 47.2 16.9
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Multiply corresponding weights for each
vector
Weighted embeddings.
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Add across all the
embedding
dimension
Weighted embeddings.
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
This is the new weighted embedding for
“bear” in sentence -1
Original embedding for
“bear” in sentence -1
Weighted embeddings.
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
This is the new weighted embedding for
“bear” in sentence -1
Original embedding for
“bear” in sentence -1
Normalizing the weights and
making their sum=1 will help us in
keeping vectors to small numbers.
Weighted embeddings. –Sentence-1
-84.5
74.0
-9.0
188.1
bear
-1.7
0.9
-0.9
3.7
This is the new weighted embedding for “bear”
in sentence -1
Original embedding for
“bear” in sentence -1
Weighted embeddings. –Sentence-2
a bear appeared behind her
-0.2 -1.7 -0.2 -1.3 -0.3
-0.2 0.9 -0.5 -0.8 1
-1.1 -0.9 0.3 0.1 -2.2
3.2 3.7 2.6 2.5 -1.5
Original embedding for
“bear” in sentence -2
• This will be the new weighted embedding
for “bear” in sentence -2
• Will be same as in sentence-1 or will be
different? What do you think?
-2.6 -30.9 -1.8 -13.8 0.6 -48.6
-2.6 16.4 -4.6 -8.5 -2.2 -1.5
-14.3 -16.4 2.8 1.1 4.8 -22.1
41.6 67.3 24.0 26.6 3.2 162.8
a -0.2 -0.2 -1.1 3.2
12.99 18.20 9.24 10.65 -2.16 bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1 -2.2 -1.5
a bear appeared behind her
-0.2 -1.7 -0.2 -1.3 -0.3
-0.2 0.9 -0.5 -0.8 1
-1.1 -0.9 0.3 0.1 -2.2
3.2 3.7 2.6 2.5 -1.5
Original embedding for
“bear” in sentence -2
Weighted embeddings. –Sentence-2
Weighted embeddings. –Sentence-1 vs 2
Sentence-1 Sentence-2
Word
embeddings
Word
embeddings
bear bear
-1.7
0.9
-0.9
3.7
-1.7
0.9
-0.9
3.7
Weighted embeddings. –Sentence-1 vs 2
Sentence-1 Sentence-2
Word
embeddings
Weighted
embeddings
Word
embeddings
Weighted
embeddings
bear bear(new) bear bear(new)
-1.7
0.9
-0.9
3.7
-84.5
74.0
-9.0
188.1
-1.7
0.9
-0.9
3.7
-48.6
-1.5
-22.1
162.8
Weighted embeddings. –Sentence-1 vs 2
•We now have two new embeddings for the same word. While training the model, we
will use different embeddings at different sentences.
•Using these new weighted embeddings will have a greater importance for the specific
contexts and these new embeddings will significantly improve the model accuracy
Sentence-1 Sentence-2
Word
embeddings
Weighted
embeddings
Word
embeddings
Weighted
embeddings
bear bear(new) bear bear(new)
-1.7
0.9
-0.9
3.7
-84.5
74.0
-9.0
188.1
-1.7
0.9
-0.9
3.7
-48.6
-1.5
-22.1
162.8
Weighted embeddings procedure
•Have you understood the formula for calculation of new weighted
embeddings?
•The procedure that we followed until now is known as attention
mechanism
•And that’s it …..That's all you need to improve the accuracy of the
model. “Attention is all you need”
•Lets go through attention once again.
Attention = Weighted embeddings
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Attention Terminology
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
These vectors in this
operation are called as
Queries(Q)
Attention Terminology
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
These vectors in this
operation are called as
keys(K)
These vectors in this
operation are called as
Queries(Q)
Attention Terminology
-29.0 -18.5 -30.9 -8.9 2.8
4.4 30.8 16.4 29.5 -7.0
16.0 -9.2 -16.4 -1.5 2.1
46.5 10.3 67.3 47.2 16.9
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
These vectors in this
operation are called as
keys(K)
These vectors in this
operation are called as
Queries(Q)
We multiply weights again with the
embeddings. These vectors are
known as values (V)
Attention Terminology
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
These vectors in this
operation are called as
keys(K)
These vectors in this
operation are called as
Queries(Q)
We multiply weights again with the
embeddings. These vectors are
known as values (V)
These are called scores. Scores are
normalized and their sum is made as 1.
Attention Terminology
V1, V2,V3,….Vn
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Attention Terminology
V1, V2,V3,….Vn
Q
bear -1.7 0.9 -0.9 3.7
Attention Terminology
V1, V2,V3,….Vn
Q K
bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Attention Terminology
V1, V2,V3,….Vn
Q K
matmul
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Matrix multiplication
Attention Terminology
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
-29.0 -18.5 -30.9 -8.9 2.8
4.4 30.8 16.4 29.5 -7.0
16.0 -9.2 -16.4 -1.5 2.1
46.5 10.3 67.3 47.2 16.9
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Attention Terminology
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Note: only one vector Yk calculation is shown
here. We can do the same for all ‘n’ vectors
Attention - Block
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
Attention block, more specially
self-attention block
Do you Agree ?
•Attention
• Is nothing but weighted embeddings only
• Smart embeddings
• Advanced version of word2vec
• More contextualized embeddings
• With a different matrix calculation formula.
Trainable Parameters ?
•Are there any trainable parameters in the above process ?
•Are we training anything? Or are we simply doing the multiplications
without any trainable weight parameters?
•Yes, until now there are NO trainable parameters.
•We need to bring trainable weights into this procedure to learn based
on the data.
•Each Q, K and V vectors need to be multiplied by the weight matrices
WQ , WK, Wv
Add Trainable Parameters
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
Multiplying with weight matrix is
same as adding a linear layer in NN
terminology
Add Trainable Parameters
While training the model, we can
consider this as attention block as a
linear embedding layer, and train the
parameters based on the error signals.
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
Attention formula
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 V
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
Attention – Actual Figure from the paper
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
V1, V2,V3,….Vn
stacked attention blocks/layers
V1, V2,V3,….Vn
Y1, Y2,Y3,….Yn
We can also stack multiple self-
attention blocks in the model. This will
further improve the accuracy
Self Attention visualization
• Many thanks to - Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved
from https://jalammar.github.io/illustrated-transformer/ Simulation and Colab code
Several parallel contexts
• I can bear the pain. I am used to playing all day. That cricket bat is very
light. I opened for my team.
• It was very dark, there was no light. When she opened the can, she found a
bat and a cricket. Suddenly, a bear appeared behind her.
• There are several parallel contexts present here
• Sports - cricket game
• Bat in sports - sports equipment
• Present and past events
• Bat/ cricket – insects
• Bear – Animals
• playing, opening, appearing – Actions
• These are the contexts from just two documents. For the large and diverse datasets there
will be multiple contexts. We need many more parallel attention stacks.
Single Head Attention.
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
• Until now, we have discussed a
single attention stack either with
single block or multiple attention
blocks.
• At the end we will have a single
set of vectors [Y1, Y2,Y3,….Yn]
• This is known as single head
attention.
• For a large and diverse dataset, we
may have several parallel contexts.
• We may have to include parallel
attention stacks to learn parallel
contexts
Attention- with two heads
Multi-head Attention.
Multi-head Attention.
Concatenate
Linear Layer
Adding a liner layer is
simply multiplying the
concatenated vector with
output weights WO
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑐𝑜𝑛𝑐𝑎𝑡 𝐻1, 𝐻2, . . , 𝐻ℎ Wo
𝑤ℎ𝑒𝑟𝑒 𝐻𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖𝑘, 𝑉𝑊𝑖𝑣
Concatenate
Linear Layer
Multi-head Attention - Actual figure from
the paper
Multi-head Attention visualization - Example
Two attention heads focusing on two
different contexts w.r.t a word
Multi-head Attention visualization - Example
Three attention heads focusing on three
different contexts w.r.t the same word
Self attention - Limitations
•Until now we discussed self-attention with single head and multiple
heads.
•In self attention, a word looks at the surrounding words in the same
sentence to calculate the attention.
•Self attention is useful for many-to-one kind of models. For example
• Sentiment Analysis
• Document Classification
•We need to have a slightly complex architecture for many-to-many or
sequence to sequence models. For example
• Machine Translation
• Chatbot
Sequence to Sequence Models.
•Before discussing attention, we used LSTM or RNN for sequence to
sequence models.
•We used a Encoder LSTM for input sequence and a Decoder LSTM for
the output sequence.
•Lets revisit the LSTM or RNN based seq to seq model using simple word
embeddings.
Encoder and Decoder – Revision
Encoder and Decoder – Revision
•Language translation is an example of sequence to sequence model
•This model has two components
• Encoder LSTM for understanding the structure of input language sequence
• Decoder LSTM for translating the the text in output language
x1 ….. xt
y1 … yt
Decoder LSTM
Encoder LSTM
Thought Vector
Encoder and Decoder – Revision
Encoder and Decoder – Revision
•Encoder:
• Takes the input sequence and creates a new vector based on the whole input
sequence. The resultant vector from encoder is also known as thought vector or
context vector
Encoder and Decoder – Revision
•Thought Vector / Context Vector:
• The overall crux of the input sequence is imparted into one vector. This will be the
only input to the decoder.
Encoder and Decoder – Revision
•Decoder:
• Decoder takes thought vector as input and generates a logical translated sequence.
Encoder and Decoder – Revision
model = Sequential()
model.add(Embedding(input_dim=30000, output_dim=256, input_length=20))
# Word2Vec
model.add(LSTM(128)) # Encoder
model.add(RepeatVector(20)) # Thought vector
model.add(LSTM(128, return_sequences=True)) #Decoder
model.add(Dense(30000, activation='softmax’))
LAB: LSTM – Seq to Seq Model
•From English to Spanish
Code: LSTM – Seq to Seq Model
•From English to Spanish
Encoder and Decoder issues
•The above procedure uses standard embeddings.
•Since we generally use LSTM or GRU, we can achieve decent accuracy.
•Less Accuracy - For large and diverse datasets, we may not get very
high accuracy. Due to the limitations of standard embeddings.
•Slow - For large and diverse datasets, we need very complex models
with very deep architecture. Parallel processing is difficult in
sequential models.
Attention + Encoder and Decoder
•We will now include multi-head attention inside the Encoder and
Decoder
•High Accuracy - For large and diverse datasets, we can use multi-head
attention mechanism to capture more relevant contexts, this will
indeed increase the accuracy.
•Speed - Since multi-head attention is nothing but parallel attention
channels, we can easily use GPUs and perform parallel processing.
Attention + Encoder and Decoder
•To solve sequence to sequence models, we need to add three different
type of attentions.
•For example we are building a machine translation model from English
to Spanish.
• We need an attention mechanism in the input language(English) to understand the
contexts in English.
• We need an attention mechanism in the output language(Spanish) to understand the
word relations in Spanish
• The third attention is the most important one, the word relations and contextual
relevance from English to Spanish.
English to Spanish Translation Model
x1 ….. xt
y1 … yt
Self-Attention in English
Self-Attention in Spanish
English to Spanish Attention
English to Spanish Translation Model
x1 ….. xt
y1 … yt
Self Attention in Encoder
Self Attention in Decoder
Encoder to Decoder Attention
Attention + Encoder and Decoder
y1 … yt
Self Attention in Decoder
Encoder to Decoder Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
Attention + Encoder and Decoder
y1 … yt
Encoder to Decoder Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
What is Masked Multi-Head Attention?
English - Encoder Spanish - Decoder
A nearly perfect photo Una foto casi perfecta
• Inside encoder self attention, while
calculating the attention for the word
“nearly”
• do we consider all the rest of the words?
• Inside decoder self attention, while calculating
the attention for the word “casi”
• can we consider all the rest of the words?
• We get all the input in one shot. • We can use only words until now, for the
prediction. While predicting word at time-t we
can use only words until time t-1 in the decoder
• At the training and test time we have
access to all the words.
• At the test time, we will NOT have access to
future words.
• Encoder self attention doesn’t have any
special changes
• Decoder Attention calculation need to be
masked from the future words.
Masked Attention scores calculation
Una foto casi perfecta
0.07 0.00 0.00 0.00 0.07
0.14 0.00 0.00 0.00 0.14
0.03 0.00 0.00 0.00 0.03
0.26 0.00 0.00 0.00 0.26
-0.34 Una -0.2 -0.3 0.3 0.7
foto -0.4 -0.3 0.6 -0.2
casi -0.1 1.7 -0.4 -0.5
perfecta -0.8 -0.1 -0.9 1.4
Una foto casi perfecta
-0.2 -0.3 0.3 0.7
-0.4 -0.3 0.6 -0.2
-0.1 1.7 -0.4 -0.5
-0.8 -0.1 -0.9 1.4
While calculating the attentions scores
for “Una” we should not consider the
future words
Masked Attention scores calculation
Una foto casi perfecta
-0.06 -0.40 0.00 0.00 -0.47
-0.12 -0.42 0.00 0.00 -0.55
-0.02 2.21 0.00 0.00 2.18
-0.22 -0.10 0.00 0.00 -0.32
-0.34 Una -0.2 -0.3 0.3 0.7
0.30 1.28 foto -0.4 -0.3 0.6 -0.2
casi -0.1 1.7 -0.4 -0.5
perfecta -0.8 -0.1 -0.9 1.4
Una foto casi perfecta
-0.2 -0.3 0.3 0.7
-0.4 -0.3 0.6 -0.2
-0.1 1.7 -0.4 -0.5
-0.8 -0.1 -0.9 1.4
While calculating the attentions scores
for “foto” we should not consider the
future words like “casi” and “perfecta”
Masked Attention scores calculation
Una foto casi perfecta
0.06 0.35 0.42 0.00 0.83
0.12 0.36 0.94 0.00 1.43
0.02 -1.91 -0.55 0.00 -2.44
0.22 0.09 -1.42 0.00 -1.11
-0.34 Una -0.2 -0.3 0.3 0.7
0.30 1.28 foto -0.4 -0.3 0.6 -0.2
-0.29 -1.11 1.58 casi -0.1 1.7 -0.4 -0.5
perfecta -0.8 -0.1 -0.9 1.4
Una foto casi perfecta
-0.2 -0.3 0.3 0.7
-0.4 -0.3 0.6 -0.2
-0.1 1.7 -0.4 -0.5
-0.8 -0.1 -0.9 1.4
Same is applicable for the attention
score calculation of the word “casi”
The Third Attention – “Inter- Attention”
y1 … yt
Encoder to Decoder Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Attention + Encoder and Decoder
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
This is also a Multi-Head
attention with a different set
of Queries and Values
Q, K, V for Self(Intra) Attention
A nearly perfect photo
-1.1 -1.2 0.1 -2.3 -4.54
-2.6 -6.5 -0.1 -3.1 -12.29
-1.7 -3.4 -0.4 -10.5 -15.90
0.9 7.4 0.0 2.0 10.29
A -0.7 0.3 0.2 0.6
nearly -1.5 1.5 -0.2 0.8
1.72 -4.43 0.42 -3.77 perfect -1.0 0.8 -0.8 2.8
photo 0.5 -1.7 0.0 -0.5
A nearly perfect photo
-0.7 0.3 0.2 0.6
-1.5 1.5 -0.2 0.8
-1.0 0.8 -0.8 2.8
0.5 -1.7 0.0 -0.5
(K)
(Q)
(V)
Q, K and V are considered
from a same set in a self
attention block
Self Attention / Intra Attention
• Until now we discussed self attention,
or intra attention within the encoder
• For example, when we are translating
from English to Spanish
• Self attention inside the encoder
captures the word relation within
English language.
• Q, K and V all are from English
• Self attention inside the decoder
captures the word relation within
Spanish language.
• Q, K and V all are from Spanish
Attention / Inter- Attention
• For example, when we are translating from English to Spanish
• Inter attention inside the decoder captures the word relation between
decoder and encoder.
• Q – Queries will be from Spanish (Decoder)
• K-Keys and V-Values will be from English (Encoder)
Attention / Inter- Attention - Calculation
A nearly perfect photo
-0.1 0.4 0.0 0.0 0.39
0.4 -1.4 0.1 0.1 -0.79
0.2 -1.2 -0.2 0.2 -0.90
-0.1 1.3 -0.1 0.0 1.09
Una -0.1 0.0 0.0 0.5
foto -0.6 0.7 -0.1 0.8
-0.15 -0.66 0.12 0.07 casi -0.5 0.4 -0.2 0.8
perfecta 0.1 -0.8 0.0 -0.2
A nearly perfect photo
0.3 -0.6 0.1 0.2
-2.3 2.1 0.8 1.6
-1.6 1.8 -1.3 2.4
0.9 -2.0 -0.5 -0.1
(K)
(Q)
(V)
• Queries-Q are from
decoder
• Keys-K and Values-V are
from encoder
Seq-to-Seq model with
Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Attention + Dense Layers for learning
•Until now, we have included only attention blocks.
•But eventually, these attention blocks will give us just better
embeddings only.
•We need to add at least a couple of dense layers on top of these
intelligent embeddings.
•These dense layers will help us in learning all the patterns in the data.
•Dense layers are also known as hidden layers or feed forward layers
Final Seq-to-Seq model
with Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Preserving the sequential order
•Questions: In word2vec do we preserve the sequential order? In
attention are we preserving the sequential order?
•Answer: Nope, we are not. Context is decided based on the
surrounding words, not necessarily by retaining the sequence.
1. “the food was good, not bad”
2. “the food was bad, not good”
Self attention treats both the sentences the same way. The order and
the position of “good” and “bad” are very important in NLP.
Positional encodings
•How do we include the information about the sequential order or the
position of the sentence?
•By using “Positional encodings”
•It is a very simple concept. We will create a new vector based on the
relative position of the word in a sentence.
•We will now add the positional information to the embeddings and
pass it on as input
Simple Positional encodings
• We can add this extra information to input embeddings and pass it on as the input for
attention calculation
the food was good not bad
0 1 2 3 4 5
the food was bad not good
0 1 2 3 4 5
• In the first sentence, good has value-3 and bad has value-5
• In the second sentence, good has value-5 and bad has value-3
Adding Positional encodings - idea
the food was good not bad
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 -0.2 -1.1 3.2 -0.2 -0.30
-1.7 0.9 -0.9 3.7 -0.5 1.00
-0.2 -0.5 0.3 2.6 0.3 -2.20
-1.3 -0.8 0.1 2.5 2.6 -1.50
Embeddings
Adding Positional encodings - idea
the food was good not bad
pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 -0.2 -1.1 3.2 -0.2 -0.30
-1.7 0.9 -0.9 3.7 -0.5 1.00
-0.2 -0.5 0.3 2.6 0.3 -2.20
-1.3 -0.8 0.1 2.5 2.6 -1.50
Positional index
Adding Positional encodings - idea
the food was good not bad
pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
+0 +1 +2 +3 +4 +5
-0.2 -0.2 -0.2 0.8 -1.1 0.9 3.2 6.2 -0.2 3.8 -0.30 4.7
-1.7 -1.7 0.9 1.9 -0.9 1.1 3.7 6.7 -0.5 3.5 1.00 6.0
-0.2 -0.2 -0.5 0.5 0.3 2.3 2.6 5.6 0.3 4.3 -2.20 2.8
-1.3 -1.3 -0.8 0.2 0.1 2.1 2.5 5.5 2.6 6.6 -1.50 3.5
Add positional index
to embeddings to get
new position based
embeddings
Adding Positional encodings - idea
the food was good not bad
0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
+0 +1 +2 +3 +4 +5
-0.2 -0.2 -0.2 0.8 -1.1 0.9 3.2 6.2 -0.2 3.8 -0.30 4.7
-1.7 -1.7 0.9 1.9 -0.9 1.1 3.7 6.7 -0.5 3.5 1.00 6.0
-0.2 -0.2 -0.5 0.5 0.3 2.3 2.6 5.6 0.3 4.3 -2.20 2.8
-1.3 -1.3 -0.8 0.2 0.1 2.1 2.5 5.5 2.6 6.6 -1.50 3.5
• This method doesn’t work.
• If the number of words are too many
inside a sequence then we will have
very large values.
• If there are 50 words in a sentence,
then we will be adding 50 to the final
word.
• We need a slightly modified function for
positional encodings.
New formula Positional encodings
the food was good not bad
pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
New formula Positional encodings
the food was good not bad
pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
New formula Positional encodings
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
Positional encodings - Calculation
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
𝑝𝑜𝑠 = 0 ; 𝑖 = 0 ; 𝑑 = 4
𝑃𝐸(0,0) = sin
0
10000
2∗0
4
Positional encodings - Calculation
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
𝑝𝑜𝑠 = 0 ; 𝑖 = 1 ; 𝑑 = 4
𝑃𝐸(0,1) = sin
0
10000
2∗0.5
4
Positional encodings - Calculation
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
𝑝𝑜𝑠 = 1 ; 𝑖 = 0 ; 𝑑 = 4
𝑃𝐸(1,0) = sin
1
10000
0
4
Positional encodings - Calculation
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
𝑝𝑜𝑠 = 1 ; 𝑖 = 1 ; 𝑑 = 4
𝑃𝐸(1,1) = sin
1
10000
1
4
Positional encodings - Calculation
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 0.01 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
𝑝𝑜𝑠 = 1 ; 𝑖 = 2 ; 𝑑 = 4
𝑃𝐸(1,2) = sin
1
10000
2
4
Positional encodings - Calculation
the food was good not bad
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 0.01 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5
𝑝𝑜𝑠 = 1 ; 𝑖 = 2 ; 𝑑 = 4
𝑃𝐸(1,2) = sin
1
10000
2
4
Positional encodings - Calculation
the food was good not bad
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 0.84 -1.1 0.909 3.2 0.141 -0.2 -0.288 -0.30 0.913
-1.7 0 0.9 0.100 -0.9 0.199 3.7 0.296 -0.5 0.002 1.00 0.002
-0.2 0 -0.5 0.010 0.3 0.020 2.6 0.030 0.3 0.000 -2.20 0.000
-1.3 0 -0.8 0.001 0.1 0.002 2.5 0.003 2.6 0.000 -1.50 0.000
Positional encodings for
all the words in
sentence-1
Embeddings + Positional encodings
the food was good not bad the food was good not bad
0 1 2 3 4 5 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3 -0.2 0.6 -0.2 3.3 -0.5 0.6
-1.7 0.9 -0.9 3.7 -0.5 1.0 -1.7 1.0 -0.7 4.0 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2 -0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5 -1.3 -0.8 0.1 2.5 2.6 -1.5
the food was good not bad
-0.2 0 -0.2 0.84 -1.1 0.909 3.2 0.141 -0.2 -0.288 -0.30 0.913
-1.7 0 0.9 0.100 -0.9 0.199 3.7 0.296 -0.5 0.002 1.00 0.002
-0.2 0 -0.5 0.010 0.3 0.020 2.6 0.030 0.3 0.000 -2.20 0.000
-1.3 0 -0.8 0.001 0.1 0.002 2.5 0.003 2.6 0.000 -1.50 0.000
Embeddings
updated with
positional
encoders
Sentence-2
the food was bad not good the food was bad not good
0 1 2 3 4 5 0 1 2 3 4 5
-0.2 -0.2 -1.1 -0.3 -0.2 3.2 -0.2 0.6 -0.2 -0.2 -0.5 4.1
-1.7 0.9 -0.9 1.0 -0.5 3.7 -1.7 1.0 -0.7 1.3 -0.5 3.7
-0.2 -0.5 0.3 -2.2 0.3 2.6 -0.2 -0.5 0.3 -2.2 0.3 2.6
-1.3 -0.8 0.1 -1.5 2.6 2.5 -1.3 -0.8 0.1 -1.5 2.6 2.5
the food was bad not good
-0.2 0.0 -0.2 0.84 -1.1 0.909 -0.3 0.141 -0.2 -0.288 3.2 0.913
-1.7 0.0 0.9 0.100 -0.9 0.199 1.0 0.296 -0.5 0.002 3.7 0.002
-0.2 0.0 -0.5 0.010 0.3 0.020 -2.2 0.030 0.3 0.000 2.6 0.000
-1.3 0.0 -0.8 0.001 0.1 0.002 -1.5 0.003 2.6 0.000 2.5 0.000
Sentence-1 vs Sentence-2 with positional
encodings
Sentence-2
the food was bad not good
0 1 2 3 4 5
-0.2 0.6 -0.2 -0.2 -0.5 4.1
-1.9 1.0 -0.7 1.3 -0.5 3.7
-1.2 -0.5 0.3 -2.2 0.3 2.6
-1.5 -0.8 0.1 -1.5 2.6 2.5
Sentence-1
the food was good not bad
0 1 2 3 4 5
-0.2 0.6 -0.2 3.3 -0.5 0.6
-1.9 1.0 -0.7 4.0 -0.5 1.0
-1.2 -0.5 0.3 2.6 0.3 -2.2
-1.5 -0.8 0.1 2.5 2.6 -1.5
Have you noticed the difference?
Sentence-1 vs Sentence-2 with positional
encodings
Sentence-2
the food was bad not good
0 1 2 3 4 5
-0.2 0.6 -0.2 -0.2 -0.5 4.1
-1.9 1.0 -0.7 1.3 -0.5 3.7
-1.2 -0.5 0.3 -2.2 0.3 2.6
-1.5 -0.8 0.1 -1.5 2.6 2.5
Sentence-1
the food was good not bad
0 1 2 3 4 5
-0.2 0.6 -0.2 3.3 -0.5 0.6
-1.9 1.0 -0.7 4.0 -0.5 1.0
-1.2 -0.5 0.3 2.6 0.3 -2.2
-1.5 -0.8 0.1 2.5 2.6 -1.5
Positional encodings - Conclusion
•Word2vec  numerical representation of the words
•Positional encodings  a numerical representation of the position of
the word in the original sentence.
•Positional encodings – An extra vector to capture the position
•Both Word2Vec and Positional encodings are vectors.
•These two vectors are added and passed on to encoder
Updated Seq-to-Seq
model with Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Updated Seq-to-Seq
model with Attention
Positional
encodings
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Few more practical issues – The high risk
points in the network
•While solving practical problems, DL models would like to have the
weights near to zero, both positive and negative.
•If not, we have multiple problems like unstable training, vanishing
gradients, internal covariate shift etc.,
•There are some layers in the that network that may cause significant
impact on the gradient calculation. Lets call them the “high risk
points”
The high risk points
Multi-Head
Attention
Masked Multi-Head
Attention
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Here
Here
Here
Here
Here
• These are risky areas
• We will have huge impact on the gradient values
based on the calculations at these places
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Fixing the practical issues – Two steps
•We will now add two extra elements to our model seq-to-seq model
architecture to keep the gradient flow smooth and to make the model
mathematically convenient to solve large data problems.
1) Residual connection
2) Normalization
Step-1: Extra connections
Multi-Head
Attention
Masked Multi-Head
Attention
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Here
Here
Here
Here
Here
• Combine original input with multi head attention
output here
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Step-1: Extra connections
Multi-Head Attention
Masked Multi-Head
Attention
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Here
Here
Here
Here
Add
This is known as a
residual connection
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
All residual connections
Multi-Head Attention Masked Multi-Head Attention
Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Add
Add
Add
Add
Add
Helps us in keeping
the smooth flow of
gradient
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
All residual connections
Multi-Head Attention Masked Multi-Head Attention
Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Add
Add
Add
Add
Add
Helps us in keeping
the smooth flow of
gradient
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Step-2: Normalization
Multi-Head Attention Masked Multi-Head Attention
Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Normalization is important for avoiding extremely high
and extremely low values after multiplying with weights.
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Residual Connections and Normalization
•Imagine a seq to seq network model with a stack of 16 encoders and
16 decoders along with several multi head attentions.
•It is going to be a very complex optimization function.
•Residual connections and Normalization are helpful while solving the
practical deep networks.
•The main idea is “Multi-Head” attention. Remaining components in the
network are to solve the problem mathematically.
Full and Final Architecture
Seq to Seq Model
Multi-Head Attention Masked Multi-Head Attention
Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Full and Final Architecture
Seq to Seq Model
Multi-Head Attention Masked Multi-Head Attention
Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Encoder
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Full and Final Architecture
Seq to Seq Model
Multi-Head Attention Masked Multi-Head Attention
Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Encoder
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Decoder
Transformers
•The full and final seq to seq model architecture that we have seen
above is known as the transformer architecture.
•Transformers are the fastest, most advanced, most accurate and most
widely used architectures to solve seq-to-seq problems.
• 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑠 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 + 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 & 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝐴𝑟𝑐ℎ𝑖𝑡𝑒𝑐𝑡𝑢𝑟𝑒
Transformers Architecture diagram from the original paper
This paper is all
you need
Download link
LAB – Chatbot building using transformers
Code – Chatbot building using
transformers
Code and output explanation
Google uses BERT
• From 2019, Google Search has begun to use Google’s transformer neural network BERT for search
queries in over 70 languages.
• Prior to this change, a lot of information retrieval was keyword based, meaning Google checked its
crawled sites without strong contextual clues. Take the example word ‘bank’, which can have many
meanings depending on the context.
• The introduction of transformer neural networks to Google Search means that queries where words
such as ‘from’ or ‘to’ affect the meaning are better understood by Google. Users can search in more
natural English rather than adapting their search query to what they think Google will understand.
• An example from Google’s blog is the query “2019 brazil traveler to usa need a visa.” The position of
the word ‘to’ is very important for the correct interpretation of the query. The previous implementation
of Google Search was not able to pick up this nuance and returned results about USA citizens
traveling to Brazil, whereas the transformer model returns much more relevant pages.
• A further advantage of the transformer architecture is that learning in one language can be transferred
to other languages via transfer learning. Google was able to take the trained English model and adapt
it easily for the other languages’ Google Search.
My References
•https://data-science-blog.com/blog/2021/02/17/sequence-to-
sequence-models-back-bones-of-various-nlp-tasks/
•https://github.com/YasuThompson/Transformer_blog_codes/blob/mai
n/rnn_translation_attention_modified.ipynb
•https://www.youtube.com/watch?v=pLpzU-xGi2E&t=1359s
•https://github.com/YanXuHappygela/NLP-
study/blob/master/seq2seq_with_attention.ipynb
•https://www.tensorflow.org/text/tutorials/nmt_with_attention
•https://www.youtube.com/watch?v=pLpzU-xGi2E&t=1359s
•https://deepai.org/machine-learning-glossary-and-terms/transformer-
neural-network
•https://www.youtube.com/watch?v=23XUv0T9L5c
Publications
142
DataRace Android App
• Comprehensive Preparation: Get ready for data science &
ML interviews.
• 5000+ Questions: Diverse collection for thorough practice.
• Question Formats:
• MCQs
• Image-based
• Long answers
• Practice projects
• Scenario-based
• All-In-One Solution: Your go-to for data science & ML
interview prep.
• Boost Confidence: Gain proficiency and interview readiness.
• Success Assurance: Increase chances of success in
interviews.

More Related Content

What's hot

[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
DBI Advanced Tutorial 2007
DBI Advanced Tutorial 2007DBI Advanced Tutorial 2007
DBI Advanced Tutorial 2007Tim Bunce
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfConnorShorten2
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with WeaviateNETWAYS
 
Skip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesSkip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesfgodin
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTOClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 

What's hot (20)

[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
DBI Advanced Tutorial 2007
DBI Advanced Tutorial 2007DBI Advanced Tutorial 2007
DBI Advanced Tutorial 2007
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdf
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Skip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesSkip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architectures
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTOClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 

Similar to Transformers 101

05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdfChaoYang81
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Brian Ho
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMakerSuman Debnath
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to TransformersSuman Debnath
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyPyData
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJordan Open Source Association
 
Data Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA
 
Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018ITEM
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes👋 Christopher Moody
 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginnerSungmin Yang
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2Omar Ahmed
 
Upgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsUpgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsAtlassian
 
Software version numbering - DSL of change
Software version numbering - DSL of changeSoftware version numbering - DSL of change
Software version numbering - DSL of changeSergii Shmarkatiuk
 

Similar to Transformers 101 (20)

05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdf
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Magpie
MagpieMagpie
Magpie
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec Explained
 
Data Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLP
 
Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2
 
Upgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsUpgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian Admins
 
Software version numbering - DSL of change
Software version numbering - DSL of changeSoftware version numbering - DSL of change
Software version numbering - DSL of change
 

More from Venkata Reddy Konasani

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Venkata Reddy Konasani
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 

More from Venkata Reddy Konasani (20)

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
ARIMA
ARIMA ARIMA
ARIMA
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Transformers 101

  • 1. Transformers 101 An Intuitive Approach for Learners Venkata Reddy Konasani
  • 2. Contents •Background •Word2vec Recap •Weighted embedding's •Attention mechanism •Single Head Attention •Multi-head Attention •Positional encodings •Seq to Seq Model •Encoder and Decoder Model •Transformers
  • 3. Background •Transformers are used for processing sequential data. • Language Translation – Very frequently used. • Text Generation: Generating human-like text based on prompts.- Chat GPT • Chatbots and Conversational Agents • Genome sequence analysis • Sound wave analysis • Time series data analysis •This is an extension of Encoder and Decoder mechanism used in sequence to sequence model building for language translation using LSTMs
  • 5. Word Embeddings – Revision •None of the ML and DL models can work with direct text data. •Model parameter calculations in ML and DL involve mathematical computations •While working on NLP problems we need to convert our text data into numbers. •One hot encoding is very rudimentary, doesn’t really work for solving large and complex NLP problems.
  • 6. The Problem Statement •We want to perform some analysis on text data. How do we convert text into numerical data? • By keeping all the meaningful relations intact • By loosing very less information in that process of conversion •The idea is to convert text data into numerical data • With very less loss of information • Without introducing any additional error? • By keeping all the information intact • By preserving all the relationships in the text data
  • 7. Word2vec introduction •word2vec computes vector representation for words •word2vec tries to convert words in to numerical vectors so that similar words share a similar vector representation. •word2vec is the name of the concept and it is not a single algorithm •word2vec is not a deep learning technique like RNN or CNN. (no unsupervised pre-training of layers)
  • 8. Word2vec - Two major steps •Word2vec comes from the idea of preserving local context •It has two major steps • Create training samples • Use these samples to train the neural network model •Word2Vec tries to create the training samples by parsing through the data with a fixed window size •After creating training samples, we will use a single layer neural network to train the model
  • 9. Word2vec – Step1: Create Training Samples Input Output king strong king man strong king strong man man king man strong queen wise queen women wise queen wise women women queen women wise • We are considering window size as 2 king strong man queen wise women
  • 10. Word2vec – Step2: Build a neural network Model Input king king strong strong man man queen queen wise wise women women words Output strong man king man king strong wise women queen women queen wise Contexts K embeddings
  • 11. Result of the word2vec Input king king strong strong man man queen queen wise wise women women words It will come from here king 3.248315 -0.292609 2.028029 man 1.032173 3.037509 -1.810387 strong 3.659783 0.865091 -1.710116 queen -3.151272 -2.009174 0.550185 woman -2.420959 0.980081 -0.391963
  • 12. Word2Vec - Example 1) I can bear the pain. I am used to playing all day. That cricket bat is very light. I opened for my team. 2) It was very dark, there was no light. When she opened the can, she found a bat and a cricket. Suddenly, a bear appeared behind her. Look at the above two sentences. Imagine if we are working with word2vec
  • 13. Word2Vec - Example 1) I can bear the pain. I am used to playing all day. That cricket bat is very light. I opened for my team. 2) It was very dark, there was no light. When she opened the can, she found a bat and a cricket. Suddenly, a bear appeared behind her. I -2.0280 0.2926 1.1242 3.2483 can -1.8104 3.0375 -0.8583 1.0322 bear -1.7101 0.8651 -0.9374 3.6598 the 0.5502 -2.0092 0.0753 -3.1513 pain -0.3920 0.9801 -0.3494 -2.4210
  • 14. Problem with word2Vec 2) It was very dark, there was no light. When she opened the can, she found a bat and a cricket. Suddenly, a bear appeared behind her. 1) I can bear the pain. I am used to playing all day. That cricket bat is very light. I opened for my team. •The word “can” in the first sentence, is it same as the can in the second sentence? •When we use word2Vec will we get the same vector or two different vectors
  • 15. Problem with word2Vec 2) It was very dark, there was no light. When she opened the can, she found a bat and a cricket. Suddenly, a bear appeared behind her. 1) I can bear the pain. I am used to playing all day. That cricket bat is very light. I opened for my team. •Similarly, The word “light” in the first sentence, is it same as the can in the second sentence? •When we use word2Vec will we get the same vector or two different vectors
  • 16. Problem with word2Vec 2) It was very dark, there was no light. When she opened the can, she found a bat and a cricket. Suddenly, a bear appeared behind her. 1) I can bear the pain. I am used to playing all day. That cricket bat is very light. I opened for my team. •Can – to be able to vs. Can or tin •light weight vs. light vision •bear- endurance vs. bear –animal •cricket bat – game vs cricket, bat insects
  • 17. Problem with word2Vec •Can – to be able to vs. Can or tin •light weight vs. light vision •bear- endurance vs. bear –animal •cricket bat – game vs cricket, bat insects •As humans, we can understand the contextual relevance and the difference between those words. Word2Vec can not distinguish. •The Word2Vec theory doesn’t have any mechanism to give differentiate these words based on their context. •We need to add some more contextual relevance to the existing word2vec formula.
  • 18. Do you agree? •For building really intelligent models, we need to weight in some more context. •The context preserved in the Word2Vec is not sufficient for solving the advanced NLP applications like • Intelligent chatbots • Search Engines • Language Translation • Voice Assistants • Email filtering •We need to preserve some more context over and above the word2Vec embeddings
  • 19. An example •Consider these word embeddings. •For the (simplicity) discussion sake we have considered rounded-off numbers. a -0.2 -0.2 -1.1 3.2 bear -1.7 0.9 -0.9 3.7 appeared -0.2 -0.5 0.3 2.6 behind -1.3 -0.8 0.1 2.5 her -0.3 1.0 -2.2 -1.5 I -2.0 0.3 1.1 3.2 can -1.8 3.0 -0.9 1.0 bear -1.7 0.9 -0.9 3.7 the 0.6 -2.0 0.1 -3.2 pain -0.4 1.0 -0.3 -2.4
  • 20. Word2Vec + more Contextual weights •How to get additional context based weights? •Let us look at the word embeddings for the words in the two sentences. •The same embeddings can be seen for bear in the two different sentences. •The goal is to include context based weights to modify these embeddings. •Send those adjusted embeddings to the final NLP model, that will indeed will give us better accuracy. a -0.2 -0.2 -1.1 3.2 bear -1.7 0.9 -0.9 3.7 appeared -0.2 -0.5 0.3 2.6 behind -1.3 -0.8 0.1 2.5 her -0.3 1.0 -2.2 -1.5 I -2.0 0.3 1.1 3.2 can -1.8 3.0 -0.9 1.0 bear -1.7 0.9 -0.9 3.7 the 0.6 -2.0 0.1 -3.2 pain -0.4 1.0 -0.3 -2.4
  • 21. Lets focus on “bear” •How to make “bear” in the first sentence look different from “bear” in the second sentence ? •We are going to add extra weightage to these embeddings. a -0.2 -0.2 -1.1 3.2 bear -1.7 0.9 -0.9 3.7 appeared -0.2 -0.5 0.3 2.6 behind -1.3 -0.8 0.1 2.5 her -0.3 1.0 -2.2 -1.5 I -2.0 0.3 1.1 3.2 can -1.8 3.0 -0.9 1.0 bear -1.7 0.9 -0.9 3.7 the 0.6 -2.0 0.1 -3.2 pain -0.4 1.0 -0.3 -2.4
  • 22. Weighted embeddings. I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 Original embedding for “bear” in sentence -1 This is the new weighted embedding for “bear” in sentence -1
  • 23. Weighted embeddings. I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 extra weights for “bear” with respect to “I”, “can”, “bear”, ”the” and “pain”
  • 24. Weighted embeddings. I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 bring “I” weight into “bear” (-1.7)*(-2)+ (0.9)*(0.3)+ (-0.9)*(1.1)+ (3.7)*(3.2)
  • 25. Weighted embeddings. I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 weights for “bear” with respect to “I”, “can”, “bear”, ”the” and “pain”
  • 26. Weighted embeddings. -29.0 4.4 16.0 46.5 I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 Multiply with weight(14.52) After multiplication
  • 27. Weighted embeddings. -29.0 -18.5 -30.9 -8.9 2.8 4.4 30.8 16.4 29.5 -7.0 16.0 -9.2 -16.4 -1.5 2.1 46.5 10.3 67.3 47.2 16.9 I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 Multiply corresponding weights for each vector
  • 28. Weighted embeddings. -29.0 -18.5 -30.9 -8.9 2.8 -84.5 4.4 30.8 16.4 29.5 -7.0 74.0 16.0 -9.2 -16.4 -1.5 2.1 -9.0 46.5 10.3 67.3 47.2 16.9 188.1 I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 Add across all the embedding dimension
  • 29. Weighted embeddings. -29.0 -18.5 -30.9 -8.9 2.8 -84.5 4.4 30.8 16.4 29.5 -7.0 74.0 16.0 -9.2 -16.4 -1.5 2.1 -9.0 46.5 10.3 67.3 47.2 16.9 188.1 I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 This is the new weighted embedding for “bear” in sentence -1 Original embedding for “bear” in sentence -1
  • 30. Weighted embeddings. -29.0 -18.5 -30.9 -8.9 2.8 -84.5 4.4 30.8 16.4 29.5 -7.0 74.0 16.0 -9.2 -16.4 -1.5 2.1 -9.0 46.5 10.3 67.3 47.2 16.9 188.1 I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 This is the new weighted embedding for “bear” in sentence -1 Original embedding for “bear” in sentence -1 Normalizing the weights and making their sum=1 will help us in keeping vectors to small numbers.
  • 31. Weighted embeddings. –Sentence-1 -84.5 74.0 -9.0 188.1 bear -1.7 0.9 -0.9 3.7 This is the new weighted embedding for “bear” in sentence -1 Original embedding for “bear” in sentence -1
  • 32. Weighted embeddings. –Sentence-2 a bear appeared behind her -0.2 -1.7 -0.2 -1.3 -0.3 -0.2 0.9 -0.5 -0.8 1 -1.1 -0.9 0.3 0.1 -2.2 3.2 3.7 2.6 2.5 -1.5 Original embedding for “bear” in sentence -2 • This will be the new weighted embedding for “bear” in sentence -2 • Will be same as in sentence-1 or will be different? What do you think?
  • 33. -2.6 -30.9 -1.8 -13.8 0.6 -48.6 -2.6 16.4 -4.6 -8.5 -2.2 -1.5 -14.3 -16.4 2.8 1.1 4.8 -22.1 41.6 67.3 24.0 26.6 3.2 162.8 a -0.2 -0.2 -1.1 3.2 12.99 18.20 9.24 10.65 -2.16 bear -1.7 0.9 -0.9 3.7 appeared -0.2 -0.5 0.3 2.6 behind -1.3 -0.8 0.1 2.5 her -0.3 1 -2.2 -1.5 a bear appeared behind her -0.2 -1.7 -0.2 -1.3 -0.3 -0.2 0.9 -0.5 -0.8 1 -1.1 -0.9 0.3 0.1 -2.2 3.2 3.7 2.6 2.5 -1.5 Original embedding for “bear” in sentence -2 Weighted embeddings. –Sentence-2
  • 34. Weighted embeddings. –Sentence-1 vs 2 Sentence-1 Sentence-2 Word embeddings Word embeddings bear bear -1.7 0.9 -0.9 3.7 -1.7 0.9 -0.9 3.7
  • 35. Weighted embeddings. –Sentence-1 vs 2 Sentence-1 Sentence-2 Word embeddings Weighted embeddings Word embeddings Weighted embeddings bear bear(new) bear bear(new) -1.7 0.9 -0.9 3.7 -84.5 74.0 -9.0 188.1 -1.7 0.9 -0.9 3.7 -48.6 -1.5 -22.1 162.8
  • 36. Weighted embeddings. –Sentence-1 vs 2 •We now have two new embeddings for the same word. While training the model, we will use different embeddings at different sentences. •Using these new weighted embeddings will have a greater importance for the specific contexts and these new embeddings will significantly improve the model accuracy Sentence-1 Sentence-2 Word embeddings Weighted embeddings Word embeddings Weighted embeddings bear bear(new) bear bear(new) -1.7 0.9 -0.9 3.7 -84.5 74.0 -9.0 188.1 -1.7 0.9 -0.9 3.7 -48.6 -1.5 -22.1 162.8
  • 37. Weighted embeddings procedure •Have you understood the formula for calculation of new weighted embeddings? •The procedure that we followed until now is known as attention mechanism •And that’s it …..That's all you need to improve the accuracy of the model. “Attention is all you need” •Lets go through attention once again.
  • 38. Attention = Weighted embeddings -29.0 -18.5 -30.9 -8.9 2.8 -84.5 4.4 30.8 16.4 29.5 -7.0 74.0 16.0 -9.2 -16.4 -1.5 2.1 -9.0 46.5 10.3 67.3 47.2 16.9 188.1 I -2 0.3 1.1 3.2 can -1.8 3 -0.9 1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 the 0.6 -2 0.1 -3.2 pain -0.4 1 -0.3 -2.4 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4
  • 39. Attention Terminology 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 These vectors in this operation are called as Queries(Q)
  • 40. Attention Terminology 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 These vectors in this operation are called as keys(K) These vectors in this operation are called as Queries(Q)
  • 41. Attention Terminology -29.0 -18.5 -30.9 -8.9 2.8 4.4 30.8 16.4 29.5 -7.0 16.0 -9.2 -16.4 -1.5 2.1 46.5 10.3 67.3 47.2 16.9 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 These vectors in this operation are called as keys(K) These vectors in this operation are called as Queries(Q) We multiply weights again with the embeddings. These vectors are known as values (V)
  • 42. Attention Terminology -29.0 -18.5 -30.9 -8.9 2.8 -84.5 4.4 30.8 16.4 29.5 -7.0 74.0 16.0 -9.2 -16.4 -1.5 2.1 -9.0 46.5 10.3 67.3 47.2 16.9 188.1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 These vectors in this operation are called as keys(K) These vectors in this operation are called as Queries(Q) We multiply weights again with the embeddings. These vectors are known as values (V) These are called scores. Scores are normalized and their sum is made as 1.
  • 43. Attention Terminology V1, V2,V3,….Vn I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4
  • 45. Attention Terminology V1, V2,V3,….Vn Q K bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4
  • 46. Attention Terminology V1, V2,V3,….Vn Q K matmul 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 Matrix multiplication
  • 47. Attention Terminology V1, V2,V3,….Vn Q K V matmul normalize matmul -29.0 -18.5 -30.9 -8.9 2.8 4.4 30.8 16.4 29.5 -7.0 16.0 -9.2 -16.4 -1.5 2.1 46.5 10.3 67.3 47.2 16.9 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4
  • 48. Attention Terminology V1, V2,V3,….Vn Q K V matmul normalize matmul Y1, Y2,Y3,….Yn -29.0 -18.5 -30.9 -8.9 2.8 -84.5 4.4 30.8 16.4 29.5 -7.0 74.0 16.0 -9.2 -16.4 -1.5 2.1 -9.0 46.5 10.3 67.3 47.2 16.9 188.1 14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7 I can bear the pain -2 -1.8 -1.7 0.6 -0.4 0.3 3 0.9 -2 1 1.1 -0.9 -0.9 0.1 -0.3 3.2 1 3.7 -3.2 -2.4 Note: only one vector Yk calculation is shown here. We can do the same for all ‘n’ vectors
  • 49. Attention - Block V1, V2,V3,….Vn Q K V matmul normalize matmul Y1, Y2,Y3,….Yn Attention block, more specially self-attention block
  • 50. Do you Agree ? •Attention • Is nothing but weighted embeddings only • Smart embeddings • Advanced version of word2vec • More contextualized embeddings • With a different matrix calculation formula.
  • 51. Trainable Parameters ? •Are there any trainable parameters in the above process ? •Are we training anything? Or are we simply doing the multiplications without any trainable weight parameters? •Yes, until now there are NO trainable parameters. •We need to bring trainable weights into this procedure to learn based on the data. •Each Q, K and V vectors need to be multiplied by the weight matrices WQ , WK, Wv
  • 52. Add Trainable Parameters V1, V2,V3,….Vn Q K V matmul normalize matmul Y1, Y2,Y3,….Yn WQ WK WV Multiplying with weight matrix is same as adding a linear layer in NN terminology
  • 53. Add Trainable Parameters While training the model, we can consider this as attention block as a linear embedding layer, and train the parameters based on the error signals. V1, V2,V3,….Vn Q K V matmul normalize matmul Y1, Y2,Y3,….Yn WQ WK WV
  • 54. Attention formula 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 V V1, V2,V3,….Vn Q K V matmul normalize matmul Y1, Y2,Y3,….Yn WQ WK WV
  • 55. Attention – Actual Figure from the paper Q K V matmul normalize matmul Y1, Y2,Y3,….Yn WQ WK WV V1, V2,V3,….Vn
  • 56. stacked attention blocks/layers V1, V2,V3,….Vn Y1, Y2,Y3,….Yn We can also stack multiple self- attention blocks in the model. This will further improve the accuracy
  • 57. Self Attention visualization • Many thanks to - Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/ Simulation and Colab code
  • 58. Several parallel contexts • I can bear the pain. I am used to playing all day. That cricket bat is very light. I opened for my team. • It was very dark, there was no light. When she opened the can, she found a bat and a cricket. Suddenly, a bear appeared behind her. • There are several parallel contexts present here • Sports - cricket game • Bat in sports - sports equipment • Present and past events • Bat/ cricket – insects • Bear – Animals • playing, opening, appearing – Actions • These are the contexts from just two documents. For the large and diverse datasets there will be multiple contexts. We need many more parallel attention stacks.
  • 59. Single Head Attention. V1, V2,V3,….Vn Q K V matmul normalize matmul Y1, Y2,Y3,….Yn WQ WK WV • Until now, we have discussed a single attention stack either with single block or multiple attention blocks. • At the end we will have a single set of vectors [Y1, Y2,Y3,….Yn] • This is known as single head attention. • For a large and diverse dataset, we may have several parallel contexts. • We may have to include parallel attention stacks to learn parallel contexts
  • 62. Multi-head Attention. Concatenate Linear Layer Adding a liner layer is simply multiplying the concatenated vector with output weights WO
  • 63. 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑐𝑜𝑛𝑐𝑎𝑡 𝐻1, 𝐻2, . . , 𝐻ℎ Wo 𝑤ℎ𝑒𝑟𝑒 𝐻𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖𝑘, 𝑉𝑊𝑖𝑣 Concatenate Linear Layer
  • 64. Multi-head Attention - Actual figure from the paper
  • 65. Multi-head Attention visualization - Example Two attention heads focusing on two different contexts w.r.t a word
  • 66. Multi-head Attention visualization - Example Three attention heads focusing on three different contexts w.r.t the same word
  • 67. Self attention - Limitations •Until now we discussed self-attention with single head and multiple heads. •In self attention, a word looks at the surrounding words in the same sentence to calculate the attention. •Self attention is useful for many-to-one kind of models. For example • Sentiment Analysis • Document Classification •We need to have a slightly complex architecture for many-to-many or sequence to sequence models. For example • Machine Translation • Chatbot
  • 68. Sequence to Sequence Models. •Before discussing attention, we used LSTM or RNN for sequence to sequence models. •We used a Encoder LSTM for input sequence and a Decoder LSTM for the output sequence. •Lets revisit the LSTM or RNN based seq to seq model using simple word embeddings.
  • 69. Encoder and Decoder – Revision
  • 70. Encoder and Decoder – Revision •Language translation is an example of sequence to sequence model •This model has two components • Encoder LSTM for understanding the structure of input language sequence • Decoder LSTM for translating the the text in output language
  • 71. x1 ….. xt y1 … yt Decoder LSTM Encoder LSTM Thought Vector Encoder and Decoder – Revision
  • 72. Encoder and Decoder – Revision •Encoder: • Takes the input sequence and creates a new vector based on the whole input sequence. The resultant vector from encoder is also known as thought vector or context vector
  • 73. Encoder and Decoder – Revision •Thought Vector / Context Vector: • The overall crux of the input sequence is imparted into one vector. This will be the only input to the decoder.
  • 74. Encoder and Decoder – Revision •Decoder: • Decoder takes thought vector as input and generates a logical translated sequence.
  • 75. Encoder and Decoder – Revision model = Sequential() model.add(Embedding(input_dim=30000, output_dim=256, input_length=20)) # Word2Vec model.add(LSTM(128)) # Encoder model.add(RepeatVector(20)) # Thought vector model.add(LSTM(128, return_sequences=True)) #Decoder model.add(Dense(30000, activation='softmax’))
  • 76. LAB: LSTM – Seq to Seq Model •From English to Spanish
  • 77. Code: LSTM – Seq to Seq Model •From English to Spanish
  • 78. Encoder and Decoder issues •The above procedure uses standard embeddings. •Since we generally use LSTM or GRU, we can achieve decent accuracy. •Less Accuracy - For large and diverse datasets, we may not get very high accuracy. Due to the limitations of standard embeddings. •Slow - For large and diverse datasets, we need very complex models with very deep architecture. Parallel processing is difficult in sequential models.
  • 79. Attention + Encoder and Decoder •We will now include multi-head attention inside the Encoder and Decoder •High Accuracy - For large and diverse datasets, we can use multi-head attention mechanism to capture more relevant contexts, this will indeed increase the accuracy. •Speed - Since multi-head attention is nothing but parallel attention channels, we can easily use GPUs and perform parallel processing.
  • 80. Attention + Encoder and Decoder •To solve sequence to sequence models, we need to add three different type of attentions. •For example we are building a machine translation model from English to Spanish. • We need an attention mechanism in the input language(English) to understand the contexts in English. • We need an attention mechanism in the output language(Spanish) to understand the word relations in Spanish • The third attention is the most important one, the word relations and contextual relevance from English to Spanish.
  • 81. English to Spanish Translation Model x1 ….. xt y1 … yt Self-Attention in English Self-Attention in Spanish English to Spanish Attention
  • 82. English to Spanish Translation Model x1 ….. xt y1 … yt Self Attention in Encoder Self Attention in Decoder Encoder to Decoder Attention
  • 83. Attention + Encoder and Decoder y1 … yt Self Attention in Decoder Encoder to Decoder Attention X1, X2,X3,….Xn Multi-Head Attention Input embeddings
  • 84. Attention + Encoder and Decoder y1 … yt Encoder to Decoder Attention X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings
  • 85. What is Masked Multi-Head Attention? English - Encoder Spanish - Decoder A nearly perfect photo Una foto casi perfecta • Inside encoder self attention, while calculating the attention for the word “nearly” • do we consider all the rest of the words? • Inside decoder self attention, while calculating the attention for the word “casi” • can we consider all the rest of the words? • We get all the input in one shot. • We can use only words until now, for the prediction. While predicting word at time-t we can use only words until time t-1 in the decoder • At the training and test time we have access to all the words. • At the test time, we will NOT have access to future words. • Encoder self attention doesn’t have any special changes • Decoder Attention calculation need to be masked from the future words.
  • 86. Masked Attention scores calculation Una foto casi perfecta 0.07 0.00 0.00 0.00 0.07 0.14 0.00 0.00 0.00 0.14 0.03 0.00 0.00 0.00 0.03 0.26 0.00 0.00 0.00 0.26 -0.34 Una -0.2 -0.3 0.3 0.7 foto -0.4 -0.3 0.6 -0.2 casi -0.1 1.7 -0.4 -0.5 perfecta -0.8 -0.1 -0.9 1.4 Una foto casi perfecta -0.2 -0.3 0.3 0.7 -0.4 -0.3 0.6 -0.2 -0.1 1.7 -0.4 -0.5 -0.8 -0.1 -0.9 1.4 While calculating the attentions scores for “Una” we should not consider the future words
  • 87. Masked Attention scores calculation Una foto casi perfecta -0.06 -0.40 0.00 0.00 -0.47 -0.12 -0.42 0.00 0.00 -0.55 -0.02 2.21 0.00 0.00 2.18 -0.22 -0.10 0.00 0.00 -0.32 -0.34 Una -0.2 -0.3 0.3 0.7 0.30 1.28 foto -0.4 -0.3 0.6 -0.2 casi -0.1 1.7 -0.4 -0.5 perfecta -0.8 -0.1 -0.9 1.4 Una foto casi perfecta -0.2 -0.3 0.3 0.7 -0.4 -0.3 0.6 -0.2 -0.1 1.7 -0.4 -0.5 -0.8 -0.1 -0.9 1.4 While calculating the attentions scores for “foto” we should not consider the future words like “casi” and “perfecta”
  • 88. Masked Attention scores calculation Una foto casi perfecta 0.06 0.35 0.42 0.00 0.83 0.12 0.36 0.94 0.00 1.43 0.02 -1.91 -0.55 0.00 -2.44 0.22 0.09 -1.42 0.00 -1.11 -0.34 Una -0.2 -0.3 0.3 0.7 0.30 1.28 foto -0.4 -0.3 0.6 -0.2 -0.29 -1.11 1.58 casi -0.1 1.7 -0.4 -0.5 perfecta -0.8 -0.1 -0.9 1.4 Una foto casi perfecta -0.2 -0.3 0.3 0.7 -0.4 -0.3 0.6 -0.2 -0.1 1.7 -0.4 -0.5 -0.8 -0.1 -0.9 1.4 Same is applicable for the attention score calculation of the word “casi”
  • 89. The Third Attention – “Inter- Attention” y1 … yt Encoder to Decoder Attention X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings
  • 90. Attention + Encoder and Decoder X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings Multi-Head Attention y1, y2,y3,….yt This is also a Multi-Head attention with a different set of Queries and Values
  • 91. Q, K, V for Self(Intra) Attention A nearly perfect photo -1.1 -1.2 0.1 -2.3 -4.54 -2.6 -6.5 -0.1 -3.1 -12.29 -1.7 -3.4 -0.4 -10.5 -15.90 0.9 7.4 0.0 2.0 10.29 A -0.7 0.3 0.2 0.6 nearly -1.5 1.5 -0.2 0.8 1.72 -4.43 0.42 -3.77 perfect -1.0 0.8 -0.8 2.8 photo 0.5 -1.7 0.0 -0.5 A nearly perfect photo -0.7 0.3 0.2 0.6 -1.5 1.5 -0.2 0.8 -1.0 0.8 -0.8 2.8 0.5 -1.7 0.0 -0.5 (K) (Q) (V) Q, K and V are considered from a same set in a self attention block
  • 92. Self Attention / Intra Attention • Until now we discussed self attention, or intra attention within the encoder • For example, when we are translating from English to Spanish • Self attention inside the encoder captures the word relation within English language. • Q, K and V all are from English • Self attention inside the decoder captures the word relation within Spanish language. • Q, K and V all are from Spanish
  • 93. Attention / Inter- Attention • For example, when we are translating from English to Spanish • Inter attention inside the decoder captures the word relation between decoder and encoder. • Q – Queries will be from Spanish (Decoder) • K-Keys and V-Values will be from English (Encoder)
  • 94. Attention / Inter- Attention - Calculation A nearly perfect photo -0.1 0.4 0.0 0.0 0.39 0.4 -1.4 0.1 0.1 -0.79 0.2 -1.2 -0.2 0.2 -0.90 -0.1 1.3 -0.1 0.0 1.09 Una -0.1 0.0 0.0 0.5 foto -0.6 0.7 -0.1 0.8 -0.15 -0.66 0.12 0.07 casi -0.5 0.4 -0.2 0.8 perfecta 0.1 -0.8 0.0 -0.2 A nearly perfect photo 0.3 -0.6 0.1 0.2 -2.3 2.1 0.8 1.6 -1.6 1.8 -1.3 2.4 0.9 -2.0 -0.5 -0.1 (K) (Q) (V) • Queries-Q are from decoder • Keys-K and Values-V are from encoder
  • 95. Seq-to-Seq model with Attention X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings Multi-Head Attention y1, y2,y3,….yt
  • 96. Attention + Dense Layers for learning •Until now, we have included only attention blocks. •But eventually, these attention blocks will give us just better embeddings only. •We need to add at least a couple of dense layers on top of these intelligent embeddings. •These dense layers will help us in learning all the patterns in the data. •Dense layers are also known as hidden layers or feed forward layers
  • 97. Final Seq-to-Seq model with Attention X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer
  • 98. Preserving the sequential order •Questions: In word2vec do we preserve the sequential order? In attention are we preserving the sequential order? •Answer: Nope, we are not. Context is decided based on the surrounding words, not necessarily by retaining the sequence. 1. “the food was good, not bad” 2. “the food was bad, not good” Self attention treats both the sentences the same way. The order and the position of “good” and “bad” are very important in NLP.
  • 99. Positional encodings •How do we include the information about the sequential order or the position of the sentence? •By using “Positional encodings” •It is a very simple concept. We will create a new vector based on the relative position of the word in a sentence. •We will now add the positional information to the embeddings and pass it on as input
  • 100. Simple Positional encodings • We can add this extra information to input embeddings and pass it on as the input for attention calculation the food was good not bad 0 1 2 3 4 5 the food was bad not good 0 1 2 3 4 5 • In the first sentence, good has value-3 and bad has value-5 • In the second sentence, good has value-5 and bad has value-3
  • 101. Adding Positional encodings - idea the food was good not bad -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 -0.2 -1.1 3.2 -0.2 -0.30 -1.7 0.9 -0.9 3.7 -0.5 1.00 -0.2 -0.5 0.3 2.6 0.3 -2.20 -1.3 -0.8 0.1 2.5 2.6 -1.50 Embeddings
  • 102. Adding Positional encodings - idea the food was good not bad pos 0 1 2 3 4 5 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 -0.2 -1.1 3.2 -0.2 -0.30 -1.7 0.9 -0.9 3.7 -0.5 1.00 -0.2 -0.5 0.3 2.6 0.3 -2.20 -1.3 -0.8 0.1 2.5 2.6 -1.50 Positional index
  • 103. Adding Positional encodings - idea the food was good not bad pos 0 1 2 3 4 5 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad +0 +1 +2 +3 +4 +5 -0.2 -0.2 -0.2 0.8 -1.1 0.9 3.2 6.2 -0.2 3.8 -0.30 4.7 -1.7 -1.7 0.9 1.9 -0.9 1.1 3.7 6.7 -0.5 3.5 1.00 6.0 -0.2 -0.2 -0.5 0.5 0.3 2.3 2.6 5.6 0.3 4.3 -2.20 2.8 -1.3 -1.3 -0.8 0.2 0.1 2.1 2.5 5.5 2.6 6.6 -1.50 3.5 Add positional index to embeddings to get new position based embeddings
  • 104. Adding Positional encodings - idea the food was good not bad 0 1 2 3 4 5 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad +0 +1 +2 +3 +4 +5 -0.2 -0.2 -0.2 0.8 -1.1 0.9 3.2 6.2 -0.2 3.8 -0.30 4.7 -1.7 -1.7 0.9 1.9 -0.9 1.1 3.7 6.7 -0.5 3.5 1.00 6.0 -0.2 -0.2 -0.5 0.5 0.3 2.3 2.6 5.6 0.3 4.3 -2.20 2.8 -1.3 -1.3 -0.8 0.2 0.1 2.1 2.5 5.5 2.6 6.6 -1.50 3.5 • This method doesn’t work. • If the number of words are too many inside a sequence then we will have very large values. • If there are 50 words in a sentence, then we will be adding 50 to the final word. • We need a slightly modified function for positional encodings.
  • 105. New formula Positional encodings the food was good not bad pos 0 1 2 3 4 5 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑
  • 106. New formula Positional encodings the food was good not bad pos 0 1 2 3 4 5 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑
  • 107. New formula Positional encodings the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5
  • 108. Positional encodings - Calculation the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5 𝑝𝑜𝑠 = 0 ; 𝑖 = 0 ; 𝑑 = 4 𝑃𝐸(0,0) = sin 0 10000 2∗0 4
  • 109. Positional encodings - Calculation the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 0 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5 𝑝𝑜𝑠 = 0 ; 𝑖 = 1 ; 𝑑 = 4 𝑃𝐸(0,1) = sin 0 10000 2∗0.5 4
  • 110. Positional encodings - Calculation the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 0 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 0 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 0 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5 𝑝𝑜𝑠 = 1 ; 𝑖 = 0 ; 𝑑 = 4 𝑃𝐸(1,0) = sin 1 10000 0 4
  • 111. Positional encodings - Calculation the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 0 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 0 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5 𝑝𝑜𝑠 = 1 ; 𝑖 = 1 ; 𝑑 = 4 𝑃𝐸(1,1) = sin 1 10000 1 4
  • 112. Positional encodings - Calculation the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 0 -0.5 0.01 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 0 -0.8 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5 𝑝𝑜𝑠 = 1 ; 𝑖 = 2 ; 𝑑 = 4 𝑃𝐸(1,2) = sin 1 10000 2 4
  • 113. Positional encodings - Calculation the food was good not bad pos 0 1 2 3 4 5 i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0 i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2 i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ? -1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ? -0.2 0 -0.5 0.01 0.3 ? 2.6 ? 0.3 ? -2.20 ? -1.3 0 -0.8 0.1 ? 2.5 ? 2.6 ? -1.50 ? 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 In our example • 𝑑 – is the dimensionality of the embeddings ; d=4 here • 𝑖 – runs from 0 to 3 • 𝑝𝑜𝑠 – Takes values 0 to 5 𝑝𝑜𝑠 = 1 ; 𝑖 = 2 ; 𝑑 = 4 𝑃𝐸(1,2) = sin 1 10000 2 4
  • 114. Positional encodings - Calculation the food was good not bad -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -1.7 0.9 -0.9 3.7 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 0.84 -1.1 0.909 3.2 0.141 -0.2 -0.288 -0.30 0.913 -1.7 0 0.9 0.100 -0.9 0.199 3.7 0.296 -0.5 0.002 1.00 0.002 -0.2 0 -0.5 0.010 0.3 0.020 2.6 0.030 0.3 0.000 -2.20 0.000 -1.3 0 -0.8 0.001 0.1 0.002 2.5 0.003 2.6 0.000 -1.50 0.000 Positional encodings for all the words in sentence-1
  • 115. Embeddings + Positional encodings the food was good not bad the food was good not bad 0 1 2 3 4 5 0 1 2 3 4 5 -0.2 -0.2 -1.1 3.2 -0.2 -0.3 -0.2 0.6 -0.2 3.3 -0.5 0.6 -1.7 0.9 -0.9 3.7 -0.5 1.0 -1.7 1.0 -0.7 4.0 -0.5 1.0 -0.2 -0.5 0.3 2.6 0.3 -2.2 -0.2 -0.5 0.3 2.6 0.3 -2.2 -1.3 -0.8 0.1 2.5 2.6 -1.5 -1.3 -0.8 0.1 2.5 2.6 -1.5 the food was good not bad -0.2 0 -0.2 0.84 -1.1 0.909 3.2 0.141 -0.2 -0.288 -0.30 0.913 -1.7 0 0.9 0.100 -0.9 0.199 3.7 0.296 -0.5 0.002 1.00 0.002 -0.2 0 -0.5 0.010 0.3 0.020 2.6 0.030 0.3 0.000 -2.20 0.000 -1.3 0 -0.8 0.001 0.1 0.002 2.5 0.003 2.6 0.000 -1.50 0.000 Embeddings updated with positional encoders
  • 116. Sentence-2 the food was bad not good the food was bad not good 0 1 2 3 4 5 0 1 2 3 4 5 -0.2 -0.2 -1.1 -0.3 -0.2 3.2 -0.2 0.6 -0.2 -0.2 -0.5 4.1 -1.7 0.9 -0.9 1.0 -0.5 3.7 -1.7 1.0 -0.7 1.3 -0.5 3.7 -0.2 -0.5 0.3 -2.2 0.3 2.6 -0.2 -0.5 0.3 -2.2 0.3 2.6 -1.3 -0.8 0.1 -1.5 2.6 2.5 -1.3 -0.8 0.1 -1.5 2.6 2.5 the food was bad not good -0.2 0.0 -0.2 0.84 -1.1 0.909 -0.3 0.141 -0.2 -0.288 3.2 0.913 -1.7 0.0 0.9 0.100 -0.9 0.199 1.0 0.296 -0.5 0.002 3.7 0.002 -0.2 0.0 -0.5 0.010 0.3 0.020 -2.2 0.030 0.3 0.000 2.6 0.000 -1.3 0.0 -0.8 0.001 0.1 0.002 -1.5 0.003 2.6 0.000 2.5 0.000
  • 117. Sentence-1 vs Sentence-2 with positional encodings Sentence-2 the food was bad not good 0 1 2 3 4 5 -0.2 0.6 -0.2 -0.2 -0.5 4.1 -1.9 1.0 -0.7 1.3 -0.5 3.7 -1.2 -0.5 0.3 -2.2 0.3 2.6 -1.5 -0.8 0.1 -1.5 2.6 2.5 Sentence-1 the food was good not bad 0 1 2 3 4 5 -0.2 0.6 -0.2 3.3 -0.5 0.6 -1.9 1.0 -0.7 4.0 -0.5 1.0 -1.2 -0.5 0.3 2.6 0.3 -2.2 -1.5 -0.8 0.1 2.5 2.6 -1.5 Have you noticed the difference?
  • 118. Sentence-1 vs Sentence-2 with positional encodings Sentence-2 the food was bad not good 0 1 2 3 4 5 -0.2 0.6 -0.2 -0.2 -0.5 4.1 -1.9 1.0 -0.7 1.3 -0.5 3.7 -1.2 -0.5 0.3 -2.2 0.3 2.6 -1.5 -0.8 0.1 -1.5 2.6 2.5 Sentence-1 the food was good not bad 0 1 2 3 4 5 -0.2 0.6 -0.2 3.3 -0.5 0.6 -1.9 1.0 -0.7 4.0 -0.5 1.0 -1.2 -0.5 0.3 2.6 0.3 -2.2 -1.5 -0.8 0.1 2.5 2.6 -1.5
  • 119. Positional encodings - Conclusion •Word2vec  numerical representation of the words •Positional encodings  a numerical representation of the position of the word in the original sentence. •Positional encodings – An extra vector to capture the position •Both Word2Vec and Positional encodings are vectors. •These two vectors are added and passed on to encoder
  • 120. Updated Seq-to-Seq model with Attention X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer
  • 121. Updated Seq-to-Seq model with Attention Positional encodings X1, X2,X3,….Xn Multi-Head Attention Input embeddings y1, y2,y3,….yt-1 Masked Multi-Head Attention Output embeddings Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer
  • 122. Few more practical issues – The high risk points in the network •While solving practical problems, DL models would like to have the weights near to zero, both positive and negative. •If not, we have multiple problems like unstable training, vanishing gradients, internal covariate shift etc., •There are some layers in the that network that may cause significant impact on the gradient calculation. Lets call them the “high risk points”
  • 123. The high risk points Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Here Here Here Here Here • These are risky areas • We will have huge impact on the gradient values based on the calculations at these places X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 124. Fixing the practical issues – Two steps •We will now add two extra elements to our model seq-to-seq model architecture to keep the gradient flow smooth and to make the model mathematically convenient to solve large data problems. 1) Residual connection 2) Normalization
  • 125. Step-1: Extra connections Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Here Here Here Here Here • Combine original input with multi head attention output here X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 126. Step-1: Extra connections Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Here Here Here Here Add This is known as a residual connection X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 127. All residual connections Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Add Add Add Add Add Helps us in keeping the smooth flow of gradient X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 128. All residual connections Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Add Add Add Add Add Helps us in keeping the smooth flow of gradient X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 129. Step-2: Normalization Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization is important for avoiding extremely high and extremely low values after multiplying with weights. X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 130. Residual Connections and Normalization •Imagine a seq to seq network model with a stack of 16 encoders and 16 decoders along with several multi head attentions. •It is going to be a very complex optimization function. •Residual connections and Normalization are helpful while solving the practical deep networks. •The main idea is “Multi-Head” attention. Remaining components in the network are to solve the problem mathematically.
  • 131. Full and Final Architecture Seq to Seq Model Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 132. Full and Final Architecture Seq to Seq Model Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Encoder X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings
  • 133. Full and Final Architecture Seq to Seq Model Multi-Head Attention Masked Multi-Head Attention Multi-Head Attention y1, y2,y3,….yt Dense Layer Output Layer with SoftMax Dense Layer Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Encoder X1, X2,X3,….Xn Input embeddings y1, y2,y3,….yt-1 Output embeddings Decoder
  • 134. Transformers •The full and final seq to seq model architecture that we have seen above is known as the transformer architecture. •Transformers are the fastest, most advanced, most accurate and most widely used architectures to solve seq-to-seq problems. • 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑠 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 + 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 & 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝐴𝑟𝑐ℎ𝑖𝑡𝑒𝑐𝑡𝑢𝑟𝑒
  • 135. Transformers Architecture diagram from the original paper
  • 136. This paper is all you need Download link
  • 137. LAB – Chatbot building using transformers
  • 138. Code – Chatbot building using transformers
  • 139. Code and output explanation
  • 140. Google uses BERT • From 2019, Google Search has begun to use Google’s transformer neural network BERT for search queries in over 70 languages. • Prior to this change, a lot of information retrieval was keyword based, meaning Google checked its crawled sites without strong contextual clues. Take the example word ‘bank’, which can have many meanings depending on the context. • The introduction of transformer neural networks to Google Search means that queries where words such as ‘from’ or ‘to’ affect the meaning are better understood by Google. Users can search in more natural English rather than adapting their search query to what they think Google will understand. • An example from Google’s blog is the query “2019 brazil traveler to usa need a visa.” The position of the word ‘to’ is very important for the correct interpretation of the query. The previous implementation of Google Search was not able to pick up this nuance and returned results about USA citizens traveling to Brazil, whereas the transformer model returns much more relevant pages. • A further advantage of the transformer architecture is that learning in one language can be transferred to other languages via transfer learning. Google was able to take the trained English model and adapt it easily for the other languages’ Google Search.
  • 143. DataRace Android App • Comprehensive Preparation: Get ready for data science & ML interviews. • 5000+ Questions: Diverse collection for thorough practice. • Question Formats: • MCQs • Image-based • Long answers • Practice projects • Scenario-based • All-In-One Solution: Your go-to for data science & ML interview prep. • Boost Confidence: Gain proficiency and interview readiness. • Success Assurance: Increase chances of success in interviews.