Transformers 101

Transformers 101
An Intuitive Approach for Learners
Venkata Reddy Konasani

Contents
•Background
•Word2vec Recap
•Weighted embedding's
•Attention mechanism
•Single Head Attention
•Multi-head Attention
•Positional encodings
•Seq to Seq Model
•Encoder and Decoder Model
•Transformers

Background
•Transformers are used for processing sequential data.
• Language Translation – Very frequently used.
• Text Generation: Generating human-like text based on prompts.- Chat GPT
• Chatbots and Conversational Agents
• Genome sequence analysis
• Sound wave analysis
• Time series data analysis
•This is an extension of Encoder and Decoder mechanism used in
sequence to sequence model building for language translation using
LSTMs

Word Embeddings – Revision
•None of the ML and DL models can work with direct text data.
•Model parameter calculations in ML and DL involve mathematical
computations
•While working on NLP problems we need to convert our text data into
numbers.
•One hot encoding is very rudimentary, doesn’t really work for solving
large and complex NLP problems.

The Problem Statement
•We want to perform some analysis on text data. How do we convert
text into numerical data?
• By keeping all the meaningful relations intact
• By loosing very less information in that process of conversion
•The idea is to convert text data into numerical data
• With very less loss of information
• Without introducing any additional error?
• By keeping all the information intact
• By preserving all the relationships in the text data

Word2vec introduction
•word2vec computes vector representation for words
•word2vec tries to convert words in to numerical vectors so that similar
words share a similar vector representation.
•word2vec is the name of the concept and it is not a single algorithm
•word2vec is not a deep learning technique like RNN or CNN. (no
unsupervised pre-training of layers)

Word2vec - Two major steps
•Word2vec comes from the idea of preserving local context
•It has two major steps
• Create training samples
• Use these samples to train the neural network model
•Word2Vec tries to create the training samples by parsing through the
data with a fixed window size
•After creating training samples, we will use a single layer neural
network to train the model

Word2vec – Step1: Create Training
Samples Input Output
king strong
king man
strong king
strong man
man king
man strong
queen wise
queen women
wise queen
wise women
women queen
women wise
• We are considering window size as 2
king strong man
queen wise women

Word2vec – Step2: Build a neural network
Model
Input
king
king
strong
strong
man
man
queen
queen
wise
wise
women
women
words
Output
strong
man
king
man
king
strong
wise
women
queen
women
queen
wise
Contexts
K embeddings

Result of the word2vec
Input
king
king
strong
strong
man
man
queen
queen
wise
wise
women
women
words
It will come
from here
king 3.248315 -0.292609 2.028029
man 1.032173 3.037509 -1.810387
strong 3.659783 0.865091 -1.710116
queen -3.151272 -2.009174 0.550185
woman -2.420959 0.980081 -0.391963

Word2Vec - Example
1) I can bear the pain. I am used to playing all day. That
cricket bat is very light. I opened for my team.
2) It was very dark, there was no light. When she opened
the can, she found a bat and a cricket. Suddenly, a bear
appeared behind her.
Look at the above two sentences. Imagine if we are working with
word2vec

Word2Vec - Example
1) I can bear the pain. I am used to playing all day. That
cricket bat is very light. I opened for my team.
2) It was very dark, there was no light. When she opened
the can, she found a bat and a cricket. Suddenly, a bear
I -2.0280 0.2926 1.1242 3.2483
can -1.8104 3.0375 -0.8583 1.0322
bear -1.7101 0.8651 -0.9374 3.6598
the 0.5502 -2.0092 0.0753 -3.1513
pain -0.3920 0.9801 -0.3494 -2.4210

Problem with word2Vec
2) It was very dark,
there was no light. When
she opened the can, she
found a bat and a
cricket. Suddenly, a bear
1) I can bear the pain. I
am used to playing all
day. That cricket bat is
very light. I opened for
my team.
•The word “can” in the first sentence, is it same as the can in the
second sentence?
•When we use word2Vec will we get the same vector or two different
vectors

found a bat and a
my team.
•Similarly, The word “light” in the first sentence, is it same as the can
in the second sentence?
•When we use word2Vec will we get the same vector or two different
vectors

found a bat and a
my team.
•Can – to be able to vs. Can or tin
•light weight vs. light vision
•bear- endurance vs. bear –animal
•cricket bat – game vs cricket, bat insects

•Can – to be able to vs. Can or tin
•light weight vs. light vision
•bear- endurance vs. bear –animal
•cricket bat – game vs cricket, bat insects
•As humans, we can understand the contextual relevance and the
difference between those words. Word2Vec can not distinguish.
•The Word2Vec theory doesn’t have any mechanism to give
differentiate these words based on their context.
•We need to add some more contextual relevance to the existing
word2vec formula.

Do you agree?
•For building really intelligent models, we need to weight in some more
context.
•The context preserved in the Word2Vec is not sufficient for solving the
advanced NLP applications like
• Intelligent chatbots
• Search Engines
• Language Translation
• Voice Assistants
• Email filtering
•We need to preserve some more context over and above the word2Vec
embeddings

An example
•Consider these word embeddings.
•For the (simplicity) discussion sake we
have considered rounded-off numbers.
a -0.2 -0.2 -1.1 3.2
bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1.0 -2.2 -1.5
I -2.0 0.3 1.1 3.2
can -1.8 3.0 -0.9 1.0
bear -1.7 0.9 -0.9 3.7
the 0.6 -2.0 0.1 -3.2
pain -0.4 1.0 -0.3 -2.4

Word2Vec + more Contextual weights
•How to get additional context based
weights?
•Let us look at the word embeddings
for the words in the two sentences.
•The same embeddings can be seen for
bear in the two different sentences.
•The goal is to include context based
weights to modify these embeddings.
•Send those adjusted embeddings to
the final NLP model, that will indeed
will give us better accuracy.
a -0.2 -0.2 -1.1 3.2
bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1.0 -2.2 -1.5
I -2.0 0.3 1.1 3.2
can -1.8 3.0 -0.9 1.0
bear -1.7 0.9 -0.9 3.7
the 0.6 -2.0 0.1 -3.2
pain -0.4 1.0 -0.3 -2.4

Lets focus on “bear”
•How to make “bear” in the first
sentence look different from “bear”
in the second sentence ?
•We are going to add extra weightage
to these embeddings.
a -0.2 -0.2 -1.1 3.2
bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1.0 -2.2 -1.5
I -2.0 0.3 1.1 3.2
can -1.8 3.0 -0.9 1.0
bear -1.7 0.9 -0.9 3.7
the 0.6 -2.0 0.1 -3.2
pain -0.4 1.0 -0.3 -2.4

Weighted embeddings.
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Original embedding for
“bear” in sentence -1
This is the new weighted embedding for

I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
extra weights for “bear” with respect to “I”,
“can”, “bear”, ”the” and “pain”

I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
bring “I” weight into “bear”
(-1.7)*(-2)+ (0.9)*(0.3)+ (-0.9)*(1.1)+ (3.7)*(3.2)

I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
weights for “bear” with respect to “I”, “can”,
“bear”, ”the” and “pain”

-29.0
4.4
16.0
46.5
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Multiply with
weight(14.52)
After multiplication

-29.0 -18.5 -30.9 -8.9 2.8
4.4 30.8 16.4 29.5 -7.0
16.0 -9.2 -16.4 -1.5 2.1
46.5 10.3 67.3 47.2 16.9
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Multiply corresponding weights for each
vector

-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Add across all the
embedding
dimension

-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4

-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Normalizing the weights and
making their sum=1 will help us in
keeping vectors to small numbers.

Weighted embeddings. –Sentence-1
-84.5
74.0
-9.0
188.1
bear
-1.7
0.9
-0.9
3.7
This is the new weighted embedding for “bear”
in sentence -1

a bear appeared behind her
-0.2 -1.7 -0.2 -1.3 -0.3
-0.2 0.9 -0.5 -0.8 1
-1.1 -0.9 0.3 0.1 -2.2
3.2 3.7 2.6 2.5 -1.5
• This will be the new weighted embedding
for “bear” in sentence -2
• Will be same as in sentence-1 or will be
different? What do you think?

-2.6 -30.9 -1.8 -13.8 0.6 -48.6
-2.6 16.4 -4.6 -8.5 -2.2 -1.5
-14.3 -16.4 2.8 1.1 4.8 -22.1
41.6 67.3 24.0 26.6 3.2 162.8
a -0.2 -0.2 -1.1 3.2
12.99 18.20 9.24 10.65 -2.16 bear -1.7 0.9 -0.9 3.7
appeared -0.2 -0.5 0.3 2.6
behind -1.3 -0.8 0.1 2.5
her -0.3 1 -2.2 -1.5
a bear appeared behind her
-0.2 -1.7 -0.2 -1.3 -0.3
-0.2 0.9 -0.5 -0.8 1
-1.1 -0.9 0.3 0.1 -2.2
3.2 3.7 2.6 2.5 -1.5

Weighted embeddings. –Sentence-1 vs 2
Sentence-1 Sentence-2
Word
embeddings
Word
embeddings
bear bear
-1.7
0.9
-0.9
3.7
-1.7
0.9
-0.9
3.7

Word
embeddings
Weighted
embeddings
Word
embeddings
Weighted
embeddings
bear bear(new) bear bear(new)
-1.7
0.9
-0.9
3.7
-84.5
74.0
-9.0
188.1
-1.7
0.9
-0.9
3.7
-48.6
-1.5
-22.1
162.8

•We now have two new embeddings for the same word. While training the model, we
will use different embeddings at different sentences.
•Using these new weighted embeddings will have a greater importance for the specific
contexts and these new embeddings will significantly improve the model accuracy
Word
embeddings
Weighted
embeddings
Word
embeddings
Weighted
embeddings
bear bear(new) bear bear(new)
-1.7
0.9
-0.9
3.7
-84.5
74.0
-9.0
188.1
-1.7
0.9
-0.9
3.7
-48.6
-1.5
-22.1
162.8

Weighted embeddings procedure
•Have you understood the formula for calculation of new weighted
embeddings?
•The procedure that we followed until now is known as attention
mechanism
•And that’s it …..That's all you need to improve the accuracy of the
model. “Attention is all you need”
•Lets go through attention once again.

Attention = Weighted embeddings
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
I -2 0.3 1.1 3.2
can -1.8 3 -0.9 1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
the 0.6 -2 0.1 -3.2
pain -0.4 1 -0.3 -2.4
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4

Attention Terminology
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
These vectors in this
operation are called as
Queries(Q)

14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
keys(K)
Queries(Q)

-29.0 -18.5 -30.9 -8.9 2.8
4.4 30.8 16.4 29.5 -7.0
16.0 -9.2 -16.4 -1.5 2.1
46.5 10.3 67.3 47.2 16.9
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
keys(K)
Queries(Q)
We multiply weights again with the
embeddings. These vectors are
known as values (V)

-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
keys(K)
Queries(Q)
We multiply weights again with the
embeddings. These vectors are
known as values (V)
These are called scores. Scores are
normalized and their sum is made as 1.

V1, V2,V3,….Vn
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4

V1, V2,V3,….Vn
Q
bear -1.7 0.9 -0.9 3.7

V1, V2,V3,….Vn
Q K
bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4

V1, V2,V3,….Vn
Q K
matmul
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Matrix multiplication

V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
-29.0 -18.5 -30.9 -8.9 2.8
4.4 30.8 16.4 29.5 -7.0
16.0 -9.2 -16.4 -1.5 2.1
46.5 10.3 67.3 47.2 16.9
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4

V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
-29.0 -18.5 -30.9 -8.9 2.8 -84.5
4.4 30.8 16.4 29.5 -7.0 74.0
16.0 -9.2 -16.4 -1.5 2.1 -9.0
46.5 10.3 67.3 47.2 16.9 188.1
14.52 10.27 18.20 -14.75 -7.03 bear -1.7 0.9 -0.9 3.7
I can bear the pain
-2 -1.8 -1.7 0.6 -0.4
0.3 3 0.9 -2 1
1.1 -0.9 -0.9 0.1 -0.3
3.2 1 3.7 -3.2 -2.4
Note: only one vector Yk calculation is shown
here. We can do the same for all ‘n’ vectors

Attention - Block
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
Attention block, more specially
self-attention block

Do you Agree ?
•Attention
• Is nothing but weighted embeddings only
• Smart embeddings
• Advanced version of word2vec
• More contextualized embeddings
• With a different matrix calculation formula.

Trainable Parameters ?
•Are there any trainable parameters in the above process ?
•Are we training anything? Or are we simply doing the multiplications
without any trainable weight parameters?
•Yes, until now there are NO trainable parameters.
•We need to bring trainable weights into this procedure to learn based
on the data.
•Each Q, K and V vectors need to be multiplied by the weight matrices
WQ , WK, Wv

Add Trainable Parameters
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
Multiplying with weight matrix is
same as adding a linear layer in NN
terminology

Add Trainable Parameters
While training the model, we can
consider this as attention block as a
linear embedding layer, and train the
parameters based on the error signals.
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV

Attention formula
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 V
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV

Attention – Actual Figure from the paper
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
V1, V2,V3,….Vn

stacked attention blocks/layers
V1, V2,V3,….Vn
Y1, Y2,Y3,….Yn
We can also stack multiple self-
attention blocks in the model. This will
further improve the accuracy

Self Attention visualization
• Many thanks to - Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved
from https://jalammar.github.io/illustrated-transformer/ Simulation and Colab code

Several parallel contexts
• I can bear the pain. I am used to playing all day. That cricket bat is very
light. I opened for my team.
• It was very dark, there was no light. When she opened the can, she found a
bat and a cricket. Suddenly, a bear appeared behind her.
• There are several parallel contexts present here
• Sports - cricket game
• Bat in sports - sports equipment
• Present and past events
• Bat/ cricket – insects
• Bear – Animals
• playing, opening, appearing – Actions
• These are the contexts from just two documents. For the large and diverse datasets there
will be multiple contexts. We need many more parallel attention stacks.

Single Head Attention.
V1, V2,V3,….Vn
Q K V
matmul
normalize matmul
Y1, Y2,Y3,….Yn
WQ WK WV
• Until now, we have discussed a
single attention stack either with
single block or multiple attention
blocks.
• At the end we will have a single
set of vectors [Y1, Y2,Y3,….Yn]
• This is known as single head
attention.
• For a large and diverse dataset, we
may have several parallel contexts.
• We may have to include parallel
attention stacks to learn parallel
contexts

Multi-head Attention.
Concatenate
Linear Layer
Adding a liner layer is
simply multiplying the
concatenated vector with
output weights WO

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑐𝑜𝑛𝑐𝑎𝑡 𝐻1, 𝐻2, . . , 𝐻ℎ Wo
𝑤ℎ𝑒𝑟𝑒 𝐻𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖𝑘, 𝑉𝑊𝑖𝑣
Concatenate
Linear Layer

Multi-head Attention - Actual figure from
the paper

Multi-head Attention visualization - Example
Two attention heads focusing on two
different contexts w.r.t a word

Multi-head Attention visualization - Example
Three attention heads focusing on three
different contexts w.r.t the same word

Self attention - Limitations
•Until now we discussed self-attention with single head and multiple
heads.
•In self attention, a word looks at the surrounding words in the same
sentence to calculate the attention.
•Self attention is useful for many-to-one kind of models. For example
• Sentiment Analysis
• Document Classification
•We need to have a slightly complex architecture for many-to-many or
sequence to sequence models. For example
• Machine Translation
• Chatbot

Sequence to Sequence Models.
•Before discussing attention, we used LSTM or RNN for sequence to
sequence models.
•We used a Encoder LSTM for input sequence and a Decoder LSTM for
the output sequence.
•Lets revisit the LSTM or RNN based seq to seq model using simple word
embeddings.

Encoder and Decoder – Revision

•Language translation is an example of sequence to sequence model
•This model has two components
• Encoder LSTM for understanding the structure of input language sequence
• Decoder LSTM for translating the the text in output language

x1 ….. xt
y1 … yt
Decoder LSTM
Encoder LSTM
Thought Vector

•Encoder:
• Takes the input sequence and creates a new vector based on the whole input
sequence. The resultant vector from encoder is also known as thought vector or
context vector

•Thought Vector / Context Vector:
• The overall crux of the input sequence is imparted into one vector. This will be the
only input to the decoder.

•Decoder:
• Decoder takes thought vector as input and generates a logical translated sequence.

model = Sequential()
model.add(Embedding(input_dim=30000, output_dim=256, input_length=20))
# Word2Vec
model.add(LSTM(128)) # Encoder
model.add(RepeatVector(20)) # Thought vector
model.add(LSTM(128, return_sequences=True)) #Decoder
model.add(Dense(30000, activation='softmax’))

LAB: LSTM – Seq to Seq Model
•From English to Spanish

Code: LSTM – Seq to Seq Model
•From English to Spanish

Encoder and Decoder issues
•The above procedure uses standard embeddings.
•Since we generally use LSTM or GRU, we can achieve decent accuracy.
•Less Accuracy - For large and diverse datasets, we may not get very
high accuracy. Due to the limitations of standard embeddings.
•Slow - For large and diverse datasets, we need very complex models
with very deep architecture. Parallel processing is difficult in
sequential models.

Attention + Encoder and Decoder
•We will now include multi-head attention inside the Encoder and
Decoder
•High Accuracy - For large and diverse datasets, we can use multi-head
attention mechanism to capture more relevant contexts, this will
indeed increase the accuracy.
•Speed - Since multi-head attention is nothing but parallel attention
channels, we can easily use GPUs and perform parallel processing.

•To solve sequence to sequence models, we need to add three different
type of attentions.
•For example we are building a machine translation model from English
to Spanish.
• We need an attention mechanism in the input language(English) to understand the
contexts in English.
• We need an attention mechanism in the output language(Spanish) to understand the
word relations in Spanish
• The third attention is the most important one, the word relations and contextual
relevance from English to Spanish.

English to Spanish Translation Model
x1 ….. xt
y1 … yt
Self-Attention in English
Self-Attention in Spanish
English to Spanish Attention

English to Spanish Translation Model
x1 ….. xt
y1 … yt
Self Attention in Encoder
Self Attention in Decoder
Encoder to Decoder Attention

y1 … yt
Self Attention in Decoder
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings

y1 … yt
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings

What is Masked Multi-Head Attention?
English - Encoder Spanish - Decoder
A nearly perfect photo Una foto casi perfecta
• Inside encoder self attention, while
calculating the attention for the word
“nearly”
• do we consider all the rest of the words?
• Inside decoder self attention, while calculating
the attention for the word “casi”
• can we consider all the rest of the words?
• We get all the input in one shot. • We can use only words until now, for the
prediction. While predicting word at time-t we
can use only words until time t-1 in the decoder
• At the training and test time we have
access to all the words.
• At the test time, we will NOT have access to
future words.
• Encoder self attention doesn’t have any
special changes
• Decoder Attention calculation need to be
masked from the future words.

Masked Attention scores calculation
Una foto casi perfecta
0.07 0.00 0.00 0.00 0.07
0.14 0.00 0.00 0.00 0.14
0.03 0.00 0.00 0.00 0.03
0.26 0.00 0.00 0.00 0.26
-0.34 Una -0.2 -0.3 0.3 0.7
foto -0.4 -0.3 0.6 -0.2
casi -0.1 1.7 -0.4 -0.5
perfecta -0.8 -0.1 -0.9 1.4
-0.2 -0.3 0.3 0.7
-0.4 -0.3 0.6 -0.2
-0.1 1.7 -0.4 -0.5
-0.8 -0.1 -0.9 1.4
While calculating the attentions scores
for “Una” we should not consider the
future words

-0.06 -0.40 0.00 0.00 -0.47
-0.12 -0.42 0.00 0.00 -0.55
-0.02 2.21 0.00 0.00 2.18
-0.22 -0.10 0.00 0.00 -0.32
-0.34 Una -0.2 -0.3 0.3 0.7
0.30 1.28 foto -0.4 -0.3 0.6 -0.2
casi -0.1 1.7 -0.4 -0.5
perfecta -0.8 -0.1 -0.9 1.4
-0.2 -0.3 0.3 0.7
-0.4 -0.3 0.6 -0.2
-0.1 1.7 -0.4 -0.5
-0.8 -0.1 -0.9 1.4
While calculating the attentions scores
for “foto” we should not consider the
future words like “casi” and “perfecta”

0.06 0.35 0.42 0.00 0.83
0.12 0.36 0.94 0.00 1.43
0.02 -1.91 -0.55 0.00 -2.44
0.22 0.09 -1.42 0.00 -1.11
-0.34 Una -0.2 -0.3 0.3 0.7
0.30 1.28 foto -0.4 -0.3 0.6 -0.2
-0.29 -1.11 1.58 casi -0.1 1.7 -0.4 -0.5
perfecta -0.8 -0.1 -0.9 1.4
-0.2 -0.3 0.3 0.7
-0.4 -0.3 0.6 -0.2
-0.1 1.7 -0.4 -0.5
-0.8 -0.1 -0.9 1.4
Same is applicable for the attention
score calculation of the word “casi”

The Third Attention – “Inter- Attention”
y1 … yt
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings

X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
This is also a Multi-Head
attention with a different set
of Queries and Values

Q, K, V for Self(Intra) Attention
A nearly perfect photo
-1.1 -1.2 0.1 -2.3 -4.54
-2.6 -6.5 -0.1 -3.1 -12.29
-1.7 -3.4 -0.4 -10.5 -15.90
0.9 7.4 0.0 2.0 10.29
A -0.7 0.3 0.2 0.6
nearly -1.5 1.5 -0.2 0.8
1.72 -4.43 0.42 -3.77 perfect -1.0 0.8 -0.8 2.8
photo 0.5 -1.7 0.0 -0.5
-0.7 0.3 0.2 0.6
-1.5 1.5 -0.2 0.8
-1.0 0.8 -0.8 2.8
0.5 -1.7 0.0 -0.5
(K)
(Q)
(V)
Q, K and V are considered
from a same set in a self
attention block

Self Attention / Intra Attention
• Until now we discussed self attention,
or intra attention within the encoder
• For example, when we are translating
from English to Spanish
• Self attention inside the encoder
captures the word relation within
English language.
• Q, K and V all are from English
• Self attention inside the decoder
captures the word relation within
Spanish language.
• Q, K and V all are from Spanish

Attention / Inter- Attention
• For example, when we are translating from English to Spanish
• Inter attention inside the decoder captures the word relation between
decoder and encoder.
• Q – Queries will be from Spanish (Decoder)
• K-Keys and V-Values will be from English (Encoder)

Attention / Inter- Attention - Calculation
-0.1 0.4 0.0 0.0 0.39
0.4 -1.4 0.1 0.1 -0.79
0.2 -1.2 -0.2 0.2 -0.90
-0.1 1.3 -0.1 0.0 1.09
Una -0.1 0.0 0.0 0.5
foto -0.6 0.7 -0.1 0.8
-0.15 -0.66 0.12 0.07 casi -0.5 0.4 -0.2 0.8
perfecta 0.1 -0.8 0.0 -0.2
0.3 -0.6 0.1 0.2
-2.3 2.1 0.8 1.6
-1.6 1.8 -1.3 2.4
0.9 -2.0 -0.5 -0.1
(K)
(Q)
(V)
• Queries-Q are from
decoder
• Keys-K and Values-V are
from encoder

Seq-to-Seq model with
Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt

Attention + Dense Layers for learning
•Until now, we have included only attention blocks.
•But eventually, these attention blocks will give us just better
embeddings only.
•We need to add at least a couple of dense layers on top of these
intelligent embeddings.
•These dense layers will help us in learning all the patterns in the data.
•Dense layers are also known as hidden layers or feed forward layers

Final Seq-to-Seq model
with Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Output Layer with SoftMax
Dense Layer

Preserving the sequential order
•Questions: In word2vec do we preserve the sequential order? In
attention are we preserving the sequential order?
•Answer: Nope, we are not. Context is decided based on the
surrounding words, not necessarily by retaining the sequence.
1. “the food was good, not bad”
2. “the food was bad, not good”
Self attention treats both the sentences the same way. The order and
the position of “good” and “bad” are very important in NLP.

Positional encodings
•How do we include the information about the sequential order or the
position of the sentence?
•By using “Positional encodings”
•It is a very simple concept. We will create a new vector based on the
relative position of the word in a sentence.
•We will now add the positional information to the embeddings and
pass it on as input

Simple Positional encodings
• We can add this extra information to input embeddings and pass it on as the input for
attention calculation
the food was good not bad
0 1 2 3 4 5
the food was bad not good
0 1 2 3 4 5
• In the first sentence, good has value-3 and bad has value-5
• In the second sentence, good has value-5 and bad has value-3

Adding Positional encodings - idea
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 -0.2 -1.1 3.2 -0.2 -0.30
-1.7 0.9 -0.9 3.7 -0.5 1.00
-0.2 -0.5 0.3 2.6 0.3 -2.20
-1.3 -0.8 0.1 2.5 2.6 -1.50
Embeddings

pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 -0.2 -1.1 3.2 -0.2 -0.30
-1.7 0.9 -0.9 3.7 -0.5 1.00
-0.2 -0.5 0.3 2.6 0.3 -2.20
-1.3 -0.8 0.1 2.5 2.6 -1.50
Positional index

pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
+0 +1 +2 +3 +4 +5
-0.2 -0.2 -0.2 0.8 -1.1 0.9 3.2 6.2 -0.2 3.8 -0.30 4.7
-1.7 -1.7 0.9 1.9 -0.9 1.1 3.7 6.7 -0.5 3.5 1.00 6.0
-0.2 -0.2 -0.5 0.5 0.3 2.3 2.6 5.6 0.3 4.3 -2.20 2.8
-1.3 -1.3 -0.8 0.2 0.1 2.1 2.5 5.5 2.6 6.6 -1.50 3.5
Add positional index
to embeddings to get
new position based
embeddings

0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
+0 +1 +2 +3 +4 +5
-0.2 -0.2 -0.2 0.8 -1.1 0.9 3.2 6.2 -0.2 3.8 -0.30 4.7
-1.7 -1.7 0.9 1.9 -0.9 1.1 3.7 6.7 -0.5 3.5 1.00 6.0
-0.2 -0.2 -0.5 0.5 0.3 2.3 2.6 5.6 0.3 4.3 -2.20 2.8
-1.3 -1.3 -0.8 0.2 0.1 2.1 2.5 5.5 2.6 6.6 -1.50 3.5
• This method doesn’t work.
• If the number of words are too many
inside a sequence then we will have
very large values.
• If there are 50 words in a sentence,
then we will be adding 50 to the final
word.
• We need a slightly modified function for
positional encodings.

New formula Positional encodings
pos 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑

New formula Positional encodings
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 ? -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 ? 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 ? -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 ? -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
• 𝑑 – is the dimensionality of the
embeddings ; d=4 here
• 𝑖 – runs from 0 to 3
• 𝑝𝑜𝑠 – Takes values 0 to 5

Positional encodings - Calculation
pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
𝑝𝑜𝑠 = 0 ; 𝑖 = 0 ; 𝑑 = 4
𝑃𝐸(0,0) = sin
0
10000
2∗0
4

pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 ? -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
𝑝𝑜𝑠 = 0 ; 𝑖 = 1 ; 𝑑 = 4
𝑃𝐸(0,1) = sin
0
10000
2∗0.5
4

pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 ? -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
𝑝𝑜𝑠 = 1 ; 𝑖 = 0 ; 𝑑 = 4
𝑃𝐸(1,0) = sin
1
10000
0
4

pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 ? 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 ? 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
𝑝𝑜𝑠 = 1 ; 𝑖 = 1 ; 𝑑 = 4
𝑃𝐸(1,1) = sin
1
10000
1
4

pos 0 1 2 3 4 5
i=0 -0.2 -0.2 -1.1 3.2 -0.2 -0.3
i=1 -1.7 0.9 -0.9 3.7 -0.5 1.0
i=2 -0.2 -0.5 0.3 2.6 0.3 -2.2
i=3 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 0.84 -1.1 ? 3.2 ? -0.2 ? -0.30 ?
-1.7 0 0.9 0.10 -0.9 ? 3.7 ? -0.5 ? 1.00 ?
-0.2 0 -0.5 0.01 0.3 ? 2.6 ? 0.3 ? -2.20 ?
-1.3 0 -0.8 0.1 ? 2.5 ? 2.6 ? -1.50 ?
𝑝𝑜𝑠
10000
2𝑖
𝑑
In our example
𝑝𝑜𝑠 = 1 ; 𝑖 = 2 ; 𝑑 = 4
𝑃𝐸(1,2) = sin
1
10000
2
4

-0.2 -0.2 -1.1 3.2 -0.2 -0.3
-1.7 0.9 -0.9 3.7 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 0.84 -1.1 0.909 3.2 0.141 -0.2 -0.288 -0.30 0.913
-1.7 0 0.9 0.100 -0.9 0.199 3.7 0.296 -0.5 0.002 1.00 0.002
-0.2 0 -0.5 0.010 0.3 0.020 2.6 0.030 0.3 0.000 -2.20 0.000
-1.3 0 -0.8 0.001 0.1 0.002 2.5 0.003 2.6 0.000 -1.50 0.000
Positional encodings for
all the words in
sentence-1

Embeddings + Positional encodings
the food was good not bad the food was good not bad
0 1 2 3 4 5 0 1 2 3 4 5
-0.2 -0.2 -1.1 3.2 -0.2 -0.3 -0.2 0.6 -0.2 3.3 -0.5 0.6
-1.7 0.9 -0.9 3.7 -0.5 1.0 -1.7 1.0 -0.7 4.0 -0.5 1.0
-0.2 -0.5 0.3 2.6 0.3 -2.2 -0.2 -0.5 0.3 2.6 0.3 -2.2
-1.3 -0.8 0.1 2.5 2.6 -1.5 -1.3 -0.8 0.1 2.5 2.6 -1.5
-0.2 0 -0.2 0.84 -1.1 0.909 3.2 0.141 -0.2 -0.288 -0.30 0.913
-1.7 0 0.9 0.100 -0.9 0.199 3.7 0.296 -0.5 0.002 1.00 0.002
-0.2 0 -0.5 0.010 0.3 0.020 2.6 0.030 0.3 0.000 -2.20 0.000
-1.3 0 -0.8 0.001 0.1 0.002 2.5 0.003 2.6 0.000 -1.50 0.000
Embeddings
updated with
positional
encoders

Sentence-2
the food was bad not good the food was bad not good
0 1 2 3 4 5 0 1 2 3 4 5
-0.2 -0.2 -1.1 -0.3 -0.2 3.2 -0.2 0.6 -0.2 -0.2 -0.5 4.1
-1.7 0.9 -0.9 1.0 -0.5 3.7 -1.7 1.0 -0.7 1.3 -0.5 3.7
-0.2 -0.5 0.3 -2.2 0.3 2.6 -0.2 -0.5 0.3 -2.2 0.3 2.6
-1.3 -0.8 0.1 -1.5 2.6 2.5 -1.3 -0.8 0.1 -1.5 2.6 2.5
-0.2 0.0 -0.2 0.84 -1.1 0.909 -0.3 0.141 -0.2 -0.288 3.2 0.913
-1.7 0.0 0.9 0.100 -0.9 0.199 1.0 0.296 -0.5 0.002 3.7 0.002
-0.2 0.0 -0.5 0.010 0.3 0.020 -2.2 0.030 0.3 0.000 2.6 0.000
-1.3 0.0 -0.8 0.001 0.1 0.002 -1.5 0.003 2.6 0.000 2.5 0.000

Sentence-1 vs Sentence-2 with positional
encodings
Sentence-2
0 1 2 3 4 5
-0.2 0.6 -0.2 -0.2 -0.5 4.1
-1.9 1.0 -0.7 1.3 -0.5 3.7
-1.2 -0.5 0.3 -2.2 0.3 2.6
-1.5 -0.8 0.1 -1.5 2.6 2.5
Sentence-1
0 1 2 3 4 5
-0.2 0.6 -0.2 3.3 -0.5 0.6
-1.9 1.0 -0.7 4.0 -0.5 1.0
-1.2 -0.5 0.3 2.6 0.3 -2.2
-1.5 -0.8 0.1 2.5 2.6 -1.5
Have you noticed the difference?

Sentence-1 vs Sentence-2 with positional
encodings
Sentence-2
0 1 2 3 4 5
-0.2 0.6 -0.2 -0.2 -0.5 4.1
-1.9 1.0 -0.7 1.3 -0.5 3.7
-1.2 -0.5 0.3 -2.2 0.3 2.6
-1.5 -0.8 0.1 -1.5 2.6 2.5
Sentence-1
0 1 2 3 4 5
-0.2 0.6 -0.2 3.3 -0.5 0.6
-1.9 1.0 -0.7 4.0 -0.5 1.0
-1.2 -0.5 0.3 2.6 0.3 -2.2
-1.5 -0.8 0.1 2.5 2.6 -1.5

Positional encodings - Conclusion
•Word2vec  numerical representation of the words
•Positional encodings  a numerical representation of the position of
the word in the original sentence.
•Positional encodings – An extra vector to capture the position
•Both Word2Vec and Positional encodings are vectors.
•These two vectors are added and passed on to encoder

Updated Seq-to-Seq
model with Attention
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Dense Layer

Updated Seq-to-Seq
model with Attention
Positional
encodings
X1, X2,X3,….Xn
Multi-Head
Attention
Input embeddings
y1, y2,y3,….yt-1
Masked Multi-Head
Attention
Output embeddings
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Dense Layer

Few more practical issues – The high risk
points in the network
•While solving practical problems, DL models would like to have the
weights near to zero, both positive and negative.
•If not, we have multiple problems like unstable training, vanishing
gradients, internal covariate shift etc.,
•There are some layers in the that network that may cause significant
impact on the gradient calculation. Lets call them the “high risk
points”

The high risk points
Multi-Head
Attention
Masked Multi-Head
Attention
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Dense Layer
Here
Here
Here
Here
Here
• These are risky areas
• We will have huge impact on the gradient values
based on the calculations at these places
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

Fixing the practical issues – Two steps
•We will now add two extra elements to our model seq-to-seq model
architecture to keep the gradient flow smooth and to make the model
mathematically convenient to solve large data problems.
1) Residual connection
2) Normalization

Step-1: Extra connections
Multi-Head
Attention
Masked Multi-Head
Attention
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Dense Layer
Here
Here
Here
Here
Here
• Combine original input with multi head attention
output here
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

Step-1: Extra connections
Multi-Head Attention
Masked Multi-Head
Attention
Multi-Head
Attention
y1, y2,y3,….yt
Dense Layer
Dense Layer
Here
Here
Here
Here
Add
This is known as a
residual connection
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

All residual connections
Multi-Head Attention Masked Multi-Head Attention
y1, y2,y3,….yt
Dense Layer
Dense Layer
Add
Add
Add
Add
Add
Helps us in keeping
the smooth flow of
gradient
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

Step-2: Normalization
y1, y2,y3,….yt
Dense Layer
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Normalization is important for avoiding extremely high
and extremely low values after multiplying with weights.
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

Residual Connections and Normalization
•Imagine a seq to seq network model with a stack of 16 encoders and
16 decoders along with several multi head attentions.
•It is going to be a very complex optimization function.
•Residual connections and Normalization are helpful while solving the
practical deep networks.
•The main idea is “Multi-Head” attention. Remaining components in the
network are to solve the problem mathematically.

Full and Final Architecture
Seq to Seq Model
y1, y2,y3,….yt
Dense Layer
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

Seq to Seq Model
y1, y2,y3,….yt
Dense Layer
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Encoder
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings

Seq to Seq Model
y1, y2,y3,….yt
Dense Layer
Dense Layer
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Encoder
X1, X2,X3,….Xn
Input embeddings
y1, y2,y3,….yt-1
Output embeddings
Decoder

Transformers
•The full and final seq to seq model architecture that we have seen
above is known as the transformer architecture.
•Transformers are the fastest, most advanced, most accurate and most
widely used architectures to solve seq-to-seq problems.
• 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑠 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 + 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 & 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝐴𝑟𝑐ℎ𝑖𝑡𝑒𝑐𝑡𝑢𝑟𝑒

Transformers Architecture diagram from the original paper

This paper is all
you need
Download link

LAB – Chatbot building using transformers

Code – Chatbot building using
transformers

Google uses BERT
• From 2019, Google Search has begun to use Google’s transformer neural network BERT for search
queries in over 70 languages.
• Prior to this change, a lot of information retrieval was keyword based, meaning Google checked its
crawled sites without strong contextual clues. Take the example word ‘bank’, which can have many
meanings depending on the context.
• The introduction of transformer neural networks to Google Search means that queries where words
such as ‘from’ or ‘to’ affect the meaning are better understood by Google. Users can search in more
natural English rather than adapting their search query to what they think Google will understand.
• An example from Google’s blog is the query “2019 brazil traveler to usa need a visa.” The position of
the word ‘to’ is very important for the correct interpretation of the query. The previous implementation
of Google Search was not able to pick up this nuance and returned results about USA citizens
traveling to Brazil, whereas the transformer model returns much more relevant pages.
• A further advantage of the transformer architecture is that learning in one language can be transferred
to other languages via transfer learning. Google was able to take the trained English model and adapt
it easily for the other languages’ Google Search.

My References
•https://data-science-blog.com/blog/2021/02/17/sequence-to-
sequence-models-back-bones-of-various-nlp-tasks/
•https://github.com/YasuThompson/Transformer_blog_codes/blob/mai
n/rnn_translation_attention_modified.ipynb
•https://www.youtube.com/watch?v=pLpzU-xGi2E&t=1359s
•https://github.com/YanXuHappygela/NLP-
study/blob/master/seq2seq_with_attention.ipynb
•https://www.tensorflow.org/text/tutorials/nmt_with_attention
•https://www.youtube.com/watch?v=pLpzU-xGi2E&t=1359s
•https://deepai.org/machine-learning-glossary-and-terms/transformer-
neural-network
•https://www.youtube.com/watch?v=23XUv0T9L5c

DataRace Android App
• Comprehensive Preparation: Get ready for data science &
ML interviews.
• 5000+ Questions: Diverse collection for thorough practice.
• Question Formats:
• MCQs
• Image-based
• Long answers
• Practice projects
• Scenario-based
• All-In-One Solution: Your go-to for data science & ML
interview prep.
• Boost Confidence: Gain proficiency and interview readiness.
• Success Assurance: Increase chances of success in
interviews.

Transformers 101

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformers 101

Similar to Transformers 101 (20)

More from Venkata Reddy Konasani

More from Venkata Reddy Konasani (20)

Recently uploaded

Recently uploaded (20)

Transformers 101