An introduction to the Transformers architecture and BERT

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An introduction to the Transformers
architecture and BERT
Suman Debnath
Principal Developer Advocate
Amazon Web Services, India

- Text Classification
- Text Summarization
- Q&A with the goal of helping our agents and members

• Natural language processing is a sub-field of {linguistics, computer science}
• Human generated language is complex to understand and interpret for computers
• In NLP, the goal is to make computers understand this complex language structure and retrieve
meaningful pieces of information from it.
• Some of the NLP use cases
• Text Classification
• Speech Recognition
• Text Summarization
• Topic Modelling
• Question Answering
Natural Language Processing
Problem to solve?
How can we build a Mathematical
Representation of language, that can
help solving all these different use
cases ?

Evolution of NLP algorithms
Word2Vec (2013) GloVe (2014) FastText (2015) Transformer (2017) BERT (2018)
Simple NN
Predict the
word based on
the context
window of
other words in
the sentence
Global Vectors
for Word
Representation
Matrix
factorization
Extension of
Word2Vec
Each word is
treated as a set
of sub-words
Attention Is All
You Need
Pre-training of
Deep Bidirectional
Transformers for
Language
Understanding

How the transformer works (language translation task)
Encoder Decoder
I am good
je vais bien
Representation

• Stack of N number of encoders
• The output of one encoder is sent as input to
the encoder above it.
Questions?
• How exactly does the encoder work ?
• How is it generating the representation for the given source
sentence (input sentence)?
The encoder of the transformer
Encoder Layer 1
Encoder Layer 2
I am good
Representation
Encoder Layer N

• All the encoder blocks are identical
• Each encoder block consists of two
sublayers
• Multi-head attention
• Feedforward network(FNN)
How exactly does the encoder work ?
Encoder Layer 2
Encoder Layer 1
FFN
Representation
Multi-head attention
I am good
FFN

Self-attention mechanism
“A dog ate the food because it was hungry”
How exactly does this work?

The embedding of the word I is :
x1 = [1.76, 2.22, … ,6.66]
The embedding of the word am is :
x2 = [7.77, 0.631, … ,5.35]
The embedding of the word good is :
x3 = [11.44, 10.10, … ,3.33]
I am good
*embedding dimension be 512

3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)

3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
Implies the
query, key, and
value vectors of
the word
“I”

Why are we computing this?
What is the use of query, key, and value matrices?
How is this going to help us?

• Step 1: ”Dot product” between the query matrix, Q, and the key matrix, KT
4 step process

• Step 2: ”Divide” the matrix by the square root of the dimension of the key vector
4 step process
*the dimension of the key vector(dk) is 64.

• Step 3: ”Normalize” the matrix by the square root of the dimension of the key vector
4 step process

• Step 3: ”Normalize” the matrix by the square root of the dimension of the key vector
4 step process
Word “I” is related to:
- itself by 90%
- am by 7%
- good by 3%

• Step 4: final step in the self-attention mechanism is to compute the “attention matrix”
4 step process
Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores

• Instead of having a single attention head, we can use multiple attention
heads
• This will be useful only in circumstances where the meaning of the actual
word is ambiguous, e.g.
“A dog ate the food because it was hungry”
Multi-head attention = Concatenation(Z1, Z2, Z3, … Zi, …, Z8 ) W0
Multi-head attention mechanism

Positional encoding
+ =
But how it generates this ?

Positional encoding
+ =
-1
-0.25
0.91
-0.25
Positional Encoding for the 30th Word

Let’s recall
FFN
Representation
I am good
FFN
FFN
Representation
I am good
Positional
Encoding

Add and norm component
Add & Norm
Add & Norm
FFN
Encoder Block
I am good
Positional
Encoding

Add & Norm
Add & Norm
FFN
Encoder
1
Positional
Encoding
I am good
Input embedding
Encoder 2

BERT(Bidirectional Encoder Representation from Transformer)
Add & Norm
Add & Norm
FFN
Encoder
1
Python is my favorite programming language
Encoder N
Encoder 2
RPython Ris Rmy Rfavorite Rprogramming Rlanguage

• BERT is supposed to do:
• Masked Language Model (MLM)
• Next Sentence Prediction (NSP)
• BERT was trained to perform these two tasks purely as a way to force it to develop a sophisticated
understanding of language, e.g.
“Kolkata is a beautiful city. I love Kolkata”
• Here’s what BERT is supposed to do:
• MLM - Predict the crossed out word
(Correct answer is “city”).
• NSP - Was sentence B found immediately after sentence A , or from somewhere else?
(Correct answer is that they are consecutive).
How BERT works ?
Pre-Training Tasks

How BERT works ?
[CLS] Kolkata is a beautiful [MASK] [SEP] I love Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
E[CLS] EKolkata Eis Ea Ebeautiful E[SEP] EI Elove EKolkata E[SEP]
Token Embeddings
EA EA EA EA EA EA EB EB EB EB
Segment Embeddings
E0 E1 E2 E3 E4 E6 EB7 E8 E9 E10
Position Embeddings
INPUT
E[MASK]
EA
E5
R[CLS] RKolkata Ris Ra Rbeautiful R[MASK] R[SEP] R[SEP]
RI Rlove RKolkata
OUTPUT
(Enhanced Embedding)
Pre-Training Tasks

How BERT works ?
Pre-Training & Fine-Tuning
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT

How BERT works ?
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
OUTPUT

How BERT works ?
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
OUTPUT
R[CLS] R[SEP]
Classifier
PERSON LOCATION
(Name entity recognition)
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
OUTPUT

Demo

Thank you!
Suman Debnath
Principal Developer Advocate
Amazon Web Services, India

An introduction to the Transformers architecture and BERT

More Related Content

What's hot

Similar to An introduction to the Transformers architecture and BERT

More from Suman Debnath

Recently uploaded

An introduction to the Transformers architecture and BERT