Š 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An introduction to the Transformers
architecture and BERT
Suman Debnath
Principal Developer Advocate
Amazon Web Services, India
- Text Classification
- Text Summarization
- Q&A with the goal of helping our agents and members
• Natural language processing is a sub-field of {linguistics, computer science}
• Human generated language is complex to understand and interpret for computers
• In NLP, the goal is to make computers understand this complex language structure and retrieve
meaningful pieces of information from it.
• Some of the NLP use cases
• Text Classification
• Speech Recognition
• Text Summarization
• Topic Modelling
• Question Answering
Natural Language Processing
Problem to solve?
How can we build a Mathematical
Representation of language, that can
help solving all these different use
cases ?
Evolution of NLP algorithms
Word2Vec (2013) GloVe (2014) FastText (2015) Transformer (2017) BERT (2018)
Simple NN
Predict the
word based on
the context
window of
other words in
the sentence
Global Vectors
for Word
Representation
Matrix
factorization
Extension of
Word2Vec
Each word is
treated as a set
of sub-words
Attention Is All
You Need
Pre-training of
Deep Bidirectional
Transformers for
Language
Understanding
Evolution of NLP algorithms
Word2Vec (2013) GloVe (2014) FastText (2015) Transformer (2017) BERT (2018)
Simple NN
Predict the
word based on
the context
window of
other words in
the sentence
Global Vectors
for Word
Representation
Matrix
factorization
Extension of
Word2Vec
Each word is
treated as a set
of sub-words
Attention Is All
You Need
Pre-training of
Deep Bidirectional
Transformers for
Language
Understanding
How the transformer works (language translation task)
Encoder Decoder
I am good
je vais bien
Representation
• Stack of N number of encoders
• The output of one encoder is sent as input to
the encoder above it.
Questions?
• How exactly does the encoder work ?
• How is it generating the representation for the given source
sentence (input sentence)?
The encoder of the transformer
Encoder Layer 1
Encoder Layer 2
I am good
Representation
Encoder Layer N
• All the encoder blocks are identical
• Each encoder block consists of two
sublayers
• Multi-head attention
• Feedforward network(FNN)
How exactly does the encoder work ?
Encoder Layer 2
Encoder Layer 1
FFN
Representation
Multi-head attention
Multi-head attention
I am good
FFN
Self-attention mechanism
“A dog ate the food because it was hungry”
How exactly does this work?
The embedding of the word I is :
x1 = [1.76, 2.22, … ,6.66]
The embedding of the word am is :
x2 = [7.77, 0.631, … ,5.35]
The embedding of the word good is :
x3 = [11.44, 10.10, … ,3.33]
I am good
*embedding dimension be 512
3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
Implies the
query, key, and
value vectors of
the word
“I”
Why are we computing this?
What is the use of query, key, and value matrices?
How is this going to help us?
• Step 1: ”Dot product” between the query matrix, Q, and the key matrix, KT
4 step process
• Step 1: ”Dot product” between the query matrix, Q, and the key matrix, KT
• Step 2: ”Divide” the matrix by the square root of the dimension of the key vector
4 step process
*the dimension of the key vector(dk) is 64.
• Step 1: ”Dot product” between the query matrix, Q, and the key matrix, KT
• Step 2: ”Divide” the matrix by the square root of the dimension of the key vector
• Step 3: ”Normalize” the matrix by the square root of the dimension of the key vector
4 step process
*the dimension of the key vector(dk) is 64.
• Step 1: ”Dot product” between the query matrix, Q, and the key matrix, KT
• Step 2: ”Divide” the matrix by the square root of the dimension of the key vector
• Step 3: ”Normalize” the matrix by the square root of the dimension of the key vector
4 step process
*the dimension of the key vector(dk) is 64.
Word “I” is related to:
- itself by 90%
- am by 7%
- good by 3%
• Step 4: final step in the self-attention mechanism is to compute the “attention matrix”
4 step process
Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores
• Instead of having a single attention head, we can use multiple attention
heads
• This will be useful only in circumstances where the meaning of the actual
word is ambiguous, e.g.
“A dog ate the food because it was hungry”
Multi-head attention = Concatenation(Z1, Z2, Z3, … Zi, …, Z8 ) W0
Multi-head attention mechanism
Positional encoding
Positional encoding
+ =
Positional encoding
+ =
But how it generates this ?
Positional encoding
+ =
-1
-0.25
0.91
-0.25
Positional Encoding for the 30th Word
Let’s recall
FFN
Representation
Multi-head attention
Multi-head attention
I am good
FFN
FFN
Representation
Multi-head attention
I am good
Positional
Encoding
Add and norm component
Add & Norm
Add & Norm
Multi-head attention
FFN
Encoder Block
I am good
Positional
Encoding
Putting it all together
Add & Norm
Add & Norm
Multi-head attention
FFN
Encoder
1
Positional
Encoding
I am good
Input embedding
Encoder 2
BERT(Bidirectional Encoder Representation from Transformer)
Add & Norm
Add & Norm
Multi-head attention
FFN
Encoder
1
Python is my favorite programming language
Encoder N
Encoder 2
RPython Ris Rmy Rfavorite Rprogramming Rlanguage
• BERT is supposed to do:
• Masked Language Model (MLM)
• Next Sentence Prediction (NSP)
• BERT was trained to perform these two tasks purely as a way to force it to develop a sophisticated
understanding of language, e.g.
“Kolkata is a beautiful city. I love Kolkata”
• Here’s what BERT is supposed to do:
• MLM - Predict the crossed out word
(Correct answer is “city”).
• NSP - Was sentence B found immediately after sentence A , or from somewhere else?
(Correct answer is that they are consecutive).
How BERT works ?
Pre-Training Tasks
How BERT works ?
[CLS] Kolkata is a beautiful [MASK] [SEP] I love Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
E[CLS] EKolkata Eis Ea Ebeautiful E[SEP] EI Elove EKolkata E[SEP]
Token Embeddings
EA EA EA EA EA EA EB EB EB EB
Segment Embeddings
E0 E1 E2 E3 E4 E6 EB7 E8 E9 E10
Position Embeddings
INPUT
E[MASK]
EA
E5
R[CLS] RKolkata Ris Ra Rbeautiful R[MASK] R[SEP] R[SEP]
RI Rlove RKolkata
OUTPUT
(Enhanced Embedding)
Pre-Training Tasks
How BERT works ?
Pre-Training & Fine-Tuning
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT
(Enhanced Embedding)
How BERT works ?
Pre-Training & Fine-Tuning
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT
(Enhanced Embedding)
How BERT works ?
Pre-Training & Fine-Tuning
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
OUTPUT
(Enhanced Embedding)
R[CLS] R[SEP]
RSuman Rloves RKolkata
Classifier
PERSON LOCATION
(Name entity recognition)
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT
(Enhanced Embedding)
Š 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
Thank you!
Š 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Suman Debnath
Principal Developer Advocate
Amazon Web Services, India

An introduction to the Transformers architecture and BERT

  • 1.
    Š 2021, AmazonWeb Services, Inc. or its affiliates. All rights reserved. An introduction to the Transformers architecture and BERT Suman Debnath Principal Developer Advocate Amazon Web Services, India
  • 2.
    - Text Classification -Text Summarization - Q&A with the goal of helping our agents and members
  • 3.
    • Natural languageprocessing is a sub-field of {linguistics, computer science} • Human generated language is complex to understand and interpret for computers • In NLP, the goal is to make computers understand this complex language structure and retrieve meaningful pieces of information from it. • Some of the NLP use cases • Text Classification • Speech Recognition • Text Summarization • Topic Modelling • Question Answering Natural Language Processing Problem to solve? How can we build a Mathematical Representation of language, that can help solving all these different use cases ?
  • 4.
    Evolution of NLPalgorithms Word2Vec (2013) GloVe (2014) FastText (2015) Transformer (2017) BERT (2018) Simple NN Predict the word based on the context window of other words in the sentence Global Vectors for Word Representation Matrix factorization Extension of Word2Vec Each word is treated as a set of sub-words Attention Is All You Need Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 5.
    Evolution of NLPalgorithms Word2Vec (2013) GloVe (2014) FastText (2015) Transformer (2017) BERT (2018) Simple NN Predict the word based on the context window of other words in the sentence Global Vectors for Word Representation Matrix factorization Extension of Word2Vec Each word is treated as a set of sub-words Attention Is All You Need Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 6.
    How the transformerworks (language translation task) Encoder Decoder I am good je vais bien Representation
  • 7.
    • Stack ofN number of encoders • The output of one encoder is sent as input to the encoder above it. Questions? • How exactly does the encoder work ? • How is it generating the representation for the given source sentence (input sentence)? The encoder of the transformer Encoder Layer 1 Encoder Layer 2 I am good Representation Encoder Layer N
  • 8.
    • All theencoder blocks are identical • Each encoder block consists of two sublayers • Multi-head attention • Feedforward network(FNN) How exactly does the encoder work ? Encoder Layer 2 Encoder Layer 1 FFN Representation Multi-head attention Multi-head attention I am good FFN
  • 9.
    Self-attention mechanism “A dogate the food because it was hungry” How exactly does this work?
  • 10.
    The embedding ofthe word I is : x1 = [1.76, 2.22, … ,6.66] The embedding of the word am is : x2 = [7.77, 0.631, … ,5.35] The embedding of the word good is : x3 = [11.44, 10.10, … ,3.33] I am good *embedding dimension be 512
  • 11.
    3 New Matrices: {query, key, value} Q K V WQ WK WV Weight Matrices WQ WK WV (randomly initialized, learned during training)
  • 12.
    3 New Matrices: {query, key, value} Q K V WQ WK WV Weight Matrices WQ WK WV (randomly initialized, learned during training) Implies the query, key, and value vectors of the word “I”
  • 13.
    Why are wecomputing this? What is the use of query, key, and value matrices? How is this going to help us?
  • 14.
    • Step 1:”Dot product” between the query matrix, Q, and the key matrix, KT 4 step process
  • 15.
    • Step 1:”Dot product” between the query matrix, Q, and the key matrix, KT • Step 2: ”Divide” the matrix by the square root of the dimension of the key vector 4 step process *the dimension of the key vector(dk) is 64.
  • 16.
    • Step 1:”Dot product” between the query matrix, Q, and the key matrix, KT • Step 2: ”Divide” the matrix by the square root of the dimension of the key vector • Step 3: ”Normalize” the matrix by the square root of the dimension of the key vector 4 step process *the dimension of the key vector(dk) is 64.
  • 17.
    • Step 1:”Dot product” between the query matrix, Q, and the key matrix, KT • Step 2: ”Divide” the matrix by the square root of the dimension of the key vector • Step 3: ”Normalize” the matrix by the square root of the dimension of the key vector 4 step process *the dimension of the key vector(dk) is 64. Word “I” is related to: - itself by 90% - am by 7% - good by 3%
  • 18.
    • Step 4:final step in the self-attention mechanism is to compute the “attention matrix” 4 step process Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores
  • 19.
    • Instead ofhaving a single attention head, we can use multiple attention heads • This will be useful only in circumstances where the meaning of the actual word is ambiguous, e.g. “A dog ate the food because it was hungry” Multi-head attention = Concatenation(Z1, Z2, Z3, … Zi, …, Z8 ) W0 Multi-head attention mechanism
  • 20.
  • 21.
  • 22.
    Positional encoding + = Buthow it generates this ?
  • 23.
  • 24.
    Let’s recall FFN Representation Multi-head attention Multi-headattention I am good FFN FFN Representation Multi-head attention I am good Positional Encoding
  • 25.
    Add and normcomponent Add & Norm Add & Norm Multi-head attention FFN Encoder Block I am good Positional Encoding
  • 26.
  • 27.
    Add & Norm Add& Norm Multi-head attention FFN Encoder 1 Positional Encoding I am good Input embedding Encoder 2
  • 28.
    BERT(Bidirectional Encoder Representationfrom Transformer) Add & Norm Add & Norm Multi-head attention FFN Encoder 1 Python is my favorite programming language Encoder N Encoder 2 RPython Ris Rmy Rfavorite Rprogramming Rlanguage
  • 29.
    • BERT issupposed to do: • Masked Language Model (MLM) • Next Sentence Prediction (NSP) • BERT was trained to perform these two tasks purely as a way to force it to develop a sophisticated understanding of language, e.g. “Kolkata is a beautiful city. I love Kolkata” • Here’s what BERT is supposed to do: • MLM - Predict the crossed out word (Correct answer is “city”). • NSP - Was sentence B found immediately after sentence A , or from somewhere else? (Correct answer is that they are consecutive). How BERT works ? Pre-Training Tasks
  • 30.
    How BERT works? [CLS] Kolkata is a beautiful [MASK] [SEP] I love Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 E[CLS] EKolkata Eis Ea Ebeautiful E[SEP] EI Elove EKolkata E[SEP] Token Embeddings EA EA EA EA EA EA EB EB EB EB Segment Embeddings E0 E1 E2 E3 E4 E6 EB7 E8 E9 E10 Position Embeddings INPUT E[MASK] EA E5 R[CLS] RKolkata Ris Ra Rbeautiful R[MASK] R[SEP] R[SEP] RI Rlove RKolkata OUTPUT (Enhanced Embedding) Pre-Training Tasks
  • 31.
    How BERT works? Pre-Training & Fine-Tuning [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT R[CLS] R[SEP] RSuman Rloves RKolkata OUTPUT (Enhanced Embedding)
  • 32.
    How BERT works? Pre-Training & Fine-Tuning .9 .1 Positive Negative FFN + Softmax (Sentiment Analysis) [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT R[CLS] R[SEP] RSuman Rloves RKolkata OUTPUT (Enhanced Embedding)
  • 33.
    How BERT works? Pre-Training & Fine-Tuning [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT OUTPUT (Enhanced Embedding) R[CLS] R[SEP] RSuman Rloves RKolkata Classifier PERSON LOCATION (Name entity recognition) .9 .1 Positive Negative FFN + Softmax (Sentiment Analysis) [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT R[CLS] R[SEP] RSuman Rloves RKolkata OUTPUT (Enhanced Embedding)
  • 34.
    Š 2021, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Demo
  • 35.
    Thank you! Š 2021,Amazon Web Services, Inc. or its affiliates. All rights reserved. Suman Debnath Principal Developer Advocate Amazon Web Services, India