Data Con LA 2022 - Transformers for NLP

Transformers for
Natural Language Processing
Dr. Ash Pahwa
©2022 Dr. Ash Pahwa 1

Bio: Dr. Ash Pahwa
 Ph.D. Computer Science
 Website: www.AshPahwa.com
 Affiliation
 California Institute of Technology, Pasadena
 UC Irvine, UCLA, UCSD, Chapman
 Field of Expertise
 Machine Learning, Deep Learning, Digital Image Processing,
Database Management, CD-ROM/DVD
 Worked for
 General Electric, AT&T Bell Laboratories, Oracle, UC Santa Barbara
2
©2021 Dr. Ash Pahwa

Outline
1. What are Transformers?
2. Transformers – Applications: GPT+BERT
3. Problem – Context Sensitive Embeddings
1. Bank Word Embeddings
4. Transformer Architecture
1. Word2Vec Embeddings
2. Positional Encoding
3. Self Attention
4. Feed Forward Neural Network

What are Transformers?
 Transformer are a new (2017) family of Deep
Learning neural network architecture
 Solution to the problems experienced by RNN
(Recurrent Neural Networks) architecture
 Transformer Architecture contains
 Encoder
 Decoder
 Primary application: Translation
Copyright 2021 - Dr. Ash Pahwa 4

Attention is All You Need
Google Research: NIPS 2017

Transformers Applications
BERT and GPT
 Google: BERT’s model architecture is based on the Encoder of Transformer
 OpenAI: GPT’s model architecture is based on the Decoder of the Transformer

OpenAI – GPT Generative Pre-trained Transformer
GPT-1 Paper: 2018 + GPT-2 Paper: 2019
GPT-3 Paper: 2020
 GPT-2: Language Models Are Unsupervised Multitask Learners, by Alec
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
 GPT-3: Language Models are Few-Shot Learners, by Tom B. Brown, Benjamin
Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan,

Bi-Directional Transformer
BERT: Sesame Street
Character
 ELMO: 2018: Bi-Directional RNN/LSTM
Embeddings from Language Models (ELMO):
Univ. of Washington
 2018 BERT from Google
 Bi-directional Encoder Representations from
Transformers
 Based on Transformer: Encoder
 BERT: Word Vector is different from
sentence 1 to sentence 2

Derivative Architecture from
BERT

GPT & BERT Applications
 GPT:
 Machine Translation
 Text Generation
 BERT
 Context sensitive enhanced word
embeddings
 Used in Google search engine

Synonymy & Polysemy
 Synonymy refers to cases where two different words have the same meaning
 Cars & Automobile
 Polysemy refers to cases where the same word has different meaning based on
the context
 Example
 I banked on my husband; he was about to drop me to the bank. He got
late and I wanted to take a cab but there was a taxi strike. I ended up
driving my husband’s vehicle. It was showing low fuel warning, I had to go
to gas station to refill, by the time I reached the bank, car parking was full.
 Synonymy:
 cab, taxi
 vehicle, car
 fuel, gas
 Polysemy
 bank, bank

Test Your English Language Skills
 I went to a “bank”
to deposit money
 What is the meaning of
the word ‘bank’?
 I went to a “bank” of
a river to take a walk
 What is the meaning of the
word ‘bank’?
Answer: A Answer: B

How Humans Interpret
Language?
 Humans read the whole
sentence to interpret the
meaning of a word
 I went to a “bank” to
deposit money
 I went to a “bank” of a
river to take a walk

How BERT Interpret
Language?
 BERT use the self-
attention method
deposit money

Self Attention
 Bank
Bank Embedding
A bunch of floating-point numbers
Self Attention
 Bank-1
 Financial
Institution
 Bank-2
 Bank of a river

Transformer Architecture
1. Word Embeddings
2. Positional Embeddings (Excel + Python)
3. Self-Attention
4. Masking
5. Train the Neural Network

Word Embeddings

Word Embeddings: Word2Vec: Google Research
Mikolov + Chen + Corrado + Dean: ICLR 2013

Example: Vector Math:
King – Man + Woman = Queen

Positional Encoding

Position of a word in a
Sentence: BERT
 In any sequence data (NLP), position of
a word is important

Need for Positional Encoding
 We feed all the words at a time in
parallel
 Need for Positional Encoding
 Positional Encoding represent the order of
words in a sentence
 Advantages
 Decrease in Training Time
 Learn long term dependency of words

Positional Encoding
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑𝑚𝑜𝑑𝑒𝑙
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos
𝑝𝑜𝑠
10000
2𝑖

Positional Encoding
Excel

Positional Encoding Value
 𝑑𝑚𝑜𝑑𝑒𝑙 = 10
 Position: 0 – 5
 Value of ‘i’: 0 - 9
𝑃𝐸_𝑉𝑎𝑙𝑢𝑒(𝑝𝑜𝑠,2𝑖) =
𝑝𝑜𝑠
10000
2𝑖

Sin and Cosine Values of Values
𝑝𝑜𝑠
10000
2𝑖
𝑝𝑜𝑠
10000
2𝑖
𝑝𝑜𝑠
10000
2𝑖

Positional Encoding
Python

Positional Embedding
Implementation in Python
 Load libraries
 Define constants
 𝑑𝑚𝑜𝑑𝑒𝑙 = 10
 Position: 0 – 5
 Value of ‘i’: 0 - 9

Positional Vector
Value
𝑝𝑜𝑠
10000
2𝑖

Sin and Cosine Values
𝑝𝑜𝑠
10000
2𝑖
𝑝𝑜𝑠
10000
2𝑖

Positional Encoding in Python

Heat-Map: Plot of Positional Encoding
d_model = 10, Positions =0-5

Heat-Map: Plot of Positional Encoding
d_model = 512, Positions =0-100

Self Attention

Transformer Architecture
Self attention
Multi-head attention
Feedforward network
Feedforward network
Encoder 2
Encoder 1
Text + Positional Embeddings
Modified Text Embeddings

Embeddings for all the words
of a sentence
Sentence: I am good
Suppose the embedding matrix contains 512 floating point
numbers
X Matrix: dimension = 3 x 512
Matrix X: Embeddings of all the words of a sentence
X 1 2 … 512
I 1.76 2.22 … 6.66
am 7.77 0.631 … 5.35
good 11.44 10.10 … 3.33

Weight Matrices
Initialized Randomly
Learned by Training
1 2 … 64
1
2
…
512
𝑊𝑄
1 2 … 64
1
2
…
512
𝑊𝐾
1 2 … 64
1
2
…
512
𝑊𝑉

Create Q,K,V Vectors
Copyright 2022 - Dr. Ash Pahwa
39
1 2 … 64
1
2
…
512
𝑊𝑄
1 2 … 64
1
2
…
512
𝑊𝐾
1 2 … 64
1
2
…
512
𝑊𝑉
X 1 2 … 512
I 1.76 2.22 … 6.66
am 7.77 0.631 … 5.35
good 11.44 10.10 … 3.33
X 1 2 … 512
I 1.76 2.22 … 6.66
am 7.77 0.631 … 5.35
good 11.44 10.10 … 3.33
X 1 2 … 512
I 1.76 2.22 … 6.66
am 7.77 0.631 … 5.35
good 11.44 10.10 … 3.33
V=X.𝑊𝑉= (3 x 512) * (512 x 64) = 3 x 64
Q = X.𝑊𝑄 =(3 x 512) * (512 x 64) = 3 x 64
K= X.𝑊𝐾 =(3 x 512) * (512 x 64) = 3 x 64

Create Q,K,V Vectors
40
𝑄 𝐾 𝑉
X 1 2 … 64
I 3.69 7.42 … 4.44
am 11.11 7.07 … 76.7
good 99.3 3.69 … 0.85
X 1 2 … 64
I 5.31 6.78 … 0.96
am 11.71 0.86 … 11.31
good 10.10 11.44 … 5.11
X 1 2 … 64
I 67.85 91.2 … 0.13
am 13.13 63.1 … 4.44
good 12.12 96.1 … 43.4
V=X.𝑊𝑉= (3 x 512) * (512 x 64) = 3 x 64
Q = X.𝑊𝑄 =(3 x 512) * (512 x 64) = 3 x 64
K= X.𝑊𝐾 =(3 x 512) * (512 x 64) = 3 x 64

Self Attention
Step-1
 Compute 𝑄. 𝐾𝑇
 Dimensions
 𝑄 = 3𝑥64
 𝐾 = 3𝑥64
 𝐾𝑇
= 64𝑥3
 𝑄. 𝐾𝑇 = 3𝑥3

Self Attention Matrix =𝑄. 𝐾𝑇
 The 𝑄. 𝐾𝑇
matrix displays the Self
Attention data
 Shows how strongly the words are
related with each other
𝑸. 𝑲𝑻 I am good
I 110 90 80
am 70 99 70
good 90 70 100
 𝑄. 𝐾𝑇 = 3𝑥3

Word
Relations
 Word ‘I’ is most related to word ‘I’
 Word ‘am’ is most related to word ‘am’
 Word ‘good’ is most related to word ‘good’
I 110 90 80
am 70 99 70
good 90 70 100
 I went to a “bank”
to deposit money
 What is the meaning of
the word ‘bank’?
 I went to a “bank” of
a river to take a walk
 What is the meaning of the
word ‘bank’?

Step-2
 Normalize the Self-Attention Vector
 Compute
𝑄𝐾𝑇
𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑜𝑓 𝐾𝑒𝑦 𝑉𝑒𝑐𝑡𝑜𝑟
=
𝑄𝐾𝑇
64
I 110 90 80
am 70 99 70
good 90 70 100
𝑸. 𝑲𝑻/ 𝒅 I am good
I 110
64
= 13.75
90
64
= 11.25
80
64
= 10
am 70
64
= 8.75
99
64
= 12.375
70
64
= 8.75
good 90
64
= 11.25
70
64
= 8.75
100
64
= 12.5

Step-3: Normalize + Softmax
 Softmax of the Function
𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝑸. 𝑲𝑻
/ 𝒅) I am good
I 0.90 0.07 0.03
am 0.025 0.95 0.025
good 0.21 0.03 0.76
𝑸. 𝑲𝑻
/ 𝒅 I am good
I 110
64
= 13.75
90
64
= 11.25
80
64
= 10
am 70
64
= 8.75
99
64
= 12.375
70
64
= 8.75
good 90
64
= 11.25
70
64
= 8.75
100
64
= 12.5

Step-4:
Attention Matrix Z = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄. 𝐾𝑇
𝑑
. 𝑉
 Attention Matrix Z = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄.𝐾𝑇
𝑑
. 𝑉
 Dimensions of Z = (3 x 64)
/ 𝒅) I am good
I 0.90 0.07 0.03
am 0.025 0.95 0.025
good 0.21 0.03 0.76
𝑉
X 1 2 … 64
I 67.85 91.2 … 0.13
am 13.13 63.1 … 4.44
good 12.12 96.1 … 43.4
V=X.𝑊𝑉= (3 x 512) * (512 x 64) = 3 x 64

Step-4:
Attention Matrix Z = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄. 𝐾𝑇
𝑑
. 𝑉
 𝑧1 = 0.90 67.85, 91.2, … , 0.13 +
0.07 13.13, 63.1, … , 4.44 +
0.03 12.12, 96.1, … , 43.4
 𝑧2 = 0.025 67.85, 91.2, … , 0.13 +
0.95 13.13, 63.1, … , 4.44 +
0.025 12.12, 96.1, … , 43.4
 𝑧3 = 0.21 67.85, 91.2, … , 0.13 +
0.03 13.13, 63.1, … , 4.44 +
0.76 12.12, 96.1, … , 43.4
/ 𝒅) I am good
I 0.90 0.07 0.03
am 0.025 0.95 0.025
good 0.21 0.03 0.76
𝑉
X 1 2 … 64
I 67.85 91.2 … 0.13
am 13.13 63.1 … 4.44
good 12.12 96.1 … 43.4
V=X.𝑊𝑉= (3 x 512) * (512 x 64) = 3x64

Masking Approach
 Key words of every sentence are masked
 The Neural Network predicts the masked word
 After the training period the following 3 matrices will be
converge into real numbers
Copyright 2022 - Dr. Ash Pahwa
1 2 … 64
1
2
…
512
𝑊𝑄
1 2 … 64
1
2
…
512
𝑊𝐾
1 2 … 64
1
2
…
512
𝑊𝑉

Train the Neural Network
 Train the Neural Network with the
Training Data
 Compute the Modified Embeddings
Training Data
with Masked
words
Embeddings
of words
Q,K,V
Matrices +
Self Attention
Vectors
Modified
Embeddings

How BERT Interpret
Language?
 BERT use the self-
attention method
deposit money

Summary
1. What are Transformers?
2. Transformers – Applications: GPT+BERT
3. Problem – Context Sensitive Embeddings
1. Bank Word Embeddings
4. Transformer Architecture
1. Word2Vec Embeddings
2. Positional Encoding
3. Self Attention
4. Feed Forward Neural Network

Data Con LA 2022 - Transformers for NLP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Con LA 2022 - Transformers for NLP

Similar to Data Con LA 2022 - Transformers for NLP (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Data Con LA 2022 - Transformers for NLP