SlideShare a Scribd company logo
1 of 103
Download to read offline
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
1
Lecture 11:
Attention and Transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
2
Administrative
- Project proposal grades released.
Check feedback on GradeScope!
- Project milestone due May 7th
Saturday 11:59pm PT
Check Ed and course website for requirements
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
3
Last Time: Recurrent Neural Networks
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
4
Last Time: Variable length computation
graph with shared weights
h0 fW
h1 fW
h2 fW
h3
x3
yT
…
x2
x1
W
hT
y3
y2
y1
L1
L2
L3
LT
L
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
bread
x4
h4
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
Sequence to Sequence with RNNs
5
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
c
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
From final hidden state predict:
Initial decoder state s0
Context vector c (often c=hT
)
Sequence to Sequence with RNNs
6
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
s1
x1
we are eating
x2
x3
h1
h2
h3
s0
[START]
y0
y1
bread
x4
h4
estamos
c
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
Decoder: st
= gU
(yt-1
, st-1
, c)
From final hidden state predict:
Initial decoder state s0
Context vector c (often c=hT
)
Sequence to Sequence with RNNs
7
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
s1
x1
we are eating
x2
x3
h1
h2
h3
s0
s2
[START]
y0
y1
y1
y2
bread
x4
h4
estamos comiendo
estamos
c
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
Decoder: st
= gU
(yt-1
, st-1
, c)
From final hidden state predict:
Initial decoder state s0
Context vector c (often c=hT
)
Sequence to Sequence with RNNs
8
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
s1
x1
we are eating
x2
x3
h1
h2
h3
s0
s2
[START]
y0
y1
y1
y2
bread
x4
h4
estamos comiendo
pan
y2
y3
estamos comiendo
s3
s4
y3
y4
pan [STOP]
c
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
Decoder: st
= gU
(yt-1
, st-1
, c)
From final hidden state predict:
Initial decoder state s0
Context vector c (often c=hT
)
Sequence to Sequence with RNNs
9
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
s1
x1
we are eating
x2
x3
h1
h2
h3
s0
s2
[START]
y0
y1
y1
y2
bread
x4
h4
estamos comiendo
pan
y2
y3
estamos comiendo
s3
s4
y3
y4
pan [STOP]
c
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
Decoder: st
= gU
(yt-1
, st-1
, c)
From final hidden state predict:
Initial decoder state s0
Context vector c (often c=hT
)
Problem: Input sequence
bottlenecked through
fixed-sized vector. What if
T=1000?
Sequence to Sequence with RNNs
10
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
s1
x1
we are eating
x2
x3
h1
h2
h3
s0
s2
[START]
y0
y1
y1
y2
bread
x4
h4
estamos comiendo
pan
y2
y3
estamos comiendo
s3
s4
y3
y4
pan [STOP]
c
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Encoder: ht
= fW
(xt
, ht-1
)
Decoder: st
= gU
(yt-1
, st-1
, c)
From final hidden state predict:
Initial decoder state s0
Context vector c (often c=hT
)
Problem: Input sequence
bottlenecked through
fixed-sized vector. What if
T=1000?
Idea: use new context vector
at each step of decoder!
Sequence to Sequence with RNNs
11
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Input: Sequence x1
, … xT
Output: Sequence y1
, …, yT’
Encoder: ht
= fW
(xt
, ht-1
)
From final hidden state:
Initial decoder state s0
Sequence to Sequence with RNNs and Attention
12
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
e11
e12
e13
e14
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Compute (scalar) alignment scores
et,i
= fatt
(st-1
, hi
) (fatt
is an MLP)
Sequence to Sequence with RNNs and Attention
From final hidden state:
Initial decoder state s0
13
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
e11
e12
e13
e14
softmax
a11
a12
a13
a14
From final hidden
state: Initial decoder
state s0
Normalize alignment scores
to get attention weights
0 < at,i
< 1 ∑i
at,i
= 1
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Compute (scalar) alignment scores
et,i
= fatt
(st-1
, hi
) (fatt
is an MLP)
Sequence to Sequence with RNNs and Attention
14
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
e11
e12
e13
e14
softmax
a11
a12
a13
a14
c1
✖
+
✖ ✖ ✖
s1
y0
y1
estamos
Normalize alignment scores
to get attention weights
0 < at,i
< 1 ∑i
at,i
= 1
Compute context vector as
linear combination of hidden
states
ct
= ∑i
at,i
hi
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Compute (scalar) alignment scores
et,i
= fatt
(st-1
, hi
) (fatt
is an MLP)
[START]
Sequence to Sequence with RNNs and Attention
From final hidden state:
Initial decoder state s0
15
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
e11
e12
e13
e14
softmax
a11
a12
a13
a14
c1
✖
+
✖ ✖ ✖
s1
y0
y1
estamos
Normalize alignment scores
to get attention weights
0 < at,i
< 1 ∑i
at,i
= 1
Compute context vector as
linear combination of hidden
states
ct
= ∑i
at,i
hi
Use context vector in
decoder: st
= gU
(yt-1
, st-1
, ct
)
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Compute (scalar) alignment scores
et,i
= fatt
(st-1
, hi
) (fatt
is an MLP)
[START]
Sequence to Sequence with RNNs and Attention
Intuition: Context
vector attends to the
relevant part of the input
sequence
“estamos” = “we are”
so maybe a11
=a12
=0.45,
a13
=a14
=0.05
From final hidden state:
Initial decoder state s0
16
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
e11
e12
e13
e14
softmax
a11
a12
a13
a14
c1
✖
+
✖ ✖ ✖
s1
y0
y1
estamos
Normalize alignment scores
to get attention weights
0 < at,i
< 1 ∑i
at,i
= 1
Compute context vector as
linear combination of hidden
states
ct
= ∑i
at,i
hi
Use context vector in
decoder: st
= gU
(yt-1
, st-1
, ct
)
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Compute (scalar) alignment scores
et,i
= fatt
(st-1
, hi
) (fatt
is an MLP)
[START]
Sequence to Sequence with RNNs and Attention
Intuition: Context
vector attends to the
relevant part of the input
sequence
“estamos” = “we are”
so maybe a11
=a12
=0.45,
a13
=a14
=0.05
From final hidden state:
Initial decoder state s0
17
This is all differentiable! No
supervision on attention
weights – backprop through
everything
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
s1
[START]
y0
y1
estamos
c1
c2
e21
e22
e23
e24
softmax
a21
a22
a23
a24
✖ ✖ ✖ ✖
+
Repeat: Use s1
to compute
new context vector c2
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence to Sequence with RNNs and Attention
18
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
s1
[START]
y0
y1
estamos
c1
c2
e21
e22
e23
e24
softmax
a21
a22
a23
a24
✖ ✖ ✖ ✖
+
s2
y2
comiendo
y1
Use c2
to compute s2
, y2
estamos
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence to Sequence with RNNs and Attention
Repeat: Use s1
to compute
new context vector c2
19
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
s1
[START]
y0
y1
estamos
c1
c2
e21
e22
e23
e24
softmax
a21
a22
a23
a24
✖ ✖ ✖ ✖
+
s2
y2
comiendo
y1
Use c2
to compute s2
, y2
estamos
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence to Sequence with RNNs and Attention
Intuition: Context
vector attends to the
relevant part of the input
sequence
“comiendo” = “eating”
so maybe a21
=a24
=0.05,
a22
=0.1, a23
=0.8
Repeat: Use s1
to compute
new context vector c2
20
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
s1
s2
[START]
y0
y1
y2
estamos comiendo
pan
estamos comiendo
s3
s4
y3
y4
pan [STOP]
c1
y1
c2
y2
c3
y3
c4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Use a different context vector in each timestep of decoder
- Input sequence not bottlenecked through single vector
- At each timestep of decoder, context vector “looks at”
different parts of the input sequence
Sequence to Sequence with RNNs and Attention
21
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Example: English to French
translation
Input: “The agreement on the
European Economic Area was
signed in August 1992.”
Output: “L’accord sur la zone
économique européenne a
été signé en août 1992.”
Visualize attention weights at,i
Sequence to Sequence with RNNs and Attention
22
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Example: English to French
translation
Input: “The agreement on
the European Economic Area
was signed in August 1992.”
Output: “L’accord sur la
zone économique européenne
a été signé en août 1992.”
Visualize attention weights at,i
Diagonal attention means
words correspond in order
Diagonal attention means
words correspond in order
Sequence to Sequence with RNNs and Attention
23
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Example: English to
French translation
Input: “The agreement on
the European Economic
Area was signed in
August 1992.”
Output: “L’accord sur la
zone économique
européenne a été signé
en août 1992.”
Visualize attention weights at,i
Attention figures out
different word orders
Diagonal attention means
words correspond in order
Diagonal attention means
words correspond in order
Sequence to Sequence with RNNs and Attention
24
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
x1
we are eating
x2
x3
h1
h2
h3
s0
bread
x4
h4
s1
s2
[START]
y0
y1
y2
estamos comiendo
pan
estamos comiendo
s3
s4
y3
y4
pan [STOP]
c1
y1
c2
y2
c3
y3
c4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
The decoder doesn’t use the fact that hi
form an ordered
sequence – it just treats them as an unordered set {hi
}
Can use similar architecture given any set of input hidden
vectors {hi
}!
Sequence to Sequence with RNNs and Attention
25
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
26
CNN
Features:
H x W x D
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
27
CNN
Features:
H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
MLP
Encoder: h0
= fW
(z)
where z is spatial CNN features
fW
(.) is an MLP
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
28
CNN
Features:
H x W x D
h0
[START]
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
y0
h1
[START]
y1
person
MLP
Encoder: h0
= fW
(z)
where z is spatial CNN features
fW
(.) is an MLP
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Decoder: yt
= gV
(yt-1
, ht-1
, c)
where context vector c is often c = h0
c
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
29
CNN
Features:
H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
MLP
Encoder: h0
= fW
(z)
where z is spatial CNN features
fW
(.) is an MLP
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Decoder: yt
= gV
(yt-1
, ht-1
, c)
where context vector c is often c = h0
h0
[START]
y0
h1
[START]
y1
h2
y2
y1
person
person wearing
c
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
30
CNN
Features:
H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
MLP
Encoder: h0
= fW
(z)
where z is spatial CNN features
fW
(.) is an MLP
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Decoder: yt
= gV
(yt-1
, ht-1
, c)
where context vector c is often c = h0
h0
[START]
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing
person wearing hat
c
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
31
CNN
Features:
H x W x D
h0
[START]
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing hat
h4
y4
y3
person wearing hat [END]
MLP
c
Encoder: h0
= fW
(z)
where z is spatial CNN features
fW
(.) is an MLP
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Decoder: yt
= gV
(yt-1
, ht-1
, c)
where context vector c is often c = h0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning using spatial features
32
CNN
Features:
H x W x D
h0
[START]
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing hat
h4
y4
y3
person wearing hat [END]
MLP
Problem: Input is "bottlenecked" through c
- Model needs to encode everything it
wants to say within c
This is a problem if we want to generate
really long descriptions? 100s of words long
c
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
33
CNN
Features:
H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Attention idea: New context vector at every time step.
Each context vector will attend to different image regions
gif source
Attention Saccades in humans
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
34
CNN
Features:
H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
e1,0,0
e1,0,1
e1,0,2
e1,1,0
e1,1,1
e1,1,2
e1,2,0
e1,2,1 e1,2,2
Alignment scores:
H x W
Compute alignments
scores (scalars):
fatt
(.) is an MLP
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
35
CNN
Features:
H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
e1,0,0
e1,0,1
e1,0,2
e1,1,0
e1,1,1
e1,1,2
e1,2,0
e1,2,1 e1,2,2
a1,0,0
a1,0,1 a1,0,2
a1,1,0
a1,1,1
a1,1,2
a1,2,0 a1,2,1 a1,2,2
Alignment scores:
H x W
Attention:
H x W Normalize to get
attention weights:
0 < at, i, j
< 1,
attention values sum to 1
Compute alignments
scores (scalars):
fatt
(.) is an MLP
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
36
CNN
Features:
H x W x D
h0
c1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
e1,0,0
e1,0,1
e1,0,2
e1,1,0
e1,1,1 e1,1,2
e1,2,0
e1,2,1
e1,2,2
a1,0,0
a1,0,1
a1,0,2
a1,1,0
a1,1,1 a1,1,2
a1,2,0
a1,2,1 a1,2,2
Alignment scores:
H x W
Attention:
H x W
X
Compute alignments
scores (scalars):
fatt
(.) is an MLP
Compute context vector:
Normalize to get
attention weights:
0 < at, i, j
< 1,
attention values sum to 1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Each timestep of decoder uses a
different context vector that looks at
different parts of the input image
Image Captioning with RNNs and Attention
37
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
person
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Decoder: yt
= gV
(yt-1
, ht-1
, ct
)
New context vector at every time step
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
38
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
person
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Decoder: yt
= gV
(yt-1
, ht-1
, ct
)
New context vector at every time step
e1,0,0
e1,0,1
e1,0,2
e1,1,0
e1,1,1
e1,1,2
e1,2,0
e1,2,1
e1,2,2
a1,0,0
a1,0,1
a1,0,2
a1,1,0
a1,1,1
a1,1,2
a1,2,0
a1,2,1
a1,2,2
Alignment scores:
H x W
Attention:
H x W
c2
X
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
39
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
h2
y2
c2
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
person
person wearing
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Each timestep of decoder uses a
different context vector that looks at
different parts of the input image
Decoder: yt
= gV
(yt-1
, ht-1
, ct
)
New context vector at every time step
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
40
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
h2
y2
c2
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
h3
y3
c3
y2
person wearing
person wearing hat
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Each timestep of decoder uses a
different context vector that looks at
different parts of the input image
Decoder: yt
= gV
(yt-1
, ht-1
, ct
)
New context vector at every time step
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
41
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
h2
y2
c2
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
h3
y3
c3
y2
person wearing hat
h4
y4
c4
y3
person wearing hat [END]
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Each timestep of decoder uses a
different context vector that looks at
different parts of the input image
Decoder: yt
= gV
(yt-1
, ht-1
, ct
)
New context vector at every time step
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
42
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
h2
y2
c2
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
h3
y3
c3
y2
person wearing hat
h4
y4
c4
y3
person wearing hat [END]
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
This entire process is differentiable.
- model chooses its own
attention weights. No attention
supervision is required
e1,0,0
e1,0,1
e1,0,2
e1,1,0
e1,1,1
e1,1,2
e1,2,0
e1,2,1
e1,2,2
a1,0,0
a1,0,1
a1,0,2
a1,1,0
a1,1,1
a1,1,2
a1,2,0
a1,2,1
a1,2,2
Alignment scores:
H x W
Attention:
H x W
X
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
43
Soft attention
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Hard attention
(requires
reinforcement
learning)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
44
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Extract spatial
features from a
pretrained CNN
Image Captioning with RNNs and Attention
45
CNN
Features:
H x W x D
h0
c1
y0
h1
[START]
y1
h2
y2
c2
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
h3
y3
c3
y2
person wearing hat
h4
y4
c4
y3
person wearing hat [END]
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
This entire process is differentiable.
- model chooses its own
attention weights. No attention
supervision is required
e1,0,0
e1,0,1
e1,0,2
e1,1,0
e1,1,1
e1,1,2
e1,2,0
e1,2,1
e1,2,2
a1,0,0
a1,0,1
a1,0,2
a1,1,0
a1,1,1
a1,1,2
a1,2,0
a1,2,1
a1,2,2
Alignment scores:
H x W
Attention:
H x W
X
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Attention we just saw in image captioning
46
Features
h
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Inputs:
Features: z (shape: H x W x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Attention we just saw in image captioning
Alignment
47
Features
h
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
e0,0
e0,1
e0,2
e1,0
e1,1
e1,2
e2,0
e2,1
e2,2
Inputs:
Features: z (shape: H x W x D)
Query: h (shape: D)
Operations:
Alignment: ei,j
= fatt
(h, zi,j
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Attention we just saw in image captioning
Alignment
48
Features
h
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
a0,0
a0,1
a0,2
a1,0
a1,1
a1,2
a2,0
a2,1
a2,2
Attention
Inputs:
Features: z (shape: H x W x D)
Query: h (shape: D)
softmax
Operations:
Alignment: ei,j
= fatt
(h, zi,j
)
Attention: a = softmax(e)
e0,0
e0,1
e0,2
e1,0
e1,1
e1,2
e2,0
e2,1
e2,2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Attention we just saw in image captioning
Alignment
49
Features
h
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Attention
Inputs:
Features: z (shape: H x W x D)
Query: h (shape: D)
softmax
c
mul + add
Outputs:
context vector: c (shape: D)
Operations:
Alignment: ei,j
= fatt
(h, zi,j
)
Attention: a = softmax(e)
Output: c = ∑i,j
ai,j
zi,j
e0,0
e0,1
e0,2
e1,0
e1,1
e1,2
e2,0
e2,1
e2,2
a0,0
a0,1
a0,2
a1,0
a1,1
a1,2
a2,0
a2,1
a2,2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Attention operation is permutation invariant.
- Doesn't care about ordering of the features
- Stretch H x W = N into N vectors
General attention layer
Alignment
50
Input
vectors
h
Attention
Inputs:
Input vectors: x (shape: N x D)
Query: h (shape: D)
softmax
c
mul + add
Outputs:
context vector: c (shape: D)
Operations:
Alignment: ei
= fatt
(h, xi
)
Attention: a = softmax(e)
Output: c = ∑i
ai
xi
x2
x1
x0
e2
e1
e0
a2
a1
a0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Change fatt
(.) to a simple dot product
- only works well with key & value
transformation trick (will mention in a
few slides)
General attention layer
Alignment
51
h
Attention
Inputs:
Input vectors: x (shape: N x D)
Query: h (shape: D)
softmax
c
mul + add
Outputs:
context vector: c (shape: D)
Operations:
Alignment: ei
= h ᐧ xi
Attention: a = softmax(e)
Output: c = ∑i
ai
xi
x2
x1
x0
e2
e1
e0
a2
a1
a0
Input
vectors
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Change fatt
(.) to a scaled simple dot product
- Larger dimensions means more terms in
the dot product sum.
- So, the variance of the logits is higher.
Large magnitude vectors will produce
much higher logits.
- So, the post-softmax distribution has
lower-entropy, assuming logits are IID.
- Ultimately, these large magnitude
vectors will cause softmax to peak and
assign very little weight to all others
- Divide by √D to reduce effect of large
magnitude vectors
General attention layer
Alignment
52
h
Attention
Inputs:
Input vectors: x (shape: N x D)
Query: h (shape: D)
softmax
c
mul + add
Outputs:
context vector: c (shape: D)
Operations:
Alignment: ei
= h ᐧ xi
/ √D
Attention: a = softmax(e)
Output: c = ∑i
ai
xi
x2
x1
x0
e2
e1
e0
a2
a1
a0
Input
vectors
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Multiple query vectors
- each query creates a new output
context vector
mul(→) + add (↑)
Multiple query vectors
General attention layer
Alignment
53
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
Queries: q (shape: M x D)
softmax (↑)
y1
Outputs:
context vectors: y (shape: D)
Operations:
Alignment: ei,j
= qj
ᐧ xi
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
xi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Notice that the input vectors are used for
both the alignment as well as the
attention calculations.
- We can add more expressivity to
the layer by adding a different FC
layer before each of the two steps.
mul(→) + add (↑)
General attention layer
Alignment
54
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
Queries: q (shape: M x D)
softmax (↑)
y1
Outputs:
context vectors: y (shape: D)
Operations:
Alignment: ei,j
= qj
ᐧ xi
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
xi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Notice that the input vectors are used for
both the alignment as well as the
attention calculations.
- We can add more expressivity to
the layer by adding a different FC
layer before each of the two steps.
General attention layer
55
q0
Inputs:
Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
x2
x1
x0
q1
q2
Input
vectors
k2
k1
k0
v2
v1
v0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
The input and output dimensions can
now change depending on the key and
value FC layers
Notice that the input vectors are used for
both the alignment as well as the
attention calculations.
- We can add more expressivity to
the layer by adding a different FC
layer before each of the two steps.
mul(→) + add (↑)
General attention layer
Alignment
56
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk
)
softmax (↑)
y1
Outputs:
context vectors: y (shape: Dv
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Alignment: ei,j
= qj
ᐧ ki
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
k2
k1
k0
v2
v1
v0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Recall that the query vector was a
function of the input vectors
mul(→) + add (↑)
General attention layer
Alignment
57
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk
)
softmax (↑)
y1
Outputs:
context vectors: y (shape: Dv
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Alignment: ei,j
= qj
ᐧ ki
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
k2
k1
k0
v2
v1
v0
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
MLP
Encoder: h0
= fW
(z)
where z is spatial CNN features
fW
(.) is an MLP
h0
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
No input query vectors anymore
Self attention layer
58
q0
Inputs:
Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq
Alignment: ei,j
= qj
ᐧ ki
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
vi
x2
x1
x0
q1
q2
Input
vectors
We can calculate the query vectors
from the input vectors, therefore,
defining a "self-attention" layer.
Instead, query vectors are
calculated using a FC layer.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
mul(→) + add (↑)
Self attention layer
Alignment
59
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
softmax (↑)
y1
Outputs:
context vectors: y (shape: Dv
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq
Alignment: ei,j
= qj
ᐧ ki
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
k2
k1
k0
v2
v1
v0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
mul(→) + add (↑)
Self attention layer - attends over sets of inputs
Alignment
60
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
softmax (↑)
y1
Outputs:
context vectors: y (shape: Dv
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq
Alignment: ei,j
= qj
ᐧ ki
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
k2
k1
k0
v2
v1
v0
x2
x1
x0
self-attention
y1
y2
y0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Self attention layer - attends over sets of inputs
61
x2
x1
x0
self-attention
y1
y2
y0
Permutation equivariant
Self-attention layer doesn’t care about the orders of the inputs!
Problem: How can we encode ordered sequences like language or spatially ordered image features?
x0
x1
x2
self-attention
y1
y0
y2
x2
x0
x1
self-attention
y0
y2
y1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Concatenate/add special positional
encoding pj
to each input vector xj
We use a function pos: N →Rd
to process the position j of the vector
into a d-dimensional vector
So, pj
= pos(j)
Positional encoding
62
x2
x1
x0
self-attention
y1
y2
y0
p2
p1
p0
Desiderata of pos(.) :
1. It should output a unique encoding for each
time-step (word’s position in a sentence)
2. Distance between any two time-steps should be
consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
without any efforts. Its values should be bounded.
4. It must be deterministic.
position encoding
x2
x1
x0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Positional encoding
63
Options for pos(.)
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
○ Lookup table contains T x d parameters.
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Desiderata of pos(.) :
1. It should output a unique encoding for each
time-step (word’s position in a sentence)
2. Distance between any two time-steps should be
consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
without any efforts. Its values should be bounded.
4. It must be deterministic.
Concatenate special positional
encoding pj
to each input vector xj
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector
So, pj
= pos(j)
x2
x1
x0
self-attention
y1
y2
y0
p2
p1
p0
position encoding
x2
x1
x0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Positional encoding
64
Options for pos(.)
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
○ Lookup table contains T x d parameters.
2. Design a fixed function with the desiderata
p(t) =
where
Concatenate special positional
encoding pj
to each input vector xj
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector
So, pj
= pos(j)
x2
x1
x0
self-attention
y1
y2
y0
p2
p1
p0
x2
x0
x1
position encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Positional encoding
65
Options for pos(.)
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
○ Lookup table contains T x d parameters.
2. Design a fixed function with the desiderata
p(t) =
where
Intuition:
image source
Concatenate special positional
encoding pj
to each input vector xj
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector
So, pj
= pos(j)
x2
x1
x0
self-attention
y1
y2
y0
p2
p1
p0
x2
x1
x0
position encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
mul(→) + add (↑)
Masked self-attention layer
Alignment
66
q0
Attention
Inputs:
Input vectors: x (shape: N x D)
softmax (↑)
y1
Outputs:
context vectors: y (shape: Dv
)
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq
Alignment: ei,j
= qj
ᐧ ki
/ √D
Attention: a = softmax(e)
Output: yj
= ∑i
ai,j
vi
x2
x1
x0
-∞
-∞
e0,0
0
0
a0,0
-∞
e1,1
e0,1
e2,2
e1,2
e0,2
0
a1,1
a0,1
a2,2
a1,2
a0,2
q1
q2
y2
y0
Input
vectors
k2
k1
k0
v2
v1
v0
- Prevent vectors from
looking at future vectors.
- Manually set alignment
scores to -infinity
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Multi-head self attention layer
- Multiple self-attention heads in parallel
67
x2
x1
x0
Self-attention
y1
y2
y0
x2
x1
x0
Self-attention
y1
y2
y0
x2
x1
x0
Self-attention
y1
y2
y0
head0
head1
...
headH-1
x2
x1
x0
y1
y2
y0
Concatenate
Split
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
General attention versus self-attention
68
x2
x1
x0
self-attention
y1
y2
y0
k2
k1
k0
attention
y1
y2
y0
v2
v1
v0
q2
q1
q0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Cat image is free to use under the Pixabay License
Input Image
CNN
Features:
C x H x W
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Example: CNN with Self-Attention
69
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Cat image is free to use under the Pixabay License
Input Image
CNN
Features:
C x H x W
Queries:
C’ x H x W
Keys:
C’ x H x W
Values:
C’ x H x W
1x1 Conv
1x1 Conv
1x1 Conv
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018
Example: CNN with Self-Attention
Slide credit: Justin Johnson
70
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Cat image is free to use under the Pixabay License
Input Image
CNN
Features:
C x H x W
Queries:
C’ x H x W
Keys:
C’ x H x W
Values:
C’ x H x W
1x1 Conv
1x1 Conv
1x1 Conv
x
Transpose
softmax
Attention Weights
(H x W) x (H x W)
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018
Example: CNN with Self-Attention
Slide credit: Justin Johnson
71
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Cat image is free to use under the Pixabay License
Input Image
CNN
Features:
C x H x W
Queries:
C’ x H x W
Keys:
C’ x H x W
Values:
C’ x H x W
1x1 Conv
1x1 Conv
1x1 Conv
x
Transpose
softmax
Attention Weights
(H x W) x (H x W)
x
C’ x H x W
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018
Example: CNN with Self-Attention
Slide credit: Justin Johnson
72
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Cat image is free to use under the Pixabay License
Input Image
CNN
Features:
C x H x W
Queries:
C’ x H x W
Keys:
C’ x H x W
Values:
C’ x H x W
1x1 Conv
1x1 Conv
1x1 Conv
x
Transpose
softmax
Attention Weights
(H x W) x (H x W)
x
C’ x H x W
1x1 Conv
C x H x H
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018
Example: CNN with Self-Attention
Slide credit: Justin Johnson
73
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Cat image is free to use under the Pixabay License
Input Image
CNN
Features:
C x H x W
Queries:
C’ x H x W
Keys:
C’ x H x W
Values:
C’ x H x W
1x1 Conv
1x1 Conv
1x1 Conv
x
Transpose
softmax
Attention Weights
(H x W) x (H x W)
x
C’ x H x W
1x1 Conv
+
C x H x W
Self-Attention Module
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018
Residual Connection
Example: CNN with Self-Attention
Slide credit: Justin Johnson
74
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Comparing RNNs to Transformer
RNNs
(+) LSTMs work reasonably well for long sequences.
(-) Expects an ordered sequences of inputs
(-) Sequential computation: subsequent hidden states can only be computed after the previous
ones are done.
Transformer:
(+) Good at long sequences. Each attention calculation looks at all inputs.
(+) Can operate over unordered sets or ordered sequences with positional encodings.
(+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and
stored for a single self-attention head. (but GPUs are getting bigger and better)
75
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
76
“ImageNet Moment for Natural Language Processing”
Pretraining:
Download a lot of text from the internet
Train a giant Transformer model for language modeling
Finetuning:
Fine-tune the Transformer on your own NLP task
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
77
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
78
Extract spatial
features from a
pretrained CNN
Image Captioning using Transformers
CNN
Features:
H x W x D
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
79
Extract spatial
features from a
pretrained CNN
Image Captioning using Transformers
CNN
Features:
H x W x D
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Encoder: c = TW
(z)
where z is spatial CNN features
TW
(.) is the transformer encoder
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
z0,2
z0,1
z0,0
z2,2
...
Transformer encoder
c0,2
c0,1
c0,0
c2,2
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
80
Extract spatial
features from a
pretrained CNN
Image Captioning using Transformers
CNN
Features:
H x W x D
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
Encoder: c = TW
(z)
where z is spatial CNN features
TW
(.) is the transformer encoder
Input: Image I
Output: Sequence y = y1
, y2
,..., yT
z0,2
z0,1
z0,0
z2,2
...
Transformer encoder
c0,2
c0,1
c0,0
c2,2
...
y0
[START]
y1
y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
Decoder: yt
= TD
(y0:t-1
, c)
where TD
(.) is the transformer decoder
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
The Transformer encoder block
81
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Made up of N encoder blocks.
In vaswani et al. N = 6, Dq
= 512
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
82
x2
x1
x0
x2
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Let's dive into one encoder block
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
83
x2
x1
x0
Positional encoding
x2
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
84
x2
x1
x0
Positional encoding
x2
Multi-head self-attention
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Attention attends over all the vectors
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
85
x2
x1
x0
Positional encoding
x2
Multi-head self-attention
+
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
86
x2
x1
x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
LayerNorm over each vector individually
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
87
x2
x1
x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
MLP
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N MLP over each vector individually
LayerNorm over each vector individually
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
88
x2
x1
x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
MLP
+
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Residual connection
MLP over each vector individually
LayerNorm over each vector individually
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
89
x2
x1
x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
Layer norm
MLP
+
y1
y2
y0
y3
z0,2
z0,1
z0,0
z2,2
...
c0,2
c0,1
c0,0
c2,2
...
Transformer
encoder
...
x N
Transformer Encoder Block:
Inputs: Set of vectors x
Outputs: Set of vectors y
Self-attention is the only
interaction between vectors.
Layer norm and MLP operate
independently per vector.
Highly scalable, highly
parallelizable, but high memory usage.
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer encoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
90
c0,2
c0,1
c0,0
c2,2
...
Transformer
decoder
...
x N
y0
[START]
y1
y2
person wearing hat
y3
y0
y1
y2
y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Made up of N decoder blocks.
In vaswani et al. N = 6, Dq
= 512
The Transformer
decoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
The Transformer
decoder block
91
c0,2
c0,1
c0,0
c2,2
...
Transformer
decoder
...
x N
y0
[START]
y1
y2
person wearing hat
y3
y0
y1
y2
y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017
x2
x1
x0
x3
y1
y2
y0
y3
c0,2
c0,1
c0,0
c2,2
...
FC
Let's dive into the
transformer decoder block
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
The Transformer
Decoder block
92
c0,2
c0,1
c0,0
c2,2
...
Transformer
decoder
...
x N
y0
[START]
y1
y2
person wearing hat
y3
y0
y1
y2
y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017
x2
x1
x0
Positional encoding
x3
Masked Multi-head
self-attention
+
Layer norm
Layer norm
MLP
+
y1
y2
y0
y3
c0,2
c0,1
c0,0
c2,2
...
FC
+
Layer norm
Most of the network is the
same the transformer
encoder.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Multi-head attention block
attends over the transformer
encoder outputs.
For image captions, this is
how we inject image
features into the decoder.
The Transformer
Decoder block
93
c0,2
c0,1
c0,0
c2,2
...
Transformer
decoder
...
x N
y0
[START]
y1
y2
person wearing hat
y3
y0
y1
y2
y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017
x2
x1
x0
Positional encoding
x3
Masked Multi-head
self-attention
+
Layer norm
Layer norm
MLP
+
y1
y2
y0
y3
c0,2
c0,1
c0,0
c2,2
...
FC
Multi-head attention
k v q
+
Layer norm
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Transformer Decoder Block:
Inputs: Set of vectors x and
Set of context vectors c.
Outputs: Set of vectors y.
Masked Self-attention only
interacts with past inputs.
Multi-head attention block is
NOT self-attention. It attends
over encoder outputs.
Highly scalable, highly
parallelizable, but high memory
usage.
The Transformer
Decoder block
94
c0,2
c0,1
c0,0
c2,2
...
Transformer
decoder
...
x N
y0
[START]
y1
y2
person wearing hat
y3
y0
y1
y2
y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017
x2
x1
x0
Positional encoding
x3
Masked Multi-head
self-attention
+
Layer norm
Layer norm
MLP
+
y1
y2
y0
y3
c0,2
c0,1
c0,0
c2,2
...
FC
Multi-head attention
k v q
+
Layer norm
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
- No recurrence at all
95
Extract spatial
features from a
pretrained CNN
Image Captioning using transformers
CNN
Features:
H x W x D
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
z0,2
z0,1
z0,0
z2,2
...
Transformer encoder
c0,2
c0,1
c0,0
c2,2
...
y0
[START]
y1
y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
- Perhaps we don't need
convolutions at all?
96
Extract spatial
features from a
pretrained CNN
Image Captioning using transformers
CNN
Features:
H x W x D
z0,0
z0,1
z0,2
z1,0
z1,1
z1,2
z2,0
z2,1
z2,2
z0,2
z0,1
z0,0
z2,2
...
Transformer encoder
c0,2
c0,1
c0,0
c2,2
...
y0
[START]
y1
y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
- Transformers from pixels to language
97
Image Captioning using ONLY transformers
...
Transformer encoder
c0,2
c0,1
c0,0
c2,2
...
y0
[START]
y1
y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
Colab link to an implementation of vision transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
98
Vision Transformers vs. ResNets
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
Colab link to an implementation of vision transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
99
Vision Transformers
Carion et al, “End-to-End Object Detection with Transformers”,
ECCV 2020
Fan et al, “Multiscale Vision Transformers”, ICCV 2021
Liu et al, “Swin Transformer: Hierarchical Vision
Transformer using Shifted Windows”, CVPR 2021
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
100
A ConvNet for the 2020s. Liu et al. CVPR 2022
ConvNets strike back!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
101
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
Summary
- Adding attention to RNNs allows them to "attend" to different
parts of the input at every time step
- The general attention layer is a new type of layer that can be
used to design new neural network architectures
- Transformers are a type of layer that uses self-attention and
layer norm.
○ It is highly scalable and highly parallelizable
○ Faster training, larger models, better performance across
vision and language tasks
○ They are quickly replacing RNNs, LSTMs, and may(?) even
replace convolutions.
102
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022
103
Next time: Video Understanding

More Related Content

Similar to MIT-Lec-11- RSA.pdf

Multilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic GatesMultilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic GatesIJCSES Journal
 
8. Recursion.pptx
8. Recursion.pptx8. Recursion.pptx
8. Recursion.pptxAbid523408
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnSARCCOM
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksGiuseppe Broccolo
 
Data Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Association for Computational Linguistics
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers leopauly
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksNguyen Quang
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understandingThyrixYang1
 
Lecture_ on_bayesian-networks & neural network.pdf
Lecture_ on_bayesian-networks & neural network.pdfLecture_ on_bayesian-networks & neural network.pdf
Lecture_ on_bayesian-networks & neural network.pdf01Abhaysingh
 

Similar to MIT-Lec-11- RSA.pdf (20)

Multilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic GatesMultilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic Gates
 
8. Recursion.pptx
8. Recursion.pptx8. Recursion.pptx
8. Recursion.pptx
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
Data Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLP
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
1.pptx
1.pptx1.pptx
1.pptx
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
 
Lecture_ on_bayesian-networks & neural network.pdf
Lecture_ on_bayesian-networks & neural network.pdfLecture_ on_bayesian-networks & neural network.pdf
Lecture_ on_bayesian-networks & neural network.pdf
 

More from Kuan-Tsae Huang

TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...Kuan-Tsae Huang
 
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdfKuan-Tsae Huang
 
discussion_3_project.pdf
discussion_3_project.pdfdiscussion_3_project.pdf
discussion_3_project.pdfKuan-Tsae Huang
 
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptxKuan-Tsae Huang
 
B 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptxB 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptxKuan-Tsae Huang
 
20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdfKuan-Tsae Huang
 
電動拖板車100208.pptx
電動拖板車100208.pptx電動拖板車100208.pptx
電動拖板車100208.pptxKuan-Tsae Huang
 
電動拖板車100121.pptx
電動拖板車100121.pptx電動拖板車100121.pptx
電動拖板車100121.pptxKuan-Tsae Huang
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfKuan-Tsae Huang
 
Thoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdfThoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdfKuan-Tsae Huang
 
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptxKuan-Tsae Huang
 
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...Kuan-Tsae Huang
 

More from Kuan-Tsae Huang (20)

TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
 
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
 
lecture_16_jiajun.pdf
lecture_16_jiajun.pdflecture_16_jiajun.pdf
lecture_16_jiajun.pdf
 
PQS_7_ruohan.pdf
PQS_7_ruohan.pdfPQS_7_ruohan.pdf
PQS_7_ruohan.pdf
 
lecture_1_2_ruohan.pdf
lecture_1_2_ruohan.pdflecture_1_2_ruohan.pdf
lecture_1_2_ruohan.pdf
 
lecture_6_jiajun.pdf
lecture_6_jiajun.pdflecture_6_jiajun.pdf
lecture_6_jiajun.pdf
 
discussion_3_project.pdf
discussion_3_project.pdfdiscussion_3_project.pdf
discussion_3_project.pdf
 
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
 
半導體實驗室.pptx
半導體實驗室.pptx半導體實驗室.pptx
半導體實驗室.pptx
 
B 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptxB 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptx
 
20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf
 
電動拖板車100208.pptx
電動拖板車100208.pptx電動拖板車100208.pptx
電動拖板車100208.pptx
 
電動拖板車100121.pptx
電動拖板車100121.pptx電動拖板車100121.pptx
電動拖板車100121.pptx
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
FIN330 Chapter 20.pptx
FIN330 Chapter 20.pptxFIN330 Chapter 20.pptx
FIN330 Chapter 20.pptx
 
FIN330 Chapter 22.pptx
FIN330 Chapter 22.pptxFIN330 Chapter 22.pptx
FIN330 Chapter 22.pptx
 
FIN330 Chapter 01.pptx
FIN330 Chapter 01.pptxFIN330 Chapter 01.pptx
FIN330 Chapter 01.pptx
 
Thoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdfThoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdf
 
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
 
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
 

Recently uploaded

Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 

MIT-Lec-11- RSA.pdf

  • 1. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 1 Lecture 11: Attention and Transformers
  • 2. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 2 Administrative - Project proposal grades released. Check feedback on GradeScope! - Project milestone due May 7th Saturday 11:59pm PT Check Ed and course website for requirements
  • 3. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 3 Last Time: Recurrent Neural Networks
  • 4. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 4 Last Time: Variable length computation graph with shared weights h0 fW h1 fW h2 fW h3 x3 yT … x2 x1 W hT y3 y2 y1 L1 L2 L3 LT L
  • 5. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 bread x4 h4 Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) Sequence to Sequence with RNNs 5
  • 6. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 c Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) From final hidden state predict: Initial decoder state s0 Context vector c (often c=hT ) Sequence to Sequence with RNNs 6
  • 7. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 s1 x1 we are eating x2 x3 h1 h2 h3 s0 [START] y0 y1 bread x4 h4 estamos c Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) Decoder: st = gU (yt-1 , st-1 , c) From final hidden state predict: Initial decoder state s0 Context vector c (often c=hT ) Sequence to Sequence with RNNs 7
  • 8. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 s1 x1 we are eating x2 x3 h1 h2 h3 s0 s2 [START] y0 y1 y1 y2 bread x4 h4 estamos comiendo estamos c Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) Decoder: st = gU (yt-1 , st-1 , c) From final hidden state predict: Initial decoder state s0 Context vector c (often c=hT ) Sequence to Sequence with RNNs 8
  • 9. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 s1 x1 we are eating x2 x3 h1 h2 h3 s0 s2 [START] y0 y1 y1 y2 bread x4 h4 estamos comiendo pan y2 y3 estamos comiendo s3 s4 y3 y4 pan [STOP] c Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) Decoder: st = gU (yt-1 , st-1 , c) From final hidden state predict: Initial decoder state s0 Context vector c (often c=hT ) Sequence to Sequence with RNNs 9
  • 10. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 s1 x1 we are eating x2 x3 h1 h2 h3 s0 s2 [START] y0 y1 y1 y2 bread x4 h4 estamos comiendo pan y2 y3 estamos comiendo s3 s4 y3 y4 pan [STOP] c Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) Decoder: st = gU (yt-1 , st-1 , c) From final hidden state predict: Initial decoder state s0 Context vector c (often c=hT ) Problem: Input sequence bottlenecked through fixed-sized vector. What if T=1000? Sequence to Sequence with RNNs 10
  • 11. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 s1 x1 we are eating x2 x3 h1 h2 h3 s0 s2 [START] y0 y1 y1 y2 bread x4 h4 estamos comiendo pan y2 y3 estamos comiendo s3 s4 y3 y4 pan [STOP] c Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Encoder: ht = fW (xt , ht-1 ) Decoder: st = gU (yt-1 , st-1 , c) From final hidden state predict: Initial decoder state s0 Context vector c (often c=hT ) Problem: Input sequence bottlenecked through fixed-sized vector. What if T=1000? Idea: use new context vector at each step of decoder! Sequence to Sequence with RNNs 11
  • 12. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Input: Sequence x1 , … xT Output: Sequence y1 , …, yT’ Encoder: ht = fW (xt , ht-1 ) From final hidden state: Initial decoder state s0 Sequence to Sequence with RNNs and Attention 12
  • 13. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 e11 e12 e13 e14 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Compute (scalar) alignment scores et,i = fatt (st-1 , hi ) (fatt is an MLP) Sequence to Sequence with RNNs and Attention From final hidden state: Initial decoder state s0 13
  • 14. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 e11 e12 e13 e14 softmax a11 a12 a13 a14 From final hidden state: Initial decoder state s0 Normalize alignment scores to get attention weights 0 < at,i < 1 ∑i at,i = 1 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Compute (scalar) alignment scores et,i = fatt (st-1 , hi ) (fatt is an MLP) Sequence to Sequence with RNNs and Attention 14
  • 15. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 e11 e12 e13 e14 softmax a11 a12 a13 a14 c1 ✖ + ✖ ✖ ✖ s1 y0 y1 estamos Normalize alignment scores to get attention weights 0 < at,i < 1 ∑i at,i = 1 Compute context vector as linear combination of hidden states ct = ∑i at,i hi Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Compute (scalar) alignment scores et,i = fatt (st-1 , hi ) (fatt is an MLP) [START] Sequence to Sequence with RNNs and Attention From final hidden state: Initial decoder state s0 15
  • 16. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 e11 e12 e13 e14 softmax a11 a12 a13 a14 c1 ✖ + ✖ ✖ ✖ s1 y0 y1 estamos Normalize alignment scores to get attention weights 0 < at,i < 1 ∑i at,i = 1 Compute context vector as linear combination of hidden states ct = ∑i at,i hi Use context vector in decoder: st = gU (yt-1 , st-1 , ct ) Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Compute (scalar) alignment scores et,i = fatt (st-1 , hi ) (fatt is an MLP) [START] Sequence to Sequence with RNNs and Attention Intuition: Context vector attends to the relevant part of the input sequence “estamos” = “we are” so maybe a11 =a12 =0.45, a13 =a14 =0.05 From final hidden state: Initial decoder state s0 16
  • 17. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 e11 e12 e13 e14 softmax a11 a12 a13 a14 c1 ✖ + ✖ ✖ ✖ s1 y0 y1 estamos Normalize alignment scores to get attention weights 0 < at,i < 1 ∑i at,i = 1 Compute context vector as linear combination of hidden states ct = ∑i at,i hi Use context vector in decoder: st = gU (yt-1 , st-1 , ct ) Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Compute (scalar) alignment scores et,i = fatt (st-1 , hi ) (fatt is an MLP) [START] Sequence to Sequence with RNNs and Attention Intuition: Context vector attends to the relevant part of the input sequence “estamos” = “we are” so maybe a11 =a12 =0.45, a13 =a14 =0.05 From final hidden state: Initial decoder state s0 17 This is all differentiable! No supervision on attention weights – backprop through everything
  • 18. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 s1 [START] y0 y1 estamos c1 c2 e21 e22 e23 e24 softmax a21 a22 a23 a24 ✖ ✖ ✖ ✖ + Repeat: Use s1 to compute new context vector c2 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Sequence to Sequence with RNNs and Attention 18
  • 19. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 s1 [START] y0 y1 estamos c1 c2 e21 e22 e23 e24 softmax a21 a22 a23 a24 ✖ ✖ ✖ ✖ + s2 y2 comiendo y1 Use c2 to compute s2 , y2 estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Sequence to Sequence with RNNs and Attention Repeat: Use s1 to compute new context vector c2 19
  • 20. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 s1 [START] y0 y1 estamos c1 c2 e21 e22 e23 e24 softmax a21 a22 a23 a24 ✖ ✖ ✖ ✖ + s2 y2 comiendo y1 Use c2 to compute s2 , y2 estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Sequence to Sequence with RNNs and Attention Intuition: Context vector attends to the relevant part of the input sequence “comiendo” = “eating” so maybe a21 =a24 =0.05, a22 =0.1, a23 =0.8 Repeat: Use s1 to compute new context vector c2 20
  • 21. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 s1 s2 [START] y0 y1 y2 estamos comiendo pan estamos comiendo s3 s4 y3 y4 pan [STOP] c1 y1 c2 y2 c3 y3 c4 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Use a different context vector in each timestep of decoder - Input sequence not bottlenecked through single vector - At each timestep of decoder, context vector “looks at” different parts of the input sequence Sequence to Sequence with RNNs and Attention 21
  • 22. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Example: English to French translation Input: “The agreement on the European Economic Area was signed in August 1992.” Output: “L’accord sur la zone économique européenne a été signé en août 1992.” Visualize attention weights at,i Sequence to Sequence with RNNs and Attention 22
  • 23. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Example: English to French translation Input: “The agreement on the European Economic Area was signed in August 1992.” Output: “L’accord sur la zone économique européenne a été signé en août 1992.” Visualize attention weights at,i Diagonal attention means words correspond in order Diagonal attention means words correspond in order Sequence to Sequence with RNNs and Attention 23
  • 24. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Example: English to French translation Input: “The agreement on the European Economic Area was signed in August 1992.” Output: “L’accord sur la zone économique européenne a été signé en août 1992.” Visualize attention weights at,i Attention figures out different word orders Diagonal attention means words correspond in order Diagonal attention means words correspond in order Sequence to Sequence with RNNs and Attention 24
  • 25. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 x1 we are eating x2 x3 h1 h2 h3 s0 bread x4 h4 s1 s2 [START] y0 y1 y2 estamos comiendo pan estamos comiendo s3 s4 y3 y4 pan [STOP] c1 y1 c2 y2 c3 y3 c4 Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 The decoder doesn’t use the fact that hi form an ordered sequence – it just treats them as an unordered set {hi } Can use similar architecture given any set of input hidden vectors {hi }! Sequence to Sequence with RNNs and Attention 25
  • 26. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 26 CNN Features: H x W x D Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Input: Image I Output: Sequence y = y1 , y2 ,..., yT
  • 27. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 27 CNN Features: H x W x D h0 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 MLP Encoder: h0 = fW (z) where z is spatial CNN features fW (.) is an MLP Input: Image I Output: Sequence y = y1 , y2 ,..., yT
  • 28. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 28 CNN Features: H x W x D h0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 y0 h1 [START] y1 person MLP Encoder: h0 = fW (z) where z is spatial CNN features fW (.) is an MLP Input: Image I Output: Sequence y = y1 , y2 ,..., yT Decoder: yt = gV (yt-1 , ht-1 , c) where context vector c is often c = h0 c
  • 29. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 29 CNN Features: H x W x D h0 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 MLP Encoder: h0 = fW (z) where z is spatial CNN features fW (.) is an MLP Input: Image I Output: Sequence y = y1 , y2 ,..., yT Decoder: yt = gV (yt-1 , ht-1 , c) where context vector c is often c = h0 h0 [START] y0 h1 [START] y1 h2 y2 y1 person person wearing c
  • 30. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 30 CNN Features: H x W x D h0 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 MLP Encoder: h0 = fW (z) where z is spatial CNN features fW (.) is an MLP Input: Image I Output: Sequence y = y1 , y2 ,..., yT Decoder: yt = gV (yt-1 , ht-1 , c) where context vector c is often c = h0 h0 [START] y0 h1 [START] y1 h2 y2 y1 h3 y3 y2 person wearing person wearing hat c
  • 31. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 31 CNN Features: H x W x D h0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 y0 h1 [START] y1 h2 y2 y1 h3 y3 y2 person wearing hat h4 y4 y3 person wearing hat [END] MLP c Encoder: h0 = fW (z) where z is spatial CNN features fW (.) is an MLP Input: Image I Output: Sequence y = y1 , y2 ,..., yT Decoder: yt = gV (yt-1 , ht-1 , c) where context vector c is often c = h0
  • 32. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning using spatial features 32 CNN Features: H x W x D h0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 y0 h1 [START] y1 h2 y2 y1 h3 y3 y2 person wearing hat h4 y4 y3 person wearing hat [END] MLP Problem: Input is "bottlenecked" through c - Model needs to encode everything it wants to say within c This is a problem if we want to generate really long descriptions? 100s of words long c
  • 33. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 33 CNN Features: H x W x D h0 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Attention idea: New context vector at every time step. Each context vector will attend to different image regions gif source Attention Saccades in humans
  • 34. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 34 CNN Features: H x W x D h0 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 e1,0,0 e1,0,1 e1,0,2 e1,1,0 e1,1,1 e1,1,2 e1,2,0 e1,2,1 e1,2,2 Alignment scores: H x W Compute alignments scores (scalars): fatt (.) is an MLP
  • 35. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 35 CNN Features: H x W x D h0 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 e1,0,0 e1,0,1 e1,0,2 e1,1,0 e1,1,1 e1,1,2 e1,2,0 e1,2,1 e1,2,2 a1,0,0 a1,0,1 a1,0,2 a1,1,0 a1,1,1 a1,1,2 a1,2,0 a1,2,1 a1,2,2 Alignment scores: H x W Attention: H x W Normalize to get attention weights: 0 < at, i, j < 1, attention values sum to 1 Compute alignments scores (scalars): fatt (.) is an MLP
  • 36. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 36 CNN Features: H x W x D h0 c1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 e1,0,0 e1,0,1 e1,0,2 e1,1,0 e1,1,1 e1,1,2 e1,2,0 e1,2,1 e1,2,2 a1,0,0 a1,0,1 a1,0,2 a1,1,0 a1,1,1 a1,1,2 a1,2,0 a1,2,1 a1,2,2 Alignment scores: H x W Attention: H x W X Compute alignments scores (scalars): fatt (.) is an MLP Compute context vector: Normalize to get attention weights: 0 < at, i, j < 1, attention values sum to 1
  • 37. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Each timestep of decoder uses a different context vector that looks at different parts of the input image Image Captioning with RNNs and Attention 37 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 person z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Decoder: yt = gV (yt-1 , ht-1 , ct ) New context vector at every time step
  • 38. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 38 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 person z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Decoder: yt = gV (yt-1 , ht-1 , ct ) New context vector at every time step e1,0,0 e1,0,1 e1,0,2 e1,1,0 e1,1,1 e1,1,2 e1,2,0 e1,2,1 e1,2,2 a1,0,0 a1,0,1 a1,0,2 a1,1,0 a1,1,1 a1,1,2 a1,2,0 a1,2,1 a1,2,2 Alignment scores: H x W Attention: H x W c2 X
  • 39. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 39 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 h2 y2 c2 y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 person person wearing z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Each timestep of decoder uses a different context vector that looks at different parts of the input image Decoder: yt = gV (yt-1 , ht-1 , ct ) New context vector at every time step
  • 40. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 40 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 h2 y2 c2 y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 h3 y3 c3 y2 person wearing person wearing hat z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Each timestep of decoder uses a different context vector that looks at different parts of the input image Decoder: yt = gV (yt-1 , ht-1 , ct ) New context vector at every time step
  • 41. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 41 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 h2 y2 c2 y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 h3 y3 c3 y2 person wearing hat h4 y4 c4 y3 person wearing hat [END] z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Each timestep of decoder uses a different context vector that looks at different parts of the input image Decoder: yt = gV (yt-1 , ht-1 , ct ) New context vector at every time step
  • 42. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 42 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 h2 y2 c2 y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 h3 y3 c3 y2 person wearing hat h4 y4 c4 y3 person wearing hat [END] z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 This entire process is differentiable. - model chooses its own attention weights. No attention supervision is required e1,0,0 e1,0,1 e1,0,2 e1,1,0 e1,1,1 e1,1,2 e1,2,0 e1,2,1 e1,2,2 a1,0,0 a1,0,1 a1,0,2 a1,1,0 a1,1,1 a1,1,2 a1,2,0 a1,2,1 a1,2,2 Alignment scores: H x W Attention: H x W X
  • 43. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 43 Soft attention Image Captioning with Attention Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission. Hard attention (requires reinforcement learning)
  • 44. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 44 Image Captioning with Attention Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
  • 45. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Extract spatial features from a pretrained CNN Image Captioning with RNNs and Attention 45 CNN Features: H x W x D h0 c1 y0 h1 [START] y1 h2 y2 c2 y1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 z0,0 h3 y3 c3 y2 person wearing hat h4 y4 c4 y3 person wearing hat [END] z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 This entire process is differentiable. - model chooses its own attention weights. No attention supervision is required e1,0,0 e1,0,1 e1,0,2 e1,1,0 e1,1,1 e1,1,2 e1,2,0 e1,2,1 e1,2,2 a1,0,0 a1,0,1 a1,0,2 a1,1,0 a1,1,1 a1,1,2 a1,2,0 a1,2,1 a1,2,2 Alignment scores: H x W Attention: H x W X
  • 46. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Attention we just saw in image captioning 46 Features h z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Inputs: Features: z (shape: H x W x D) Query: h (shape: D)
  • 47. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Attention we just saw in image captioning Alignment 47 Features h z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 e0,0 e0,1 e0,2 e1,0 e1,1 e1,2 e2,0 e2,1 e2,2 Inputs: Features: z (shape: H x W x D) Query: h (shape: D) Operations: Alignment: ei,j = fatt (h, zi,j )
  • 48. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Attention we just saw in image captioning Alignment 48 Features h z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 a0,0 a0,1 a0,2 a1,0 a1,1 a1,2 a2,0 a2,1 a2,2 Attention Inputs: Features: z (shape: H x W x D) Query: h (shape: D) softmax Operations: Alignment: ei,j = fatt (h, zi,j ) Attention: a = softmax(e) e0,0 e0,1 e0,2 e1,0 e1,1 e1,2 e2,0 e2,1 e2,2
  • 49. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Attention we just saw in image captioning Alignment 49 Features h z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Attention Inputs: Features: z (shape: H x W x D) Query: h (shape: D) softmax c mul + add Outputs: context vector: c (shape: D) Operations: Alignment: ei,j = fatt (h, zi,j ) Attention: a = softmax(e) Output: c = ∑i,j ai,j zi,j e0,0 e0,1 e0,2 e1,0 e1,1 e1,2 e2,0 e2,1 e2,2 a0,0 a0,1 a0,2 a1,0 a1,1 a1,2 a2,0 a2,1 a2,2
  • 50. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Attention operation is permutation invariant. - Doesn't care about ordering of the features - Stretch H x W = N into N vectors General attention layer Alignment 50 Input vectors h Attention Inputs: Input vectors: x (shape: N x D) Query: h (shape: D) softmax c mul + add Outputs: context vector: c (shape: D) Operations: Alignment: ei = fatt (h, xi ) Attention: a = softmax(e) Output: c = ∑i ai xi x2 x1 x0 e2 e1 e0 a2 a1 a0
  • 51. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Change fatt (.) to a simple dot product - only works well with key & value transformation trick (will mention in a few slides) General attention layer Alignment 51 h Attention Inputs: Input vectors: x (shape: N x D) Query: h (shape: D) softmax c mul + add Outputs: context vector: c (shape: D) Operations: Alignment: ei = h ᐧ xi Attention: a = softmax(e) Output: c = ∑i ai xi x2 x1 x0 e2 e1 e0 a2 a1 a0 Input vectors
  • 52. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Change fatt (.) to a scaled simple dot product - Larger dimensions means more terms in the dot product sum. - So, the variance of the logits is higher. Large magnitude vectors will produce much higher logits. - So, the post-softmax distribution has lower-entropy, assuming logits are IID. - Ultimately, these large magnitude vectors will cause softmax to peak and assign very little weight to all others - Divide by √D to reduce effect of large magnitude vectors General attention layer Alignment 52 h Attention Inputs: Input vectors: x (shape: N x D) Query: h (shape: D) softmax c mul + add Outputs: context vector: c (shape: D) Operations: Alignment: ei = h ᐧ xi / √D Attention: a = softmax(e) Output: c = ∑i ai xi x2 x1 x0 e2 e1 e0 a2 a1 a0 Input vectors
  • 53. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Multiple query vectors - each query creates a new output context vector mul(→) + add (↑) Multiple query vectors General attention layer Alignment 53 q0 Attention Inputs: Input vectors: x (shape: N x D) Queries: q (shape: M x D) softmax (↑) y1 Outputs: context vectors: y (shape: D) Operations: Alignment: ei,j = qj ᐧ xi / √D Attention: a = softmax(e) Output: yj = ∑i ai,j xi x2 x1 x0 e2,0 e1,0 e0,0 a2,0 a1,0 a0,0 e2,1 e1,1 e0,1 e2,2 e1,2 e0,2 a2,1 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors
  • 54. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Notice that the input vectors are used for both the alignment as well as the attention calculations. - We can add more expressivity to the layer by adding a different FC layer before each of the two steps. mul(→) + add (↑) General attention layer Alignment 54 q0 Attention Inputs: Input vectors: x (shape: N x D) Queries: q (shape: M x D) softmax (↑) y1 Outputs: context vectors: y (shape: D) Operations: Alignment: ei,j = qj ᐧ xi / √D Attention: a = softmax(e) Output: yj = ∑i ai,j xi x2 x1 x0 e2,0 e1,0 e0,0 a2,0 a1,0 a0,0 e2,1 e1,1 e0,1 e2,2 e1,2 e0,2 a2,1 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors
  • 55. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Notice that the input vectors are used for both the alignment as well as the attention calculations. - We can add more expressivity to the layer by adding a different FC layer before each of the two steps. General attention layer 55 q0 Inputs: Input vectors: x (shape: N x D) Queries: q (shape: M x Dk ) Operations: Key vectors: k = xWk Value vectors: v = xWv x2 x1 x0 q1 q2 Input vectors k2 k1 k0 v2 v1 v0
  • 56. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 The input and output dimensions can now change depending on the key and value FC layers Notice that the input vectors are used for both the alignment as well as the attention calculations. - We can add more expressivity to the layer by adding a different FC layer before each of the two steps. mul(→) + add (↑) General attention layer Alignment 56 q0 Attention Inputs: Input vectors: x (shape: N x D) Queries: q (shape: M x Dk ) softmax (↑) y1 Outputs: context vectors: y (shape: Dv ) Operations: Key vectors: k = xWk Value vectors: v = xWv Alignment: ei,j = qj ᐧ ki / √D Attention: a = softmax(e) Output: yj = ∑i ai,j vi x2 x1 x0 e2,0 e1,0 e0,0 a2,0 a1,0 a0,0 e2,1 e1,1 e0,1 e2,2 e1,2 e0,2 a2,1 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors k2 k1 k0 v2 v1 v0
  • 57. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Recall that the query vector was a function of the input vectors mul(→) + add (↑) General attention layer Alignment 57 q0 Attention Inputs: Input vectors: x (shape: N x D) Queries: q (shape: M x Dk ) softmax (↑) y1 Outputs: context vectors: y (shape: Dv ) Operations: Key vectors: k = xWk Value vectors: v = xWv Alignment: ei,j = qj ᐧ ki / √D Attention: a = softmax(e) Output: yj = ∑i ai,j vi x2 x1 x0 e2,0 e1,0 e0,0 a2,0 a1,0 a0,0 e2,1 e1,1 e0,1 e2,2 e1,2 e0,2 a2,1 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors k2 k1 k0 v2 v1 v0 z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 MLP Encoder: h0 = fW (z) where z is spatial CNN features fW (.) is an MLP h0 CNN
  • 58. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 No input query vectors anymore Self attention layer 58 q0 Inputs: Input vectors: x (shape: N x D) Queries: q (shape: M x Dk ) Operations: Key vectors: k = xWk Value vectors: v = xWv Query vectors: q = xWq Alignment: ei,j = qj ᐧ ki / √D Attention: a = softmax(e) Output: yj = ∑i ai,j vi x2 x1 x0 q1 q2 Input vectors We can calculate the query vectors from the input vectors, therefore, defining a "self-attention" layer. Instead, query vectors are calculated using a FC layer.
  • 59. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 mul(→) + add (↑) Self attention layer Alignment 59 q0 Attention Inputs: Input vectors: x (shape: N x D) softmax (↑) y1 Outputs: context vectors: y (shape: Dv ) Operations: Key vectors: k = xWk Value vectors: v = xWv Query vectors: q = xWq Alignment: ei,j = qj ᐧ ki / √D Attention: a = softmax(e) Output: yj = ∑i ai,j vi x2 x1 x0 e2,0 e1,0 e0,0 a2,0 a1,0 a0,0 e2,1 e1,1 e0,1 e2,2 e1,2 e0,2 a2,1 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors k2 k1 k0 v2 v1 v0
  • 60. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 mul(→) + add (↑) Self attention layer - attends over sets of inputs Alignment 60 q0 Attention Inputs: Input vectors: x (shape: N x D) softmax (↑) y1 Outputs: context vectors: y (shape: Dv ) Operations: Key vectors: k = xWk Value vectors: v = xWv Query vectors: q = xWq Alignment: ei,j = qj ᐧ ki / √D Attention: a = softmax(e) Output: yj = ∑i ai,j vi x2 x1 x0 e2,0 e1,0 e0,0 a2,0 a1,0 a0,0 e2,1 e1,1 e0,1 e2,2 e1,2 e0,2 a2,1 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors k2 k1 k0 v2 v1 v0 x2 x1 x0 self-attention y1 y2 y0
  • 61. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Self attention layer - attends over sets of inputs 61 x2 x1 x0 self-attention y1 y2 y0 Permutation equivariant Self-attention layer doesn’t care about the orders of the inputs! Problem: How can we encode ordered sequences like language or spatially ordered image features? x0 x1 x2 self-attention y1 y0 y2 x2 x0 x1 self-attention y0 y2 y1
  • 62. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Concatenate/add special positional encoding pj to each input vector xj We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector So, pj = pos(j) Positional encoding 62 x2 x1 x0 self-attention y1 y2 y0 p2 p1 p0 Desiderata of pos(.) : 1. It should output a unique encoding for each time-step (word’s position in a sentence) 2. Distance between any two time-steps should be consistent across sentences with different lengths. 3. Our model should generalize to longer sentences without any efforts. Its values should be bounded. 4. It must be deterministic. position encoding x2 x1 x0
  • 63. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Positional encoding 63 Options for pos(.) 1. Learn a lookup table: ○ Learn parameters to use for pos(t) for t ε [0, T) ○ Lookup table contains T x d parameters. Vaswani et al, “Attention is all you need”, NeurIPS 2017 Desiderata of pos(.) : 1. It should output a unique encoding for each time-step (word’s position in a sentence) 2. Distance between any two time-steps should be consistent across sentences with different lengths. 3. Our model should generalize to longer sentences without any efforts. Its values should be bounded. 4. It must be deterministic. Concatenate special positional encoding pj to each input vector xj We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector So, pj = pos(j) x2 x1 x0 self-attention y1 y2 y0 p2 p1 p0 position encoding x2 x1 x0
  • 64. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Positional encoding 64 Options for pos(.) 1. Learn a lookup table: ○ Learn parameters to use for pos(t) for t ε [0, T) ○ Lookup table contains T x d parameters. 2. Design a fixed function with the desiderata p(t) = where Concatenate special positional encoding pj to each input vector xj We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector So, pj = pos(j) x2 x1 x0 self-attention y1 y2 y0 p2 p1 p0 x2 x0 x1 position encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017
  • 65. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Positional encoding 65 Options for pos(.) 1. Learn a lookup table: ○ Learn parameters to use for pos(t) for t ε [0, T) ○ Lookup table contains T x d parameters. 2. Design a fixed function with the desiderata p(t) = where Intuition: image source Concatenate special positional encoding pj to each input vector xj We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector So, pj = pos(j) x2 x1 x0 self-attention y1 y2 y0 p2 p1 p0 x2 x1 x0 position encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017
  • 66. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 mul(→) + add (↑) Masked self-attention layer Alignment 66 q0 Attention Inputs: Input vectors: x (shape: N x D) softmax (↑) y1 Outputs: context vectors: y (shape: Dv ) Operations: Key vectors: k = xWk Value vectors: v = xWv Query vectors: q = xWq Alignment: ei,j = qj ᐧ ki / √D Attention: a = softmax(e) Output: yj = ∑i ai,j vi x2 x1 x0 -∞ -∞ e0,0 0 0 a0,0 -∞ e1,1 e0,1 e2,2 e1,2 e0,2 0 a1,1 a0,1 a2,2 a1,2 a0,2 q1 q2 y2 y0 Input vectors k2 k1 k0 v2 v1 v0 - Prevent vectors from looking at future vectors. - Manually set alignment scores to -infinity
  • 67. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Multi-head self attention layer - Multiple self-attention heads in parallel 67 x2 x1 x0 Self-attention y1 y2 y0 x2 x1 x0 Self-attention y1 y2 y0 x2 x1 x0 Self-attention y1 y2 y0 head0 head1 ... headH-1 x2 x1 x0 y1 y2 y0 Concatenate Split
  • 68. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 General attention versus self-attention 68 x2 x1 x0 self-attention y1 y2 y0 k2 k1 k0 attention y1 y2 y0 v2 v1 v0 q2 q1 q0
  • 69. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Cat image is free to use under the Pixabay License Input Image CNN Features: C x H x W Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson Example: CNN with Self-Attention 69
  • 70. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Cat image is free to use under the Pixabay License Input Image CNN Features: C x H x W Queries: C’ x H x W Keys: C’ x H x W Values: C’ x H x W 1x1 Conv 1x1 Conv 1x1 Conv Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Example: CNN with Self-Attention Slide credit: Justin Johnson 70
  • 71. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Cat image is free to use under the Pixabay License Input Image CNN Features: C x H x W Queries: C’ x H x W Keys: C’ x H x W Values: C’ x H x W 1x1 Conv 1x1 Conv 1x1 Conv x Transpose softmax Attention Weights (H x W) x (H x W) Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Example: CNN with Self-Attention Slide credit: Justin Johnson 71
  • 72. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Cat image is free to use under the Pixabay License Input Image CNN Features: C x H x W Queries: C’ x H x W Keys: C’ x H x W Values: C’ x H x W 1x1 Conv 1x1 Conv 1x1 Conv x Transpose softmax Attention Weights (H x W) x (H x W) x C’ x H x W Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Example: CNN with Self-Attention Slide credit: Justin Johnson 72
  • 73. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Cat image is free to use under the Pixabay License Input Image CNN Features: C x H x W Queries: C’ x H x W Keys: C’ x H x W Values: C’ x H x W 1x1 Conv 1x1 Conv 1x1 Conv x Transpose softmax Attention Weights (H x W) x (H x W) x C’ x H x W 1x1 Conv C x H x H Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Example: CNN with Self-Attention Slide credit: Justin Johnson 73
  • 74. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Cat image is free to use under the Pixabay License Input Image CNN Features: C x H x W Queries: C’ x H x W Keys: C’ x H x W Values: C’ x H x W 1x1 Conv 1x1 Conv 1x1 Conv x Transpose softmax Attention Weights (H x W) x (H x W) x C’ x H x W 1x1 Conv + C x H x W Self-Attention Module Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Residual Connection Example: CNN with Self-Attention Slide credit: Justin Johnson 74
  • 75. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Comparing RNNs to Transformer RNNs (+) LSTMs work reasonably well for long sequences. (-) Expects an ordered sequences of inputs (-) Sequential computation: subsequent hidden states can only be computed after the previous ones are done. Transformer: (+) Good at long sequences. Each attention calculation looks at all inputs. (+) Can operate over unordered sets or ordered sequences with positional encodings. (+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel. (-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and stored for a single self-attention head. (but GPUs are getting bigger and better) 75
  • 76. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 76 “ImageNet Moment for Natural Language Processing” Pretraining: Download a lot of text from the internet Train a giant Transformer model for language modeling Finetuning: Fine-tune the Transformer on your own NLP task
  • 77. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 77
  • 78. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 78 Extract spatial features from a pretrained CNN Image Captioning using Transformers CNN Features: H x W x D z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Input: Image I Output: Sequence y = y1 , y2 ,..., yT
  • 79. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 79 Extract spatial features from a pretrained CNN Image Captioning using Transformers CNN Features: H x W x D z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Encoder: c = TW (z) where z is spatial CNN features TW (.) is the transformer encoder Input: Image I Output: Sequence y = y1 , y2 ,..., yT z0,2 z0,1 z0,0 z2,2 ... Transformer encoder c0,2 c0,1 c0,0 c2,2 ...
  • 80. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 80 Extract spatial features from a pretrained CNN Image Captioning using Transformers CNN Features: H x W x D z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 Encoder: c = TW (z) where z is spatial CNN features TW (.) is the transformer encoder Input: Image I Output: Sequence y = y1 , y2 ,..., yT z0,2 z0,1 z0,0 z2,2 ... Transformer encoder c0,2 c0,1 c0,0 c2,2 ... y0 [START] y1 y2 y1 y3 y2 person wearing hat y4 y3 person wearing hat [END] Transformer decoder Decoder: yt = TD (y0:t-1 , c) where TD (.) is the transformer decoder
  • 81. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 The Transformer encoder block 81 z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Made up of N encoder blocks. In vaswani et al. N = 6, Dq = 512 Vaswani et al, “Attention is all you need”, NeurIPS 2017
  • 82. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 82 x2 x1 x0 x2 z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Let's dive into one encoder block Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 83. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 83 x2 x1 x0 Positional encoding x2 z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Add positional encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 84. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 84 x2 x1 x0 Positional encoding x2 Multi-head self-attention z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Attention attends over all the vectors Add positional encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 85. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 85 x2 x1 x0 Positional encoding x2 Multi-head self-attention + z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Attention attends over all the vectors Residual connection Add positional encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 86. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 86 x2 x1 x0 Positional encoding x2 Multi-head self-attention + Layer norm z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N LayerNorm over each vector individually Attention attends over all the vectors Residual connection Add positional encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 87. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 87 x2 x1 x0 Positional encoding x2 Multi-head self-attention + Layer norm MLP z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N MLP over each vector individually LayerNorm over each vector individually Attention attends over all the vectors Residual connection Add positional encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 88. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 88 x2 x1 x0 Positional encoding x2 Multi-head self-attention + Layer norm MLP + z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Residual connection MLP over each vector individually LayerNorm over each vector individually Attention attends over all the vectors Residual connection Add positional encoding Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 89. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 89 x2 x1 x0 Positional encoding x2 Multi-head self-attention + Layer norm Layer norm MLP + y1 y2 y0 y3 z0,2 z0,1 z0,0 z2,2 ... c0,2 c0,1 c0,0 c2,2 ... Transformer encoder ... x N Transformer Encoder Block: Inputs: Set of vectors x Outputs: Set of vectors y Self-attention is the only interaction between vectors. Layer norm and MLP operate independently per vector. Highly scalable, highly parallelizable, but high memory usage. Vaswani et al, “Attention is all you need”, NeurIPS 2017 The Transformer encoder block
  • 90. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 90 c0,2 c0,1 c0,0 c2,2 ... Transformer decoder ... x N y0 [START] y1 y2 person wearing hat y3 y0 y1 y2 y3 person wearing hat [END] Vaswani et al, “Attention is all you need”, NeurIPS 2017 Made up of N decoder blocks. In vaswani et al. N = 6, Dq = 512 The Transformer decoder block
  • 91. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 The Transformer decoder block 91 c0,2 c0,1 c0,0 c2,2 ... Transformer decoder ... x N y0 [START] y1 y2 person wearing hat y3 y0 y1 y2 y3 person wearing hat [END] Vaswani et al, “Attention is all you need”, NeurIPS 2017 x2 x1 x0 x3 y1 y2 y0 y3 c0,2 c0,1 c0,0 c2,2 ... FC Let's dive into the transformer decoder block
  • 92. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 The Transformer Decoder block 92 c0,2 c0,1 c0,0 c2,2 ... Transformer decoder ... x N y0 [START] y1 y2 person wearing hat y3 y0 y1 y2 y3 person wearing hat [END] Vaswani et al, “Attention is all you need”, NeurIPS 2017 x2 x1 x0 Positional encoding x3 Masked Multi-head self-attention + Layer norm Layer norm MLP + y1 y2 y0 y3 c0,2 c0,1 c0,0 c2,2 ... FC + Layer norm Most of the network is the same the transformer encoder.
  • 93. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Multi-head attention block attends over the transformer encoder outputs. For image captions, this is how we inject image features into the decoder. The Transformer Decoder block 93 c0,2 c0,1 c0,0 c2,2 ... Transformer decoder ... x N y0 [START] y1 y2 person wearing hat y3 y0 y1 y2 y3 person wearing hat [END] Vaswani et al, “Attention is all you need”, NeurIPS 2017 x2 x1 x0 Positional encoding x3 Masked Multi-head self-attention + Layer norm Layer norm MLP + y1 y2 y0 y3 c0,2 c0,1 c0,0 c2,2 ... FC Multi-head attention k v q + Layer norm
  • 94. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Transformer Decoder Block: Inputs: Set of vectors x and Set of context vectors c. Outputs: Set of vectors y. Masked Self-attention only interacts with past inputs. Multi-head attention block is NOT self-attention. It attends over encoder outputs. Highly scalable, highly parallelizable, but high memory usage. The Transformer Decoder block 94 c0,2 c0,1 c0,0 c2,2 ... Transformer decoder ... x N y0 [START] y1 y2 person wearing hat y3 y0 y1 y2 y3 person wearing hat [END] Vaswani et al, “Attention is all you need”, NeurIPS 2017 x2 x1 x0 Positional encoding x3 Masked Multi-head self-attention + Layer norm Layer norm MLP + y1 y2 y0 y3 c0,2 c0,1 c0,0 c2,2 ... FC Multi-head attention k v q + Layer norm
  • 95. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 - No recurrence at all 95 Extract spatial features from a pretrained CNN Image Captioning using transformers CNN Features: H x W x D z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 z0,2 z0,1 z0,0 z2,2 ... Transformer encoder c0,2 c0,1 c0,0 c2,2 ... y0 [START] y1 y2 y1 y3 y2 person wearing hat y4 y3 person wearing hat [END] Transformer decoder
  • 96. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 - Perhaps we don't need convolutions at all? 96 Extract spatial features from a pretrained CNN Image Captioning using transformers CNN Features: H x W x D z0,0 z0,1 z0,2 z1,0 z1,1 z1,2 z2,0 z2,1 z2,2 z0,2 z0,1 z0,0 z2,2 ... Transformer encoder c0,2 c0,1 c0,0 c2,2 ... y0 [START] y1 y2 y1 y3 y2 person wearing hat y4 y3 person wearing hat [END] Transformer decoder
  • 97. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 - Transformers from pixels to language 97 Image Captioning using ONLY transformers ... Transformer encoder c0,2 c0,1 c0,0 c2,2 ... y0 [START] y1 y2 y1 y3 y2 person wearing hat y4 y3 person wearing hat [END] Transformer decoder Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020 Colab link to an implementation of vision transformers
  • 98. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 98 Vision Transformers vs. ResNets Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020 Colab link to an implementation of vision transformers
  • 99. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 99 Vision Transformers Carion et al, “End-to-End Object Detection with Transformers”, ECCV 2020 Fan et al, “Multiscale Vision Transformers”, ICCV 2021 Liu et al, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, CVPR 2021
  • 100. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 100 A ConvNet for the 2020s. Liu et al. CVPR 2022 ConvNets strike back!
  • 101. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 101
  • 102. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 Summary - Adding attention to RNNs allows them to "attend" to different parts of the input at every time step - The general attention layer is a new type of layer that can be used to design new neural network architectures - Transformers are a type of layer that uses self-attention and layer norm. ○ It is highly scalable and highly parallelizable ○ Faster training, larger models, better performance across vision and language tasks ○ They are quickly replacing RNNs, LSTMs, and may(?) even replace convolutions. 102
  • 103. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - May 03, 2022 103 Next time: Video Understanding