Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell:
Neural Image Caption Generation with Visual Attention
by Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel,
Yoshua Bengio, ICML 2015
Presented by Eun-ji Lee
2015.10.14
Data Mining Research Lab
Sogang University

Contents
1. Introduction
2. Image Caption Generation with Attention Mechanism
a. LSTM Tutorial
b. Model Details: Encoder & Decoder
3. Learning Stochastic “Hard” vs Deterministic “Soft” Attention
a. Stochastic “Hard” Attention
b. Deterministic “Soft” Attention
c. Training Procedure
4. Experiments

1. Introduction
“Scene understanding“
“ Rather than compress an entire image into a static representation, attention allows for
salient features to dynamically come to the forefront as needed.
“hard” attention & “soft attention

2-a. LSTM tutorial (1)
• 𝒙 𝒕 : a input to the memory cell layer at time 𝑡
• 𝑊𝑖, 𝑊𝑓, 𝑊𝑐, 𝑊𝑜, 𝑈𝑖, 𝑈𝑓, 𝑈𝑐, 𝑈 𝑜, 𝑉𝑜 : weight matrices
• 𝒃𝒊, 𝒃 𝒇, 𝒃 𝒄, 𝒃 𝒐 : bias vectors
1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊) (Input gate)
2. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄) (Candidate state)
3. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇) (Forget gate)
4. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏 (Memory Cells’ new state)
5. 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉𝒕−𝟏 + 𝑉𝑜 𝑪𝒕 + 𝒃 𝒐) (Output gate)
6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕) (Outputs, or Hidden states)
http://deeplearning.net/tutorial/lstm.html#lstm
𝒙 𝒕
𝒊 𝒕
𝒐 𝒕
𝒇 𝒕
𝑪 𝒕
𝒉 𝒕
𝑪 𝒕−𝟏

• 𝒙 𝒕 : a input to the memory cell layer at time 𝑡
• 𝑊𝑖, 𝑊𝑓, 𝑊𝑐, 𝑊𝑜, 𝑈𝑖, 𝑈𝑓, 𝑈𝑐, 𝑈 𝑜, 𝑉𝑜 : weight matrices
• 𝒃𝒊, 𝒃 𝒇, 𝒃 𝒄, 𝒃 𝒐 : bias vectors
1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊)
2. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄)
3. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇)
4. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏
5. 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉𝒕−𝟏 + 𝑉𝑜 𝑪𝒕 + 𝒃 𝒐) ⇒ 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉 𝒕−𝟏 + 𝒃 𝒐)
6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕)
𝒙 𝒕
𝒊 𝒕
𝒐 𝒕
𝒇 𝒕
𝑪 𝒕
𝒉 𝒕
𝑪 𝒕−𝟏

1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊)
2. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇)
3. 𝒐 𝒕 = 𝜎 𝑊𝑜 𝒙𝒕 + 𝑈 𝑜 𝒉 𝒕−𝟏 + 𝒃 𝒐
4. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄)
5. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏
6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕)
𝒙 𝒕
𝒊 𝒕
𝒐 𝒕
𝒇 𝒕
𝑪 𝒕
𝒉 𝒕
𝑪 𝒕−𝟏

2-b. Model Details: Encoder
A model takes a single raw image and generates a caption 𝒚 encoded as a sequence of
1-of-K encoded words.
• Caption :
• Image :
𝑦 = 𝒚 𝟏, … , 𝒚 𝑪 , 𝒚𝒊 ∈ ℝ 𝐾
𝐾: vocab size, 𝐶: caption length
𝐷 : dim. of representation corresponding to a part of the image
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
𝒂𝒊
𝒚𝒊
𝒂 𝑳
𝒂 𝟏 ⋯
⋯

𝒂 𝑳
𝒂 𝟏 ⋯
⋯
2-b. Model Details: Encoder
• Caption :
• Image :
“We extract features from a lower convolutional layer unlike previous work which instead
used a fully connected layer.”
𝑦 = 𝒚 𝟏, … , 𝒚 𝑪 , 𝒚𝒊 ∈ ℝ 𝐾
𝐾: vocab size, 𝐶: caption length
𝐷 : dim. of representation corresponding to a part of the image
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷

• We use a LSTM[1] that produces a caption by generating one word at every time step
conditioned on a context vector, the previous hidden state and the previously
generated words.
2-b. Model Details: Decoder (LSTM)
𝒚 𝒕
𝒛 𝒕 𝒉 𝒕−𝟏
𝒚 𝒕−𝟏
[1] Hochreiter & Schmidhuber, 1997

• 𝐢𝐭 = 𝜎 𝑊𝑖 𝐸𝐲𝐭−𝟏 + 𝑈𝑖 𝐡𝐭−𝟏 + 𝑍𝑖 𝐳𝐭 + 𝐛𝐢 ,
• 𝐟𝐭 = 𝜎 𝑊𝑓 𝐸𝐲𝐭−𝟏 + 𝑈𝑓 𝐡𝐭−𝟏 + 𝑍𝑓 𝐳𝐭 + 𝐛 𝐟 ,
• 𝐜𝐭 = 𝐟𝐜 𝐜𝐭−𝟏 + 𝐢𝐭 tanh 𝑊𝑐 𝐸𝐲𝐭−𝟏 + 𝑈𝑐 𝐡𝐭−𝟏 + 𝑍 𝑐 𝐳𝐭 + 𝐛 𝐜 ,
• 𝐨𝐭 = 𝜎 𝑊𝑜 𝐸𝐲𝐭−𝟏 + 𝑈 𝑜 𝐡𝐭−𝟏 + 𝑍 𝑜 𝐳𝐭 + 𝐛 𝐨 ,
• 𝐡𝐭 = 𝐨𝐭 tanh 𝐜𝐭 .
𝐢𝐭, 𝐟𝐭, 𝐜𝐭, 𝐨𝐭, 𝐡𝐭 are the input, forget, memory, output and hidden state of LSTM.
𝑊∎, 𝑈∎, 𝑍∎ and 𝐛∎ are learned weight matrices and biases.
𝐄 ∈ ℝ 𝑚×𝐾
: an embedding matrix.
𝑚 : embedding dim. 𝑛 : LSTM dim.
𝜎 : logistic sigmoid activation.
2-b. LSTM

• A dynamic representation of the relevant part of the image input at time 𝑡.
𝐳 𝑡 = 𝜙( 𝐚𝑖 , 𝛼𝑖 )
𝑒𝑡𝑖 = 𝑓𝑎𝑡𝑡 𝐚𝑖, 𝐡 𝑡−1
𝛼 𝑡𝑖 =
exp(𝑒𝑡𝑖)
𝑘=1
𝐿
exp(𝑒𝑡𝑘)
2-b. Context vector 𝐳𝐭
- (Stochastic attention) : the probability that location 𝑖 is the right place to focus for producing
the next word.
- (Deterministic attention) : the relative importance to give to location 𝑖 in blending the 𝑎𝑖’s
together.
The weight 𝛼𝑖 of each annotation
vector 𝐚𝑖 is computed by an
attention model 𝑓𝑎𝑡𝑡 for which we
use a multilayer perceptron
conditioned on 𝐡 𝑡−1.
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷

• The initial memory state and hidden state of the LSTM are predicted by an average of
the annotation vectors fed through two separate MLPs (init,c and init,h):
𝐜0 = 𝑓𝑖𝑛𝑖𝑡,𝑐
1
𝐿
𝑖
𝐿
𝐚𝑖 , 𝐡0 = 𝑓𝑖𝑛𝑖𝑡,ℎ
1
𝐿
𝑖
𝐿
𝐚𝑖
2-b. Initialization (LSTM)
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷

• We use a deep output layer(Pascanu et al., 2014) to compute the output word
probability.
𝑝 𝐲𝑡 𝐚, 𝐲1
𝑡−1
∝ exp(𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝐡 𝑡 + 𝐋 𝑧 𝐳 𝑡 )
where 𝐋0 ∈ ℝ 𝐾×𝑚
, 𝐋ℎ ∈ ℝ 𝑚×𝑛
, 𝐋 𝑧 ∈ ℝ 𝑚×𝐷
, 𝐄 are learned parameters initialized randomly.
2-b. Output word probability
• Vector exponential
exp 𝐯 = 𝟏 + 𝐯 +
1
2!
𝐯2
+
1
3!
𝐯3
+ ⋯
= 𝟏 cosh 𝐯 +
𝐯
𝐯
sinh( 𝐯 ).

• We represent the location variable 𝑠𝑡 as where the model decides to focus attention
when generating the 𝑡 𝑡ℎ
word. 𝑠𝑡,𝑖 is an indicator one-hot variable which is set to 1 if
the 𝑖-th location (out of 𝐿) is the one used to extract visual features.
𝑝 𝑠𝑡,𝑖 = 1 𝑠𝑗<𝑡, 𝒂 = 𝛼 𝑡,𝑖
𝐳 𝑡 =
𝑖
𝑠𝑡,𝑖 𝐚𝑖
3-a. Stochastic “Hard” Attention
Binary One-hot
00
01
10
11
0001
0010
0100
1000
1
𝐿
𝑖
𝑠𝑡
𝑡 시간에 attention 할 부분
𝐚 = 𝐚 𝟏, … , 𝐚 𝑳 , 𝐚𝒊 ∈ ℝ 𝐷
𝛼𝑖 =
exp(𝑓𝑎𝑡𝑡 𝐚 𝑖,𝐡 𝑡−1 )
𝑘=1
𝐿
exp(𝑓𝑎𝑡𝑡 𝐚𝑖,𝐡 𝑡−1 )
𝐳 𝑡 = 𝜙( 𝐚𝑖 , 𝛼𝑖 )
𝑠𝑡 : attention location var.

• A variational lower bound on the marginal log-likelihood log 𝑝(𝐲|𝐚) of observing the
sequence of words 𝐲 given image features a.
𝐿 𝑠 =
𝑠
𝑝 𝑠 𝐚 log 𝑝(𝐲|𝑠, 𝐚) ≤ log
𝑠
𝑝 𝑠 𝐚 𝑝 𝐲 𝑠, 𝐚 = log 𝑝(𝐲|𝐚)
𝜕𝐿 𝑠
𝜕𝑊
=
𝑠
𝑝(𝑠|𝐚)
𝜕 log 𝑝 𝐲 𝑠, 𝐚
𝜕𝑊
+ log 𝑝 𝐲 𝑠, 𝐚
𝜕 log 𝑝 𝑠 𝐚
𝜕𝑊
3-a. A new objective function 𝐿 𝑠

• Monte Carlo based sampling approximation of the gradient with respect to the model
parameters:
𝑠𝑡
𝑛
∼ Multinoulli 𝐿( 𝛼 𝑡
𝑛
)
𝜕𝐿 𝑠
𝜕𝑊
≈
1
𝑁
𝑛=1
𝑁
𝜕 log 𝑝 𝐲 𝑠 𝑛, 𝐚
𝜕𝑊
+ log 𝑝 𝐲 𝑠 𝑛, 𝐚
𝜕 log 𝑝 𝑠 𝑛 𝐚
𝜕𝑊
3-a. Approximation of the gradient
• 난수를 이용하여 함수의 값을 확률적으로 계산하는 알고리즘.
• 계산하려는 값이 닫힌 형식으로 표현되지 않거나 복잡한 경우 근사적으로 계산할 때 사용.
(ex) 원주율 계산
원 안의 점 개수
전체 점 개수
≈
𝜋
4
Monte Carlo method
𝑠 𝑛
= (𝑠1
𝑛
, 𝑠2
𝑛
, … )

• A moving average baseline
Upon seeing the 𝑘 𝑡ℎmini-batch, the moving average baseline is estimated as an
accumulated sum of the previous log likelihoods with exponential decay:
𝑏 𝑘 = 0.9 × 𝑏 𝑘−1 + 0.1 × log 𝑝(𝐲| 𝑠 𝑘, 𝐚)
• An entropy term on the multinouilli distribution, 𝐻[𝑠], is added.
𝜕𝐿 𝑠
𝜕𝑊
≈
1
𝑁
𝑛=1
𝑁
𝜕 log 𝑝 𝐲 𝑠 𝑛, 𝐚
𝜕𝑊
+ 𝜆 𝑟 log 𝑝 𝐲 𝑠 𝑛, 𝐚 − 𝑏
𝜕 log 𝑝 𝑠 𝑛 𝐚
𝜕𝑊
+ 𝜆 𝑒
𝜕𝐻[ 𝑠 𝑛]
𝜕𝑊
3-a. Variance Reduction

• In making a hard choice at every point, 𝜙( 𝐚𝑖 , 𝛼𝑖 ) is a function that returns a
sampled 𝐚𝑖 at every point in time based upon a multinouilli distribution parameterized
by 𝛼 .
3-a. Stochastic “Hard” Attention

• Take the expectation of the context vector 𝐳 𝑡 directly,
𝔼 𝑝 𝑠𝑡 𝑎 𝐳 𝑡 =
𝑖=1
𝐿
𝛼 𝑡,𝑖 𝐚𝑖
and formulate a deterministic attention model by computing a soft attention
weighted annotation vector 𝜙 𝐚𝑖 , 𝛼𝑖 = 𝑖
𝐿
𝛼𝑖 𝐚𝑖.
• This corresponds to feeding in a soft 𝛼 weighted context into the system.
3-b. Deterministic “Soft” Attention

• Learning the deterministic attention can be understood as approximately optimizing
the marginal likelihood under the 𝑠𝑡.
• The hidden activation of LSTM 𝐡 𝑡 is a linear projection of the stochastic context vector
𝐳 𝑡 followed by tanh non-linearity.
• To the 1 𝑡ℎ order Taylor approximation, the expected value 𝔼 𝑝 𝑠𝑡 𝑎 𝐡 𝑡 is equal to
computing 𝐡 𝑡 using a single forward prop with the expected context vector
𝔼 𝑝 𝑠𝑡 𝑎 𝐳 𝑡 .

• Let 𝐧 𝑡 = 𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝐡 𝑡 + 𝐋 𝑧 𝐳 𝑡
(𝐧 𝑡,𝑖 ∶ 𝐧 𝑡 computed by setting 𝐳 𝑡 = 𝐚𝑖)
• Define the normalized weighted geometric mean(NWGM) for the softmax 𝑘 𝑡ℎ word
prediction:
𝑁𝑊𝐺𝑀 𝑝 𝑦𝑡 = 𝑘 𝑎 =
𝑖 exp 𝑛 𝑡,𝑘,𝑖
𝑝 𝑠𝑡,𝑖=1 𝑎
𝑗 𝑖 exp 𝑛 𝑡,𝑗,𝑖
𝑝 𝑠𝑡,𝑖=1 𝑎
=
exp(𝔼 𝑝 𝑠𝑡 𝑎 [𝑛 𝑡,𝑘])
𝑗 exp(𝔼 𝑝 𝑠𝑡 𝑎 [𝑛 𝑡,𝑗])

• The NWGM can be approximated well by 𝔼[𝐧 𝑡] = 𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝔼[𝐡 𝑡] + 𝐋 𝑧 𝔼[ 𝐳 𝑡] .
(It shows that the NWGM of a softmax unit is obtained by applying softmax to the expectations of the
underlying linear projections.)
• Also, from the results in (Baldi&Sadowski, 2014), 𝑁𝑊𝐺𝑀 𝑝 𝐲𝑡 = 𝑘 𝐚 ≈ 𝔼[𝑝 𝐲𝑡 = 𝑘 𝐚 ]
under softmax activation.
• This means the expectation of the outputs over all possible attention locations
induced by random variable 𝑠𝑡 is computed by simple feedforward propagation with
expected context vector 𝔼[ 𝐳 𝑡].
• In other words, the deterministic attention model is an approximation to the marginal
likelihood over the attention locations.
𝑝 𝑋 𝛼 =
𝜃
𝑝 𝑋 𝜃 𝑝 𝜃 𝛼 𝑑𝜃
Marginal likelihood over 𝜃

• By construction, 𝑖 𝛼 𝑡,𝑖 = 1 as they are the output of a softmax.
• In training the deterministic version of our model, we introduce a form of doubly
stochastic regularization where 𝒕 𝜶 𝒕,𝒊 ≈ 𝟏.
(This can be interpreted as encouraging the model to pay equal attention to every part of the image over
the course of generation.)
• This penalty was important to improve overall BLEU score and this leads to more rich
and descriptive captions.
3-b-1. Doubly Stochastic Attention
𝛼 𝑡𝑖 =
exp(𝑒𝑡𝑖)
𝑘=1
𝐿
exp(𝑒𝑡𝑘)

• In addition, the soft attention model predicts a gating scalar 𝛽 from previous hidden
state 𝐡 𝑡−1 at each time step 𝑡, s.t
𝜙 𝐚𝑖 , 𝛼𝑖 = 𝛽
𝑖
𝐿
𝛼𝑖 𝐚𝑖
where 𝛽𝑡 = 𝜎(𝑓𝛽 𝐡 𝑡−1 ).
• This gating variable lets the decoder decide whether to put more emphasis on
language modeling or on the context at each time step.
• Qualitatively, we observe that the gating variable is larger than the decoder describes
an object in the image.
3-b-1. Doubly Stochastic Attention

• The soft attention model is trained end-to-end by minimizing the following penalized
negative log-likelihood:
𝐿 𝑑 = − log 𝑝 𝑦 𝑎 + 𝜆
𝑖
𝐿
1 −
𝑡
𝐶
𝛼 𝑡𝑖
2
Where we simply fixed 𝜏 to 1.
3-b. Soft Attention Model

• Both variants of our attention model were trained with SGD using adaptive learning
rate algorithms.
• To create 𝑎𝑖, we used Oxford VGGnet pretrained on ImageNet without finetuning. We
use the 14 x 14 x 512 feature map of the 4 𝑡ℎ convolutional layer before max pooling.
This means our decoder operates on the flattened 196 x 512(𝑳 × 𝑫) encoding.
• (MS COCO) Soft attention model took less than 3 days (NVIDIA Titan Black GPU).
• GoogLeNet or Oxford VGG can give a boost in performance over using the AlexNet.
3-c. Training

Flickr8k Flickr30k MS COCO
8,000 images 30,000 images 82,738 images
5 reference sentences / image More than 5 / image
4. Experiments
• Data
• Metric : BLEU (Bilingual Evaluation Understudy)
 An algorithm for evaluating the quality of text which has been machine translated from one natural
language to another.
 Quality is considered to be the correspondence between a machine's output and
that of a human: "the closer a machine translation is to a professional human translation, the better it
is" – this is the central idea behind BLEU.

• We are able to significantly improve the state of the art performance METEOR on MS
COCO that we speculate is connected to some of the regularization technique and
our lower level representation.
• Our approach is much more flexible, since the model can attend to “non object”
salient regions.
4. Experiments

Reference
• Papers
 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Kelvin Xu et al, ICML
2015
• Useful websites
 딥 러닝 라이브러리 정리, RNN 튜토리얼(한글) : http://aikorea.org/
 LSTM tutorial : http://deeplearning.net/tutorial/lstm.html#lstm
 BLEU: a Method for Automatic Evaluation of Machine Translation
(http://www.aclweb.org/anthology/P02-1040.pdf)

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Similar to Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (20)

Recently uploaded

Recently uploaded (20)

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention