Multimodal Residual Networks for Visual QA

Multimodal Residual Networks  
for Visual QA
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak,
Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha,  
Byoung-Tak Zhang
9 June 2016
Biointelligence Lab.
Seoul National University @jnhwkim

Table of Contents
1. Visual QA Challenge
2. Background
1. Deep Residual Learning
2. Stacked Attention Networks
3. Element-wise Multiplication
3. Multimodal Residual Networks
4. Results
1. Quantitive Analysis
2. Qualitative Analysis
5. References

Visual QA Challenge
■ What is VQA?
• VQA is a new dataset containing open-ended questions about images.
• These questions require an understanding of vision, language and
commonsense knowledge to answer.
[Agrawal et al., 2015]

Deep Residual Learning
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
image
3x3 conv, 512
3x3 conv, 64
3x3 conv, 64
pool, /2
3x3 conv, 128
3x3 conv, 128
pool, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
pool, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
pool, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
pool, /2
fc 4096
fc 4096
fc 1000
image
output
size: 112
output
size: 224
output
size: 56
output
size: 28
output
size: 14
output
size: 7
output
size: 1
VGG-19 34-layer plain
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
image
34-layer residual
[He et al., 2015]
identity
weight layer
weight layer
relu
relu
F(x) + x
x
F(x)
x
Figure 2. Residual learning: a building block.

Stacked Attention Networks
Question:
What are sitting
in the basket on
a bicycle?
CNN/
LSTM
Softmax
dogs
Answer:
CNN
+
Query
+
Attention layer 1
Attention layer 2
feature vectors of different
parts of image
(a) Stacked Attention Network for Image QA
Original Image First Attention Layer Second Attention Layer
(b) Visualization of the learned multiple attention layers. The
stacked attention network first focuses on all referred concepts,
e.g., bicycle, basket and objects in the basket (dogs) in
the first attention layer and then further narrows down the focus in
the second layer and finds out the answer dog.
[Yang et al., 2015]

■ Attentional Parameters
• For linear combination of visual features
• Shortcut for question vector
■ Representative Bottleneck
• question weakly contributes to the joint only  
through coefficients p
• which may cause a “bottleneck”
Stacked Attention Networks
[Yang et al., 2015]
vQ vI
⊕
◉
hA
pI
⊕ vI’
u vI
Layer 1
Layer 2
uk
= !vI
k
+ uk−1
!vI
k
= pi
k
vi
i
∑

Multimodal Learning for VQA
■ A Strong Baseline by Lu et al., 2015
• A simple method of element-wise multiplication after linear-tanh
embeddings
• Outperform some of the recent works, DPPnet (Noh et al., 2015) and  
D-NMN (Andreas, et al., 2016).
vQ vI
tanh tanh
◉
softmax

Multimodal Residual Networks
■ Residual Networks for Multimodal Inputs
• A shortcut mapping of SAN (question vector)
• Element-wise multiplication for the joint residual function
Q
V
ARNN
CNN
softmax
What kind of animals
are these ?
sheep
word
embedding
question shortcuts
element-wise
multiplication
word2vec
(Mikolov et al., 2013)
skip-thought vectors
(Kiros et al., 2015)
ResNet
(He et al., 2016)

A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax

Exploring Alternative Models
Tanh
Linear
Linear
Tanh
Linear
Q V
Hl V
⊙
⊕
(a)
Linear
Tanh
Linear
Tanh
Linear
Tanh
Linear
Q V
Hl
V⊙
⊕
(c)
Linear
Tanh
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(b)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl
V
⊙
⊕
(e)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(d)
if l=1
else
Identity
if l=1
Linear
else
none
Table 1: The results of alternative models
(a)-(e) on the test-dev.
Open-Ended
All Y/N Num. Other
(a) 60.17 81.83 38.32 46.61
(b) 60.53 82.53 38.34 46.78
(c) 60.19 81.91 37.87 46.70
(d) 59.69 81.67 37.23 46.00
(e) 60.20 81.98 38.25 46.57
Table 2: The e ect of the visual features and
# of target answers on the test-dev results.
Vgg for VGG-19, and Res for ResNet-152 fea-
tures described in Section 4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78
Vgg, 2k 60.79 82.13 38.87 47.52
Vgg, 3k 60.68 82.40 38.69 47.10
Res, 1k 61.45 82.36 38.40 48.81
Res, 2k 61.68 82.28 38.82 49.25
Res, 3k 61.47 82.28 39.09 48.76

Results on VQA test-standard
Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are one
less than others, so, zero-ﬁlled to match others.
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79
D-NMN [1] 58.00 - - - - - - -
Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64
SAN [30] 58.90 - - - - - - -
ACK [27] 59.44 81.07 37.12 45.83 - - - -
FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20
DMN+ [28] 60.36 80.43 36.82 48.33 - - - -
MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40
Human [2] 83.30 95.77 83.39 72.67 - - - -
5.1 Visualization
In Equation 3, the left term ‡(Wqq) can be seen as a masking (attention) vector to
select a part of visual information. We assume that the di erence between the right term
V = ‡(W2‡(W1v)) and the masked vector F(q, v) indicates an attention e ect caused by
the masking vector. Then, the attention e ect Latt = 1
2 (V ≠ F)2
is visualized on the image
by calculating the gradient of Latt with respect to a given image I.

Visualization
■ Attentive Effect
• The difference between the right term V = σ(Wσ(Wv)) and the masked
vector F(q,v) caused by the masking vector σ(Wq).
■ Visualization of Input Gradient
• Then, the attention effect is visualized on the image by calculating the
gradient of Latt with respect to a given image I, while treating F as a
constant. 
Latt =
1
2
V − F
2
∂Latt
∂I
=
∂V
∂I
(V − F )

Visualization (Cont’d)
A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax
pretrained
model
1st visualization
2nd visualization
3rd visualization

Visualization Examplesexamples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surﬁng Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)

Visualization Examples
What color is the bird ? yellow
Is the horse jumping ? yes
(f)
(h)

Acknowledgments
This work was supported by Naver Corp.
and partly by the Korea government (IITP-R0126-16-1072-
SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF,
ADD-UD130070ID-BMRR).

References
• Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question
Answering. arXiv:1505.00468v1, 1–16.
• Noh, H., Seo, P. H., & Han, B. (2015). Image Question Answering using Convolutional Neural Network with Dynamic
Parameter Prediction. Computer Vision and Pattern Recognition; Computation and Language; Learning. Retrieved
from http://arxiv.org/abs/1511.05756
• Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked Attention Networks for Image Question Answering.
arXiv:1511.02274. Learning; Computation and Language; Computer Vision and Pattern Recognition; Neural and
Evolutionary Computing. Retrieved from http://arxiv.org/abs/1511.02274
• He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. Computer Vision and
Pattern Recognition. Retrieved from http://arxiv.org/abs/1512.03385
• Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 11. Learning; Neural and
Evolutionary Computing. Retrieved from http://arxiv.org/abs/1507.06228
• Kim, J.-H., Kim, J., Ha, J.-W., & Zhang, B.-T. (2016). TrimZero: A Torch Recurrent Module for Efficient Natural
Language Processing. In Proceedings of KIIS Spring Conference (Vol. 26, pp. 165–166).
• Léonard, N., Waghmare, S., Wang, Y., & Kim, J.-H. (2015). rnn : Recurrent Library for Torch. arXiv Preprint arXiv:
1511.07889. Retrieved from http://arxiv.org/abs/1511.07889
• Xiong, C., Merity, S., & Socher, R. (2016). Dynamic Memory Networks for Visual and Textual Question Answering.
Neural and Evolutionary Computing; Computation and Language; Computer Vision and Pattern Recognition. Retrieved
from http://arxiv.org/abs/1603.01417
• Gal, Y. (2015). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv Preprint arXiv:
1512.05287. Machine Learning. Retrieved from http://arxiv.org/abs/1512.05287
• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector
space. ICLR, 2013.
• Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-Thought Vectors.
In Advances in Neural Information Processing Systems 28 (pp. 3294–3302). Computation and Language; Learning.

More Examples
(a) Does the man have good posture ? no (b) Did he fall down ? yes
(c) Are there two cats in the picture ? no (d) What color are the bears ? brown
(e) What are many of the people carrying ? umbrellas (f) What color is the dog ? black
(g) Are these animals tall ? yes (h) What animal is that ? sheep
(i) Are all the cows the same color ? no (j) What is the reﬂection of in the mirror ? dog
(k) What are the giraffe in the foreground doing ?
eating
(l) What animal is standing in the water other than
birds ? bear

Comparative Analysis
(a1) What is the animal on the left ? giraffe
(a2) Can you see trees ? yes
(b1) What is the lady riding ? motorcycle
(b2) Is she riding the motorcycle on the street ? no

Failure Examples
(a) What animals are these ? bears ducks (b) What are these animals ? cows goats
(c) What animals are visible ? sheep horses (d) How many animals are depicted ? 2 1
(e) What ﬂavor donut is this ? chocolate strawberry (f) What is the man doing ? playing tennis frisbee
(g) What color are the giraffes eyelashes ? brown
black
(h) What food is the bear trying to eat ? banana
papaya
(i) What kind of animal is used to herd these animals ?
sheep dog
(j) What species of tree are in the background ? pine
palm
(k) Are there any birds on the photo ? no yes (l) Why is the hydrant smiling ? happy
someone drew on it

TrimZero in Torch rnn
■ MaskZero - a naive approach
how
what
was
0 1 1 ..
0 0 1 ..
1 1 1 ..
is ..
is ..
your ..
0 what is ..
0 0 is ..
how was your ..
LM
0 s1 s2 ..
0 0 s1 ..
s1 s2 s3 ..
LM s1
s1
s2
s2 ..
s1 ..
s3 ..
0 s1 s2 ..
0 0 s1 ..
s1 s2 s3 ..
recovery
at every step
■ TrimZero - training time reduced by 37.5%
[Kim et al., 2016]

Note on TrimZero
■ GPU Computation
• Efficiency of TrimZero is degraded for CUDA computing,
• however, it is mainly affected by batch size, rnn size and the number of
zeros in inputs.
• Empirically, natural language sentences (around mean length 7~8, max
length 26) with batch size = 200 and rnn size = 2400 (skip-thought vectors)
gain decent computational advantage (+37.5%) for CUDA computing.
■ Citation
• Jin-Hwa Kim, Jeonghee Kim, Jung-Woo Ha and Byoung-Tak Zhang,
(2016). TrimZero: A Torch Recurrent Module for Efficient Natural
Language Processing. In Proceedings of KIIS Spring Conference (Vol. 26,
pp. 165–166).

BGRUs (approx. accuracy+.5%)
PennTreeBank Benchmark 
nn.GRU vs. nn.BGRU (dropout)
Perplexity
70
75
80
85
90
95
100
105
110
115
120
125
130
135
140
145
150
Epoch
1 51 101 151 201 251
GRU-TRAIN
GRU-VALID
BGRU-TRAIN
BGRU-VALID
Results on PennTreeBank
Standard GRU Dropout GRU Bayesian GRU
Val perp. 130.75 102.78 99.47
Test perp. 125.10 99.13 95.31
Standard GRUs Dropout GRUs Bayesian GRUs
LookupTable LookupTable LookupTable
- Dropout(.5) -
GRUs GRUs BGRUs(.25)
- Dropout(.5) Dropout(.5)
weight_decay
=1e-4
weight_decay
=1e-4
[Gal, 2015]

Multimodal Residual Networks for Visual QA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multimodal Residual Networks for Visual QA

Similar to Multimodal Residual Networks for Visual QA (20)

Recently uploaded

Recently uploaded (20)

Multimodal Residual Networks for Visual QA