SlideShare a Scribd company logo
Multimodal Residual Networks 

for Visual QA
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak,
Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, 

Byoung-Tak Zhang
9 June 2016
Biointelligence Lab.
Seoul National University @jnhwkim
Table of Contents
1. Visual QA Challenge
2. Background
1. Deep Residual Learning
2. Stacked Attention Networks
3. Element-wise Multiplication
3. Multimodal Residual Networks
4. Results
1. Quantitive Analysis
2. Qualitative Analysis
5. References
Visual QA Challenge
■ What is VQA?
• VQA is a new dataset containing open-ended questions about images.
• These questions require an understanding of vision, language and
commonsense knowledge to answer.
[Agrawal et al., 2015]
Deep Residual Learning
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
image
3x3 conv, 512
3x3 conv, 64
3x3 conv, 64
pool, /2
3x3 conv, 128
3x3 conv, 128
pool, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
pool, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
pool, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
pool, /2
fc 4096
fc 4096
fc 1000
image
output
size: 112
output
size: 224
output
size: 56
output
size: 28
output
size: 14
output
size: 7
output
size: 1
VGG-19 34-layer plain
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
image
34-layer residual
[He et al., 2015]
identity
weight layer
weight layer
relu
relu
F(x) + x
x
F(x)
x
Figure 2. Residual learning: a building block.
Stacked Attention Networks
Question:
What are sitting
in the basket on
a bicycle?
CNN/
LSTM
Softmax
dogs
Answer:
CNN
+
Query
+
Attention layer 1
Attention layer 2
feature vectors of different
parts of image
(a) Stacked Attention Network for Image QA
Original Image First Attention Layer Second Attention Layer
(b) Visualization of the learned multiple attention layers. The
stacked attention network first focuses on all referred concepts,
e.g., bicycle, basket and objects in the basket (dogs) in
the first attention layer and then further narrows down the focus in
the second layer and finds out the answer dog.
[Yang et al., 2015]
■ Attentional Parameters
• For linear combination of visual features
• Shortcut for question vector
■ Representative Bottleneck
• question weakly contributes to the joint only 

through coefficients p
• which may cause a “bottleneck”
Stacked Attention Networks
[Yang et al., 2015]
vQ vI
⊕
◉
hA
pI
⊕ vI’
u vI
Layer 1
Layer 2
uk
= !vI
k
+ uk−1
!vI
k
= pi
k
vi
i
∑
Multimodal Learning for VQA
■ A Strong Baseline by Lu et al., 2015
• A simple method of element-wise multiplication after linear-tanh
embeddings
• Outperform some of the recent works, DPPnet (Noh et al., 2015) and 

D-NMN (Andreas, et al., 2016).
vQ vI
tanh tanh
◉
softmax
Multimodal Residual Networks
■ Residual Networks for Multimodal Inputs
• A shortcut mapping of SAN (question vector)
• Element-wise multiplication for the joint residual function
Q
V
ARNN
CNN
softmax
Multimodal Residual Networks
What kind of animals
are these ?
sheep
word
embedding
question shortcuts
element-wise
multiplication
word2vec
(Mikolov et al., 2013)
skip-thought vectors
(Kiros et al., 2015)
ResNet
(He et al., 2016)
Multimodal Residual Networks
A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax
Exploring Alternative Models
Tanh
Linear
Linear
Tanh
Linear
Q V
Hl V
⊙
⊕
(a)
Linear
Tanh
Linear
Tanh
Linear
Tanh
Linear
Q V
Hl
V⊙
⊕
(c)
Linear
Tanh
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(b)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl
V
⊙
⊕
(e)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(d)
if l=1
else
Identity
if l=1
Linear
else
none
Table 1: The results of alternative models
(a)-(e) on the test-dev.
Open-Ended
All Y/N Num. Other
(a) 60.17 81.83 38.32 46.61
(b) 60.53 82.53 38.34 46.78
(c) 60.19 81.91 37.87 46.70
(d) 59.69 81.67 37.23 46.00
(e) 60.20 81.98 38.25 46.57
Table 2: The e ect of the visual features and
# of target answers on the test-dev results.
Vgg for VGG-19, and Res for ResNet-152 fea-
tures described in Section 4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78
Vgg, 2k 60.79 82.13 38.87 47.52
Vgg, 3k 60.68 82.40 38.69 47.10
Res, 1k 61.45 82.36 38.40 48.81
Res, 2k 61.68 82.28 38.82 49.25
Res, 3k 61.47 82.28 39.09 48.76
Results on VQA test-standard
Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are one
less than others, so, zero-filled to match others.
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79
D-NMN [1] 58.00 - - - - - - -
Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64
SAN [30] 58.90 - - - - - - -
ACK [27] 59.44 81.07 37.12 45.83 - - - -
FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20
DMN+ [28] 60.36 80.43 36.82 48.33 - - - -
MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40
Human [2] 83.30 95.77 83.39 72.67 - - - -
5.1 Visualization
In Equation 3, the left term ‡(Wqq) can be seen as a masking (attention) vector to
select a part of visual information. We assume that the di erence between the right term
V = ‡(W2‡(W1v)) and the masked vector F(q, v) indicates an attention e ect caused by
the masking vector. Then, the attention e ect Latt = 1
2 (V ≠ F)2
is visualized on the image
by calculating the gradient of Latt with respect to a given image I.
Visualization
■ Attentive Effect
• The difference between the right term V = σ(Wσ(Wv)) and the masked
vector F(q,v) caused by the masking vector σ(Wq).
■ Visualization of Input Gradient
• Then, the attention effect is visualized on the image by calculating the
gradient of Latt with respect to a given image I, while treating F as a
constant.

Latt =
1
2
V − F
2
∂Latt
∂I
=
∂V
∂I
(V − F )
Visualization (Cont’d)
A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax
pretrained
model
1st visualization
2nd visualization
3rd visualization
Visualization Examplesexamples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surfing Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Visualization Examples
What color is the bird ? yellow
Is the horse jumping ? yes
(f)
(h)
Acknowledgments
This work was supported by Naver Corp.
and partly by the Korea government (IITP-R0126-16-1072-
SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF,
ADD-UD130070ID-BMRR).
References
• Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question
Answering. arXiv:1505.00468v1, 1–16.
• Noh, H., Seo, P. H., & Han, B. (2015). Image Question Answering using Convolutional Neural Network with Dynamic
Parameter Prediction. Computer Vision and Pattern Recognition; Computation and Language; Learning. Retrieved
from http://arxiv.org/abs/1511.05756
• Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked Attention Networks for Image Question Answering.
arXiv:1511.02274. Learning; Computation and Language; Computer Vision and Pattern Recognition; Neural and
Evolutionary Computing. Retrieved from http://arxiv.org/abs/1511.02274
• He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. Computer Vision and
Pattern Recognition. Retrieved from http://arxiv.org/abs/1512.03385
• Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 11. Learning; Neural and
Evolutionary Computing. Retrieved from http://arxiv.org/abs/1507.06228
• Kim, J.-H., Kim, J., Ha, J.-W., & Zhang, B.-T. (2016). TrimZero: A Torch Recurrent Module for Efficient Natural
Language Processing. In Proceedings of KIIS Spring Conference (Vol. 26, pp. 165–166).
• Léonard, N., Waghmare, S., Wang, Y., & Kim, J.-H. (2015). rnn : Recurrent Library for Torch. arXiv Preprint arXiv:
1511.07889. Retrieved from http://arxiv.org/abs/1511.07889
• Xiong, C., Merity, S., & Socher, R. (2016). Dynamic Memory Networks for Visual and Textual Question Answering.
Neural and Evolutionary Computing; Computation and Language; Computer Vision and Pattern Recognition. Retrieved
from http://arxiv.org/abs/1603.01417
• Gal, Y. (2015). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv Preprint arXiv:
1512.05287. Machine Learning. Retrieved from http://arxiv.org/abs/1512.05287
• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector
space. ICLR, 2013.
• Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-Thought Vectors.
In Advances in Neural Information Processing Systems 28 (pp. 3294–3302). Computation and Language; Learning.
Any Questions?
Appendix
More Examples
(a) Does the man have good posture ? no (b) Did he fall down ? yes
(c) Are there two cats in the picture ? no (d) What color are the bears ? brown
(e) What are many of the people carrying ? umbrellas (f) What color is the dog ? black
(g) Are these animals tall ? yes (h) What animal is that ? sheep
(i) Are all the cows the same color ? no (j) What is the reflection of in the mirror ? dog
(k) What are the giraffe in the foreground doing ?
eating
(l) What animal is standing in the water other than
birds ? bear
Comparative Analysis
(a1) What is the animal on the left ? giraffe
(a2) Can you see trees ? yes
(b1) What is the lady riding ? motorcycle
(b2) Is she riding the motorcycle on the street ? no
Failure Examples
(a) What animals are these ? bears ducks (b) What are these animals ? cows goats
(c) What animals are visible ? sheep horses (d) How many animals are depicted ? 2 1
(e) What flavor donut is this ? chocolate strawberry (f) What is the man doing ? playing tennis frisbee
(g) What color are the giraffes eyelashes ? brown
black
(h) What food is the bear trying to eat ? banana
papaya
(i) What kind of animal is used to herd these animals ?
sheep dog
(j) What species of tree are in the background ? pine
palm
(k) Are there any birds on the photo ? no yes (l) Why is the hydrant smiling ? happy
someone drew on it
TrimZero in Torch rnn
■ MaskZero - a naive approach
how
what
was
0 1 1 ..
0 0 1 ..
1 1 1 ..
is ..
is ..
your ..
0 what is ..
0 0 is ..
how was your ..
LM
0 s1 s2 ..
0 0 s1 ..
s1 s2 s3 ..
LM s1
s1
s2
s2 ..
s1 ..
s3 ..
0 s1 s2 ..
0 0 s1 ..
s1 s2 s3 ..
recovery
at every step
■ TrimZero - training time reduced by 37.5%
[Kim et al., 2016]
Note on TrimZero
■ GPU Computation
• Efficiency of TrimZero is degraded for CUDA computing,
• however, it is mainly affected by batch size, rnn size and the number of
zeros in inputs.
• Empirically, natural language sentences (around mean length 7~8, max
length 26) with batch size = 200 and rnn size = 2400 (skip-thought vectors)
gain decent computational advantage (+37.5%) for CUDA computing.
■ Citation
• Jin-Hwa Kim, Jeonghee Kim, Jung-Woo Ha and Byoung-Tak Zhang,
(2016). TrimZero: A Torch Recurrent Module for Efficient Natural
Language Processing. In Proceedings of KIIS Spring Conference (Vol. 26,
pp. 165–166).
BGRUs (approx. accuracy+.5%)
PennTreeBank Benchmark

nn.GRU vs. nn.BGRU (dropout)
Perplexity
70
75
80
85
90
95
100
105
110
115
120
125
130
135
140
145
150
Epoch
1 51 101 151 201 251
GRU-TRAIN
GRU-VALID
BGRU-TRAIN
BGRU-VALID
Results on PennTreeBank
Standard GRU Dropout GRU Bayesian GRU
Val perp. 130.75 102.78 99.47
Test perp. 125.10 99.13 95.31
Standard GRUs Dropout GRUs Bayesian GRUs
LookupTable LookupTable LookupTable
- Dropout(.5) -
GRUs GRUs BGRUs(.25)
- Dropout(.5) Dropout(.5)
weight_decay
=1e-4
weight_decay
=1e-4
[Gal, 2015]

More Related Content

What's hot

Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
Yasuo Tabei
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
Charles Deledalle
 
1Sem-Basic Electronics Notes-Unit8-Digital Logic
1Sem-Basic Electronics Notes-Unit8-Digital Logic1Sem-Basic Electronics Notes-Unit8-Digital Logic
1Sem-Basic Electronics Notes-Unit8-Digital Logic
Dr. Shivananda Koteshwar
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Charles Deledalle
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
Charles Deledalle
 
Topology Matters in Communication
Topology Matters in CommunicationTopology Matters in Communication
Topology Matters in Communication
cseiitgn
 
Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)
Vlad Ovidiu Mihalca
 
BMC 2012
BMC 2012BMC 2012
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEYIMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
ijcsit
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks
Abdallah Bashir
 
How to calculate back propagation
How to calculate back propagationHow to calculate back propagation
How to calculate back propagation
Shinagawa Seitaro
 
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Shao-Chuan Wang
 
Probing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsProbing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of Findings
EDD SFSU
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
tuxette
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learningbutest
 
Lec14 eigenface and fisherface
Lec14 eigenface and fisherfaceLec14 eigenface and fisherface
Lec14 eigenface and fisherface
United States Air Force Academy
 
Cooperative Game Theory
Cooperative Game TheoryCooperative Game Theory
Cooperative Game Theory
SSA KPI
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
FedorNikolaev
 

What's hot (20)

Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
1Sem-Basic Electronics Notes-Unit8-Digital Logic
1Sem-Basic Electronics Notes-Unit8-Digital Logic1Sem-Basic Electronics Notes-Unit8-Digital Logic
1Sem-Basic Electronics Notes-Unit8-Digital Logic
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
 
Topology Matters in Communication
Topology Matters in CommunicationTopology Matters in Communication
Topology Matters in Communication
 
Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)
 
BMC 2012
BMC 2012BMC 2012
BMC 2012
 
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEYIMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks
 
How to calculate back propagation
How to calculate back propagationHow to calculate back propagation
How to calculate back propagation
 
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
 
Probing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsProbing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of Findings
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learning
 
Lec14 eigenface and fisherface
Lec14 eigenface and fisherfaceLec14 eigenface and fisherface
Lec14 eigenface and fisherface
 
Cooperative Game Theory
Cooperative Game TheoryCooperative Game Theory
Cooperative Game Theory
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
 
Fd25951958
Fd25951958Fd25951958
Fd25951958
 
Quiz
QuizQuiz
Quiz
 

Similar to Multimodal Residual Networks for Visual QA

Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-Answering
NAVER D2
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
VasileiosMezaris
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
Universitat Politècnica de Catalunya
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Soma Boubou
 
Fcv learn yu
Fcv learn yuFcv learn yu
Fcv learn yuzukun
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Performance Evaluation of Object Tracking Technique Based on Position Vectors
Performance Evaluation of Object Tracking Technique Based on Position VectorsPerformance Evaluation of Object Tracking Technique Based on Position Vectors
Performance Evaluation of Object Tracking Technique Based on Position Vectors
CSCJournals
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Universitat Politècnica de Catalunya
 
D3L4-objects.pdf
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
ssusere945ae
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Universitat Politècnica de Catalunya
 
Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector
Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super VectorLec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector
Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector
United States Air Force Academy
 
Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...
Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...
Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...
ERIKAMARIAGUDIOSANTA
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdf
NarenRajVivek
 
IUI 2016 Presentation Slide
IUI 2016 Presentation SlideIUI 2016 Presentation Slide
IUI 2016 Presentation Slide
University of Central Florida
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
INSA Lyon - L'Institut National des Sciences Appliquées de Lyon
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 

Similar to Multimodal Residual Networks for Visual QA (20)

Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-Answering
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
 
Fcv learn yu
Fcv learn yuFcv learn yu
Fcv learn yu
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
 
Performance Evaluation of Object Tracking Technique Based on Position Vectors
Performance Evaluation of Object Tracking Technique Based on Position VectorsPerformance Evaluation of Object Tracking Technique Based on Position Vectors
Performance Evaluation of Object Tracking Technique Based on Position Vectors
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
D3L4-objects.pdf
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
 
Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector
Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super VectorLec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector
Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector
 
Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...
Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...
Aplicaciones de espacios y subespacios vectoriales en la carrera de tecnologi...
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdf
 
IUI 2016 Presentation Slide
IUI 2016 Presentation SlideIUI 2016 Presentation Slide
IUI 2016 Presentation Slide
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
 

Recently uploaded

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Multimodal Residual Networks for Visual QA

  • 1. Multimodal Residual Networks 
 for Visual QA Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, 
 Byoung-Tak Zhang 9 June 2016 Biointelligence Lab. Seoul National University @jnhwkim
  • 2. Table of Contents 1. Visual QA Challenge 2. Background 1. Deep Residual Learning 2. Stacked Attention Networks 3. Element-wise Multiplication 3. Multimodal Residual Networks 4. Results 1. Quantitive Analysis 2. Qualitative Analysis 5. References
  • 3. Visual QA Challenge ■ What is VQA? • VQA is a new dataset containing open-ended questions about images. • These questions require an understanding of vision, language and commonsense knowledge to answer. [Agrawal et al., 2015]
  • 4. Deep Residual Learning 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 3x3 conv, 512 3x3 conv, 64 3x3 conv, 64 pool, /2 3x3 conv, 128 3x3 conv, 128 pool, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 pool, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 pool, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 pool, /2 fc 4096 fc 4096 fc 1000 image output size: 112 output size: 224 output size: 56 output size: 28 output size: 14 output size: 7 output size: 1 VGG-19 34-layer plain 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 34-layer residual [He et al., 2015] identity weight layer weight layer relu relu F(x) + x x F(x) x Figure 2. Residual learning: a building block.
  • 5. Stacked Attention Networks Question: What are sitting in the basket on a bicycle? CNN/ LSTM Softmax dogs Answer: CNN + Query + Attention layer 1 Attention layer 2 feature vectors of different parts of image (a) Stacked Attention Network for Image QA Original Image First Attention Layer Second Attention Layer (b) Visualization of the learned multiple attention layers. The stacked attention network first focuses on all referred concepts, e.g., bicycle, basket and objects in the basket (dogs) in the first attention layer and then further narrows down the focus in the second layer and finds out the answer dog. [Yang et al., 2015]
  • 6. ■ Attentional Parameters • For linear combination of visual features • Shortcut for question vector ■ Representative Bottleneck • question weakly contributes to the joint only 
 through coefficients p • which may cause a “bottleneck” Stacked Attention Networks [Yang et al., 2015] vQ vI ⊕ ◉ hA pI ⊕ vI’ u vI Layer 1 Layer 2 uk = !vI k + uk−1 !vI k = pi k vi i ∑
  • 7. Multimodal Learning for VQA ■ A Strong Baseline by Lu et al., 2015 • A simple method of element-wise multiplication after linear-tanh embeddings • Outperform some of the recent works, DPPnet (Noh et al., 2015) and 
 D-NMN (Andreas, et al., 2016). vQ vI tanh tanh ◉ softmax
  • 8. Multimodal Residual Networks ■ Residual Networks for Multimodal Inputs • A shortcut mapping of SAN (question vector) • Element-wise multiplication for the joint residual function Q V ARNN CNN softmax Multimodal Residual Networks What kind of animals are these ? sheep word embedding question shortcuts element-wise multiplication word2vec (Mikolov et al., 2013) skip-thought vectors (Kiros et al., 2015) ResNet (He et al., 2016)
  • 9. Multimodal Residual Networks A Linear Tanh Linear TanhLinear Tanh Linear Q V H1 Linear Tanh Linear TanhLinear Tanh Linear H2 V Linear Tanh Linear TanhLinear Tanh Linear H3 V Linear Softmax ⊙ ⊕ ⊙ ⊕ ⊙ ⊕ Softmax
  • 10. Exploring Alternative Models Tanh Linear Linear Tanh Linear Q V Hl V ⊙ ⊕ (a) Linear Tanh Linear Tanh Linear Tanh Linear Q V Hl V⊙ ⊕ (c) Linear Tanh Linear Tanh Linear TanhLinear Tanh Linear Q V Hl V ⊙ ⊕ (b) Linear Tanh Linear TanhLinear Tanh Linear Q V Hl V ⊙ ⊕ (e) Linear Tanh Linear TanhLinear Tanh Linear Q V Hl V ⊙ ⊕ (d) if l=1 else Identity if l=1 Linear else none Table 1: The results of alternative models (a)-(e) on the test-dev. Open-Ended All Y/N Num. Other (a) 60.17 81.83 38.32 46.61 (b) 60.53 82.53 38.34 46.78 (c) 60.19 81.91 37.87 46.70 (d) 59.69 81.67 37.23 46.00 (e) 60.20 81.98 38.25 46.57 Table 2: The e ect of the visual features and # of target answers on the test-dev results. Vgg for VGG-19, and Res for ResNet-152 fea- tures described in Section 4. Open-Ended All Y/N Num. Other Vgg, 1k 60.53 82.53 38.34 46.78 Vgg, 2k 60.79 82.13 38.87 47.52 Vgg, 3k 60.68 82.40 38.69 47.10 Res, 1k 61.45 82.36 38.40 48.81 Res, 2k 61.68 82.28 38.82 49.25 Res, 3k 61.47 82.28 39.09 48.76
  • 11. Results on VQA test-standard Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are one less than others, so, zero-filled to match others. Open-Ended Multiple-Choice All Y/N Num. Other All Y/N Num. Other DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79 D-NMN [1] 58.00 - - - - - - - Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64 SAN [30] 58.90 - - - - - - - ACK [27] 59.44 81.07 37.12 45.83 - - - - FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20 DMN+ [28] 60.36 80.43 36.82 48.33 - - - - MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40 Human [2] 83.30 95.77 83.39 72.67 - - - - 5.1 Visualization In Equation 3, the left term ‡(Wqq) can be seen as a masking (attention) vector to select a part of visual information. We assume that the di erence between the right term V = ‡(W2‡(W1v)) and the masked vector F(q, v) indicates an attention e ect caused by the masking vector. Then, the attention e ect Latt = 1 2 (V ≠ F)2 is visualized on the image by calculating the gradient of Latt with respect to a given image I.
  • 12. Visualization ■ Attentive Effect • The difference between the right term V = σ(Wσ(Wv)) and the masked vector F(q,v) caused by the masking vector σ(Wq). ■ Visualization of Input Gradient • Then, the attention effect is visualized on the image by calculating the gradient of Latt with respect to a given image I, while treating F as a constant.
 Latt = 1 2 V − F 2 ∂Latt ∂I = ∂V ∂I (V − F )
  • 14. Visualization Examplesexamples examples What kind of animals are these ? sheep What animal is the picture ? elephant What is this animal ? zebra What game is this person playing ? tennis How many cats are here ? 2 What color is the bird ? yellow What sport is this ? surfing Is the horse jumping ? yes (a) (b) (c) (d) (e) (f) (g) (h)
  • 15. Visualization Examples What color is the bird ? yellow Is the horse jumping ? yes (f) (h)
  • 16. Acknowledgments This work was supported by Naver Corp. and partly by the Korea government (IITP-R0126-16-1072- SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF, ADD-UD130070ID-BMRR).
  • 17. References • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question Answering. arXiv:1505.00468v1, 1–16. • Noh, H., Seo, P. H., & Han, B. (2015). Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. Computer Vision and Pattern Recognition; Computation and Language; Learning. Retrieved from http://arxiv.org/abs/1511.05756 • Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked Attention Networks for Image Question Answering. arXiv:1511.02274. Learning; Computation and Language; Computer Vision and Pattern Recognition; Neural and Evolutionary Computing. Retrieved from http://arxiv.org/abs/1511.02274 • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. Computer Vision and Pattern Recognition. Retrieved from http://arxiv.org/abs/1512.03385 • Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 11. Learning; Neural and Evolutionary Computing. Retrieved from http://arxiv.org/abs/1507.06228 • Kim, J.-H., Kim, J., Ha, J.-W., & Zhang, B.-T. (2016). TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing. In Proceedings of KIIS Spring Conference (Vol. 26, pp. 165–166). • Léonard, N., Waghmare, S., Wang, Y., & Kim, J.-H. (2015). rnn : Recurrent Library for Torch. arXiv Preprint arXiv: 1511.07889. Retrieved from http://arxiv.org/abs/1511.07889 • Xiong, C., Merity, S., & Socher, R. (2016). Dynamic Memory Networks for Visual and Textual Question Answering. Neural and Evolutionary Computing; Computation and Language; Computer Vision and Pattern Recognition. Retrieved from http://arxiv.org/abs/1603.01417 • Gal, Y. (2015). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv Preprint arXiv: 1512.05287. Machine Learning. Retrieved from http://arxiv.org/abs/1512.05287 • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR, 2013. • Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-Thought Vectors. In Advances in Neural Information Processing Systems 28 (pp. 3294–3302). Computation and Language; Learning.
  • 20. More Examples (a) Does the man have good posture ? no (b) Did he fall down ? yes (c) Are there two cats in the picture ? no (d) What color are the bears ? brown (e) What are many of the people carrying ? umbrellas (f) What color is the dog ? black (g) Are these animals tall ? yes (h) What animal is that ? sheep (i) Are all the cows the same color ? no (j) What is the reflection of in the mirror ? dog (k) What are the giraffe in the foreground doing ? eating (l) What animal is standing in the water other than birds ? bear
  • 21. Comparative Analysis (a1) What is the animal on the left ? giraffe (a2) Can you see trees ? yes (b1) What is the lady riding ? motorcycle (b2) Is she riding the motorcycle on the street ? no
  • 22. Failure Examples (a) What animals are these ? bears ducks (b) What are these animals ? cows goats (c) What animals are visible ? sheep horses (d) How many animals are depicted ? 2 1 (e) What flavor donut is this ? chocolate strawberry (f) What is the man doing ? playing tennis frisbee (g) What color are the giraffes eyelashes ? brown black (h) What food is the bear trying to eat ? banana papaya (i) What kind of animal is used to herd these animals ? sheep dog (j) What species of tree are in the background ? pine palm (k) Are there any birds on the photo ? no yes (l) Why is the hydrant smiling ? happy someone drew on it
  • 23. TrimZero in Torch rnn ■ MaskZero - a naive approach how what was 0 1 1 .. 0 0 1 .. 1 1 1 .. is .. is .. your .. 0 what is .. 0 0 is .. how was your .. LM 0 s1 s2 .. 0 0 s1 .. s1 s2 s3 .. LM s1 s1 s2 s2 .. s1 .. s3 .. 0 s1 s2 .. 0 0 s1 .. s1 s2 s3 .. recovery at every step ■ TrimZero - training time reduced by 37.5% [Kim et al., 2016]
  • 24. Note on TrimZero ■ GPU Computation • Efficiency of TrimZero is degraded for CUDA computing, • however, it is mainly affected by batch size, rnn size and the number of zeros in inputs. • Empirically, natural language sentences (around mean length 7~8, max length 26) with batch size = 200 and rnn size = 2400 (skip-thought vectors) gain decent computational advantage (+37.5%) for CUDA computing. ■ Citation • Jin-Hwa Kim, Jeonghee Kim, Jung-Woo Ha and Byoung-Tak Zhang, (2016). TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing. In Proceedings of KIIS Spring Conference (Vol. 26, pp. 165–166).
  • 25. BGRUs (approx. accuracy+.5%) PennTreeBank Benchmark
 nn.GRU vs. nn.BGRU (dropout) Perplexity 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 Epoch 1 51 101 151 201 251 GRU-TRAIN GRU-VALID BGRU-TRAIN BGRU-VALID Results on PennTreeBank Standard GRU Dropout GRU Bayesian GRU Val perp. 130.75 102.78 99.47 Test perp. 125.10 99.13 95.31 Standard GRUs Dropout GRUs Bayesian GRUs LookupTable LookupTable LookupTable - Dropout(.5) - GRUs GRUs BGRUs(.25) - Dropout(.5) Dropout(.5) weight_decay =1e-4 weight_decay =1e-4 [Gal, 2015]