SlideShare a Scribd company logo
1 of 24
Download to read offline
Multimodal Residual Learning
for Visual QA
NamHyuk Ahn
Table of Contents
1. Visual QA
2. Stacked Attention Network (SAN)
3. Residual Learning
4. Multimodal Residual Network (MRN)
Visual QA
Evaluation Metric
- Robust to variabilityinter-
human
- Human accuracy is almost 90
- 248,349 Training questions
(82,783 Images)
- 121,512 Validation questions
(40,504 Images)
- 244,302 Testing questions
(81,434 Images)
Stacked Attention Network
Motivation
- Answering question requires
multi-step reasoning
- With {bicycles, window, street,
baskets, dogs} objects
- To answer good question,
pinpoint relevant region.
Q: what are sitting in the basket
on a bicycle
Stacked Attention Network (SAN)
- SAN allows multi-step reasoning for visual QA
- Extension of Attention mechanism which
successfully applied in captioning, translation etc.
Q: what are sitting in the basket on a bicycle
Stacked Attention Network
- Image Model
• Extract image feature using
CNN
- Question Model
• Extract semantic vector
using CNN or LSTM
- Stacked Attention
• Multi-step reasoning
with attention layer
Stacked Attention
Multi-step reasoning
using attention layer
Image / Question Model
- Image Model
• Get feature map from
raw pixel Image
• Rescale image to 448x448,
take feature from pool5 of
VGGNet (14x14x512)
• Additional layer to fit to
question feature
- Question Model
•
Stacked Attention Model
- Global image feature leads to
suboptimal due to noise from
irrelevant object / region.
- Instead use SAM to pinpoint
relevant region
- Given image feature matrix
and question vector ,
14x14 attention distribution
- Get weighted sum of image
vectors from each region.
-
refined query vector
Result
Residual Learning
Problem of degradation
- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error
Residual Network (ResNet)
Residual Block
- To avoid degradation
problem, add shortcut
connection.
- Element-wise addition with
F(x) and shortcut connection,
and pass through ReLU.
- Similar to LSTM
http://torch.ch/blog/2016/02/04/resnets.html
Shortcut connection
Multimodal Residual
Network
Introduction
- Extend deep residual learning for visual QA
- Achieving the state-of-the-art results on visual QA
dataset (not today :(.
- Introducing a method to visualize spatial attention
effect of joint residual mappings
Background
SAN
- But question info contribute
weakly, it cause bottleneck
Baseline [Lu et al.]
- With just elem-wise multiple,
visual and question feature
embed very well.
MRN
- Shortcut mapping and
stacking architecture
- No weighted-sum
- Instead use global
multiplication [Lu et al.] does.
Quantitative Analysis
- (a) shows large improvement
over SAN, (b) is better.
- (c) add extra embedding in
question cause overfitting.
- (d) identity shortcut cause
degradation (extra linear
mapping is needed).
- (e) performs reasonable, but
extra shortcut is not essential.
Quantitative Analysis
# of Learning blocks
- 58.85% (L=1), 59.44% (L=2),
60.53% (L=3), 60.42% (L=4)
Visual Features
- ResNet-152 is significantly
better than VGGNet
- Even though ResNet has less
feature dim (2048 vs 4096).
# of Answer Class
- Trade-off relation among
answer type, but 2k is best
- Implicit attention with multiplication
- Get high-resolution attention map
Reference
- Yang, Zichao, et al. "Stacked attention networks for image question
answering." arXiv preprint arXiv:1511.02274 (2015).
- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv
preprint arXiv:1606.01455 (2016).
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of
the IEEE International Conference on Computer Vision. 2015.

More Related Content

What's hot

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningTrong-An Bui
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...Edge AI and Vision Alliance
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkNamHyuk Ahn
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks Roozbeh Sanaei
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksJeremy Nixon
 

What's hot (20)

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
 
Cnn
CnnCnn
Cnn
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Cnn
CnnCnn
Cnn
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 

Viewers also liked

807 103康八上 my comic book
807 103康八上 my comic book807 103康八上 my comic book
807 103康八上 my comic bookAlly Lin
 
Postavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojářePostavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojářeLadislav Prskavec
 
Reclaiming the idea of the University
Reclaiming the idea of the UniversityReclaiming the idea of the University
Reclaiming the idea of the UniversityRichard Hall
 
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Doug Oldfield
 
Marquette Social Listening presentation
Marquette Social Listening presentationMarquette Social Listening presentation
Marquette Social Listening presentation7Summits
 
อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์ooh Pongtorn
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-frameworkAbdhesh Kumar
 
4º básico a semana 03 de junio al 10 de junio
4º básico a  semana  03 de junio al 10 de junio4º básico a  semana  03 de junio al 10 de junio
4º básico a semana 03 de junio al 10 de junioColegio Camilo Henríquez
 
Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5Maurizio Taglioretti
 
Bateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileiraBateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileiramanda555
 
9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusatRosyid Musthofa
 
The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...EDUCAUSE
 

Viewers also liked (19)

807 103康八上 my comic book
807 103康八上 my comic book807 103康八上 my comic book
807 103康八上 my comic book
 
Postavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojářePostavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojáře
 
Giveandget.com
Giveandget.comGiveandget.com
Giveandget.com
 
Reclaiming the idea of the University
Reclaiming the idea of the UniversityReclaiming the idea of the University
Reclaiming the idea of the University
 
Ingles
InglesIngles
Ingles
 
Gamification review 1
Gamification review 1Gamification review 1
Gamification review 1
 
Frede space up paris 2013
Frede space up paris 2013Frede space up paris 2013
Frede space up paris 2013
 
6º básico a semana 09 al 13 de mayo (1)
6º básico a semana 09  al 13 de  mayo (1)6º básico a semana 09  al 13 de  mayo (1)
6º básico a semana 09 al 13 de mayo (1)
 
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
 
User experience eBay
User experience eBayUser experience eBay
User experience eBay
 
Marquette Social Listening presentation
Marquette Social Listening presentationMarquette Social Listening presentation
Marquette Social Listening presentation
 
อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-framework
 
4º básico a semana 03 de junio al 10 de junio
4º básico a  semana  03 de junio al 10 de junio4º básico a  semana  03 de junio al 10 de junio
4º básico a semana 03 de junio al 10 de junio
 
Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5
 
iProductive Environment Platform
iProductive Environment PlatformiProductive Environment Platform
iProductive Environment Platform
 
Bateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileiraBateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileira
 
9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat
 
The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...
 

Similar to Multimodal Residual Learning for Visual QA

ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksSeunghyun Hwang
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET Journal
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Daniel Davis
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsDarian Frajberg
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchEduard Tyantov
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Ontico
 
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET Journal
 
Surveillance scene classification using machine learning
Surveillance scene classification using machine learningSurveillance scene classification using machine learning
Surveillance scene classification using machine learningUtkarsh Contractor
 
IRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial ImagesIRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial ImagesIRJET Journal
 
Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation Adz91 Digital Ads Pvt Ltd
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 

Similar to Multimodal Residual Learning for Visual QA (20)

ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A Review
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labels
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To Hatch
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
 
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
 
Surveillance scene classification using machine learning
Surveillance scene classification using machine learningSurveillance scene classification using machine learning
Surveillance scene classification using machine learning
 
IRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial ImagesIRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial Images
 
Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 

Recently uploaded

Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...Amil baba
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Chapter 9 Mechanical Injection Systems.pdf
Chapter 9 Mechanical Injection Systems.pdfChapter 9 Mechanical Injection Systems.pdf
Chapter 9 Mechanical Injection Systems.pdfFaizanAhmed396943
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxLina Kadam
 
Network Enhancements on BitVisor for BitVisor Summit 12
Network Enhancements on BitVisor for BitVisor Summit 12Network Enhancements on BitVisor for BitVisor Summit 12
Network Enhancements on BitVisor for BitVisor Summit 12cjchen22
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...IJAEMSJORNAL
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...KrishnaveniKrishnara1
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesSreedhar Chowdam
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunicationnovrain7111
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Overview of IS 16700:2023 (by priyansh verma)
Overview of IS 16700:2023 (by priyansh verma)Overview of IS 16700:2023 (by priyansh verma)
Overview of IS 16700:2023 (by priyansh verma)Priyansh
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHbirinder2
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 

Recently uploaded (20)

Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Chapter 9 Mechanical Injection Systems.pdf
Chapter 9 Mechanical Injection Systems.pdfChapter 9 Mechanical Injection Systems.pdf
Chapter 9 Mechanical Injection Systems.pdf
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptx
 
Network Enhancements on BitVisor for BitVisor Summit 12
Network Enhancements on BitVisor for BitVisor Summit 12Network Enhancements on BitVisor for BitVisor Summit 12
Network Enhancements on BitVisor for BitVisor Summit 12
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunication
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Versatile Engineering Construction Firms
Versatile Engineering Construction FirmsVersatile Engineering Construction Firms
Versatile Engineering Construction Firms
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Overview of IS 16700:2023 (by priyansh verma)
Overview of IS 16700:2023 (by priyansh verma)Overview of IS 16700:2023 (by priyansh verma)
Overview of IS 16700:2023 (by priyansh verma)
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRH
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 

Multimodal Residual Learning for Visual QA

  • 1. Multimodal Residual Learning for Visual QA NamHyuk Ahn
  • 2. Table of Contents 1. Visual QA 2. Stacked Attention Network (SAN) 3. Residual Learning 4. Multimodal Residual Network (MRN)
  • 3. Visual QA Evaluation Metric - Robust to variabilityinter- human - Human accuracy is almost 90 - 248,349 Training questions (82,783 Images) - 121,512 Validation questions (40,504 Images) - 244,302 Testing questions (81,434 Images)
  • 5. Motivation - Answering question requires multi-step reasoning - With {bicycles, window, street, baskets, dogs} objects - To answer good question, pinpoint relevant region. Q: what are sitting in the basket on a bicycle
  • 6. Stacked Attention Network (SAN) - SAN allows multi-step reasoning for visual QA - Extension of Attention mechanism which successfully applied in captioning, translation etc. Q: what are sitting in the basket on a bicycle
  • 7. Stacked Attention Network - Image Model • Extract image feature using CNN - Question Model • Extract semantic vector using CNN or LSTM - Stacked Attention • Multi-step reasoning with attention layer Stacked Attention Multi-step reasoning using attention layer
  • 8. Image / Question Model - Image Model • Get feature map from raw pixel Image • Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512) • Additional layer to fit to question feature - Question Model •
  • 9. Stacked Attention Model - Global image feature leads to suboptimal due to noise from irrelevant object / region. - Instead use SAM to pinpoint relevant region - Given image feature matrix and question vector , 14x14 attention distribution - Get weighted sum of image vectors from each region. - refined query vector
  • 11.
  • 13. Problem of degradation - More depth, more accurate but deep network can vanish/explode gradient • BN, Xavier Init, Dropout can handle (~30 layer) - More deeper, degradation problem occur • Not only overfit, but also increase training error
  • 14. Residual Network (ResNet) Residual Block - To avoid degradation problem, add shortcut connection. - Element-wise addition with F(x) and shortcut connection, and pass through ReLU. - Similar to LSTM http://torch.ch/blog/2016/02/04/resnets.html Shortcut connection
  • 16. Introduction - Extend deep residual learning for visual QA - Achieving the state-of-the-art results on visual QA dataset (not today :(. - Introducing a method to visualize spatial attention effect of joint residual mappings
  • 17. Background SAN - But question info contribute weakly, it cause bottleneck Baseline [Lu et al.] - With just elem-wise multiple, visual and question feature embed very well. MRN - Shortcut mapping and stacking architecture - No weighted-sum - Instead use global multiplication [Lu et al.] does.
  • 18.
  • 19.
  • 20. Quantitative Analysis - (a) shows large improvement over SAN, (b) is better. - (c) add extra embedding in question cause overfitting. - (d) identity shortcut cause degradation (extra linear mapping is needed). - (e) performs reasonable, but extra shortcut is not essential.
  • 21. Quantitative Analysis # of Learning blocks - 58.85% (L=1), 59.44% (L=2), 60.53% (L=3), 60.42% (L=4) Visual Features - ResNet-152 is significantly better than VGGNet - Even though ResNet has less feature dim (2048 vs 4096). # of Answer Class - Trade-off relation among answer type, but 2k is best
  • 22. - Implicit attention with multiplication - Get high-resolution attention map
  • 23.
  • 24. Reference - Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015). - Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016). - Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.