SlideShare a Scribd company logo
1 of 40
Image and Video Processing
Team Members:
Parth Vinay Maheshwari - IIT2021202
Ritesh Kumar Gupta - IIT2021135
Sneha Choudhary - IIT2021132
Pankti Salvi - IIT2021134
Gaurav Singh - IIT2021120
Tushar Kumar - IIT2021203
Aditya Ranjan - IIT2021194
Visual Question
Answering using
Explicit Visual
Attention
Table of
Contents
● What is VQA?
● Requirements
● Explicit Visual Attention
● Related Research Works
● Architecture of Model
● Training of Model
● Advantages and Limitations
● Application and Scope
Visual Question Answering (VQA) is a research area at the intersection of computer vision
and natural language processing. It involves developing algorithms and models that can
understand and answer questions about visual content, such as images or videos. (Vinyals
et al., 2015).
What is Visual Question Answering?
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit
Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki,
Thessaloniki, 54124, Greece.
Computer
Vision
Natural Language
Processing
INTERNAL WORKING
[1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question
Answering," College of Information Engineering, Shanghai, 2020.
REQUIREMENTS
Implementing a Visual Question Answering (VQA) project requires a combination
of technical skills, tools, and resources. Here's a list of requirements you'll need to
consider:
1. Programming Languages and Libraries:
Python: The primary language for implementing machine learning and deep
learning models.
Deep Learning Frameworks: TensorFlow, PyTorch, or Keras for building and
training neural networks.
Libraries for Image Processing: OpenCV, PIL (Python Imaging Library) for image
preprocessing and manipulation.
2. Hardware and Software:
A computer with sufficient computational power, to speed up training.
Development environment with Python and necessary libraries installed.
3. Dataset:
Gather or obtain a VQA dataset that includes images, questions, and
corresponding answers. We have used the COCO-QA(Ren et al. 2015a, b,c)
dataset.
4. Pretrained Models and Embeddings:
Pretrained image classification models (e.g., ResNet, VGG, etc.) for feature
extraction. Pre-trained word embeddings (e.g., GloVe, Word2Vec) for representing
words in questions.
Technique for
achieving VQA
-Explicit visual attention
Explicit Visual Attention
Explicit Visual Attention in the context of Visual Question Answering (VQA) is a
concept that enhances the accuracy and interpretability of VQA models by
explicitly guiding the model's focus to specific regions of an image that are
relevant to the question being asked.
[1] S. Manmadhan and B. C. Kovoor,
"Visual question answering: a state-of-the-
art review," IEEE, 8 April 2020.
Explicit Visual Attention
Working
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
How Explicit Visual Attention works?
● Identifying relevant parts of an image for a question.
● Generating an attention map highlighting those regions.
● Combining the map with the question's context.
● Enhancing the model's focus on crucial visual details.
● Leading to accurate and contextually aware answers in Visual
Question Answering.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124,
Greece.
Significance of Using Explicit Visual Attention in VQA
Beyond Soft Attention:
Explicit Visual Attention takes attention mechanisms a step further. Unlike soft
attention that softly distributes focus, explicit attention directly guides the model's
gaze to specific image regions. This precise focus enhances the model's
understanding of image-question relationships, leading to more accurate and
contextually relevant answers.
Interpretable Focus:
Explicit Visual Attention's localized focus provides transparency into how the
model processes images and questions.
[1] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, "Explicit Knowledge-based
Reasoning for Visual Question Answering," School of Computer Science, The University of
Adelaide, November 13, 2015.
Addressing Ambiguity with Clarity
VQA questions can be ambiguous. Explicit Visual Attention helps the model
address ambiguity by attending to the most relevant part of the image, clarifying
the context and intent of the question. This ensures more accurate answers, even
when the question phrasing might lead to multiple interpretations.
Capturing Fine Details
Images often contain intricate details critical for answering specific questions.
Explicit Visual Attention ensures the model doesn't miss these details, leading to
more precise answers.
[1] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, "Explicit Knowledge-
based Reasoning for Visual Question Answering," School of Computer Science, The
University of Adelaide, November 13, 2015.
Existing Projects and Research Works
An explicitly trained attention model that is inspired by the theory of
pictorial superiority effect. In this model, it use attention-oriented
word embeddings that increase the efficiency of learning common
representation spaces.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question
Answering using Explicit Visual Attention," IEEE, 2018.
A novel model called Multi-modal Explicit Sparse Attention
Networks (MESAN), which concentrates the model’s attention by
explicitly selecting the parts of the input features that are the most
relevant to answering the input question.
[1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention
Networks for Visual Question Answering," College of
Information Engineering, Shanghai, 2020.
[1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum
(East China), China, and L. Nie, Shandong University, China, "Answer Questions with Right
Image Regions: A Visual Attention Regularization Approach," IEEE.
How explicit visual attention is different from others?
Explicit Visual Attention is a specific approach within the broader field of attention
mechanisms used in tasks like Visual Question Answering (VQA). It differentiates
itself from other attention mechanisms through its focus on providing a more
controlled and interpretable way of directing a model's attention to specific regions
of an image.
Soft Attention: It distributes attention across the entire image, allowing the model
to consider all regions simultaneously. The model assigns varying degrees of
attention to different parts of the image, but there's no strict focus on specific
regions.
[1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question
Answering," College of Information Engineering, Shanghai, 2020.
Hard Attention:
It forces the model to focus on a single region of the image. Only one region is
chosen for processing, leading to limitations in handling complex scenes or
capturing multiple relevant details. It can be computationally expensive and may
not always be practical for complex tasks.
Implicit Visual Attention:
It refers to attention mechanisms where the model's focus on image regions is
learned implicitly as part of the model's internal processes. While it may capture
relevant regions, the exact regions attended to might not be transparent or easily
interpretable.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124,
Greece.
Architecture of this model:
[1] Z. Lei, G. Zhang, L. Wu, K. Zhang, and R. Liang, "A Multi-level Mesh Mutual Attention Model
for Visual Question Answering," IEEE, 2022.
Input Layer: The input consists of an image and a question. The image is fed into
a CNN, and the question is tokenized and processed by an RNN.
CNN (Convolutional Neural Network): The CNN processes the image and
extracts high-level visual features. These features capture the image's content,
including objects, patterns, and structures.
RNN (Recurrent Neural Network): The RNN processes the question tokens
sequentially and generates a question representation.The question representation
captures the semantic context of the question.
Attention Mechanism: The attention mechanism combines the image features
from the CNN and the question features from the RNN. It computes attention
scores that determine the importance of different image regions for answering the
question.
Attention Map Generation:
The attention scores are used to generate an attention map that highlights
relevant regions of the image. The attention map guides the model's focus on
specific image regions.
Enhanced Image Features:
The attention map is combined with the original image features, creating
enhanced image features. These enhanced features emphasize the regions
of the image that are important for the given question.
Joint Representation:
The enhanced image features and the question representation from the RNN
are combined. This creates a joint representation that incorporates both visual
and textual information, with explicit attention.
Answer Prediction:
The joint representation is used to predict the answer to the question. The
model leverages the enhanced features and contextually focused attention for
more accurate responses.
Attention Visualization:
Optionally, the attention map can be visualized to show which regions of the
image were attended to during answer prediction. This helps users
understand the model's reasoning process.
Training and Optimization:
The model is trained using labeled VQA data, which includes questions,
images, and answers. During training, the model learns to generate accurate
attention maps and make correct predictions.
[1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of
Petroleum (East China), China, and L. Nie, Shandong University, China, "Answer
Questions with Right Image Regions: A Visual Attention Regularization Approach,"
IEEE.
Training of this model:
Data Preparation:
Gather a dataset of images, corresponding questions, and ground truth answers.
Preprocess the images by resizing, normalizing pixel values, and extracting features using a
pre-trained CNN. Tokenize and encode the questions using word embeddings and an RNN.
Attention Mechanism Setup:
Choose an attention mechanism architecture that combines image and question features to
generate attention scores. Design the attention mechanism to generate attention maps that
highlight important image regions.
Loss Function:
The loss function measures the difference between predicted answers and ground truth
answers. We will choose a loss function that suits the problem. Common loss functions for
VQA include.
Cross-Entropy Loss: Suitable for single-label classification tasks where each question has a
single correct answer.
Ranking Loss: Used when multiple correct answers are available for a question. The model
is trained to rank the correct answer higher than incorrect ones.
Attention Mechanism Loss: If using attention, consider incorporating attention-related loss
terms that encourage the model to focus on the right image regions.
Fine-Tuning and Transfer Learning:
Leverage pre-trained models to bootstrap our VQA model's performance. Transfer Learning:
Start with pre-trained models for both vision (CNNs) and language (BERT, GPT) to provide a
strong initial representation for your model.
Fine-Tuning: Fine-tune the pre-trained models on our VQA dataset to adapt them to the
specific task.
Validation and Evaluation:
Regularly evaluate the model's performance on a validation set to monitor its progress.
Validation Metrics: Use accuracy, top-k accuracy, or other relevant metrics to track
performance. Hyperparameter Tuning: Adjust hyperparameters based on validation
performance.
Comparison of results
[1] S. Manmadhan and B. C. Kovoor, "Visual
question answering: a state-of-the-art
review," IEEE, 2020.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual
Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle
University of Thessaloniki, Thessaloniki, 54124,
Greece.
Graphical Interpretation
[1] V. Lioutas, N. Passalis, and A. Tefas, "Explicit ensemble attention learning for improving visual
question answering," Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki
54124, Greece.
Advantages of this model:
Accurate Answers: By combining visual information from CNNs with semantic context from
RNNs, the architecture enables the model to generate more accurate answers. The explicit
attention mechanism further enhances accuracy by focusing on relevant image
regions.(Lopes et al., 2017).
Contextual Understanding: The joint representation created by merging image features
and question features captures the contextual relationship between visual and textual
elements, facilitating a deeper understanding of the question.(Vinyals et al., 2015).
Handling Complex Scenes: CNNs excel at capturing visual details in images, enabling the
model to handle complex scenes with multiple objects, textures, and patterns, enhancing the
model's ability to provide meaningful answers.(Antol et al., 2015).
Limitations of this model:
Computational Complexity: Combining CNNs and RNNs can lead to higher computational
demands during both training and inference. This might result in longer training times and
slower predictions, especially for real-time applications(Szegedy et al., 2015).
Data Requirements: Training an architecture with both CNNs and RNNs requires labeled
data that includes images, questions, and answers. Gathering such data can be time-
consuming and expensive(Antol et al., 2015).
Hyperparameter Tuning: The architecture involves multiple components with their own set
of hyperparameters (e.g., learning rates, sequence lengths). Fine-tuning these parameters
for optimal performance can be challenging and time-intensive(Xu et al., 2015).
Overfitting: The combination of complex components might increase the risk of overfitting,
where the model performs well on the training data but struggles to generalize to new,
unseen examples(Malinowski et al., 2015).
APPLICATIONS
Visual Question Answering (VQA) has several potential applications across
various domains. Here are some ways in which VQA can be used:
1)Image Captioning and Enhancement: VQA can be used to automatically
generate descriptive captions for images or videos, enhancing their accessibility
and understandability. This is particularly useful for visually impaired individuals
who can benefit from textual descriptions of visual content(Vinyals et al., 2015).
2)Interactive AI Systems: VQA can be integrated into interactive AI systems,
such as virtual assistants or chatbots, to allow users to ask questions about
images or scenes. For example, a user could ask a virtual shopping assistant
about the details of a product shown in an image(Antol et al., 2015).
3) Visual Content Retrieval: VQA can assist in retrieving specific visual content
from large databases based on textual queries. This is helpful for searching
through image or video collections, enabling more intuitive and specific
searches(Gordo et al., 2017).
4)Education and Learning: VQA can be used in educational settings to create
interactive learning materials. Students can ask questions about images or
diagrams to deepen their understanding of the subject matter(Malinowski et al.,
2015).
5)Medical Imaging: In the medical field, VQA can aid doctors in understanding
medical images like X-rays. By asking questions about specific aspects of the
images, clinicians can receive insights to support diagnosis decisions(Lopes et
al., 2017).
6)Safety and Surveillance: VQA can be used in surveillance systems to
SCOPE
1. Improved Performance: Explicit visual attention has been shown to improve
the performance of models in tasks that involve understanding images in the
context of natural language questions(Anderson et al., 2018).
2. Interpretability: Attention maps generated by explicit visual attention provide
insights into the reasoning process of the model. Users and researchers can
visualize these maps to understand which image regions contribute most to the
answer, making the decision-making process more transparent and
interpretable(Xu et al., 2015).
3. Fine-Grained Understanding: Explicit attention mechanisms enable models to
pay selective attention to specific objects, regions, or details within an image,
leading to a deeper understanding of the visual content. (Xu et al., 2015).
4. Robustness to Clutter: In images with multiple objects or complex scenes,
explicit visual attention can help the model focus on the relevant objects or regions
while ignoring distractions or irrelevant parts of the image(Ba et al., 2014).
5. Adaptability to Different Questions: Attention mechanisms can adapt to the
nature of the question being asked. For example, for questions like "What color is
the apple?" the attention can be directed to the region containing the apple(Luong
et al., 2015).
6. Localization in Images: Explicit visual attention can be extended to tasks like
image localization, where the model is required to identify and highlight specific
objects or regions within an image(Xu et al., 2015).
7. Personalized User Interfaces: In applications involving human-computer
interaction, attention maps can be used to create user interfaces that highlight the
areas of an image the model is focusing on.(Chen et al., 2018).
CURRENT STATE OF ART
There are several interesting future work directions. Thus, more deliberate
techniques could be employed for combining multiple attention models, e.g., the
AdaBoost technique. Furthermore, in the proposed method the attention model
was not trained when a question does not contain ground truth bounding boxes.
Exploiting the information contained in these image-question pairs, in a way
similar to the implicit attention, can lead to a hybrid implicit explicit attention model
that can further improve the visual question answering accuracy.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
Advanced pooling techniques, such as BoF pooling, can be also used to improve
the scale invariance of the attention model and provide more reliable attention
information. The proposed methodology can also be applied to other tasks that
require high level visual understanding, such as image caption generation, and
video caption generation. Finally, the proposed approach could also be used to
improve the precision of multi-modal information retrieval, where providing
accurate visual attention information given a textual query from the user is of
critical significance.
These models integrate attention mechanisms and have led to competitive results.
Pretrained language models like BERT have been fine-tuned for VQA. By
leveraging large amounts of textual data, these models can capture contextual
relationships between question and answer elements.
[1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum (East
China), China, and L. Nie, Shandong University, China, "Answer Questions with Right Image
Regions: A Visual Attention Regularization Approach," IEEE.
CONCLUSION
Visual Question Answering is an interesting combination of Computer Vision
and Natural Language Processing which is growing by utilizing the power of
deep learning methods. It is very challenging task where requires solving many
subtasks like object detection, activity recognition, spatial relationships
between objects, commonsense reasoning etc
In conclusion, our implementation of explicit visual attention in our VQA model,
which employs CNNs for computer vision and RNNs for text processing, has
yielded substantial improvements in terms of both accuracy and loss functions.
The main obstacle in the journey of VQA models towards AI dream is that it is
not clear what the source of improvement is and to which extent the model
understood the visual-language concepts.
[1] S. Manmadhan and B. C. Kovoor, "Visual question answering: a state-of-the-art
review," IEEE, 8 April 2020.
REFERENCES
Authors: Vasileios Lioutas, Nikolaos Passalis, Anastasios Tefas
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8351158
Authors: YIBING LIU, YANGYANG GUO, JIANHUA YIN, and XUEMENG SONG
https://arxiv.org/pdf/2102.01916.pdf
Authors: Zhi Lei, Guixian Zhang, Lijuan Wu, Kui Zhang, Rongjiao Liang
https://link.springer.com/article/10.1007/s41019-022-00200-9
Authors: Zihan Guo and Dezhi Han
https://www.mdpi.com/1424-8220/20/23/6758
Thank You !!

More Related Content

Similar to What is VQA? Explicit Visual Attention.pptx

TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019sipij
 
Real-time eyeglass detection using transfer learning for non-standard facial...
Real-time eyeglass detection using transfer learning for  non-standard facial...Real-time eyeglass detection using transfer learning for  non-standard facial...
Real-time eyeglass detection using transfer learning for non-standard facial...IJECEIAES
 
Development of video-based emotion recognition using deep learning with Googl...
Development of video-based emotion recognition using deep learning with Googl...Development of video-based emotion recognition using deep learning with Googl...
Development of video-based emotion recognition using deep learning with Googl...TELKOMNIKA JOURNAL
 
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNINGATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNINGNathan Mathis
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringIRJET Journal
 
IRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-SegmentationIRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-SegmentationIRJET Journal
 
Modelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionModelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionIJERA Editor
 
A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...
A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...
A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...Lisa Muthukumar
 
IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET -  	  Deep Learning Approach to Inpainting and Outpainting SystemIRJET -  	  Deep Learning Approach to Inpainting and Outpainting System
IRJET - Deep Learning Approach to Inpainting and Outpainting SystemIRJET Journal
 
Ch9visualtech
Ch9visualtechCh9visualtech
Ch9visualtechdawklein
 
Analysis of student sentiment during video class with multi-layer deep learni...
Analysis of student sentiment during video class with multi-layer deep learni...Analysis of student sentiment during video class with multi-layer deep learni...
Analysis of student sentiment during video class with multi-layer deep learni...IJECEIAES
 
WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...
WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...
WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...Nexgen Technology
 
Visual Principles II: Visual Literacy
Visual Principles II: Visual LiteracyVisual Principles II: Visual Literacy
Visual Principles II: Visual LiteracyAlaa Sadik
 
A study on attention-based deep learning architecture model for image captioning
A study on attention-based deep learning architecture model for image captioningA study on attention-based deep learning architecture model for image captioning
A study on attention-based deep learning architecture model for image captioningIAESIJAI
 
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
IMAGE CONTENT DESCRIPTION USING LSTM APPROACHIMAGE CONTENT DESCRIPTION USING LSTM APPROACH
IMAGE CONTENT DESCRIPTION USING LSTM APPROACHcsandit
 
FACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEW
FACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEWFACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEW
FACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEWIRJET Journal
 
Stellmach.2011.designing gaze supported multimodal interactions for the explo...
Stellmach.2011.designing gaze supported multimodal interactions for the explo...Stellmach.2011.designing gaze supported multimodal interactions for the explo...
Stellmach.2011.designing gaze supported multimodal interactions for the explo...mrgazer
 

Similar to What is VQA? Explicit Visual Attention.pptx (20)

TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019
 
Real-time eyeglass detection using transfer learning for non-standard facial...
Real-time eyeglass detection using transfer learning for  non-standard facial...Real-time eyeglass detection using transfer learning for  non-standard facial...
Real-time eyeglass detection using transfer learning for non-standard facial...
 
Development of video-based emotion recognition using deep learning with Googl...
Development of video-based emotion recognition using deep learning with Googl...Development of video-based emotion recognition using deep learning with Googl...
Development of video-based emotion recognition using deep learning with Googl...
 
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNINGATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question Answering
 
IRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-SegmentationIRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-Segmentation
 
Modelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionModelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object Recognition
 
A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...
A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...
A Study Of Visual Design In PowerPoint Presentation Slide And Its Relationshi...
 
IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET -  	  Deep Learning Approach to Inpainting and Outpainting SystemIRJET -  	  Deep Learning Approach to Inpainting and Outpainting System
IRJET - Deep Learning Approach to Inpainting and Outpainting System
 
Ch9visualtech
Ch9visualtechCh9visualtech
Ch9visualtech
 
Analysis of student sentiment during video class with multi-layer deep learni...
Analysis of student sentiment during video class with multi-layer deep learni...Analysis of student sentiment during video class with multi-layer deep learni...
Analysis of student sentiment during video class with multi-layer deep learni...
 
WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...
WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...
WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENT...
 
Real time facial expression analysis using pca
Real time facial expression analysis using pcaReal time facial expression analysis using pca
Real time facial expression analysis using pca
 
Visual Principles II: Visual Literacy
Visual Principles II: Visual LiteracyVisual Principles II: Visual Literacy
Visual Principles II: Visual Literacy
 
280 284
280 284280 284
280 284
 
A study on attention-based deep learning architecture model for image captioning
A study on attention-based deep learning architecture model for image captioningA study on attention-based deep learning architecture model for image captioning
A study on attention-based deep learning architecture model for image captioning
 
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
IMAGE CONTENT DESCRIPTION USING LSTM APPROACHIMAGE CONTENT DESCRIPTION USING LSTM APPROACH
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
 
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
 
FACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEW
FACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEWFACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEW
FACE PHOTO-SKETCH RECOGNITION USING DEEP LEARNING TECHNIQUES - A REVIEW
 
Stellmach.2011.designing gaze supported multimodal interactions for the explo...
Stellmach.2011.designing gaze supported multimodal interactions for the explo...Stellmach.2011.designing gaze supported multimodal interactions for the explo...
Stellmach.2011.designing gaze supported multimodal interactions for the explo...
 

Recently uploaded

Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...ShivamTiwari995432
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...jiyav969
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfmahaffeycheryld
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesRashidFaridChishti
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxalijaker017
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...archanaece3
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoninghotman30312
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashidFaiyazSheikh
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2T.D. Shashikala
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.MdManikurRahman
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor banktawat puangthong
 
BORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfBORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfomarzaboub1997
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdfKamal Acharya
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Lovely Professional University
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024EMMANUELLEFRANCEHELI
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdfKamal Acharya
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoAbhimanyu Sangale
 

Recently uploaded (20)

Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdf
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
BORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdfBORESCOPE INSPECTION for engins CFM56.pdf
BORESCOPE INSPECTION for engins CFM56.pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdf
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of Arduino
 

What is VQA? Explicit Visual Attention.pptx

  • 1. Image and Video Processing Team Members: Parth Vinay Maheshwari - IIT2021202 Ritesh Kumar Gupta - IIT2021135 Sneha Choudhary - IIT2021132 Pankti Salvi - IIT2021134 Gaurav Singh - IIT2021120 Tushar Kumar - IIT2021203 Aditya Ranjan - IIT2021194
  • 3. Table of Contents ● What is VQA? ● Requirements ● Explicit Visual Attention ● Related Research Works ● Architecture of Model ● Training of Model ● Advantages and Limitations ● Application and Scope
  • 4. Visual Question Answering (VQA) is a research area at the intersection of computer vision and natural language processing. It involves developing algorithms and models that can understand and answer questions about visual content, such as images or videos. (Vinyals et al., 2015). What is Visual Question Answering?
  • 5. [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
  • 6. Computer Vision Natural Language Processing INTERNAL WORKING [1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering," College of Information Engineering, Shanghai, 2020.
  • 7. REQUIREMENTS Implementing a Visual Question Answering (VQA) project requires a combination of technical skills, tools, and resources. Here's a list of requirements you'll need to consider: 1. Programming Languages and Libraries: Python: The primary language for implementing machine learning and deep learning models. Deep Learning Frameworks: TensorFlow, PyTorch, or Keras for building and training neural networks. Libraries for Image Processing: OpenCV, PIL (Python Imaging Library) for image preprocessing and manipulation.
  • 8. 2. Hardware and Software: A computer with sufficient computational power, to speed up training. Development environment with Python and necessary libraries installed. 3. Dataset: Gather or obtain a VQA dataset that includes images, questions, and corresponding answers. We have used the COCO-QA(Ren et al. 2015a, b,c) dataset. 4. Pretrained Models and Embeddings: Pretrained image classification models (e.g., ResNet, VGG, etc.) for feature extraction. Pre-trained word embeddings (e.g., GloVe, Word2Vec) for representing words in questions.
  • 10. Explicit Visual Attention Explicit Visual Attention in the context of Visual Question Answering (VQA) is a concept that enhances the accuracy and interpretability of VQA models by explicitly guiding the model's focus to specific regions of an image that are relevant to the question being asked. [1] S. Manmadhan and B. C. Kovoor, "Visual question answering: a state-of-the- art review," IEEE, 8 April 2020.
  • 11. Explicit Visual Attention Working [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
  • 12. How Explicit Visual Attention works? ● Identifying relevant parts of an image for a question. ● Generating an attention map highlighting those regions. ● Combining the map with the question's context. ● Enhancing the model's focus on crucial visual details. ● Leading to accurate and contextually aware answers in Visual Question Answering. [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
  • 13. Significance of Using Explicit Visual Attention in VQA Beyond Soft Attention: Explicit Visual Attention takes attention mechanisms a step further. Unlike soft attention that softly distributes focus, explicit attention directly guides the model's gaze to specific image regions. This precise focus enhances the model's understanding of image-question relationships, leading to more accurate and contextually relevant answers. Interpretable Focus: Explicit Visual Attention's localized focus provides transparency into how the model processes images and questions. [1] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, "Explicit Knowledge-based Reasoning for Visual Question Answering," School of Computer Science, The University of Adelaide, November 13, 2015.
  • 14. Addressing Ambiguity with Clarity VQA questions can be ambiguous. Explicit Visual Attention helps the model address ambiguity by attending to the most relevant part of the image, clarifying the context and intent of the question. This ensures more accurate answers, even when the question phrasing might lead to multiple interpretations. Capturing Fine Details Images often contain intricate details critical for answering specific questions. Explicit Visual Attention ensures the model doesn't miss these details, leading to more precise answers. [1] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, "Explicit Knowledge- based Reasoning for Visual Question Answering," School of Computer Science, The University of Adelaide, November 13, 2015.
  • 15. Existing Projects and Research Works An explicitly trained attention model that is inspired by the theory of pictorial superiority effect. In this model, it use attention-oriented word embeddings that increase the efficiency of learning common representation spaces. [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," IEEE, 2018.
  • 16. A novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model’s attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. [1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering," College of Information Engineering, Shanghai, 2020.
  • 17. [1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum (East China), China, and L. Nie, Shandong University, China, "Answer Questions with Right Image Regions: A Visual Attention Regularization Approach," IEEE.
  • 18. How explicit visual attention is different from others? Explicit Visual Attention is a specific approach within the broader field of attention mechanisms used in tasks like Visual Question Answering (VQA). It differentiates itself from other attention mechanisms through its focus on providing a more controlled and interpretable way of directing a model's attention to specific regions of an image. Soft Attention: It distributes attention across the entire image, allowing the model to consider all regions simultaneously. The model assigns varying degrees of attention to different parts of the image, but there's no strict focus on specific regions. [1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering," College of Information Engineering, Shanghai, 2020.
  • 19. Hard Attention: It forces the model to focus on a single region of the image. Only one region is chosen for processing, leading to limitations in handling complex scenes or capturing multiple relevant details. It can be computationally expensive and may not always be practical for complex tasks. Implicit Visual Attention: It refers to attention mechanisms where the model's focus on image regions is learned implicitly as part of the model's internal processes. While it may capture relevant regions, the exact regions attended to might not be transparent or easily interpretable. [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
  • 20. Architecture of this model: [1] Z. Lei, G. Zhang, L. Wu, K. Zhang, and R. Liang, "A Multi-level Mesh Mutual Attention Model for Visual Question Answering," IEEE, 2022.
  • 21. Input Layer: The input consists of an image and a question. The image is fed into a CNN, and the question is tokenized and processed by an RNN. CNN (Convolutional Neural Network): The CNN processes the image and extracts high-level visual features. These features capture the image's content, including objects, patterns, and structures. RNN (Recurrent Neural Network): The RNN processes the question tokens sequentially and generates a question representation.The question representation captures the semantic context of the question. Attention Mechanism: The attention mechanism combines the image features from the CNN and the question features from the RNN. It computes attention scores that determine the importance of different image regions for answering the question.
  • 22. Attention Map Generation: The attention scores are used to generate an attention map that highlights relevant regions of the image. The attention map guides the model's focus on specific image regions. Enhanced Image Features: The attention map is combined with the original image features, creating enhanced image features. These enhanced features emphasize the regions of the image that are important for the given question. Joint Representation: The enhanced image features and the question representation from the RNN are combined. This creates a joint representation that incorporates both visual and textual information, with explicit attention.
  • 23. Answer Prediction: The joint representation is used to predict the answer to the question. The model leverages the enhanced features and contextually focused attention for more accurate responses. Attention Visualization: Optionally, the attention map can be visualized to show which regions of the image were attended to during answer prediction. This helps users understand the model's reasoning process. Training and Optimization: The model is trained using labeled VQA data, which includes questions, images, and answers. During training, the model learns to generate accurate attention maps and make correct predictions.
  • 24. [1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum (East China), China, and L. Nie, Shandong University, China, "Answer Questions with Right Image Regions: A Visual Attention Regularization Approach," IEEE.
  • 25. Training of this model: Data Preparation: Gather a dataset of images, corresponding questions, and ground truth answers. Preprocess the images by resizing, normalizing pixel values, and extracting features using a pre-trained CNN. Tokenize and encode the questions using word embeddings and an RNN. Attention Mechanism Setup: Choose an attention mechanism architecture that combines image and question features to generate attention scores. Design the attention mechanism to generate attention maps that highlight important image regions.
  • 26. Loss Function: The loss function measures the difference between predicted answers and ground truth answers. We will choose a loss function that suits the problem. Common loss functions for VQA include. Cross-Entropy Loss: Suitable for single-label classification tasks where each question has a single correct answer. Ranking Loss: Used when multiple correct answers are available for a question. The model is trained to rank the correct answer higher than incorrect ones. Attention Mechanism Loss: If using attention, consider incorporating attention-related loss terms that encourage the model to focus on the right image regions.
  • 27. Fine-Tuning and Transfer Learning: Leverage pre-trained models to bootstrap our VQA model's performance. Transfer Learning: Start with pre-trained models for both vision (CNNs) and language (BERT, GPT) to provide a strong initial representation for your model. Fine-Tuning: Fine-tune the pre-trained models on our VQA dataset to adapt them to the specific task. Validation and Evaluation: Regularly evaluate the model's performance on a validation set to monitor its progress. Validation Metrics: Use accuracy, top-k accuracy, or other relevant metrics to track performance. Hyperparameter Tuning: Adjust hyperparameters based on validation performance.
  • 28. Comparison of results [1] S. Manmadhan and B. C. Kovoor, "Visual question answering: a state-of-the-art review," IEEE, 2020. [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
  • 29. Graphical Interpretation [1] V. Lioutas, N. Passalis, and A. Tefas, "Explicit ensemble attention learning for improving visual question answering," Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece.
  • 30. Advantages of this model: Accurate Answers: By combining visual information from CNNs with semantic context from RNNs, the architecture enables the model to generate more accurate answers. The explicit attention mechanism further enhances accuracy by focusing on relevant image regions.(Lopes et al., 2017). Contextual Understanding: The joint representation created by merging image features and question features captures the contextual relationship between visual and textual elements, facilitating a deeper understanding of the question.(Vinyals et al., 2015). Handling Complex Scenes: CNNs excel at capturing visual details in images, enabling the model to handle complex scenes with multiple objects, textures, and patterns, enhancing the model's ability to provide meaningful answers.(Antol et al., 2015).
  • 31. Limitations of this model: Computational Complexity: Combining CNNs and RNNs can lead to higher computational demands during both training and inference. This might result in longer training times and slower predictions, especially for real-time applications(Szegedy et al., 2015). Data Requirements: Training an architecture with both CNNs and RNNs requires labeled data that includes images, questions, and answers. Gathering such data can be time- consuming and expensive(Antol et al., 2015). Hyperparameter Tuning: The architecture involves multiple components with their own set of hyperparameters (e.g., learning rates, sequence lengths). Fine-tuning these parameters for optimal performance can be challenging and time-intensive(Xu et al., 2015). Overfitting: The combination of complex components might increase the risk of overfitting, where the model performs well on the training data but struggles to generalize to new, unseen examples(Malinowski et al., 2015).
  • 32. APPLICATIONS Visual Question Answering (VQA) has several potential applications across various domains. Here are some ways in which VQA can be used: 1)Image Captioning and Enhancement: VQA can be used to automatically generate descriptive captions for images or videos, enhancing their accessibility and understandability. This is particularly useful for visually impaired individuals who can benefit from textual descriptions of visual content(Vinyals et al., 2015). 2)Interactive AI Systems: VQA can be integrated into interactive AI systems, such as virtual assistants or chatbots, to allow users to ask questions about images or scenes. For example, a user could ask a virtual shopping assistant about the details of a product shown in an image(Antol et al., 2015).
  • 33. 3) Visual Content Retrieval: VQA can assist in retrieving specific visual content from large databases based on textual queries. This is helpful for searching through image or video collections, enabling more intuitive and specific searches(Gordo et al., 2017). 4)Education and Learning: VQA can be used in educational settings to create interactive learning materials. Students can ask questions about images or diagrams to deepen their understanding of the subject matter(Malinowski et al., 2015). 5)Medical Imaging: In the medical field, VQA can aid doctors in understanding medical images like X-rays. By asking questions about specific aspects of the images, clinicians can receive insights to support diagnosis decisions(Lopes et al., 2017). 6)Safety and Surveillance: VQA can be used in surveillance systems to
  • 34. SCOPE 1. Improved Performance: Explicit visual attention has been shown to improve the performance of models in tasks that involve understanding images in the context of natural language questions(Anderson et al., 2018). 2. Interpretability: Attention maps generated by explicit visual attention provide insights into the reasoning process of the model. Users and researchers can visualize these maps to understand which image regions contribute most to the answer, making the decision-making process more transparent and interpretable(Xu et al., 2015). 3. Fine-Grained Understanding: Explicit attention mechanisms enable models to pay selective attention to specific objects, regions, or details within an image, leading to a deeper understanding of the visual content. (Xu et al., 2015).
  • 35. 4. Robustness to Clutter: In images with multiple objects or complex scenes, explicit visual attention can help the model focus on the relevant objects or regions while ignoring distractions or irrelevant parts of the image(Ba et al., 2014). 5. Adaptability to Different Questions: Attention mechanisms can adapt to the nature of the question being asked. For example, for questions like "What color is the apple?" the attention can be directed to the region containing the apple(Luong et al., 2015). 6. Localization in Images: Explicit visual attention can be extended to tasks like image localization, where the model is required to identify and highlight specific objects or regions within an image(Xu et al., 2015). 7. Personalized User Interfaces: In applications involving human-computer interaction, attention maps can be used to create user interfaces that highlight the areas of an image the model is focusing on.(Chen et al., 2018).
  • 36. CURRENT STATE OF ART There are several interesting future work directions. Thus, more deliberate techniques could be employed for combining multiple attention models, e.g., the AdaBoost technique. Furthermore, in the proposed method the attention model was not trained when a question does not contain ground truth bounding boxes. Exploiting the information contained in these image-question pairs, in a way similar to the implicit attention, can lead to a hybrid implicit explicit attention model that can further improve the visual question answering accuracy. [1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
  • 37. Advanced pooling techniques, such as BoF pooling, can be also used to improve the scale invariance of the attention model and provide more reliable attention information. The proposed methodology can also be applied to other tasks that require high level visual understanding, such as image caption generation, and video caption generation. Finally, the proposed approach could also be used to improve the precision of multi-modal information retrieval, where providing accurate visual attention information given a textual query from the user is of critical significance. These models integrate attention mechanisms and have led to competitive results. Pretrained language models like BERT have been fine-tuned for VQA. By leveraging large amounts of textual data, these models can capture contextual relationships between question and answer elements. [1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum (East China), China, and L. Nie, Shandong University, China, "Answer Questions with Right Image Regions: A Visual Attention Regularization Approach," IEEE.
  • 38. CONCLUSION Visual Question Answering is an interesting combination of Computer Vision and Natural Language Processing which is growing by utilizing the power of deep learning methods. It is very challenging task where requires solving many subtasks like object detection, activity recognition, spatial relationships between objects, commonsense reasoning etc In conclusion, our implementation of explicit visual attention in our VQA model, which employs CNNs for computer vision and RNNs for text processing, has yielded substantial improvements in terms of both accuracy and loss functions. The main obstacle in the journey of VQA models towards AI dream is that it is not clear what the source of improvement is and to which extent the model understood the visual-language concepts. [1] S. Manmadhan and B. C. Kovoor, "Visual question answering: a state-of-the-art review," IEEE, 8 April 2020.
  • 39. REFERENCES Authors: Vasileios Lioutas, Nikolaos Passalis, Anastasios Tefas https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8351158 Authors: YIBING LIU, YANGYANG GUO, JIANHUA YIN, and XUEMENG SONG https://arxiv.org/pdf/2102.01916.pdf Authors: Zhi Lei, Guixian Zhang, Lijuan Wu, Kui Zhang, Rongjiao Liang https://link.springer.com/article/10.1007/s41019-022-00200-9 Authors: Zihan Guo and Dezhi Han https://www.mdpi.com/1424-8220/20/23/6758