What is VQA? Explicit Visual Attention.pptx

Image and Video Processing
Team Members:
Parth Vinay Maheshwari - IIT2021202
Ritesh Kumar Gupta - IIT2021135
Sneha Choudhary - IIT2021132
Pankti Salvi - IIT2021134
Gaurav Singh - IIT2021120
Tushar Kumar - IIT2021203
Aditya Ranjan - IIT2021194

Visual Question
Answering using
Explicit Visual
Attention

Table of
Contents
● What is VQA?
● Requirements
● Explicit Visual Attention
● Related Research Works
● Architecture of Model
● Training of Model
● Advantages and Limitations
● Application and Scope

Visual Question Answering (VQA) is a research area at the intersection of computer vision
and natural language processing. It involves developing algorithms and models that can
understand and answer questions about visual content, such as images or videos. (Vinyals
et al., 2015).
What is Visual Question Answering?

[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit
Visual Attention," Dept. of Informatics, Aristotle University of Thessaloniki,
Thessaloniki, 54124, Greece.

Computer
Vision
Natural Language
Processing
INTERNAL WORKING
[1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question
Answering," College of Information Engineering, Shanghai, 2020.

REQUIREMENTS
Implementing a Visual Question Answering (VQA) project requires a combination
of technical skills, tools, and resources. Here's a list of requirements you'll need to
consider:
1. Programming Languages and Libraries:
Python: The primary language for implementing machine learning and deep
learning models.
Deep Learning Frameworks: TensorFlow, PyTorch, or Keras for building and
training neural networks.
Libraries for Image Processing: OpenCV, PIL (Python Imaging Library) for image
preprocessing and manipulation.

2. Hardware and Software:
A computer with sufficient computational power, to speed up training.
Development environment with Python and necessary libraries installed.
3. Dataset:
Gather or obtain a VQA dataset that includes images, questions, and
corresponding answers. We have used the COCO-QA(Ren et al. 2015a, b,c)
dataset.
4. Pretrained Models and Embeddings:
Pretrained image classification models (e.g., ResNet, VGG, etc.) for feature
extraction. Pre-trained word embeddings (e.g., GloVe, Word2Vec) for representing
words in questions.

Technique for
achieving VQA
-Explicit visual attention

Explicit Visual Attention
Explicit Visual Attention in the context of Visual Question Answering (VQA) is a
concept that enhances the accuracy and interpretability of VQA models by
explicitly guiding the model's focus to specific regions of an image that are
relevant to the question being asked.
[1] S. Manmadhan and B. C. Kovoor,
"Visual question answering: a state-of-the-
art review," IEEE, 8 April 2020.

Explicit Visual Attention
Working
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.

How Explicit Visual Attention works?
● Identifying relevant parts of an image for a question.
● Generating an attention map highlighting those regions.
● Combining the map with the question's context.
● Enhancing the model's focus on crucial visual details.
● Leading to accurate and contextually aware answers in Visual
Question Answering.
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124,
Greece.

Significance of Using Explicit Visual Attention in VQA
Beyond Soft Attention:
Explicit Visual Attention takes attention mechanisms a step further. Unlike soft
attention that softly distributes focus, explicit attention directly guides the model's
gaze to specific image regions. This precise focus enhances the model's
understanding of image-question relationships, leading to more accurate and
contextually relevant answers.
Interpretable Focus:
Explicit Visual Attention's localized focus provides transparency into how the
model processes images and questions.
[1] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, "Explicit Knowledge-based
Reasoning for Visual Question Answering," School of Computer Science, The University of
Adelaide, November 13, 2015.

Addressing Ambiguity with Clarity
VQA questions can be ambiguous. Explicit Visual Attention helps the model
address ambiguity by attending to the most relevant part of the image, clarifying
the context and intent of the question. This ensures more accurate answers, even
when the question phrasing might lead to multiple interpretations.
Capturing Fine Details
Images often contain intricate details critical for answering specific questions.
Explicit Visual Attention ensures the model doesn't miss these details, leading to
more precise answers.
[1] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, "Explicit Knowledge-
based Reasoning for Visual Question Answering," School of Computer Science, The
University of Adelaide, November 13, 2015.

Existing Projects and Research Works
An explicitly trained attention model that is inspired by the theory of
pictorial superiority effect. In this model, it use attention-oriented
word embeddings that increase the efficiency of learning common
representation spaces.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual Question
Answering using Explicit Visual Attention," IEEE, 2018.

A novel model called Multi-modal Explicit Sparse Attention
Networks (MESAN), which concentrates the model’s attention by
explicitly selecting the parts of the input features that are the most
relevant to answering the input question.
[1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention
Networks for Visual Question Answering," College of
Information Engineering, Shanghai, 2020.

[1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum
(East China), China, and L. Nie, Shandong University, China, "Answer Questions with Right
Image Regions: A Visual Attention Regularization Approach," IEEE.

How explicit visual attention is different from others?
Explicit Visual Attention is a specific approach within the broader field of attention
mechanisms used in tasks like Visual Question Answering (VQA). It differentiates
itself from other attention mechanisms through its focus on providing a more
controlled and interpretable way of directing a model's attention to specific regions
of an image.
Soft Attention: It distributes attention across the entire image, allowing the model
to consider all regions simultaneously. The model assigns varying degrees of
attention to different parts of the image, but there's no strict focus on specific
regions.
[1] Z. Guo and D. Han, "Multi-Modal Explicit Sparse Attention Networks for Visual Question
Answering," College of Information Engineering, Shanghai, 2020.

Hard Attention:
It forces the model to focus on a single region of the image. Only one region is
chosen for processing, leading to limitations in handling complex scenes or
capturing multiple relevant details. It can be computationally expensive and may
not always be practical for complex tasks.
Implicit Visual Attention:
It refers to attention mechanisms where the model's focus on image regions is
learned implicitly as part of the model's internal processes. While it may capture
relevant regions, the exact regions attended to might not be transparent or easily
interpretable.
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124,
Greece.

Architecture of this model:
[1] Z. Lei, G. Zhang, L. Wu, K. Zhang, and R. Liang, "A Multi-level Mesh Mutual Attention Model
for Visual Question Answering," IEEE, 2022.

Input Layer: The input consists of an image and a question. The image is fed into
a CNN, and the question is tokenized and processed by an RNN.
CNN (Convolutional Neural Network): The CNN processes the image and
extracts high-level visual features. These features capture the image's content,
including objects, patterns, and structures.
RNN (Recurrent Neural Network): The RNN processes the question tokens
sequentially and generates a question representation.The question representation
captures the semantic context of the question.
Attention Mechanism: The attention mechanism combines the image features
from the CNN and the question features from the RNN. It computes attention
scores that determine the importance of different image regions for answering the
question.

Attention Map Generation:
The attention scores are used to generate an attention map that highlights
relevant regions of the image. The attention map guides the model's focus on
specific image regions.
Enhanced Image Features:
The attention map is combined with the original image features, creating
enhanced image features. These enhanced features emphasize the regions
of the image that are important for the given question.
Joint Representation:
The enhanced image features and the question representation from the RNN
are combined. This creates a joint representation that incorporates both visual
and textual information, with explicit attention.

Answer Prediction:
The joint representation is used to predict the answer to the question. The
model leverages the enhanced features and contextually focused attention for
more accurate responses.
Attention Visualization:
Optionally, the attention map can be visualized to show which regions of the
image were attended to during answer prediction. This helps users
understand the model's reasoning process.
Training and Optimization:
The model is trained using labeled VQA data, which includes questions,
images, and answers. During training, the model learns to generate accurate
attention maps and make correct predictions.

[1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of
Petroleum (East China), China, and L. Nie, Shandong University, China, "Answer
Questions with Right Image Regions: A Visual Attention Regularization Approach,"
IEEE.

Training of this model:
Data Preparation:
Gather a dataset of images, corresponding questions, and ground truth answers.
Preprocess the images by resizing, normalizing pixel values, and extracting features using a
pre-trained CNN. Tokenize and encode the questions using word embeddings and an RNN.
Attention Mechanism Setup:
Choose an attention mechanism architecture that combines image and question features to
generate attention scores. Design the attention mechanism to generate attention maps that
highlight important image regions.

Loss Function:
The loss function measures the difference between predicted answers and ground truth
answers. We will choose a loss function that suits the problem. Common loss functions for
VQA include.
Cross-Entropy Loss: Suitable for single-label classification tasks where each question has a
single correct answer.
Ranking Loss: Used when multiple correct answers are available for a question. The model
is trained to rank the correct answer higher than incorrect ones.
Attention Mechanism Loss: If using attention, consider incorporating attention-related loss
terms that encourage the model to focus on the right image regions.

Fine-Tuning and Transfer Learning:
Leverage pre-trained models to bootstrap our VQA model's performance. Transfer Learning:
Start with pre-trained models for both vision (CNNs) and language (BERT, GPT) to provide a
strong initial representation for your model.
Fine-Tuning: Fine-tune the pre-trained models on our VQA dataset to adapt them to the
specific task.
Validation and Evaluation:
Regularly evaluate the model's performance on a validation set to monitor its progress.
Validation Metrics: Use accuracy, top-k accuracy, or other relevant metrics to track
performance. Hyperparameter Tuning: Adjust hyperparameters based on validation
performance.

Comparison of results
[1] S. Manmadhan and B. C. Kovoor, "Visual
question answering: a state-of-the-art
review," IEEE, 2020.
[1] V. Lioutas, N. Passalis, and A. Tefas, "Visual
Question Answering using Explicit Visual
Attention," Dept. of Informatics, Aristotle
University of Thessaloniki, Thessaloniki, 54124,
Greece.

Graphical Interpretation
[1] V. Lioutas, N. Passalis, and A. Tefas, "Explicit ensemble attention learning for improving visual
question answering," Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki
54124, Greece.

Advantages of this model:
Accurate Answers: By combining visual information from CNNs with semantic context from
RNNs, the architecture enables the model to generate more accurate answers. The explicit
attention mechanism further enhances accuracy by focusing on relevant image
regions.(Lopes et al., 2017).
Contextual Understanding: The joint representation created by merging image features
and question features captures the contextual relationship between visual and textual
elements, facilitating a deeper understanding of the question.(Vinyals et al., 2015).
Handling Complex Scenes: CNNs excel at capturing visual details in images, enabling the
model to handle complex scenes with multiple objects, textures, and patterns, enhancing the
model's ability to provide meaningful answers.(Antol et al., 2015).

Limitations of this model:
Computational Complexity: Combining CNNs and RNNs can lead to higher computational
demands during both training and inference. This might result in longer training times and
slower predictions, especially for real-time applications(Szegedy et al., 2015).
Data Requirements: Training an architecture with both CNNs and RNNs requires labeled
data that includes images, questions, and answers. Gathering such data can be time-
consuming and expensive(Antol et al., 2015).
Hyperparameter Tuning: The architecture involves multiple components with their own set
of hyperparameters (e.g., learning rates, sequence lengths). Fine-tuning these parameters
for optimal performance can be challenging and time-intensive(Xu et al., 2015).
Overfitting: The combination of complex components might increase the risk of overfitting,
where the model performs well on the training data but struggles to generalize to new,
unseen examples(Malinowski et al., 2015).

APPLICATIONS
Visual Question Answering (VQA) has several potential applications across
various domains. Here are some ways in which VQA can be used:
1)Image Captioning and Enhancement: VQA can be used to automatically
generate descriptive captions for images or videos, enhancing their accessibility
and understandability. This is particularly useful for visually impaired individuals
who can benefit from textual descriptions of visual content(Vinyals et al., 2015).
2)Interactive AI Systems: VQA can be integrated into interactive AI systems,
such as virtual assistants or chatbots, to allow users to ask questions about
images or scenes. For example, a user could ask a virtual shopping assistant
about the details of a product shown in an image(Antol et al., 2015).

3) Visual Content Retrieval: VQA can assist in retrieving specific visual content
from large databases based on textual queries. This is helpful for searching
through image or video collections, enabling more intuitive and specific
searches(Gordo et al., 2017).
4)Education and Learning: VQA can be used in educational settings to create
interactive learning materials. Students can ask questions about images or
diagrams to deepen their understanding of the subject matter(Malinowski et al.,
2015).
5)Medical Imaging: In the medical field, VQA can aid doctors in understanding
medical images like X-rays. By asking questions about specific aspects of the
images, clinicians can receive insights to support diagnosis decisions(Lopes et
al., 2017).
6)Safety and Surveillance: VQA can be used in surveillance systems to

SCOPE
1. Improved Performance: Explicit visual attention has been shown to improve
the performance of models in tasks that involve understanding images in the
context of natural language questions(Anderson et al., 2018).
2. Interpretability: Attention maps generated by explicit visual attention provide
insights into the reasoning process of the model. Users and researchers can
visualize these maps to understand which image regions contribute most to the
answer, making the decision-making process more transparent and
interpretable(Xu et al., 2015).
3. Fine-Grained Understanding: Explicit attention mechanisms enable models to
pay selective attention to specific objects, regions, or details within an image,
leading to a deeper understanding of the visual content. (Xu et al., 2015).

4. Robustness to Clutter: In images with multiple objects or complex scenes,
explicit visual attention can help the model focus on the relevant objects or regions
while ignoring distractions or irrelevant parts of the image(Ba et al., 2014).
5. Adaptability to Different Questions: Attention mechanisms can adapt to the
nature of the question being asked. For example, for questions like "What color is
the apple?" the attention can be directed to the region containing the apple(Luong
et al., 2015).
6. Localization in Images: Explicit visual attention can be extended to tasks like
image localization, where the model is required to identify and highlight specific
objects or regions within an image(Xu et al., 2015).
7. Personalized User Interfaces: In applications involving human-computer
interaction, attention maps can be used to create user interfaces that highlight the
areas of an image the model is focusing on.(Chen et al., 2018).

CURRENT STATE OF ART
There are several interesting future work directions. Thus, more deliberate
techniques could be employed for combining multiple attention models, e.g., the
AdaBoost technique. Furthermore, in the proposed method the attention model
was not trained when a question does not contain ground truth bounding boxes.
Exploiting the information contained in these image-question pairs, in a way
similar to the implicit attention, can lead to a hybrid implicit explicit attention model
that can further improve the visual question answering accuracy.
Attention," Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.

Advanced pooling techniques, such as BoF pooling, can be also used to improve
the scale invariance of the attention model and provide more reliable attention
information. The proposed methodology can also be applied to other tasks that
require high level visual understanding, such as image caption generation, and
video caption generation. Finally, the proposed approach could also be used to
improve the precision of multi-modal information retrieval, where providing
accurate visual attention information given a textual query from the user is of
critical significance.
These models integrate attention mechanisms and have led to competitive results.
Pretrained language models like BERT have been fine-tuned for VQA. By
leveraging large amounts of textual data, these models can capture contextual
relationships between question and answer elements.
[1] Y. Liu, Y. Guo, J. Yin, X. Song, S. University, China, W. Liu, China University of Petroleum (East
China), China, and L. Nie, Shandong University, China, "Answer Questions with Right Image
Regions: A Visual Attention Regularization Approach," IEEE.

CONCLUSION
Visual Question Answering is an interesting combination of Computer Vision
and Natural Language Processing which is growing by utilizing the power of
deep learning methods. It is very challenging task where requires solving many
subtasks like object detection, activity recognition, spatial relationships
between objects, commonsense reasoning etc
In conclusion, our implementation of explicit visual attention in our VQA model,
which employs CNNs for computer vision and RNNs for text processing, has
yielded substantial improvements in terms of both accuracy and loss functions.
The main obstacle in the journey of VQA models towards AI dream is that it is
not clear what the source of improvement is and to which extent the model
understood the visual-language concepts.
[1] S. Manmadhan and B. C. Kovoor, "Visual question answering: a state-of-the-art
review," IEEE, 8 April 2020.

REFERENCES
Authors: Vasileios Lioutas, Nikolaos Passalis, Anastasios Tefas
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8351158
Authors: YIBING LIU, YANGYANG GUO, JIANHUA YIN, and XUEMENG SONG
https://arxiv.org/pdf/2102.01916.pdf
Authors: Zhi Lei, Guixian Zhang, Lijuan Wu, Kui Zhang, Rongjiao Liang
https://link.springer.com/article/10.1007/s41019-022-00200-9
Authors: Zihan Guo and Dezhi Han
https://www.mdpi.com/1424-8220/20/23/6758

What is VQA? Explicit Visual Attention.pptx

Recommended

Recommended

More Related Content

Similar to What is VQA? Explicit Visual Attention.pptx

Similar to What is VQA? Explicit Visual Attention.pptx (20)

Recently uploaded

Recently uploaded (20)

What is VQA? Explicit Visual Attention.pptx