Title of presentation
Subtitle
Name of presenter
Date
Disturbing Image Detection Using LMM-Elicited Emotion
Embeddings
Maria Tzelepi and Vasileios Mezaris
CERTH-ITI, Thermi, Thessaloniki, Greece
LVLM 2024: Integrating Image
Processing with Large-Scale
Language/Vision Models for
Advanced Visual Understanding
Workshop at IEEE ICIP 2024, Abu
Dhabi, United Arab Emirates,
Oct.2024
Outline
• Introduction
○ Problem statement
○ Motivation
• Proposed method
○ Preliminaries: MiniGPT-4
○ Preliminaries: CLIP
○ DID using LMM-elicited emotion embeddings
• Experimental evaluation
○ Dataset and implementation details
○ Experimental results
• Conclusions
2
Problem statement
• Disturbing Image Detection (DID): detecting content in images that can cause
trauma to viewers
• It may include images that depict violence, pornography, animal cruelty
• Such content elicits anxiety or/and fear to viewers
• DID is a task of significant importance
• Limited literature due to the challenging nature of creating datasets, restricting,
in turn, the generalization ability of the trained models
○ [1] proposed a framework that exploits large-scale multimedia datasets so as
to automatically extend initial training datasets with hard examples. An
EfficientNet-b1 is trained on the augmented dataset for addressing the DID
task.
3
[1] Sarridis, Ioannis, et al. "Leveraging large-scale multimedia datasets to refine content moderation models." 2022 IEEE
Eighth International Conference on Multimedia Big Data (BigMM). IEEE, 2022.
Motivation
• Large Language Models (LLMs) have demonstrated exceptional performance in
several downstream vision recognition tasks
• Goal: Address the DID task exploiting knowledge encoded in LLMs and
particularly in Large Multimodal Models (LMMs)
• [2] proposed to use an LMM in order to extract semantic descriptions for the
images of a dataset, and use them in order to address generic image
classification tasks
• Apart from these generic semantic descriptions, we propose to extract responses
linked with a complementary task, i.e., emotion recognition
• We argue that we can advance the performance in the DID task by also extracting
LMM-elicited emotions for each image of the dataset
4
[2] Tzelepi, Maria, and Vasileios Mezaris. "Exploiting LMM-based knowledge for image classification tasks."
International Conference on Engineering Applications of Neural Networks. Cham: Springer Nature Switzerland, 2024.
Preliminaries: MiniGPT-4
• GPT-4 is the first model to accept both
text and image input, producing text
output, however the technical details
behind GPT-4 remain undisclosed
• MiniGPT-4 aligns a frozen visual
encoder with a frozen LLM, using a
single projection layer
- LLM: Vicuna
- Visual encoder: ViT-G/14 from
EVA- CLIP and a Q-Former network
• MiniGPT-4 only requires training the
linear projection layer to align the
visual features with the LLM
5
The MiniGPT-4 model [3]
[3] Zhu, Deyao, et al. "Minigpt-4: Enhancing vision-language understanding with advanced large language models."
arXiv preprint arXiv:2304.10592 (2023).
Preliminaries: CLIP
• CLIP comprises of an image and text encoder
• It is trained with (image,text) pairs for
predicting which of the possible pairs actually
occurred
• To do so, it learns a multimodal embedding
space by jointly training the image and text
encoders to maximize the cosine similarity of
the real image and text embeddings, while
minimizing the cosine similarity of the
embeddings of the incorrect pairs
• CLIP provides outstanding zero-shot
classification performance
• Another approach is to use the CLIP image
encoder for extracting the corresponding
embeddings and use them with a classifier 6
The CLIP model [4]
[4] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International
conference on machine learning. PMLR, 2021.
DID using LMM-elicited emotion embeddings
Proposed method for Disturbing Image Detection:
• Prompt MiniGPT-4 for obtaining 10 semantic descriptions for each image of the dataset
• Prompt MiniGPT-4 for obtaining 10 elicited emotions for each image of the dataset
• Extract the CLIP embeddings for both the MiniGPT-4-generated responses
• These two text embeddings are concatenated with the corresponding CLIP image embeddings and
propagated to a simple classifier for performing the DID task (trained using cross entropy loss)
7
Dataset & implementation details
Dataset
• DID-Aug.: augmented DID dataset using hard examples from the YFCC dataset.
• 30,106 training images (8,070 disturbing and 22,036 non-disturbing images)
• 1,080 test images (405 disturbing and 675 non-disturbing images)
Implementation Details
• MiniGPT-4 with Vicuna-13B locally
• ViT-L-14 CLIP version
• Classification head: three linear layers of 512, 256, and 2 neurons
• The model is trained for 500 epochs, the learning rate is set to 0.001, and the batch
size is set to 32 samples
8
Experimental results
Semantic descriptions and elicited emotions MiniGPT-4 responses for a non-disturbing image.
Example of a test image that was misclassified by the baseline method, while
correctly classified as disturbing using the proposed method, along with the LMM-
generated responses.
9
Experimental results
10
• The proposed method significantly improves the baseline performance (using only
image embeddings) in terms of accuracy, accomplishing also superior performance
over current state-of-the-art
[1] Sarridis, Ioannis, et al. "Leveraging large-scale multimedia datasets to refine content moderation models." 2022 IEEE
Eighth International Conference on Multimedia Big Data (BigMM). IEEE, 2022.
[4] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International
conference on machine learning. PMLR, 2021.
Method Test Accuracy (%)
CLIP - Image Embeddings
[4]
94.444
EfficientNet-b1 [1] 95.000
CLIP - Proposed 96.907
Table 1: Test accuracy on DID-Aug. dataset - Comparison with state-of-
the-art.
Experimental results
11
Method Test Accuracy (%)
CLIP - Image Embeddings 94.444 ± 0.131
CLIP - Emotion Embeddings 91.092 ± 0.108
CLIP - Semantic Description Embeddings 92.592 ± 0.058
CLIP - Image Embeddings + Emotion Embeddings 95.462 ± 0.101
CLIP - Image Embeddings + Semantic Description
Embeddings
96.222 ± 0.107
CLIP - Emotion Embeddings + Semantic Description
Embeddings
95.185 ± 0.261
CLIP - Image Embeddings + Emotion Embeddings +
Semantic Description Embeddings (proposed)
96.907 ± 0.125
Table 2: Test accuracy on DID-Aug. dataset - Ablation
study.
Experimental results
12
• Using only the MiniGPT-4 knowledge to represent the images of the dataset results in,
as expected, lower -compared to image embeddings- but very competitive
performance
• Interestingly, using only the LMM-elicited emotion embeddings the model achieves
rather high performance, while combining the two text embeddings leads to a
significant improvement, compared to using either one alone
• The combination of image embeddings with each of the text embeddings leads to
enhanced performance
• The proposed method achieves the highest performance, validating our claim that
the elicited emotion embeddings, which are tailored to the DID task, give further
improvement
Conclusions
• Dealt with the DID problem leveraging knowledge encoded in LMMs
• We appropriately prompted the LMM in order to extract generic semantic descriptions,
as well as elicited emotions
• We used the CLIP’s text encoder in order to obtain the text embeddings of both the
generic semantic descriptions and LMM-elicited emotions
• We used them along with the corresponding CLIP’s image embeddings for addressing
the downstream task of DID
• The proposed method achieved very good performance in terms of classification
accuracy, superior to the current state-of-the-art on the DID-Aug. dataset
13
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
This work has been funded by the European Union as part of the Horizon Europe
Framework Program, under grant agreements 101070190 (AI4TRUST) and 101070109
(TransMIXR).
14

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

  • 1.
    Title of presentation Subtitle Nameof presenter Date Disturbing Image Detection Using LMM-Elicited Emotion Embeddings Maria Tzelepi and Vasileios Mezaris CERTH-ITI, Thermi, Thessaloniki, Greece LVLM 2024: Integrating Image Processing with Large-Scale Language/Vision Models for Advanced Visual Understanding Workshop at IEEE ICIP 2024, Abu Dhabi, United Arab Emirates, Oct.2024
  • 2.
    Outline • Introduction ○ Problemstatement ○ Motivation • Proposed method ○ Preliminaries: MiniGPT-4 ○ Preliminaries: CLIP ○ DID using LMM-elicited emotion embeddings • Experimental evaluation ○ Dataset and implementation details ○ Experimental results • Conclusions 2
  • 3.
    Problem statement • DisturbingImage Detection (DID): detecting content in images that can cause trauma to viewers • It may include images that depict violence, pornography, animal cruelty • Such content elicits anxiety or/and fear to viewers • DID is a task of significant importance • Limited literature due to the challenging nature of creating datasets, restricting, in turn, the generalization ability of the trained models ○ [1] proposed a framework that exploits large-scale multimedia datasets so as to automatically extend initial training datasets with hard examples. An EfficientNet-b1 is trained on the augmented dataset for addressing the DID task. 3 [1] Sarridis, Ioannis, et al. "Leveraging large-scale multimedia datasets to refine content moderation models." 2022 IEEE Eighth International Conference on Multimedia Big Data (BigMM). IEEE, 2022.
  • 4.
    Motivation • Large LanguageModels (LLMs) have demonstrated exceptional performance in several downstream vision recognition tasks • Goal: Address the DID task exploiting knowledge encoded in LLMs and particularly in Large Multimodal Models (LMMs) • [2] proposed to use an LMM in order to extract semantic descriptions for the images of a dataset, and use them in order to address generic image classification tasks • Apart from these generic semantic descriptions, we propose to extract responses linked with a complementary task, i.e., emotion recognition • We argue that we can advance the performance in the DID task by also extracting LMM-elicited emotions for each image of the dataset 4 [2] Tzelepi, Maria, and Vasileios Mezaris. "Exploiting LMM-based knowledge for image classification tasks." International Conference on Engineering Applications of Neural Networks. Cham: Springer Nature Switzerland, 2024.
  • 5.
    Preliminaries: MiniGPT-4 • GPT-4is the first model to accept both text and image input, producing text output, however the technical details behind GPT-4 remain undisclosed • MiniGPT-4 aligns a frozen visual encoder with a frozen LLM, using a single projection layer - LLM: Vicuna - Visual encoder: ViT-G/14 from EVA- CLIP and a Q-Former network • MiniGPT-4 only requires training the linear projection layer to align the visual features with the LLM 5 The MiniGPT-4 model [3] [3] Zhu, Deyao, et al. "Minigpt-4: Enhancing vision-language understanding with advanced large language models." arXiv preprint arXiv:2304.10592 (2023).
  • 6.
    Preliminaries: CLIP • CLIPcomprises of an image and text encoder • It is trained with (image,text) pairs for predicting which of the possible pairs actually occurred • To do so, it learns a multimodal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the real image and text embeddings, while minimizing the cosine similarity of the embeddings of the incorrect pairs • CLIP provides outstanding zero-shot classification performance • Another approach is to use the CLIP image encoder for extracting the corresponding embeddings and use them with a classifier 6 The CLIP model [4] [4] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
  • 7.
    DID using LMM-elicitedemotion embeddings Proposed method for Disturbing Image Detection: • Prompt MiniGPT-4 for obtaining 10 semantic descriptions for each image of the dataset • Prompt MiniGPT-4 for obtaining 10 elicited emotions for each image of the dataset • Extract the CLIP embeddings for both the MiniGPT-4-generated responses • These two text embeddings are concatenated with the corresponding CLIP image embeddings and propagated to a simple classifier for performing the DID task (trained using cross entropy loss) 7
  • 8.
    Dataset & implementationdetails Dataset • DID-Aug.: augmented DID dataset using hard examples from the YFCC dataset. • 30,106 training images (8,070 disturbing and 22,036 non-disturbing images) • 1,080 test images (405 disturbing and 675 non-disturbing images) Implementation Details • MiniGPT-4 with Vicuna-13B locally • ViT-L-14 CLIP version • Classification head: three linear layers of 512, 256, and 2 neurons • The model is trained for 500 epochs, the learning rate is set to 0.001, and the batch size is set to 32 samples 8
  • 9.
    Experimental results Semantic descriptionsand elicited emotions MiniGPT-4 responses for a non-disturbing image. Example of a test image that was misclassified by the baseline method, while correctly classified as disturbing using the proposed method, along with the LMM- generated responses. 9
  • 10.
    Experimental results 10 • Theproposed method significantly improves the baseline performance (using only image embeddings) in terms of accuracy, accomplishing also superior performance over current state-of-the-art [1] Sarridis, Ioannis, et al. "Leveraging large-scale multimedia datasets to refine content moderation models." 2022 IEEE Eighth International Conference on Multimedia Big Data (BigMM). IEEE, 2022. [4] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021. Method Test Accuracy (%) CLIP - Image Embeddings [4] 94.444 EfficientNet-b1 [1] 95.000 CLIP - Proposed 96.907 Table 1: Test accuracy on DID-Aug. dataset - Comparison with state-of- the-art.
  • 11.
    Experimental results 11 Method TestAccuracy (%) CLIP - Image Embeddings 94.444 ± 0.131 CLIP - Emotion Embeddings 91.092 ± 0.108 CLIP - Semantic Description Embeddings 92.592 ± 0.058 CLIP - Image Embeddings + Emotion Embeddings 95.462 ± 0.101 CLIP - Image Embeddings + Semantic Description Embeddings 96.222 ± 0.107 CLIP - Emotion Embeddings + Semantic Description Embeddings 95.185 ± 0.261 CLIP - Image Embeddings + Emotion Embeddings + Semantic Description Embeddings (proposed) 96.907 ± 0.125 Table 2: Test accuracy on DID-Aug. dataset - Ablation study.
  • 12.
    Experimental results 12 • Usingonly the MiniGPT-4 knowledge to represent the images of the dataset results in, as expected, lower -compared to image embeddings- but very competitive performance • Interestingly, using only the LMM-elicited emotion embeddings the model achieves rather high performance, while combining the two text embeddings leads to a significant improvement, compared to using either one alone • The combination of image embeddings with each of the text embeddings leads to enhanced performance • The proposed method achieves the highest performance, validating our claim that the elicited emotion embeddings, which are tailored to the DID task, give further improvement
  • 13.
    Conclusions • Dealt withthe DID problem leveraging knowledge encoded in LMMs • We appropriately prompted the LMM in order to extract generic semantic descriptions, as well as elicited emotions • We used the CLIP’s text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions • We used them along with the corresponding CLIP’s image embeddings for addressing the downstream task of DID • The proposed method achieved very good performance in terms of classification accuracy, superior to the current state-of-the-art on the DID-Aug. dataset 13
  • 14.
    Thank you foryour attention! Questions? Vasileios Mezaris, bmezaris@iti.gr This work has been funded by the European Union as part of the Horizon Europe Framework Program, under grant agreements 101070190 (AI4TRUST) and 101070109 (TransMIXR). 14