This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multi-modal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Factify2_Challenge_Logically.pptx
1. Logically at the Factify 2: A Multi-Modal Fact
Checking System Based on Evidence Retrieval
techniques and Transformer Encoder
Architecture
Pim Jordi Verschuuren, Jie Gao, Adelize Van Eeden, Stylianos
Oikonomou, Anil Bandhakavi
Feb, 2023
2. 1 Introduction
Dataset and Evaluation
Challenges
Data analysis
2
3
4
Table of content
5 Our participating system
6 Results and Experiments
7 Conclusion
3. Factify 2 Challenge: Introduction - Task
Claim
Text: BREAKING: Meghan, the Duchess of
Sussex, is in labor with her first child,
Buckingham Palace announces.
https://abcn.ws/2GXAYNy
Text: By Lauren Said-Moorhouse, CNN
Buckingham Palace tells CNN that the
Duchess of Sussex went into labor in the early
hours on Monday morning. … Harry and
Meghan married … and announced they were
expecting in October when they touched down
in Australia for their first overseas tour as a
married couple.
Document
label:
SUPPORT_MULTIMODAL
_MULTIMODAL
SUPPORT
images similar /
about the same
scene/entity
doc text supports
claim text
5. Factify 2 Challenge: Introduction - Challenges
● Combine different modalities such as text, image,
videos and audio into a single system to accurately
detect false information
● Subtle differences in fine-grained veracity
categories
● Feature representation from multiple sources in
order to identify discrepancies
● Integrate of NLP for context understanding and
semantics within each data source type to correctly
assess factuality presented by various media outlets
● Scalability of large data volume and long sequence
multimodal modeling complexity (doc text in this
task)
Major technical challenges
6. Data Analysis - Text Length Distribution
● Claim/Doc text and OCR text length distributions from train
set are represented (right-side)
● Text length distribution varies between the claim and doc
text
Observations
● Texts are much shorter and less varied in “Refute” category in
both claim and doc
● “Support_Multimodal” and “Support_Text” categories have
document text lengths that are on average longer and have
a bigger spread
● “Insufficient_Multimodal” and “Insufficient_Text” have
document text lengths that are shorter on average
● OCR text distribution show less variability between claim and
doc
● “Refute” has on average longer OCR text lengths
○ Followed by two text related categories
Long text sequences challenge
Fig 2 (a). Claim text length dist. Fig 2 (b) Doc text length dist.
Fig 3 (a). Claim OCR length dist. Fig 3 (b). Doc OCR length dist.
7. Data Analysis - Image Similarity Distribution
● CLIP pre-trained model is used to encode images
● Correlation between claim and doc images is measured
with cosine similarity
Observations
● Image similarity correlation is higher for two “Multimodal”
related categories than other three categories
○ I.e., can be leveraged to verify multimodal entailment
categories
● The label correlation largely increased compared to the
dataset in Factify 1
Image Matching Challenge
Fig 4 (b). Claim Image and Doc Image Similarity Dist.
Fig 4 (a). Claim Image and Doc Image Similarity Histogram
8. Data Analysis - Multimodal Similarity Distribution
● CLIP pre-trained model is used to encode both
text and images
● Bimodality pair correlation is measured with
cosine similarity
Observations
● “Support_Multimodal” presents the relatively
higher pairwise similarity correlation between
label and multimodal pair of claim text and
evidence image
● “Insufficient text” have the lowest pairwise
similarity correlation between claim text and doc
image
● No significant correlation among other
multimodal pairs
Cross-Modality matching Challenge
Fig. 5 (b) Claim Text and Doc Image
Similarity Dist.
Fig. 6 (c) Claim Image and Doc Text
Similarity Dist.
9. Our participating system - overall architecture
● a textual evidence retrieval component
● a transformer based seq2seq cross-modal veracity model
Two-stage evidence based seq2seq veracity detection system
10. Our participating system - Evidence Retrieval Component
● “multi-qa-mpnet-base-dot-v1” (from
SBERT) to compute embeddings of
claim-doc text pairs (Bi-Encoder)
● Top 𝐾 passages are selected, ranked and
concatenated
Textual Evidence Retrieval
11. Our participating system - seq2seq cross-modal veracity model
● Embedding layer
○ A pre-trained cross-modal model (i.e.
CLIP)
○ a pre-trained text embedding (w2v)
● Cross-modal embeddings of 6 modality
input are concatenated (listwise)
● Text/tokens embeddings of claim and
evidence passage pairs are
concatenated
● Feed into two separate transformer
encoders before concatenating and
passing through an MLP classifier
Transformer based seq2seq cross-modality
veracity prediction
12. Results and Experiments
● Settings
○ with or without evidence selection
○ vary length of evidence doc text sorted
by evidence retriever
○ passage ranking at paragraph level
versus sentence level;
○ text-to-text alignment with SBERT vs.
cross-modal alignment with CLIP
○ Validate if SBERT model trained on QA
dataset perform better than general
purpose SBERT model
● Preliminary Results
○ QA-enhanced/fine-tuned models
perform better than all-round model
○ combining SBERT-QA at top K sentence-
level evidence passage retrieval achieves
optimal performance
○ best model "SBERT-
QA_sentence_ER_top5" obtains 0.79
weighted avg. F1 with 20th epochs.
Validate and optimize the effect of
evidence retrieval settings
13. Conclusion
● Cross-modal pre-trained models (such as CLIP)
exhibit great transferability for zero-shot or few-
shot scenarios
● Self-attentive modules based on transformer
encoder architecture is a very effective fine-
grained alignment technique to learn hidden
relationships across multi-modalities
Future work
● Finer-grained cross-modal representations for
intra- and intermodality alignment (sentence
words or entities/visual objects)
● More focus should be placed on real-world
challenges
○ Cross-modal retrieval
○ Diverse context and domains
○ Large and high-quality multimodal fact checking
datasets reflecting real-world scenario
○ Explainability
Lessons learnt