Factify2_Challenge_Logically.pptx

Logically at the Factify 2: A Multi-Modal Fact
Checking System Based on Evidence Retrieval
techniques and Transformer Encoder
Architecture
Pim Jordi Verschuuren, Jie Gao, Adelize Van Eeden, Stylianos
Oikonomou, Anil Bandhakavi
Feb, 2023

1 Introduction
Dataset and Evaluation
Challenges
Data analysis
2
3
4
Table of content
5 Our participating system
6 Results and Experiments
7 Conclusion

Factify 2 Challenge: Introduction - Task
Claim
Text: BREAKING: Meghan, the Duchess of
Sussex, is in labor with her first child,
Buckingham Palace announces.
https://abcn.ws/2GXAYNy
Text: By Lauren Said-Moorhouse, CNN
Buckingham Palace tells CNN that the
Duchess of Sussex went into labor in the early
hours on Monday morning. … Harry and
Meghan married … and announced they were
expecting in October when they touched down
in Australia for their first overseas tour as a
married couple.
Document
label:
SUPPORT_MULTIMODAL
_MULTIMODAL
SUPPORT
images similar /
about the same
scene/entity
doc text supports
claim text

Factify 2 Challenge: Dataset and Evaluation
● Images, textual claims, doc/reference textual
document/images, OCR texts from claim/doc images
provided
● Balanced dataset
Metrics
● Weighted-average F1 score
Category Train Val Test
Support_Multimodal 7000 1500 1500
Support_Text 7000 1500 1500
Insufficient_Multimodal 7000 1500 1500
Insufficient_Text 7000 1500 1500
Refute 7000 1500 1500
35000 7500 7500
Dataset summary

Factify 2 Challenge: Introduction - Challenges
● Combine different modalities such as text, image,
videos and audio into a single system to accurately
detect false information
● Subtle differences in fine-grained veracity
categories
● Feature representation from multiple sources in
order to identify discrepancies
● Integrate of NLP for context understanding and
semantics within each data source type to correctly
assess factuality presented by various media outlets
● Scalability of large data volume and long sequence
multimodal modeling complexity (doc text in this
task)
Major technical challenges

Data Analysis - Text Length Distribution
● Claim/Doc text and OCR text length distributions from train
set are represented (right-side)
● Text length distribution varies between the claim and doc
text
Observations
● Texts are much shorter and less varied in “Refute” category in
both claim and doc
● “Support_Multimodal” and “Support_Text” categories have
document text lengths that are on average longer and have
a bigger spread
● “Insufficient_Multimodal” and “Insufficient_Text” have
document text lengths that are shorter on average
● OCR text distribution show less variability between claim and
doc
● “Refute” has on average longer OCR text lengths
○ Followed by two text related categories
Long text sequences challenge
Fig 2 (a). Claim text length dist. Fig 2 (b) Doc text length dist.
Fig 3 (a). Claim OCR length dist. Fig 3 (b). Doc OCR length dist.

Data Analysis - Image Similarity Distribution
● CLIP pre-trained model is used to encode images
● Correlation between claim and doc images is measured
with cosine similarity
Observations
● Image similarity correlation is higher for two “Multimodal”
related categories than other three categories
○ I.e., can be leveraged to verify multimodal entailment
categories
● The label correlation largely increased compared to the
dataset in Factify 1
Image Matching Challenge
Fig 4 (b). Claim Image and Doc Image Similarity Dist.
Fig 4 (a). Claim Image and Doc Image Similarity Histogram

Data Analysis - Multimodal Similarity Distribution
● CLIP pre-trained model is used to encode both
text and images
● Bimodality pair correlation is measured with
cosine similarity
Observations
● “Support_Multimodal” presents the relatively
higher pairwise similarity correlation between
label and multimodal pair of claim text and
evidence image
● “Insufficient text” have the lowest pairwise
similarity correlation between claim text and doc
image
● No significant correlation among other
multimodal pairs
Cross-Modality matching Challenge
Fig. 5 (b) Claim Text and Doc Image
Similarity Dist.
Fig. 6 (c) Claim Image and Doc Text
Similarity Dist.

Our participating system - overall architecture
● a textual evidence retrieval component
● a transformer based seq2seq cross-modal veracity model
Two-stage evidence based seq2seq veracity detection system

Our participating system - Evidence Retrieval Component
● “multi-qa-mpnet-base-dot-v1” (from
SBERT) to compute embeddings of
claim-doc text pairs (Bi-Encoder)
● Top 𝐾 passages are selected, ranked and
concatenated
Textual Evidence Retrieval

Our participating system - seq2seq cross-modal veracity model
● Embedding layer
○ A pre-trained cross-modal model (i.e.
CLIP)
○ a pre-trained text embedding (w2v)
● Cross-modal embeddings of 6 modality
input are concatenated (listwise)
● Text/tokens embeddings of claim and
evidence passage pairs are
concatenated
● Feed into two separate transformer
encoders before concatenating and
passing through an MLP classifier
Transformer based seq2seq cross-modality
veracity prediction

Results and Experiments
● Settings
○ with or without evidence selection
○ vary length of evidence doc text sorted
by evidence retriever
○ passage ranking at paragraph level
versus sentence level;
○ text-to-text alignment with SBERT vs.
cross-modal alignment with CLIP
○ Validate if SBERT model trained on QA
dataset perform better than general
purpose SBERT model
● Preliminary Results
○ QA-enhanced/fine-tuned models
perform better than all-round model
○ combining SBERT-QA at top K sentence-
level evidence passage retrieval achieves
optimal performance
○ best model "SBERT-
QA_sentence_ER_top5" obtains 0.79
weighted avg. F1 with 20th epochs.
Validate and optimize the effect of
evidence retrieval settings

Conclusion
● Cross-modal pre-trained models (such as CLIP)
exhibit great transferability for zero-shot or few-
shot scenarios
● Self-attentive modules based on transformer
encoder architecture is a very effective fine-
grained alignment technique to learn hidden
relationships across multi-modalities
Future work
● Finer-grained cross-modal representations for
intra- and intermodality alignment (sentence
words or entities/visual objects)
● More focus should be placed on real-world
challenges
○ Cross-modal retrieval
○ Diverse context and domains
○ Large and high-quality multimodal fact checking
datasets reflecting real-world scenario
○ Explainability
Lessons learnt

Factify2_Challenge_Logically.pptx

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Factify2_Challenge_Logically.pptx