This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multi-modal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Factify_Data_Challenge.pptx
1. Logically at the Factify 2022:
Multimodal Fact Verification
Jie Gao, Hella-Franziska Hoffmann, Stylianos Oikonomou,
David Kiskovski, Anil Bandhakavi
Feb 28, 2022
3. Factify Challenge: Introduction - Task
Claim
Text: China’s famed wandering
elephants are on the move again,
heading southwest while a male who
broke from the herd is still keeping his
distance. https://t.co/o5j7PDDveJ
Text: By Julia Hollingsworth and Zixu
Wang, CNNUpdated 1:03 AM ET, Fri June
11, 2021 (CNN)At least a dozen buzzing
drones monitor them around the clock.
Wherever they go, they're escorted by
police. And when they eat or sleep, they're
watched by millions online. CNN's Jessie
Yeung contributed to this report.
Document
label:
SUPPORT_MULTIMODAL
_MULTIMODAL
SUPPORT
images similar /
about the same
situation
doc text supports
claim text
Data challenge as part of De-Factify at AAAI ‘22
Train pairs: 35000
Validation pairs: 7500
Test pairs: 7500
4 weeks to train/eval, 1 week to apply to test
4. Factify Challenge: Introduction - Usage
● Entailment prediction is a
technique for claim
verification, i.e., predict
whether the evidence
supports or refute the claim
● Typically, given a tweet with
text message and image, and
a potential evidence article,
can we automatically predict
the veracity ?
Overview
Claim Detection Claim Verification
Worthiness Prioritising
Evidence
Retrieval
Veracity
Prediction
Produce
Justification
Factify
(Multimodal
Entailment)
Claim Matching
5. Factify Challenge: Introduction - Usage
● Entailment prediction is a
technique for claim
verification, i.e., predict
whether the evidence
supports or refute the claim
● Typically, given a tweet with
text message and image, and
a potential evidence article,
can we automatically predict
the veracity ?
Overview
Claim Detection Claim Verification
Worthiness Prioritising
Evidence
Retrieval
Veracity
Prediction
Produce
Justification
Claim Matching
6. Solution: Ensemble Model
● Train two unimodal models:
○ 3-way Textual Entailment:
“What is the relationship between
document and claim?”
support / refute / neutral
○ Image Relatedness:
“Is the doc. image contextually
related to the claim text + image?”
Y / N
● Combine the two unimodal models
with data-specific features into a
multimodal 5-way classifier.
Approach
7. Experiments: 5-way Multimodal Entailment
● Ensemble Model:
sklearn's DecisionTreeClassifier with
‘best’ split and ‘gini’ impurity matrix
as training criteria and an upper
bound of 8 on the number of layers.
● Feature Creation:
○ Text Entailment:
pre-trained BigBird model
fine-tuned on factify data set
○ pretrained ResNet-50 for
image cosine sim
○ sklearn 1-hot encoders for
image domains
Experiment setup
Validation
Test
8. Experiments: 5-way Multimodal Entailment
● Ensemble Model:
sklearn's DecisionTreeClassifier with
‘best’ split and ‘gini’ impurity matrix
as training criteria and an upper
bound of 8 on the number of layers.
● Feature Creation:
○ Text Entailment:
pre-trained BigBird model
fine-tuned on factify data set
○ pretrained ResNet-50 for
image cosine sim
○ sklearn 1-hot encoders for
image domains
Experiment setup
Leaderboard
9. Experiments: 3-way Textual Entailment
● As part of the design we chose to train a
separate model to address the textual
entailment part of the multi-modal task:
“Given a claim and an evidence
document, determine if the text evidence
supports, refutes, or is neutral towards the
claim.”
● Best model setup:
○ pre-trained Huggingface BigBird
○ fine-tuned for pairwise classification
of claim / doc text pairs over 2 epochs
with AdamW optimizer, learning rate
2e-5, epsilon 1e-8, batch size 4, and
max. sentence length of 1396 tokens.
Experiment setup
Validation Scores
Factify Label Text Entailment Label
Support_Multimodal Support
Support_Text Support
Insufficient_Multimodal Insufficient_Evidence
Insufficient_Text Insufficient_Evidence
Refute Refute
Label Mapping
10. Data Bias
Text Length Distribution by Label (Train)
OCR Text Length Distribution by Label (Train)
Many of our model choices were inspired by
inherent biases observed in the data.
Generating large annotated gold data sets that
appropriately represent the real-world fact
checking domain remains an ongoing challenge.
Text Word Overlap by Label (Train/Val)
11. Data Bias
Img Similarity by Label (Val)
Claim Image Source Distribution by Label (Train)
Many of our model choices were inspired by
inherent biases observed in the data.
Generating large annotated gold data sets that
appropriately represent the real-world fact
checking domain remains an ongoing challenge.
12. Incorrect and Ambiguous Labels
Insufficient_Multimodal
claim: Special counsel Robert
Mueller did not have sufficient
evidence to prosecute
obstruction, but does not
exonerate President Trump.
https://t.co/nfbBsVjDBG
https://t.co/83P7RDQadK
doc: “Attorney General William
Barr will now review the report.
Robert Mueller ends Russia
investigation without more
indictments: SourceSpecial counsel
Robert Mueller's much-anticipated
report -- the product of nearly two
years of investigation -- [..]
Support_Text
doc: In an unprecedented move, her
casket has been placed outside on the
court steps. Remembering Supreme Court
Justice Ruth Bader GinsburgThree days
of public mourning for Justice Ruth Bader
Ginsburg, a champion of equality and
pioneer of women's rights, began
Wednesday when her casket arrived at
the Supreme Court [..]
claim: President Trump and first lady
Melania Trump paid their respects to
Supreme Court Justice Ruth Bader
Ginsburg as a crowd booed and chanted
"Vote him out." https://t.co/M7m7kEIBg7
https://t.co/tWYfyKIdIF
13. Conclusion and Discussion
Learnings:
● DecisionTree classifier as best performing model
● 3-way text entailment as separate task with its own value
● DNN-based multimodal model suffers overfitting (refer to paper for details)
● Clear data bias and ambiguous labels (e.g. “support_multimodal” vs “support_text”)
Recommendations:
● Improve data creation process to reduce bias
● More practical labels and annotation scheme for real-world applications/challenges
● Further experimentation with multimodal architectures