Stance classification. Uni Cambridge 22 Jan 2021

REVISITING AND RE-EVALUATING RUMOUR STANCE CLASSIFICATION
University of Cambridge, 22nd January 2021
Carolina Scarton
c.scarton@sheffield.ac.uk
carolscarton

ONLINE RUMOURS
“circulating story of questionable veracity,
which is apparently credible but hard to verify,
and produces sufficient skepticism and/or
anxiety so as to motivate finding out the actual
truth” (Zubiaga et al., 2015)

RUMOUR STANCE CLASSIFICATION
➢ What is being said about a rumour?

➢ Stance of replies can help in predicting veracity (Mendoza et al., 2010;
Kumar and Carley, 2019) → specially denies (Zubiaga et al., 2016)

➢ Stance of replies can help in predicting veracity (Mendoza et al., 2010;
Kumar and Carley, 2019) → specially denies (Zubiaga et al., 2016)
➢ However,
• four-class classification problem
• support, deny, query, comment
• Highly imbalanced problem
• Support and denies
• most important classes
• Different from traditional stance classification task

➢ RumourEval 2017 and 2019 → most used datasets (PHEME project)
• Task A: rumour stance classification

➢ RumourEval 2017 and 2019 → most used datasets (PHEME project)
• Task A: rumour stance classification
• Current models and official evaluation metrics:
• not robust for four-class imbalanced problems
• not robust for problems where classes have different importance

RUMOUREVAL 2017 → ACCURACY SCORE
WINNER - ACC: 0.784

Adjusted weights

Two-step classification

Over-sampling

DEALING WITH IMBALANCED DATA
FOR STANCE CLASSIFICATION
Yue Li and Carolina Scarton (2020): Revisiting Rumour Stance Classification: Dealing with Imbalanced Data. RDSM 2020.

GOING BACK TO BASICS...
➢ RumourEval 2017 data
➢ Feature-based classifier:
• Glove word embeddings (average for Twitter embedding)

• Features from Twitter metadata (Aker et al., 2017):
• number of replies
• has URL
• verified account
• number of followers, etc.

• has URL
• Textual features (Aker et al., 2017):
• sentiment analysis
• emoticon analysis
• has slang or curse word
• surprise/doubt scores, etc.

• has URL
• Textual features (Aker et al., 2017):
• sentiment analysis
• emoticon analysis
• has slang or curse word
• surprise/doubt scores, etc.
macro-F1: 0.486

… LOOKING INTO SOTA
➢ BERT model → fine-tuning BERT for stance classification task
macro-F1: 0.516

… LOOKING INTO SOTA
➢ BERT model → fine-tuning BERT for stance classification task
macro-F1: 0.516 macro-F1: 0.486

DEALING WITH IMBALANCED DATA (TRADITIONAL METHODS)
➢ Data-based approaches:
• Random over and undersampling: ROS and RUS

• Synthetic over-sampling:
• SMOTE: k-nearest neighbours of each observation in the
minority class
• ADASYN: level of hardness of learning the data
observation

minority class
observation
• Hybrid sampling: SMOTEEN → data cleaning

minority class
observation
• Hybrid sampling: SMOTEEN → data cleaning
➢ Learning-based approach: threshold moving (TM) →
changing probabilities of predicted classes

METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set

➢ Training Process: 4-fold cross validation for hyperparameter
tuning, including the parameter in synthetic over-sampling

➢ Each experiment is run 10 times to assess the model stability

➢ Evaluation metrics: Macro-F1, geometric mean of Recall (GMR)

➢ Evaluation metrics: Macro-F1, geometric mean of Recall (GMR)
➢ Feature-based classifiers: LR, RF, MLP

RESULTS
● RUS → improves the performance of feature-based classifiers

RESULTS
● TM is similar to RUS
● Best for two neural network models, BERT and MLP → good estimation of posterior
probabilities

RESULTS
● It is very important to assess and select model considering multiple metrics!

RESULTS - RUMOUREVAL2017 AND RUMOUREVAL2019

CONCLUSIONS
➢ Feature-based approaches can still be competitive

CONCLUSIONS
➢ Traditional methods for dealing with imbalanced data improve both
feature-based and BERT-based approaches

CONCLUSIONS
➢ BERT-based approaches → SOTA
• Still room for improvements → support and denies

CONCLUSIONS
➢ Clever ways of using thread information may help

CONCLUSIONS
➢ Clever ways of using thread information may help
➢ Evaluation needs to be more detailed

RE-EVALUATING STANCE
CLASSIFICATION TASK
Carolina Scarton, Diego Furtado Silva and Kalina Bontcheva (2020): Measuring What Counts: The case of Rumour Stance
Classification. AACL 2020.

5th - ACC: 0.709 7th - ACC: 0.641

RUMOUREVAL 2019 → MACRO-F1
WINNER - macro-F1: 0.619

3rd - macro-F1: 0.578

7th - macro-F1: 0.370

RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny

heavily penalises models that achieves a low score
for a given class

for a given class
weighted version of AUC
ROC → relationship between R and FPR

for a given class
weighted version of macro-Fβ
β = 1 → precision and recall have same importance
β > 1 → recall has more importance

for a given class
weighted version of macro-Fβ
β = 1 → precision and recall have same importance
β > 1 → recall has more importance
Weights → empirically
defined
wsupport
= 0.40
wdeny
= 0.40
wquery
= 0.15
wcomment
= 0.05

RUMOUREVAL 2017 → WF2
WINNER - wF2: 0.296 2nd - wF2: 0.294

7th - wF2: 0.230

1st - wF2: 0.509 2nd - wF2: 0.506 3rd - wF2: 0.499

WINNER - wF2: 0.602

4th - wF2: 0.325

2nd - wF2: 0.514 3rd - wF2: 0.505

WEIGHTS DISCUSSION
➢ Weights need to:
Weights only based only on data distribution:
Mama Edha:
- wsupport
= 0.157
- wdeny
= 0.396
- wquery
= 0.399
- wcomment
= 0.048
UPV:
- wsupport
= 0.200
- wdeny
= 0.350
- wquery
= 0.350
- wcomment
= 0.100

CONCLUSION
➢ Evaluation needs to take into account the task purposes:
• Rumour Stance Classification → improve veracity classification / rumour analysis
• Most informative classes: support and deny
• Highly imbalanced four-class classification problem

CONCLUSION
➢ Recall based metrics → higher priority to minority classes

CONCLUSION
➢ Weighted metrics → higher priority to most important classes

CONCLUSION
➢ Weighted metrics → higher priority to most important classes
Ideal evaluation: takes into account multiple metrics!

THANK YOU FOR YOUR ATTENTION!
www.weverify.eu
@WeVerify
Try yourself: https://cloud.gate.ac.uk/shopfront#tagged=WeVerify
Thanks to Yue Li for a lot of the slides (and work done!)
Collaboration with Kalina Bontcheva and Diego Silva

Stance classification. Uni Cambridge 22 Jan 2021

Recommended

Recommended

More Related Content

Similar to Stance classification. Uni Cambridge 22 Jan 2021

Similar to Stance classification. Uni Cambridge 22 Jan 2021 (20)

More from Weverify

More from Weverify (20)

Recently uploaded

Recently uploaded (20)

Stance classification. Uni Cambridge 22 Jan 2021