Stance classification - Presentation QMUL by Carolina Scarton, USFD

REVISITING AND RE-EVALUATING RUMOUR STANCE CLASSIFICATION
Queen Mary University London, 11th November 2020
Carolina Scarton
c.scarton@sheffield.ac.uk
carolscarton

A LITTLE BIT ABOUT MYSELF...
➢ UG and MSc from the University of São Paulo, Brazil (2013)

➢ PhD from the University of Sheffield (2017)

➢ PhD from the University of Sheffield (2017)
➢ Research interests:
• Machine Translation
• Text Simplification
• NLP for social media
• Multi-word expressions processing
• NLP evaluation
• Personalised NLP
• NLP for healthcare
• …

ONLINE RUMOURS
“circulating story of questionable veracity,
which is apparently credible but hard to verify,
and produces sufficient skepticism and/or
anxiety so as to motivate finding out the actual
truth” (Zubiaga et al., 2015)

RUMOUR STANCE CLASSIFICATION
➢ What is being said about a rumour?

➢ Stance of replies can help in predicting veracity (Mendoza et al., 2010;
Kumar and Carley, 2019) → specially denies (Zubiaga et al., 2016)

➢ Stance of replies can help in predicting veracity (Mendoza et al., 2010;
Kumar and Carley, 2019) → specially denies (Zubiaga et al., 2016)
➢ However,
• four-class classification problem
• support, deny, query, comment
• Highly imbalanced problem
• Support and denies
• most important classes
• Different from traditional stance classification task

➢ RumourEval 2017 and 2019 → most used datasets (PHEME project)
• Task A: rumour stance classification

➢ RumourEval 2017 and 2019 → most used datasets (PHEME project)
• Task A: rumour stance classification
• Current models and official evaluation metrics:
• not robust for four-class imbalanced problems
• not robust for problems where classes have different importance

RUMOUREVAL 2017 → ACCURACY SCORE
WINNER - ACC: 0.784

Adjusted weights

Two-step classification

Over-sampling

DEALING WITH IMBALANCED DATA
FOR STANCE CLASSIFICATION
Yue Li and Carolina Scarton (to appear): Revisiting Rumour Stance Classification: Dealing with Imbalanced Data. RDSM 2020.

GOING BACK TO BASICS...
➢ RumourEval 2017 data
➢ Feature-based classifier:
• Glove word embeddings (average for Twitter embedding)

• Features from Twitter metadata (Aker et al., 2017):
• number of replies
• has URL
• verified account
• number of followers, etc.

• has URL
• Textual features (Aker et al., 2017):
• sentiment analysis
• emoticon analysis
• has slang or curse word
• surprise/doubt scores, etc.

• has URL
• Textual features (Aker et al., 2017):
• sentiment analysis
• emoticon analysis
• has slang or curse word
• surprise/doubt scores, etc.
macro-F1: 0.486

… LOOKING INTO SOTA
➢ BERT model → fine-tuning BERT for stance classification task
macro-F1: 0.516

… LOOKING INTO SOTA
➢ BERT model → fine-tuning BERT for stance classification task
macro-F1: 0.516 macro-F1: 0.486

DEALING WITH IMBALANCED DATA (TRADITIONAL METHODS)
➢ Data-based approaches:
• Random over and undersampling: ROS and RUS

• Synthetic over-sampling:
• SMOTE: k-nearest neighbours of each observation in the
minority class
• ADASYN: level of hardness of learning the data
observation

minority class
observation
• Hybrid sampling: SMOTEEN → data cleaning

minority class
observation
• Hybrid sampling: SMOTEEN → data cleaning
➢ Learning-based approach: threshold moving (TM) →
changing probabilities of predicted classes

METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set

➢ Training Process: 4-fold cross validation for hyperparameter
tuning, including the parameter in synthetic over-sampling

➢ Each experiment is run 10 times to assess the model stability

➢ Evaluation metrics: Macro-F1, geometric mean of Recall (GMR)

➢ Evaluation metrics: Macro-F1, geometric mean of Recall (GMR)
➢ Feature-based classifiers: LR, RF, MLP

RESULTS
● RUS → improves the performance of feature-based classifiers

RESULTS
● TM is similar to RUS
● Best for two neural network models, BERT and MLP → good estimation of posterior
probabilities

RESULTS
● It is very important to assess and select model considering multiple metrics!

RESULTS - RUMOUREVAL2017 AND RUMOUREVAL2019

CONCLUSIONS
➢ Feature-based approaches can still be competitive

CONCLUSIONS
➢ Traditional methods for dealing with imbalanced data improve both
feature-based and BERT-based approaches

CONCLUSIONS
➢ BERT-based approaches → SOTA
• Still room for improvements → support and denies

CONCLUSIONS
➢ Clever ways of using thread information may help

CONCLUSIONS
➢ Clever ways of using thread information may help
➢ Evaluation needs to be more detailed

RE-EVALUATING STANCE
CLASSIFICATION TASK
Carolina Scarton, Diego Furtado Silva and Kalina Bontcheva (to appear): Measuring What Counts: The case of Rumour Stance
Classification. AACL 2020.

5th - ACC: 0.709 7th - ACC: 0.641

RUMOUREVAL 2019 → MACRO-F1
WINNER - macro-F1: 0.619

3rd - macro-F1: 0.578

7th - macro-F1: 0.370

RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny

heavily penalises models that achieves a low score
for a given class

for a given class
weighted version of AUC
ROC → relationship between R and FPR

for a given class
weighted version of macro-Fβ
β = 1 → precision and recall have same importance
β > 1 → recall has more importance

for a given class
weighted version of macro-Fβ
β = 1 → precision and recall have same importance
β > 1 → recall has more importance
Weights → empirically
defined
wsupport
= 0.40
wdeny
= 0.40
wquery
= 0.15
wcomment
= 0.05

RUMOUREVAL 2017 → WF2
WINNER - wF2: 0.296 2nd - wF2: 0.294

7th - wF2: 0.230

1st - wF2: 0.509 2nd - wF2: 0.506 3rd - wF2: 0.499

WINNER - wF2: 0.602

4th - wF2: 0.325

2nd - wF2: 0.514 3rd - wF2: 0.505

WEIGHTS DISCUSSION
➢ Weights need to:
Weights only based only on data distribution:
Mama Edha:
- wsupport
= 0.157
- wdeny
= 0.396
- wquery
= 0.399
- wcomment
= 0.048
UPV:
- wsupport
= 0.200
- wdeny
= 0.350
- wquery
= 0.350
- wcomment
= 0.100

CONCLUSION
➢ Evaluation needs to take into account the task purposes:
• Rumour Stance Classification → improve veracity classification / rumour analysis
• Most informative classes: support and deny
• Highly imbalanced four-class classification problem

CONCLUSION
➢ Recall based metrics → higher priority to minority classes

CONCLUSION
➢ Weighted metrics → higher priority to most important classes

CONCLUSION
➢ Weighted metrics → higher priority to most important classes
Ideal evaluation: takes into account multiple metrics!

THANK YOU FOR YOUR ATTENTION!
www.weverify.eu
@WeVerify
Thanks to Yue Li for a lot of the slides (and work done!)
Collaboration with Kalina Bontcheva and Diego Silva

Stance classification - Presentation QMUL by Carolina Scarton, USFD

Recommended

Recommended

More Related Content

Similar to Stance classification - Presentation QMUL by Carolina Scarton, USFD

Similar to Stance classification - Presentation QMUL by Carolina Scarton, USFD (20)

More from Weverify

More from Weverify (20)

Recently uploaded

Recently uploaded (20)

Stance classification - Presentation QMUL by Carolina Scarton, USFD