Mastering Wealth with YouTube Content Marketing.pdf
Stance classification - Presentation QMUL by Carolina Scarton, USFD
1. REVISITING AND RE-EVALUATING RUMOUR STANCE CLASSIFICATION
Queen Mary University London, 11th November 2020
Carolina Scarton
c.scarton@sheffield.ac.uk
carolscarton
2. A LITTLE BIT ABOUT MYSELF...
➢ UG and MSc from the University of São Paulo, Brazil (2013)
3. A LITTLE BIT ABOUT MYSELF...
➢ UG and MSc from the University of São Paulo, Brazil (2013)
➢ PhD from the University of Sheffield (2017)
4. A LITTLE BIT ABOUT MYSELF...
➢ UG and MSc from the University of São Paulo, Brazil (2013)
➢ PhD from the University of Sheffield (2017)
➢ Research interests:
• Machine Translation
• Text Simplification
• NLP for social media
• Multi-word expressions processing
• NLP evaluation
• Personalised NLP
• NLP for healthcare
• …
6. ONLINE RUMOURS
“circulating story of questionable veracity,
which is apparently credible but hard to verify,
and produces sufficient skepticism and/or
anxiety so as to motivate finding out the actual
truth” (Zubiaga et al., 2015)
11. RUMOUR STANCE CLASSIFICATION
➢ Stance of replies can help in predicting veracity (Mendoza et al., 2010;
Kumar and Carley, 2019) → specially denies (Zubiaga et al., 2016)
12. RUMOUR STANCE CLASSIFICATION
➢ Stance of replies can help in predicting veracity (Mendoza et al., 2010;
Kumar and Carley, 2019) → specially denies (Zubiaga et al., 2016)
➢ However,
• four-class classification problem
• support, deny, query, comment
• Highly imbalanced problem
• Support and denies
• most important classes
• Different from traditional stance classification task
13. RUMOUR STANCE CLASSIFICATION
➢ RumourEval 2017 and 2019 → most used datasets (PHEME project)
• Task A: rumour stance classification
14. RUMOUR STANCE CLASSIFICATION
➢ RumourEval 2017 and 2019 → most used datasets (PHEME project)
• Task A: rumour stance classification
• Current models and official evaluation metrics:
• not robust for four-class imbalanced problems
• not robust for problems where classes have different importance
19. DEALING WITH IMBALANCED DATA
FOR STANCE CLASSIFICATION
Yue Li and Carolina Scarton (to appear): Revisiting Rumour Stance Classification: Dealing with Imbalanced Data. RDSM 2020.
20. GOING BACK TO BASICS...
➢ RumourEval 2017 data
➢ Feature-based classifier:
• Glove word embeddings (average for Twitter embedding)
21. GOING BACK TO BASICS...
➢ RumourEval 2017 data
➢ Feature-based classifier:
• Glove word embeddings (average for Twitter embedding)
• Features from Twitter metadata (Aker et al., 2017):
• number of replies
• has URL
• verified account
• number of followers, etc.
22. GOING BACK TO BASICS...
➢ RumourEval 2017 data
➢ Feature-based classifier:
• Glove word embeddings (average for Twitter embedding)
• Features from Twitter metadata (Aker et al., 2017):
• number of replies
• has URL
• verified account
• number of followers, etc.
• Textual features (Aker et al., 2017):
• sentiment analysis
• emoticon analysis
• has slang or curse word
• surprise/doubt scores, etc.
23. GOING BACK TO BASICS...
➢ RumourEval 2017 data
➢ Feature-based classifier:
• Glove word embeddings (average for Twitter embedding)
• Features from Twitter metadata (Aker et al., 2017):
• number of replies
• has URL
• verified account
• number of followers, etc.
• Textual features (Aker et al., 2017):
• sentiment analysis
• emoticon analysis
• has slang or curse word
• surprise/doubt scores, etc.
macro-F1: 0.486
24. … LOOKING INTO SOTA
➢ RumourEval 2017 data
➢ BERT model → fine-tuning BERT for stance classification task
macro-F1: 0.516
25. … LOOKING INTO SOTA
➢ RumourEval 2017 data
➢ BERT model → fine-tuning BERT for stance classification task
macro-F1: 0.516 macro-F1: 0.486
26. DEALING WITH IMBALANCED DATA (TRADITIONAL METHODS)
➢ Data-based approaches:
• Random over and undersampling: ROS and RUS
27. DEALING WITH IMBALANCED DATA (TRADITIONAL METHODS)
➢ Data-based approaches:
• Random over and undersampling: ROS and RUS
• Synthetic over-sampling:
• SMOTE: k-nearest neighbours of each observation in the
minority class
• ADASYN: level of hardness of learning the data
observation
28. DEALING WITH IMBALANCED DATA (TRADITIONAL METHODS)
➢ Data-based approaches:
• Random over and undersampling: ROS and RUS
• Synthetic over-sampling:
• SMOTE: k-nearest neighbours of each observation in the
minority class
• ADASYN: level of hardness of learning the data
observation
• Hybrid sampling: SMOTEEN → data cleaning
29. DEALING WITH IMBALANCED DATA (TRADITIONAL METHODS)
➢ Data-based approaches:
• Random over and undersampling: ROS and RUS
• Synthetic over-sampling:
• SMOTE: k-nearest neighbours of each observation in the
minority class
• ADASYN: level of hardness of learning the data
observation
• Hybrid sampling: SMOTEEN → data cleaning
➢ Learning-based approach: threshold moving (TM) →
changing probabilities of predicted classes
30. METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set
31. METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set
➢ Training Process: 4-fold cross validation for hyperparameter
tuning, including the parameter in synthetic over-sampling
32. METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set
➢ Training Process: 4-fold cross validation for hyperparameter
tuning, including the parameter in synthetic over-sampling
➢ Each experiment is run 10 times to assess the model stability
33. METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set
➢ Training Process: 4-fold cross validation for hyperparameter
tuning, including the parameter in synthetic over-sampling
➢ Each experiment is run 10 times to assess the model stability
➢ Evaluation metrics: Macro-F1, geometric mean of Recall (GMR)
34. METHODOLOGY - MODEL SELECTION
➢ Training data: RumourEval 2017 training set
➢ Evaluation: RumourEval 2017 test set
➢ Training Process: 4-fold cross validation for hyperparameter
tuning, including the parameter in synthetic over-sampling
➢ Each experiment is run 10 times to assess the model stability
➢ Evaluation metrics: Macro-F1, geometric mean of Recall (GMR)
➢ Feature-based classifiers: LR, RF, MLP
47. CONCLUSIONS
➢ Feature-based approaches can still be competitive
➢ Traditional methods for dealing with imbalanced data improve both
feature-based and BERT-based approaches
48. CONCLUSIONS
➢ Feature-based approaches can still be competitive
➢ Traditional methods for dealing with imbalanced data improve both
feature-based and BERT-based approaches
➢ BERT-based approaches → SOTA
• Still room for improvements → support and denies
49. CONCLUSIONS
➢ Feature-based approaches can still be competitive
➢ Traditional methods for dealing with imbalanced data improve both
feature-based and BERT-based approaches
➢ BERT-based approaches → SOTA
• Still room for improvements → support and denies
➢ Clever ways of using thread information may help
50. CONCLUSIONS
➢ Feature-based approaches can still be competitive
➢ Traditional methods for dealing with imbalanced data improve both
feature-based and BERT-based approaches
➢ BERT-based approaches → SOTA
• Still room for improvements → support and denies
➢ Clever ways of using thread information may help
➢ Evaluation needs to be more detailed
57. RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
58. RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
heavily penalises models that achieves a low score
for a given class
59. RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
heavily penalises models that achieves a low score
for a given class
weighted version of AUC
ROC → relationship between R and FPR
60. RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
heavily penalises models that achieves a low score
for a given class
weighted version of AUC
ROC → relationship between R and FPR
weighted version of macro-Fβ
β = 1 → precision and recall have same importance
β > 1 → recall has more importance
61. RUMOUR STANCE CLASSIFICATION EVALUATION
➢ New metrics are needed to reliably evaluate models
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
heavily penalises models that achieves a low score
for a given class
weighted version of AUC
ROC → relationship between R and FPR
weighted version of macro-Fβ
β = 1 → precision and recall have same importance
β > 1 → recall has more importance
Weights → empirically
defined
wsupport
= 0.40
wdeny
= 0.40
wquery
= 0.15
wcomment
= 0.05
70. WEIGHTS DISCUSSION
➢ Weights need to:
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
Weights only based only on data distribution:
Mama Edha:
- wsupport
= 0.157
- wdeny
= 0.396
- wquery
= 0.399
- wcomment
= 0.048
UPV:
- wsupport
= 0.200
- wdeny
= 0.350
- wquery
= 0.350
- wcomment
= 0.100
71. WEIGHTS DISCUSSION
➢ Weights need to:
• Deal with data imbalance
• Give higher value to the most important classes: support and deny
Weights only based only on data distribution:
Mama Edha:
- wsupport
= 0.157
- wdeny
= 0.396
- wquery
= 0.399
- wcomment
= 0.048
UPV:
- wsupport
= 0.200
- wdeny
= 0.350
- wquery
= 0.350
- wcomment
= 0.100
72. CONCLUSION
➢ Evaluation needs to take into account the task purposes:
• Rumour Stance Classification → improve veracity classification / rumour analysis
• Most informative classes: support and deny
• Highly imbalanced four-class classification problem
73. CONCLUSION
➢ Evaluation needs to take into account the task purposes:
• Rumour Stance Classification → improve veracity classification / rumour analysis
• Most informative classes: support and deny
• Highly imbalanced four-class classification problem
➢ Recall based metrics → higher priority to minority classes
74. CONCLUSION
➢ Evaluation needs to take into account the task purposes:
• Rumour Stance Classification → improve veracity classification / rumour analysis
• Most informative classes: support and deny
• Highly imbalanced four-class classification problem
➢ Recall based metrics → higher priority to minority classes
➢ Weighted metrics → higher priority to most important classes
75. CONCLUSION
➢ Evaluation needs to take into account the task purposes:
• Rumour Stance Classification → improve veracity classification / rumour analysis
• Most informative classes: support and deny
• Highly imbalanced four-class classification problem
➢ Recall based metrics → higher priority to minority classes
➢ Weighted metrics → higher priority to most important classes
Ideal evaluation: takes into account multiple metrics!
76. THANK YOU FOR YOUR ATTENTION!
www.weverify.eu
@WeVerify
Thanks to Yue Li for a lot of the slides (and work done!)
Collaboration with Kalina Bontcheva and Diego Silva