Weakly Supervised Learning for Fake News Detection on Twitter

08/30/18 Stefan Helmstetter, Heiko Paulheim 1
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim

Motivation
• Social media...
– ...are an increasingly important source of information
– ...can be manipulated easily

Motivation
• Fake news detection: a straight forward machine learning problem
– Simplest case: two classes
– Researched for several decades
– Used, e.g., for spam filtering

Motivation
• Challenge
– The more training data, the better
– Mass labeling data is difficult (e.g., requires investigations)
●
cf. spam filtering: labeling can be done “on the fly” by laymen

Approach
• We cannot easily tell a fake news tweet from a real one
• But we have information on fake and trustworthy sources

Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet
• Our collection:
– 65 fake news sources
– 47 trustworthy news sources
– 401k tweets
●
111k fake news
●
291k real news

Approach
• Skew towards 2017
– time of crawling, limitations of Twitter API
– more real than fake news (intentionally!)

Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet

Approach
• Mind the classification task
– if we train a classifier, we learn to identify
tweets from untrustworthy sources
– not necessarily the same as fake news tweets
• Assumption
– the training dataset is large
– non-fake news are also covered by trustworthy sources
– trustworthy copies outnumber fake news ones
●
incidental skew in the dataset

Approach
• Leaving that caveat aside, we use
– 53 user-level features
e.g., no. of followers, tweet frequency
– 69 tweet-level features
e.g., length, no. of hashtags, no. of URLs
– text features
as BoW (60k features) or doc2vec model (300 features)
– topic features
10-200 topics created using LDA
– eight features using sentiment and polarity analysis
• Classifiers
– Naive Bayes, Decision Trees, SVM, Neural Net (1 hidden layer),
Random Forest, xgboost
– Voting and weighted voting of the above

Approach
• Optimal selection of features per classifier

Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is trustworthiness of source
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Two variants each
– with and without user level features
– idea: judging tweets from known and unknown sources

Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is
trustworthiness of source
• Results
– up to .78 without user level tweets
– up to .94 with user level tweets
– xgboost and voting work best

Evaluation
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Results
– up to .77 without user level tweets
– up to .89 with user level tweets
– neural net works best
• Observation:
– results are not much worse than for setting 1
– i.e.: source labels seem to be a suitable proxy for tweet labels

Evaluation
• Feature weighting by xgboost:
– most important features are user level features

Evaluation
• Without user level features
– surface level features are strong
– content/topics are not too important

Conclusion
• Fake news detection is a straight forward classification task
– but training data is scarce
• Inexact mass-labeling can be done
– by using source instead of tweet labels
– collection of large-scale training is easy
– automatic re-collection is possible
(e.g., for new topics, changed twitter behavior)
• Results for tweet labeling
– not much worse than for source labeling

Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim

Weakly Supervised Learning for Fake News Detection on Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Weakly Supervised Learning for Fake News Detection on Twitter

Similar to Weakly Supervised Learning for Fake News Detection on Twitter (15)

More from Heiko Paulheim

More from Heiko Paulheim (20)

Recently uploaded

Recently uploaded (20)

Weakly Supervised Learning for Fake News Detection on Twitter