Weakly Supervised Learning for Fake News Detection on Twitter

•Download as ODP, PDF•

1 like•1,831 views

The problem of automatic detection of fake news insocial media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.

Data & Analytics

08/30/18 Stefan Helmstetter, Heiko Paulheim 1
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim

08/30/18 Stefan Helmstetter, Heiko Paulheim 2
Motivation
• Social media...
– ...are an increasingly important source of information
– ...can be manipulated easily

08/30/18 Stefan Helmstetter, Heiko Paulheim 3
Motivation
• Fake news detection: a straight forward machine learning problem
– Simplest case: two classes
– Researched for several decades
– Used, e.g., for spam filtering

08/30/18 Stefan Helmstetter, Heiko Paulheim 4
Motivation
• Challenge
– The more training data, the better
– Mass labeling data is difficult (e.g., requires investigations)
●
cf. spam filtering: labeling can be done “on the fly” by laymen

08/30/18 Stefan Helmstetter, Heiko Paulheim 5
Approach
• We cannot easily tell a fake news tweet from a real one
• But we have information on fake and trustworthy sources

08/30/18 Stefan Helmstetter, Heiko Paulheim 6
Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet
• Our collection:
– 65 fake news sources
– 47 trustworthy news sources
– 401k tweets
●
111k fake news
●
291k real news

08/30/18 Stefan Helmstetter, Heiko Paulheim 7
Approach
• Skew towards 2017
– time of crawling, limitations of Twitter API
– more real than fake news (intentionally!)

08/30/18 Stefan Helmstetter, Heiko Paulheim 8
Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet

08/30/18 Stefan Helmstetter, Heiko Paulheim 9
Approach
• Mind the classification task
– if we train a classifier, we learn to identify
tweets from untrustworthy sources
– not necessarily the same as fake news tweets
• Assumption
– the training dataset is large
– non-fake news are also covered by trustworthy sources
– trustworthy copies outnumber fake news ones
●
incidental skew in the dataset

08/30/18 Stefan Helmstetter, Heiko Paulheim 10
Approach
• Leaving that caveat aside, we use
– 53 user-level features
e.g., no. of followers, tweet frequency
– 69 tweet-level features
e.g., length, no. of hashtags, no. of URLs
– text features
as BoW (60k features) or doc2vec model (300 features)
– topic features
10-200 topics created using LDA
– eight features using sentiment and polarity analysis
• Classifiers
– Naive Bayes, Decision Trees, SVM, Neural Net (1 hidden layer),
Random Forest, xgboost
– Voting and weighted voting of the above

08/30/18 Stefan Helmstetter, Heiko Paulheim 11
Approach
• Optimal selection of features per classifier

08/30/18 Stefan Helmstetter, Heiko Paulheim 12
Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is trustworthiness of source
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Two variants each
– with and without user level features
– idea: judging tweets from known and unknown sources

08/30/18 Stefan Helmstetter, Heiko Paulheim 13
Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is
trustworthiness of source
• Results
– up to .78 without user level tweets
– up to .94 with user level tweets
– xgboost and voting work best

08/30/18 Stefan Helmstetter, Heiko Paulheim 14
Evaluation
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Results
– up to .77 without user level tweets
– up to .89 with user level tweets
– neural net works best
• Observation:
– results are not much worse than for setting 1
– i.e.: source labels seem to be a suitable proxy for tweet labels

08/30/18 Stefan Helmstetter, Heiko Paulheim 15
Evaluation
• Feature weighting by xgboost:
– most important features are user level features

08/30/18 Stefan Helmstetter, Heiko Paulheim 16
Evaluation
• Without user level features
– surface level features are strong
– content/topics are not too important

08/30/18 Stefan Helmstetter, Heiko Paulheim 17
Conclusion
• Fake news detection is a straight forward classification task
– but training data is scarce
• Inexact mass-labeling can be done
– by using source instead of tweet labels
– collection of large-scale training is easy
– automatic re-collection is possible
(e.g., for new topics, changed twitter behavior)
• Results for tweet labeling
– not much worse than for source labeling

08/30/18 Stefan Helmstetter, Heiko Paulheim 18
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim

What's hot

Twitter sentiment analysis pptSonuCreation

Social Media Sentiment AnalysisMehmet Burak Akgün

DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGijcsit

Sentiment analysis of Twitter data using pythonHetu Bhavsar

Twitter sentiment analysis project reportBharat Khanna

FAKE Review DetectionCognizant Technology Solutions

Seminar on detecting fake accounts in social media using machine learningParvathi Sanil Nair

Sentiment Analysis using Twitter DataHari Prasad

Detecting fake news .pptxHabiburRahmanZihad3

Sentiment Analysis on TwitterSmritiAgarwal26

Introduction to Deep Learning Salesforce Engineering

PPT2: Introduction of Machine Learning & Deep Learning and its typesakira-ai

New sentiment analysis of tweets using python by Ravi kumarRavi Kumar

Sentiment Analaysis on TwitterNitish J Prabhu

Deep Learning FundamentalsThomas Delteil

Spam identification fake profileKamoru Abiodun Balogun(Bsc,MIT,CCNA,OCA,PhD inview UPM)

Generative AI Use-cases for Enterprise - First SessionGene Leybzon

Fake News Detection Using Machine LearningIRJET Journal

Unlocking the Power of Generative AI Models and Systems such as GPT-4 and Cha...Nohoax Kanont

Transformers, LLMs, and the Possibility of AGISynaptonIncorporated

What's hot (20)

Twitter sentiment analysis ppt

Social Media Sentiment Analysis

DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING

Sentiment analysis of Twitter data using python

Twitter sentiment analysis project report

FAKE Review Detection

Seminar on detecting fake accounts in social media using machine learning

Sentiment Analysis using Twitter Data

Detecting fake news .pptx

Sentiment Analysis on Twitter

Introduction to Deep Learning

PPT2: Introduction of Machine Learning & Deep Learning and its types

New sentiment analysis of tweets using python by Ravi kumar

Sentiment Analaysis on Twitter

Deep Learning Fundamentals

Spam identification fake profile

Generative AI Use-cases for Enterprise - First Session

Fake News Detection Using Machine Learning

Unlocking the Power of Generative AI Models and Systems such as GPT-4 and Cha...

Transformers, LLMs, and the Possibility of AGI

Similar to Weakly Supervised Learning for Fake News Detection on Twitter

Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim

Open Analytics: Building Effective Frameworks for Social Media Analysisikanow

Open analytics social media frameworkOpen Analytics

Usable Privacy and Security: A Grand Challenge for HCI, Human Computer Inter...Jason Hong

Using Twitter as a data source: An overview of ethical challengesDr Wasim Ahmed

Building Effective Frameworks for Social Media Analysisikanow

Analysis of Cyberbullying Tweets in Trending World Eventskcortis

Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim

Group Chapter Presentationfeinal

Hallam healthresearch tweetchatsHeidi Probst

Building Effective Frameworks for Social Media AnalysisOpen Analytics

Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...Timo Wandhoefer

Sentiment Analysis of Twitter DataSumit Raj

DP1_160430723010_Divya.pptxDivyaPatel729457

Social Network Analysis Basics for Social Media Profs - HandoutMatthew J. Kushin, Ph.D.

Social Media Crawling & Mining Seminar Symeon Papadopoulos

Similar to Weakly Supervised Learning for Fake News Detection on Twitter (16)

Exploiting Linked Open Data as Background Knowledge in Data Mining

Open Analytics: Building Effective Frameworks for Social Media Analysis

Open analytics social media framework

Usable Privacy and Security: A Grand Challenge for HCI, Human Computer Inter...

Using Twitter as a data source: An overview of ethical challenges

Building Effective Frameworks for Social Media Analysis

Analysis of Cyberbullying Tweets in Trending World Events

Fast Approximate A-box Consistency Checking using Machine Learning

Group Chapter Presentation

Hallam healthresearch tweetchats

Building Effective Frameworks for Social Media Analysis

Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...

Sentiment Analysis of Twitter Data

DP1_160430723010_Divya.pptx

Social Network Analysis Basics for Social Media Profs - Handout

Social Media Crawling & Mining Seminar

Recently uploaded

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

ASML's Taxonomy Adventure by Daniel Cantervoginip

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Industrialised data - the key to AI success.pdfLars Albertsson

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

ASML's Taxonomy Adventure by Daniel Canter

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

E-Commerce Order PredictionShraddha Kamble.pptx

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Industrialised data - the key to AI success.pdf

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一

Weakly Supervised Learning for Fake News Detection on Twitter

1. 08/30/18 Stefan Helmstetter, Heiko Paulheim 1 Weakly Supervised Learning for Fake News Detection on Twitter Stefan Helmstetter, Heiko Paulheim

2. 08/30/18 Stefan Helmstetter, Heiko Paulheim 2 Motivation • Social media... – ...are an increasingly important source of information – ...can be manipulated easily

3. 08/30/18 Stefan Helmstetter, Heiko Paulheim 3 Motivation • Fake news detection: a straight forward machine learning problem – Simplest case: two classes – Researched for several decades – Used, e.g., for spam filtering

4. 08/30/18 Stefan Helmstetter, Heiko Paulheim 4 Motivation • Challenge – The more training data, the better – Mass labeling data is difficult (e.g., requires investigations) ● cf. spam filtering: labeling can be done “on the fly” by laymen

5. 08/30/18 Stefan Helmstetter, Heiko Paulheim 5 Approach • We cannot easily tell a fake news tweet from a real one • But we have information on fake and trustworthy sources

6. 08/30/18 Stefan Helmstetter, Heiko Paulheim 6 Approach • Naive mass labeling: – every tweet from a fake source is a fake tweet – every tweet from a trustworthy source is a true tweet • Our collection: – 65 fake news sources – 47 trustworthy news sources – 401k tweets ● 111k fake news ● 291k real news

7. 08/30/18 Stefan Helmstetter, Heiko Paulheim 7 Approach • Skew towards 2017 – time of crawling, limitations of Twitter API – more real than fake news (intentionally!)

8. 08/30/18 Stefan Helmstetter, Heiko Paulheim 8 Approach • Naive mass labeling: – every tweet from a fake source is a fake tweet – every tweet from a trustworthy source is a true tweet

9. 08/30/18 Stefan Helmstetter, Heiko Paulheim 9 Approach • Mind the classification task – if we train a classifier, we learn to identify tweets from untrustworthy sources – not necessarily the same as fake news tweets • Assumption – the training dataset is large – non-fake news are also covered by trustworthy sources – trustworthy copies outnumber fake news ones ● incidental skew in the dataset

10. 08/30/18 Stefan Helmstetter, Heiko Paulheim 10 Approach • Leaving that caveat aside, we use – 53 user-level features e.g., no. of followers, tweet frequency – 69 tweet-level features e.g., length, no. of hashtags, no. of URLs – text features as BoW (60k features) or doc2vec model (300 features) – topic features 10-200 topics created using LDA – eight features using sentiment and polarity analysis • Classifiers – Naive Bayes, Decision Trees, SVM, Neural Net (1 hidden layer), Random Forest, xgboost – Voting and weighted voting of the above

11. 08/30/18 Stefan Helmstetter, Heiko Paulheim 11 Approach • Optimal selection of features per classifier

12. 08/30/18 Stefan Helmstetter, Heiko Paulheim 12 Evaluation • Setting 1 – Cross validation on the training set – Remember: actual target is trustworthiness of source • Setting 2 – Validation against a gold standard – Target here: trustworthiness of tweet • Two variants each – with and without user level features – idea: judging tweets from known and unknown sources

13. 08/30/18 Stefan Helmstetter, Heiko Paulheim 13 Evaluation • Setting 1 – Cross validation on the training set – Remember: actual target is trustworthiness of source • Results – up to .78 without user level tweets – up to .94 with user level tweets – xgboost and voting work best

14. 08/30/18 Stefan Helmstetter, Heiko Paulheim 14 Evaluation • Setting 2 – Validation against a gold standard – Target here: trustworthiness of tweet • Results – up to .77 without user level tweets – up to .89 with user level tweets – neural net works best • Observation: – results are not much worse than for setting 1 – i.e.: source labels seem to be a suitable proxy for tweet labels

15. 08/30/18 Stefan Helmstetter, Heiko Paulheim 15 Evaluation • Feature weighting by xgboost: – most important features are user level features

16. 08/30/18 Stefan Helmstetter, Heiko Paulheim 16 Evaluation • Without user level features – surface level features are strong – content/topics are not too important

17. 08/30/18 Stefan Helmstetter, Heiko Paulheim 17 Conclusion • Fake news detection is a straight forward classification task – but training data is scarce • Inexact mass-labeling can be done – by using source instead of tweet labels – collection of large-scale training is easy – automatic re-collection is possible (e.g., for new topics, changed twitter behavior) • Results for tweet labeling – not much worse than for source labeling

18. 08/30/18 Stefan Helmstetter, Heiko Paulheim 18 Weakly Supervised Learning for Fake News Detection on Twitter Stefan Helmstetter, Heiko Paulheim

Weakly Supervised Learning for Fake News Detection on Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Weakly Supervised Learning for Fake News Detection on Twitter

Similar to Weakly Supervised Learning for Fake News Detection on Twitter (16)

More from Heiko Paulheim

More from Heiko Paulheim (20)

Recently uploaded

Recently uploaded (20)

Weakly Supervised Learning for Fake News Detection on Twitter