The Grammar of Truth and Lies

•Download as ODP, PDF•

0 likes•235 views

Peter Bleackley

Using NLP to detect Fake News. Slides from a talk given a PyData London meetup on 4th June 2019

Data & Analytics

The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk

The Problem
● “A lie can run around the world before the truth can get its
boots on.”
● Fake News spreads six times faster than real news on Twitter
● The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th
March 2018
● https://science.sciencemag.org/content/359/6380/1146

The Data
● “Getting Real about Fake News” Kaggle Dataset
● https://www.kaggle.com/mrisdal/fake-news
● 12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
● Reuters-21578, Distribution 1.0 Corpus
● 10000 articles from Reuters Newswire, 1987
● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
● Available from NLTK

Don’t Use Vocabulary!
● Potential for bias, especially as corpora are from different
time periods
● Difficult to generalise
● Could be reverse-engineered by a bad actor

Sentence structure features
● Perform Part of Speech tagging with TextBlob
● Concatenate tags to form a feature for each sentence
● “Pete Bleackley is a self-employed data scientist and
computational linguist.”
● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
● Very large, very sparse feature set

First model
● Train LSI model (Gensim) on sentence structure features
from whole dataset
● 70/30 split between training and test data
● Sentence structure features => LSI => Logistic Regression
(scikit-learn)
● https://www.kaggle.com/petebleackley/the-grammar-of-truth-an

Performance
● Precision 61%
● Recall 96%
● Accuracy 70%
● Matthews Correlation Coefficient 50%
● Precision measures our ability to catch the bad guys.

Sentiment analysis
● Used VADER model in NLTK
● Produces Positive, Negative and Neutral scores for each
sentence
● Sum over document
● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%

Sentence Structure + Sentiments
● Precision 74%
● Recall 90%
● Accuracy 81%
● Matthews 64%
● Slight improvement, but it looks like sentiment is doing
most of the work

Random Forests
Precision Recall Accuracy Matthews
Sentence
structure
83% 89% 86% 71%
Sentiments 75% 75% 78% 76%
Both 84% 89% 87% 76%

Understanding the models
● Out of 333264 sentence structure features, 298332 occur
only in a single document
● Out of 23000 documents, 11276 have no features in
common with others
● We need some denser features

Function words
● Pronouns, prepositions, conjunctions, auxilliaries
● Present in every document – most common words
● Usually discarded as “stopwords”...
● ...but useful for stylometric analysis, eg document
attribution
● NLTK stopwords corpus

New model
● Sentence structure features + function words => LSI =>
Logistic Regression
● Precision 90%
● Recall 96%
● Accuracy 93%
● Matthews 87%

What have we learnt?
● Grammatical and stylistic features can be used to
distinguish between real and fake news
● Good choice of features is the key to success
● Will this generalise to other sources?

See also...
● The (mis)informed citizen
● Alan Turing Institute project
● https://www.turing.ac.uk/research/research-projects/misinforme

Similar to The Grammar of Truth and Lies

Grammar of truth and liesPeter Bleackley

Entity Search Engine DRTC Indian Statistical Institute Bangalore

Getting to Know Your Data with RStephen Withington

PasswordsKevin OBrien

The zen of predictive modellingQuinton Anderson

Ml masterclassMaxwell Rebo

Word Cloud Plus with Will and Ray PoynterRay Poynter

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague

A field guide to the Financial Times, Rhys Evans, Financial TimesNeo4j

Vikrant data scientistVikrant Narayan

Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks

Social media analytics as a service: tools from GATEDiana Maynard

Reanimating DevOps to Build Things that WorkDevOpsDays Baltimore

Hacking Predictive Modeling - RoadSec 2018HJ van Veen

How Machine Learning Works for Business10x Nation

Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos

OTel Orientation: How to Train Teams (OTel in Practice)Paige Cruz

The agile forecast joe tristano southern fried agile 2018_ finalJoe Tristano

Expertise on Demand - How machine learning puts the best-of-the-best at your ...10x Nation

And then there were ... Large Language ModelsLeon Dohmen

Similar to The Grammar of Truth and Lies (20)

Grammar of truth and lies

Entity Search Engine

Getting to Know Your Data with R

Passwords

The zen of predictive modelling

Ml masterclass

Word Cloud Plus with Will and Ray Poynter

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...

A field guide to the Financial Times, Rhys Evans, Financial Times

Vikrant data scientist

Webinar: Modern Techniques for Better Search Relevance with Fusion

Social media analytics as a service: tools from GATE

Reanimating DevOps to Build Things that Work

Hacking Predictive Modeling - RoadSec 2018

How Machine Learning Works for Business

Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry

OTel Orientation: How to Train Teams (OTel in Practice)

The agile forecast joe tristano southern fried agile 2018_ final

Expertise on Demand - How machine learning puts the best-of-the-best at your ...

And then there were ... Large Language Models

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand

Aspirational Block Program Block Syaldey District - Almora

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Anomaly detection and data imputation within time series

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...

Predicting Loan Approval: A Data Science Project

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...

The Grammar of Truth and Lies

1. The Grammar of Truth and Lies Using NLP to detect Fake News Peter J Bleackley Playful Technology Limited peter.bleackley@playfultechnology.co.uk

2. The Problem ● “A lie can run around the world before the truth can get its boots on.” ● Fake News spreads six times faster than real news on Twitter ● The spread of true and false news online, Sorush Vosougi, Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp. 1146-1151, 9th March 2018 ● https://science.sciencemag.org/content/359/6380/1146

3. The Data ● “Getting Real about Fake News” Kaggle Dataset ● https://www.kaggle.com/mrisdal/fake-news ● 12999 articles from sites flagged as unreliable by the BS Detector chrome extension ● Reuters-21578, Distribution 1.0 Corpus ● 10000 articles from Reuters Newswire, 1987 ● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html ● Available from NLTK

4. Don’t Use Vocabulary! ● Potential for bias, especially as corpora are from different time periods ● Difficult to generalise ● Could be reverse-engineered by a bad actor

5. Sentence structure features ● Perform Part of Speech tagging with TextBlob ● Concatenate tags to form a feature for each sentence ● “Pete Bleackley is a self-employed data scientist and computational linguist.” ● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN' ● Very large, very sparse feature set

6. First model ● Train LSI model (Gensim) on sentence structure features from whole dataset ● 70/30 split between training and test data ● Sentence structure features => LSI => Logistic Regression (scikit-learn) ● https://www.kaggle.com/petebleackley/the-grammar-of-truth-an

7. Performance ● Precision 61% ● Recall 96% ● Accuracy 70% ● Matthews Correlation Coefficient 50% ● Precision measures our ability to catch the bad guys.

8. Sentiment analysis ● Used VADER model in NLTK ● Produces Positive, Negative and Neutral scores for each sentence ● Sum over document ● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%

9. Sentence Structure + Sentiments ● Precision 74% ● Recall 90% ● Accuracy 81% ● Matthews 64% ● Slight improvement, but it looks like sentiment is doing most of the work

10. Random Forests Precision Recall Accuracy Matthews Sentence structure 83% 89% 86% 71% Sentiments 75% 75% 78% 76% Both 84% 89% 87% 76%

11. Understanding the models ● Out of 333264 sentence structure features, 298332 occur only in a single document ● Out of 23000 documents, 11276 have no features in common with others ● We need some denser features

12. Function words ● Pronouns, prepositions, conjunctions, auxilliaries ● Present in every document – most common words ● Usually discarded as “stopwords”... ● ...but useful for stylometric analysis, eg document attribution ● NLTK stopwords corpus

13. New model ● Sentence structure features + function words => LSI => Logistic Regression ● Precision 90% ● Recall 96% ● Accuracy 93% ● Matthews 87%

14. What have we learnt? ● Grammatical and stylistic features can be used to distinguish between real and fake news ● Good choice of features is the key to success ● Will this generalise to other sources?

15. See also... ● The (mis)informed citizen ● Alan Turing Institute project ● https://www.turing.ac.uk/research/research-projects/misinforme

The Grammar of Truth and Lies

Recommended

Recommended

More Related Content

Similar to The Grammar of Truth and Lies

Similar to The Grammar of Truth and Lies (20)

Recently uploaded

Recently uploaded (20)

The Grammar of Truth and Lies