Successfully reported this slideshow.
Your SlideShare is downloading. ×

The Grammar of Truth and Lies

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Grammar of truth and lies
Grammar of truth and lies
Loading in …3
×

Check these out next

1 of 15 Ad
Advertisement

More Related Content

Similar to The Grammar of Truth and Lies (20)

Advertisement

The Grammar of Truth and Lies

  1. 1. The Grammar of Truth and Lies Using NLP to detect Fake News Peter J Bleackley Playful Technology Limited peter.bleackley@playfultechnology.co.uk
  2. 2. The Problem ● “A lie can run around the world before the truth can get its boots on.” ● Fake News spreads six times faster than real news on Twitter ● The spread of true and false news online, Sorush Vosougi, Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp. 1146-1151, 9th March 2018 ● https://science.sciencemag.org/content/359/6380/1146
  3. 3. The Data ● “Getting Real about Fake News” Kaggle Dataset ● https://www.kaggle.com/mrisdal/fake-news ● 12999 articles from sites flagged as unreliable by the BS Detector chrome extension ● Reuters-21578, Distribution 1.0 Corpus ● 10000 articles from Reuters Newswire, 1987 ● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html ● Available from NLTK
  4. 4. Don’t Use Vocabulary! ● Potential for bias, especially as corpora are from different time periods ● Difficult to generalise ● Could be reverse-engineered by a bad actor
  5. 5. Sentence structure features ● Perform Part of Speech tagging with TextBlob ● Concatenate tags to form a feature for each sentence ● “Pete Bleackley is a self-employed data scientist and computational linguist.” ● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN' ● Very large, very sparse feature set
  6. 6. First model ● Train LSI model (Gensim) on sentence structure features from whole dataset ● 70/30 split between training and test data ● Sentence structure features => LSI => Logistic Regression (scikit-learn) ● https://www.kaggle.com/petebleackley/the-grammar-of-truth-an
  7. 7. Performance ● Precision 61% ● Recall 96% ● Accuracy 70% ● Matthews Correlation Coefficient 50% ● Precision measures our ability to catch the bad guys.
  8. 8. Sentiment analysis ● Used VADER model in NLTK ● Produces Positive, Negative and Neutral scores for each sentence ● Sum over document ● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
  9. 9. Sentence Structure + Sentiments ● Precision 74% ● Recall 90% ● Accuracy 81% ● Matthews 64% ● Slight improvement, but it looks like sentiment is doing most of the work
  10. 10. Random Forests Precision Recall Accuracy Matthews Sentence structure 83% 89% 86% 71% Sentiments 75% 75% 78% 76% Both 84% 89% 87% 76%
  11. 11. Understanding the models ● Out of 333264 sentence structure features, 298332 occur only in a single document ● Out of 23000 documents, 11276 have no features in common with others ● We need some denser features
  12. 12. Function words ● Pronouns, prepositions, conjunctions, auxilliaries ● Present in every document – most common words ● Usually discarded as “stopwords”... ● ...but useful for stylometric analysis, eg document attribution ● NLTK stopwords corpus
  13. 13. New model ● Sentence structure features + function words => LSI => Logistic Regression ● Precision 90% ● Recall 96% ● Accuracy 93% ● Matthews 87%
  14. 14. What have we learnt? ● Grammatical and stylistic features can be used to distinguish between real and fake news ● Good choice of features is the key to success ● Will this generalise to other sources?
  15. 15. See also... ● The (mis)informed citizen ● Alan Turing Institute project ● https://www.turing.ac.uk/research/research-projects/misinforme

×