Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services

451 views

Published on

O​nline Social Networks have become a cornerstone of Web 2.0 era. Internet users around the world use Online Social Networks as primary sources to consume news, updates, and information about events around the world. However, given the enormous volume and veracity, it is hard to manually moderate all content that is generated and shared on these networks. This phenomenon enables hostile entities to generate and promote various types of poor quality content (including but not limited to scams, fake news, false information, rumors, untrustworthy or unreliable information) and pollute the information stream for monetary gains, hinder user experience, or to compromise system reputation. We aim to address this challenge of automatically identifying poor quality content on Online Social Networks. We focus our work on Facebook, which is currently the biggest Online Social Network.

Published in: Engineering
  • Be the first to comment

Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services

  1. 1. Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services Prateek Dewan PhD Thesis Defense November 14, 2017 prateekd@iiitd.ac.in Committee members Dr. Alessandra Sala Dr. Sanasam Ranbir Singh Dr. Aditya Telang Dr. Ponnurangam Kumaraguru (Advisor)
  2. 2. Who am I? • Data Scientist at Apple • PhD student since February, 2012 – IIIT-Delhi • Masters (2010 – 2012), IIIT-Delhi • Collaborations • IBM IRL (Delhi and Bengaluru), Symantec Research Labs (Pune), Dublin City University (Ireland), UFMG (Brazil) • Worked in Privacy and Security on Online Social Media • Research interests • Applied Machine Learning • Natural Language Processing • Web Security 2
  3. 3. Online Social Media: The Big Picture 3
  4. 4. “With great power comes great responsibility” 4
  5. 5. Thesis statement • To design and evaluate automated techniques for quality assessment of context-specific content on social media services in real time • Focus: Facebook • Biggest Online Social Media service • 2.01 billion monthly active users • Every 2 out of 7 human beings on the planet uses Facebook • Most sought-after OSN for news 5
  6. 6. Proposed Solution 6 Identify Characterize Model PrototypeDeployEvaluate
  7. 7. Facebook Inspector: Demo 7
  8. 8. Scope • Establishing the definition of poor quality content • What all content is poor in quality? • Untrustworthy • Child unsafe • Misleading information • Hoaxes, scams, clickbait • Violence, hate speech • Definition conforming to • Facebook’s community standards 1 • Definitions of page spam 8 1 https://www.facebook.com/communitystandards
  9. 9. Approach •Poor quality posts published on Facebook •Facebook pages publishing poor quality content •Misinformation spread on Facebook through images Characterize •Ground truth extraction using URL blacklists, and human annotation •Experiments with multiple supervised learning techniques •Two-fold model to identify malicious content in real time Model •Facebook Inspector (FbI) Architecture •Live deployment via REST API and browser plug-ins for Chrome and Firefox •3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed •Evaluation in terms of response time, performance, and usability Implement 9
  10. 10. Approach • Poor quality posts published on Facebook •Facebook pages publishing poor quality content •Misinformation spread on Facebook through images Characterize •Ground truth extraction using URL blacklists, and human annotation •Experiments with multiple supervised learning techniques •Two-fold model to identify malicious content in real time Model •Facebook Inspector (FbI) Architecture •Live deployment via REST API and browser plug-ins for Chrome and Firefox •3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed •Evaluation in terms of response time, performance, and usability Implement 10
  11. 11. Dataset Data Type Quantity Unique posts 4,465,371 Unique entities 3,373,953 Unique users 2,983,707 Unique pages 390,246 Unique URLs 480,407 Unique posts with one or more URLs 1,222,137 Unique entities posting URLs 856,758 Unique posts with one or more malicious URLs 11,217 Unique entities posting one or more malicious URLs 7,962 Unique malicious URLs 4,622 11
  12. 12. Establishing Ground Truth • Extracted posts containing one or more URLs • 1.2 million out of 4.4 million posts in total • 480k unique URLs • Used six URL blacklists • Google Safebrowsing(malware / phishing) • VirusTotal (spam / malware / phishing) • Surbl (spam) • Web of Trust (trust score)* • SpamHaus (spam) • Phishtank(phishing) • Post containing one or more blacklisted URL marked as poor quality posts (11,217 in all) 12
  13. 13. Web of Trust 13 Reputation: Unsatisfactory / Poor / Very poor (less than 60) Confidence: High (greater than 10) OR Category: Negative Malicious http://www.domain.com
  14. 14. Findings • Facebook’s current techniques do not suffice • 65% of all poor quality posts existed on Facebook after 4 (or more) months • Gathered likes from 52,169 unique users; comments from 8,784 unique users • Facebook’s partnership with Web of Trust? • 88% of all malicious URLs had poor reputation on WOT • No warning pages 14
  15. 15. Platforms used to post 15
  16. 16. Distribution of poor quality posts 16 Pages Users Entities Posts
  17. 17. Approach •Poor quality posts published on Facebook • Facebook pages publishing poor quality content •Misinformation spread on Facebook through images Characterize •Ground truth extraction using URL blacklists, and human annotation •Experiments with multiple supervised learning techniques •Two-fold model to identify malicious content in real time Model •Facebook Inspector (FbI) Architecture •Live deployment via REST API and browser plug-ins for Chrome and Firefox •3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed •Evaluation in terms of response time, performance, and usability Implement 17
  18. 18. Facebook Pages posting poor quality content 18 Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages. Prateek Dewan, Shrey Bagroy, and Ponnurangam Kumaraguru (Short paper). Published at IEEE/ACM Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, USA. 2016.
  19. 19. Ground Truth extraction: Facebook pages 4.4 million posts 10,341 malicious posts (1,557 pages; 5,868 users) 627 malicious pages 19 1 or more malicious URLs in the most recent 100 posts
  20. 20. Dataset of pages posting poor quality content WOT response No. of pages No. of posts Child unsafe 387 10,891 Untrustworthy 317 8,057 Questionable 312 8,859 Negative 266 5,863 Adult content 162 3,290 Spam 124 4,985 Phishing 39 495 Total 627 (31) 20,999 20 • Numbers in brackets are Verified pages
  21. 21. Content analysis (page names) 21 • Sentence Tokenization à Word Tokenization à Case normalization à Stemming à Stopword removal • N-gram analysis (n = 1, 2, 3) • Politically polarized entities amongst poor quality pages • British National Party (BNP), The Tea Party, English Defense League, American Defense League, American Conservatives, Geert Wilders supporters…
  22. 22. Network analysis 22 • Collusive behavior within pages posting poor quality content Shares LikesComments
  23. 23. Temporal activity • Activity ratio: "#.#% &'() *"'&+ ,-&'.) &#&,/ "#.#% &'() *"'&+ during complete observation period • Malicious pages are more active than benign pages 23
  24. 24. Approach •Poor quality posts published on Facebook •Facebook pages publishing poor quality content • Misinformation spread on Facebook through images Characterize •Ground truth extraction using URL blacklists, and human annotation •Experiments with multiple supervised learning techniques •Two-fold model to identify malicious content in real time Model •Facebook Inspector (FbI) Architecture •Live deployment via REST API and browser plug-ins for Chrome and Firefox •3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed •Evaluation in terms of response time, performance, and usability Implement 24
  25. 25. Why?: The Human Brain - Images versus text • Human brain processes images 60,000 times faster than text 25
  26. 26. Are we doing enough to "understand" images? • Most research to analyze social media content focuses on text • Topic modelling • Sentiment analysis • Does it capture everything? • Studies related to images are limited to small scale • Few hundred images manually annotated and analyzed • What can be done? • Automated techniques for image summarization; Deep Learning and Convolutional Neural Networks (CNNs) to scale across large no. of images • Domain transfer learning • Optical Character Recognition 26
  27. 27. Methodology • Images posted on Facebook during the Paris Attacks, November 2015 • 3-tier pipeline for extracting high level image descriptors from images 27 Uniqueposts 131,548 Unique users 106,275 Posts with images 75,277 Totalimages extracted 57,748 Total unique images 15,123 Images Themes (Inception v3) Image Sentiment (DeCAF trained on SentiBank) Optical Character Recognition Human understandable descriptors Text Sentiment (LIWC) + Topics (TF) Manual calibration Tier 1: Visual Themes Tier 2: Image Sentiment Tier 3: Text embedded in images
  28. 28. Tier I: Visual Themes • ImageNet Large Scale Visual Recognition Challenge (ILSVRC), 2012 • 1.2 million images, 1,000 categories • Winner: Google’s Inception-v3 (top-1 error: 17.2%) • 48-layer Deep Convolutional Neural Network 28
  29. 29. Tier I: Visual Themes contd. • All images labeled using Inception-v3 • Validation: • Random sample of 2,545 images annotated by 3 human annotators • 38.87% accuracy (majority voting) • Manual calibration • Renamed 7 out of the top 30 (most frequently occurring) labels • New accuracy: 51.3% • Why rename? à 29 Bolo Tie (Inception-v3) PeaceForParis (Our dataset)
  30. 30. Tier II: Image Sentiment • Domain Transfer Learning • Inception-v3’s last layer retrained using SentiBank • SentiBank • Images collected from Flickr using Adjective Noun Pairs (ANPs) as search query • ANPs: happy dog, adorable baby, abandoned house • Weakly labeled dataset of images carrying emotion • Final training set – 133,108 negative + 305,100 positive sentiment images • 10-fold random subsampling • 69.8% accuracy 30
  31. 31. Tier III: Text embedded in images • Optical Character Recognition (OCR) • Tesseract OCR (Python) • 31,689 images had text • Manually extracted text from a random sample of 1,000 images • Compared with OCR output using string similarity metrics • ~62% accuracy 31 Tesseract output: No-one thinks that these people are representative of Christians. So why do so many think that these people are representative of Muslims?
  32. 32. Image and post text had different topics • Text embedded in images depicted more negative sentiment than user generated textual content 32 Text embedded in images User generated text
  33. 33. Sentiment: Images versus text • Image sentiment was more positive than text sentiment 33 0 0.1 0.2 0.3 0.4 0.5 0.6 8 24 40 56 72 88 104 120 136 152 168 184 200 216 232 248 264 280 Sentiment Value / Volume Fraction No. of hours after the attacks Post Text Image Text Image Volume Fraction
  34. 34. Poor quality image content popular on Facebook 34
  35. 35. Approach •Poor quality posts published on Facebook •Facebook pages publishing poor quality content •Misinformation spread on Facebook through images Characterize •Ground truth extraction using URL blacklists, and human annotation •Experiments with multiple supervised learning techniques •Two-fold model to identify malicious content in real time Model •Facebook Inspector (FbI) Architecture •Live deployment via REST API and browser plug-ins for Chrome and Firefox •3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed •Evaluation in terms of response time, performance, and usability Implement 35
  36. 36. Revisiting -- Establishing Ground Truth • Extracted posts containing one or more URLs • 1.2 million out of 4.4 million posts in total • 480k unique URLs • Used six URL blacklists • Google Safebrowsing(malware / phishing) • VirusTotal (spam / malware / phishing) • Surbl (spam) • Web of Trust (trust score)* • SpamHaus (spam) • Phishtank(phishing) • Post containing one or more blacklisted URL marked as poor quality posts (11,217 in all) 36
  37. 37. Ground Truth extraction – Dataset II •What if a post does not have a URL? • 500 random Facebook posts x 17 events x 3 annotators • Definition of malicious post • “Any irrelevant or unsolicited messages sent over the Internet, typically to large numbers of users, for the purposes of advertising, phishing, spreading malware, etc. are categorized as spam. In terms of online social media, social spam is any content which is irrelevant / unrelated to the event under consideration, and / or aimed at spreading phishing, malware, advertisements, self promotion etc., including bulk messages, profanity, insults, hate speech, malicious links, fraudulent reviews, scams, fake information etc.” • Final dataset (all 3 annotators agreed on the same label) • 571 malicious posts • 3,841 benign posts 37
  38. 38. Feature set: Facebook Posts Source Features Entity (9) isPage, gender, pageCategory, hasUsername, usernameLength, nameLength, numWordsInName, locale, pageLikes Textual content (18) Presence of !,?,!!,??, emoticons (smile, frown), numWords, avgWordLength, numSentences, avgSentenceLength, numDictionaryWords, numHashtags, hashtagsPerWord, numCharacters, numURLs, URLsPerWord, numUppercaseCharacters, numWords / numUniqueWords Metadata (10) Application, Presence of facebook.com URL, Presence of apps.facebook.com URL, Presence of Facebook event URL, hasMessage, hasStory, hasPicture, hasLink, type, linkLength Link (7) http / https, numHyphens, numParameters, avgParameterLength, numSubdomains, pathLength 38
  39. 39. Supervised learning: Dataset I Classifier / Features Entity Text Metadata Link All Top 7 Naïve Bayes 54.79 52.41 71.60 69.25 56.15 74.72 Decision Tree 63.02 64.78 80.56 82.34 84.67 86.17 Random Forest 63.47 66.25 80.67 82.56 85.05 86.62 SVMrbf 61.77 64.89 78.75 81.45 75.89 83.66 39
  40. 40. Supervised learning: Dataset II Classifier / Features Entity Text Metadata Link All Naïve Bayes 51.67 51.60 72.45 77.58 67.63 Decision Tree 51.66 73.16 79.01 81.04 76.17 Random Forest 52.86 76.56 79.87 81.49 80.56 SVMrbf 53.16 76.52 78.18 80.37 73.79 40
  41. 41. Feature set: Facebook Pages Page features Likes, talking about, description length, bio, category, name, location, check-ins, … Posting behavior Daily activity ratio, post types, post likes, post comments, post shares, post engagement ratio, post language, average post length, no. of unique URLs in posts, no. of unique domains in posts, etc. 41 • Supervised learning • Page + post features • 55 features from page information • 41 features from posting behavior • Bag of words • Content generated by pages
  42. 42. Supervised learning: Page + post features Classifier Feature set Accuracy (%) ROC AUC Naïve Bayesian Page 63.95 0.685 Post 69.61 0.753 Page + Post 70.81 0.776 Logistic Regression Page 67.38 0.745 Post 76.55 0.825 Page + Post 76.71 0.846 Decision Trees Page 65.55 0.668 Post 71.37 0.720 Page + Post 70.81 0.758 Random Forest Page 67.86 0.750 Post 74.95 0.829 Page + Post 75.27 0.837 42
  43. 43. Supervised learning: Bag of words Classifier Feature set Accuracy (%) ROC AUC Naïve Bayesian Unigrams 68.27 0.682 Bigrams 69.06 0.690 Trigrams 69.77 0.697 Logistic Regression Unigrams 74.18 0.795 Bigrams 74.34 0.791 Trigrams 73.93 0.789 Decision Trees Unigrams 68.12 0.678 Bigrams 67.05 0.678 Trigrams 66.63 0.672 Random Forest Unigrams 72.26 0.794 Bigrams 71.80 0.802 Trigrams 72.18 0.794 Sparse NN Unigrams 81.74 0.862 Bigrams 84.12 0.872 Trigrams 84.13 0.900 43
  44. 44. Model for real time detection • Model for pages depends on posts published by pages • Can’t be used for detection in real time • Two fold supervised learning based model using post features • Utilizing class probabilities for decision making 44
  45. 45. Decision boundary 45 Classifier 1 Classifier 2 1 10 High High Low Malicious Benign
  46. 46. Approach •Poor quality posts published on Facebook •Facebook pages publishing poor quality content •Misinformation spread on Facebook through images Characterize •Ground truth extraction using URL blacklists, and human annotation •Experiments with multiple supervised learning techniques •Two-fold model to identify malicious content in real time Model •Facebook Inspector (FbI) Architecture •Live deployment via REST API and browser plug-ins for Chrome and Firefox •3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed •Evaluation in terms of response time, performance, and usability Implement 46
  47. 47. Facebook Inspector (FbI): Architecture 47
  48. 48. FbI stats Date of public launch August 23, 2015 Total Incoming Requests 9 million + Total public posts analyzed 3.5 million + Total downloads 5,000+ Daily active users 250+ Total unique browsers 1,250+ Posts marked as malicious 615,000+ Posts marked as benign 2.9 million+ 48
  49. 49. FbI evaluation: Response time 49 • ~80% posts processed within 3 seconds • Average time per post: 2.635 seconds
  50. 50. FbI evaluation: Usability • Usability study with 53 participants • SUS score: 81.36 (A grade) • Higher perceived usability that > 90% of all systems evaluated using SUS scale • 98.1% participants found FbI “easy to use” • 67.9% participants would like use FbI frequently • Quotes from users: • “Saves your time spent on spam links and hence enhances user experience.” • “[Facebook Inspector] Can be useful for minors and people who lack the judgement to decide how the post is.” 50
  51. 51. Contributions summary • Identified and characterizedpoor quality content spread on Facebook, with the purpose of identifying poor quality posts published during news-making events in real time • Evaluated supervised learning approaches for identifying poor quality posts on Facebook in real time, using entity, textual, metadata, and URL features • Deployed and evaluated a novel framework and system for real time detection of poor quality posts on Facebook during news-making events 51
  52. 52. How does it help? • Social media services are the primary source of information for majority of Internet users • Content is unmoderated and crowd-sourced; everything you see may not be true • Facebook Inspector provides a useful and usable real world solution to assist users • Methodology for fast and accurate summarization of image datasets pertaining to a given topic • Government agencies / brands can use this methodology to quickly produce high-level summaries of events / products and gauge the pulse of the masses 52
  53. 53. Real world impact • Real time system Facebook Inspector built to identify poor quality content is used by 250+ Facebook users, and has processed over 9 million requests • A unique dataset of Facebook posts containing malicious URLs, pages posting malicious content, and images depicting misinformation from 20+ news-making events 53
  54. 54. Limitations and future work • Current system does not incorporate user feedback • We would like to enable users to provide feedback to make a more personalized detection model • Computer vision techniques have limited accuracy on social media content • Object detection, sentiment analysis, and optical character recognition techniques we used are not tested thoroughly on social media content • Identify and rank users on the basis of degree of malice • More malicious content generated, higher the ranking 54
  55. 55. Acknowledgements • NIXI for travel support (eCRS, 2014) • IIIT-Delhi for travel support (ASONAM, 2017) • Govt. of India for funding during PhD • Collaborators and co-authors: Dr. Anand Kashyap, Shrey Bagroy, Anshuman Suri, Varun Bharadhwaj, Aditi Mithal • Monitoring committee: Dr. Vinayak and Dr. Sambuddho • Peers: Dr. Niharika Sachdeva, Anupama Aggarwal, Dr. Paridhi Jain, Dr. Aditi Gupta, Srishti Gupta, Rishabh Kaushal • Members of Precog@IIITD and CERC • Everyone else who has been part of my journey… 55
  56. 56. Publications – Part of thesis • Dewan, P., Bagroy, S., and Kumaraguru, P. Hiding in Plain Sight: The Anatomy of Malicious Pages on Facebook. Book chapter, Lecture Notes in Social Networks, Springer 2017 (To appear) • Dewan, P., Suri, A., Bharadhwaj, V., Mithal, A., and Kumaraguru, P. Towards Understanding Crisis Events On Online Social Networks Through Pictures. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017. • Dewan, P., and Kumaraguru, P. Facebook Inspector (FbI): Towards Automatic Real Time Detection of Malicious Content on Facebook. Social Network Analysis and Mining Journal (SNAM), 2017. Volume 7, Issue 1. • Dewan, P., Bagroy, S., and Kumaraguru, P. Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2016 (Short paper) • Dewan, P., and Kumaraguru, P. Towards Automatic Real Time Identification of Malicious Posts on Facebook. Thirteenth Annual Conference on Privacy, Security and Trust (PST), 2015 • Dewan, P., Kashyap, A., and Kumaraguru, P. Analyzing Social and Stylometric Features to Identify Spear phishing Emails. APWG eCrime Research Symposium (eCRS), 2014 56
  57. 57. Publications – Other • Kaushal, R., Chandok, S., Jain P., Dewan, P., Gupta, N., and Kumaraguru, P. Nudging Nemo: Helping Users Control Linkability across Social Networks. 9th International Conference on Social Informatics (SocInfo), 2017 (Short paper). • Deshpande, P., Joshi, S., Dewan, P., Murthy, K., Mohania, M., Agrawal, S. The Mask of ZoRRo: preventing information leakage from documents. Knowledge and Information Systems Journal, 2014 • Mittal, S., Gupta, N., Dewan, P., Kumaraguru, P. Pinned it! A large scale study of the Pinterest network. 1st ACM IKDD Conference on Data Sciences (CoDS), 2014 • Dewan, P., Gupta, M., Goyal, K., and Kumaraguru, P. MultiOSN: Realtime Monitoring of Real World Events on Multiple Online Social Media IBM ICARE 2013 • Magalhães, T., Dewan, P., Kumaraguru, P., Melo-Minardi, R., and Almeida, V. uTrack: Track Yourself! Monitoring Information on Online Social Media. 22nd International World Wide Web Conference (WWW) (2013) • Conway M., Dewan P., Kumaraguru P., McInerney L. 'White Pride Worldwide': A Meta- analysis of Stormfront.org Internet, Politics, Policy 2012: Big Data, Big Challenges?, Oxford Internet Institute, University of Oxford. 57
  58. 58. Thank you! prateekd@iiitd.ac.in http://precog.iiitd.edu.in/people/prateek

×