Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Identifying and Mitigating Cross-Platform Phone Number Abuse on Social Channels

142 views

Published on

Telephony has become a cost-effective medium for spammers to engage, and phone numbers are now being used to drive call traffic to spammer operated resources. The convergence of telephony and the Internet with technologies like Voice over IP (VoIP) is fueling the growth of Over-The-Top (OTT) messaging applications (like WhatsApp, Viber) that allow smartphone users to communicate with each other in myriad ways. These social channels (OSNs and OTT applications) and VoIP applications (like Skype, Google Hangouts) are used by millions of users around the globe. In fact, the volume of messages via OTT messaging applications has overtaken traditional SMS and e-mail. As a result, these social channels have become an attractive attack vector for spammers and malicious actors who are now abusing it for illicit activities like delivering spam and phishing messages. In this work, we aim to detect cybercriminals / spammers that use phone numbers to spread spam on OSNs. We divide this thesis into 4 parts – (1) Understanding the threat landscape of phone attacks on OTT messaging applications leveraging information from OSNs, (2) Uncovering the spam ecosystem on OSNs and identifying spammers which contribute is spreading spam, (3) Evaluating the trustworthiness of current caller ID services and machine learning model that identify spam calls / spammers, (4) Proposing a robust phone reputation score for identifying spam phone numbers on OSNs.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Identifying and Mitigating Cross-Platform Phone Number Abuse on Social Channels

  1. 1. Identifying and Mitigating Cross-Platform Phone Number Abuse on Social Channels linkedin/in/srishti-gupta-627aa738 @Srishti_Gupta14 fb.com/gupta.srishti14 Committee Members Dr. Fabricio Benevenuto Dr. Pawan Goyal Dr. Sameep Mehta Dr. Ponnurangam Kumaraguru (Advisor) Srishti Gupta PhD Thesis Defense April 25, 2019 IIIT-Delhi 1
  2. 2. Who am I? ◆ Research Scientist at American Express ◆ PhD student since December, 2013 - IIIT-Delhi ▶ Masters (2011 - 2013, IIIT-Delhi) ◆ Collaborations ▶ New York University (Abu Dhabi), Georgia Institute of Technology (Atlanta), Microsoft IDC (Hyderabad), Pindrop Security (Atlanta) ◆ Worked in Privacy and Security in Online Social Networks ◆ Research Interests ▶ Applied Machine Learning ▶ Natural Language Processing ▶ Web Security 2
  3. 3. Motivation 3
  4. 4. 4
  5. 5. Keys to the Kingdom! Outgoing Spam Communication! 5
  6. 6. What difference does it make? ◆ Numbers under spammer control - no spoofing ◆ Incoming services like Truecaller don’t work for outgoing spammers ◆ Difference w.r.t URLs? ▶ Medium different, more trust ▶ Minimal defense solutions unlike spam filters ▶ The propagating and damaging channel are different 6
  7. 7. “ What can we do? 7
  8. 8. “ Locate spammers and take them down! 8
  9. 9. Challenges ◆ Lack of useful header data ◆ Difficulty in handling audio streams ◆ Lack of ground truth ◆ Country of origin of phone numbers unknown (toll-free numbers) ◆ Temporary disposable numbers 9
  10. 10. Thesis Statement Cross-platform phone-based spam campaigns can be disintegrated across social channels by identifying and mitigating spam using relational similarity that thrives on identifiable and discriminative public attributes 10
  11. 11. Contributions Summary ◆ Building automated frameworks to identify and characterise phone-based spam campaigns on social channels ◆ Evaluating the effectiveness of existing state-of-art tools in detecting spam phone numbers ◆ Mitigating phone-based spam campaigns by building SpamDoctor, a supervised detection method to flag phone numbers abused on OSNs. 11
  12. 12. SpamDoctor: Demo 12
  13. 13. Contributions Summary ◆ Building automated frameworks to identify and characterise phone-based spam campaigns on social channels ◆ Evaluating the effectiveness of existing state-of-art tools in detecting spam phone numbers ◆ Mitigating phone-based spam campaigns by building SpamDoctor, a supervised detection method to flag phone numbers abused on OSNs. 13
  14. 14. Targeted Attacks on Over-The-Top (OTT) Messaging Applications 14 Malicious Entity: Advertisements, random contact requests Service Provider: Inefficient filtering mechanisms OTT User: Spam activities not yet seen
  15. 15. System Architecture 15 Gupta, S., Gupta, P., Ahamad, M., and Kumaraguru, P. Exploiting Phone Numbers and Cross-Application Features in Targeted Mobile Attacks. Accepted at the 6th Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM), 2016
  16. 16. Information Gathering ◆ Leveraged using Truecaller ◆ Information like name, address, photo URL, OSN handles (Twitter and Facebook), e-mail ◆ Facebook Graph API ◆ Public feeds, posts, albums (public sources) 16
  17. 17. Scalability 17
  18. 18. Success: Amazon Mechanical Turk 18 Social (69.2) > Spear (54.3) > Non-targeted (34.5)
  19. 19. “Threat landscape of phone-based spam campaigns on Online Social Networks? 19
  20. 20. System Architecture 20 Start Gupta, S., Kuchhal, D., Gupta, P., Ahamad, M., Gupta, M. and Kumaraguru, P. "Under the Shadow of Sunshine: Characterizing Spam Campaigns Abusing Phone Numbers Across Online Social Networks. Accepted in the 10th ACM Conference on Web Science, Amsterdam, 27-30 May 2018
  21. 21. Data Collection ◆ Tweet Collection: using keywords like: “call”, “ring”, “reach”, “SMS”, “WhatsApp” etc. ◆ Data stored - phone number, posts, author details, URLs, suspended accounts’ information ◆ Google - Existing Internet Infrastructure 21
  22. 22. Needle in the haystack! 22 P: Phone number; T: Tweet; U: Unigrams
  23. 23. Ground Truth Creation ◆ Suspended accounts ◆ Overlap with FTC dataset ◆ Overlap with existing Truecaller services ◆ Duplicate posts by single and multiple accounts 23
  24. 24. Dataset ◆ ~22M posts ▶ 22,390 campaigns ▶ 1,845,150 distinct phone numbers ▶ 3,365,017 distinct user accounts ◆ ~4.9M posts ▶ Manually verified 202 campaigns ▶ 2,346 distinct phone numbers ▶ 157,494 distinct user accounts 24
  25. 25. Modus Operandi ◆ Advanced fee ◆ Selling Products ◆ Alternating beliefs (LoveGuru) ◆ Tech Support 25
  26. 26. Where does Phone Spam Originate? ◆ Country code using Google libphonenumber ◆ Automated calling using ◆ Google Speech API (Audio to text) 26
  27. 27. Where does Phone Spam Originate? 27
  28. 28. “ How do campaigns spread across Online Social Networks? 28
  29. 29. Case Study: Tech Support Campaign ◆ 43,552 posts ◆ Used toll-free numbers ◆ Majority phone numbers registered between 2014 and 2016 Feature Twitter Facebook GooglePlus YouTube Flickr Total Posts 28,984 2,151 7,830 2,850 1,737 Dis. Phone Numbers 41 33 37 39 20 Distinct User IDs 748 289 360 433 79 29
  30. 30. Cross-Pollination ◆ Is particular OSN prefered? Specific pattern? 30
  31. 31. Existing Web Intelligence useful? ◆ 68.7% accounts never suspended ◆ However, 92% accounts suspended within 3 days in URL based spam campaigns ◆ 4,581 unique URLs, 594 distinct domains ◆ 10% URLs suspended by Web Of Trust (WOT); none by Google Safe Browsing 31
  32. 32. “ Can cross-platform intelligence from Online Social Networks be used? 32
  33. 33. Homogeneous Identities Same identity across networks; levenshtein distance on usernames 33
  34. 34. Cross-Platform Intelligence ◆ 65 instances of homogeneous identities ◆ 52% more posts on GooglePlus; 93.3% more accounts suspended on Twitter ◆ Intelligence propagation from Twitter to other OSNs ◆ Reducing financial loss and victims: collected friends, followers, and likes on Facebook, GooglePlus, and YouTube 34
  35. 35. Cross-Platform Intelligence (I) ◆ Can save approximately 8.8M USD ▶ 21,053 friends on Facebook ▶ 11,538 followers on GooglePlus ▶ 2,816 likes on YouTube ▶ Total - 670,164 users ▶ Average cost of TechSupport spam - $290.9 per victim ▶ Total money saved - 670,164*290.9 = $8.8M 35
  36. 36. “Do legitimate campaigns exist on OSNs? How are they different from spam campaigns? 36
  37. 37. Comparing Spam and Legitimate Identities 16 brands targeted like Microsoft, Facebook, Yahoo, McAfee etc. 37 Category Spam Legitimate Number of posts 269,652 5,712 Number of unique phone numbers 1,164 279 Number of unique IDs 6,077 794 Number of suspended IDs 67,757 47
  38. 38. Spammers vs. Non-spammers 38 Legitimate accounts post about 1 brand while spammers promote multiple brands Larger lifetime of legitimate phone number than phone numbers used in spam campaigns
  39. 39. Network Characteristics 39 Non-spammers Spammers
  40. 40. Takeaways ◆ Cross-platform spam campaigns span across multiple countries: top are Indonesia, USA, India, and UAE ◆ URL spammers are suspended within 3 days while 68.7% phone-based spammers are never suspended ◆ Cross-platform intelligence can be shared across OSNs: Twitter is able to suspend 93.3% more accounts than Facebook. Around 35, 407 victims can be protected and $8.8M be saved ◆ Spammers collude and form dense communities to expand their reach 40
  41. 41. Contributions Summary ◆ Building automated frameworks to identify and characterise phone-based spam campaigns on social channels ◆ Evaluating the effectiveness of existing state-of-art tools in detecting spam phone numbers ◆ Mitigating phone-based spam campaigns by building SpamDoctor, a supervised detection method to flag phone numbers abused on OSNs. 41
  42. 42. Fake Registration ◆ No means of identity verification ◆ Social media accounts can be linked ◆ Similar situation for multiple Applications like Whitepages Pro, Contactive, Whoscall, Hello 42
  43. 43. Trust in Caller ID Applications 43
  44. 44. Spam Phone Number Coverage ◆ FTC Do-not-complaint dataset (0.001%) ▶ Information reported by consumers ▶ Do not call and robocall complaints ◆ Truecaller - 0.4% ▶ Exploiting search endpoint to crawl data ◆ MalwareBytes - 20.3% ▶ Coverage with only TechSupport campaign https://www.ftc.gov/site-information/open-government/data-sets/do-not-call-data 44
  45. 45. “ How to mitigate phone based spam campaigns? 45
  46. 46. Contributions Summary ◆ Building automated frameworks to identify and characterise phone-based spam campaigns on social channels ◆ Evaluating the effectiveness of existing state-of-art tools in detecting spam phone numbers ◆ Mitigating phone-based spam campaigns by building SpamDoctor, a supervised detection method to flag phone numbers abused on OSNs. 46
  47. 47. Revisiting Dataset ◆ Campaigns with at-least one suspended user: 3,370 / 22,390 ◆ 670,257 unique user accounts, 5,593 already suspended ◆ 26,160 unique phone numbers ◆ 893,808 unique URLs 47 Gupta, S., Khattar, A., Gogia, A., Kumaraguru, P. and Chakraborty, T. Collective Classification of Spam Campaigners on Twitter: A Hierarchical Meta-Path Based Approach. Accepted at The Web Conf 2018 (Formerly WWW Conference).
  48. 48. Heterogeneous Networks and Meta-Paths Meta-Paths: Two users can be connected via different paths viz. user-phone-user, user-url-user, user-phone-url-user Collective Classification: Combined classification of nodes based on correlations between known and unknown nodes Known nodes: Already suspended users by Twitter 48
  49. 49. Methodology 49
  50. 50. Hierarchical Meta-Path Based Score (HMPS) Local HMPS score calculated for each campaign: HMPS value for spammer in campaign1 can be shared by non-spammer in campaign2 50
  51. 51. Edge Weights ◆ W(Useri , Phonej ): This is the weight of the edge connecting a user and a phone number, as is measured as the ratio of tweets propagated by Useri containing Phonej over all the tweets propagated by Useri ◆ W(Useri , URLj ): This is the weight of the edge connecting a user and a URL, and is measured as the ratio of tweets propagated by Useri containing URLj over all the tweets propagated by Useri 51
  52. 52. Edge Weights (I) ◆ W(Campi , Phonej ): This is the weight of the edge connecting a campaign and a phone number, and is measured as the ratio of tweets containing Phonej in Campi over all the tweets containing phone numbers in Campi ◆ W(Campi , URLj ): This is the weight of the edge connecting a campaign and a URL, and is measured as the ratio of tweets containing URLj in Campi over all the tweets containing URLs in Campi 52
  53. 53. HMPS (User1 ) ◆ Weight between User1 and User2 , W1 : W(User1 , Phone2 )* (User2 , Phone2 ) ◆ Weight between User1 and User4 , W2 : maximum score calculated for 2 possible meta-paths, i.e., User1 -URL1 -User4 and User1 -Phone2 -Camp1 -URL1 -User4 ; W2 = max ([W(User1 , URL1 ) * W(User4 , URL1 )], [W(User1 , Phone2 ) * W(Camp1 , Phone2 ) * W(Camp1 , URL1 ) * W(User4 , URL1 )]) ◆ The final HMPS of User1 , HMPS (User1 ) = W1 + W2 53
  54. 54. Challenges ◆ Imbalanced Dataset ◆ Manual labelling needs human efforts ◆ Individual campaigns might not have sufficient training samples 54
  55. 55. Challenges ◆ Imbalanced Dataset - One class classifier! ◆ Manual labelling needs human efforts - One class classifier! ◆ Individual campaigns might not have sufficient training samples: ▶ Active learning with feedback used ▶ Gather cues for unknown users from multiple campaigns — 21% overlapping users 55
  56. 56. Active Learning with Feedback 56
  57. 57. Selection Criterion Given (a) a one-class classifier C, represented by the function f(x) which, for instance x, provides the distance of x from the classification boundary, and (b) X, a set of unlabeled instances, we take the maximum distance among all the training samples from the decision boundary, Tc max = maxx ∈X f(x). Now, from the unknown set Xu , which are labeled by C, we choose those instances X’u such that ∀x ∈ Xu ’ : f(x) >= Tc max . Note that the threshold Tc max is specific to a campaign 57
  58. 58. Comparison with Baselines Baseline 1: Profile based features like tweets, followers, hashtags etc. [1]. Baseline 2: URL based features like number of URLs, number of words in the URL etc.[2]. Baseline 3: Content based features like tweets, hashtags, mentions, popularity ratio etc.[3]. [1] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida. 2010. Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), Vol. 6. 1–12. [2] Usman US Khan, Mazhar Ali, Assad Abbas, Samee Khan, and Albert Zomaya. 2016. Segregating Spammers and Unsolicited Bloggers from Genuine Experts on Twitter. IEEE Transactions on Dependable and Secure Computing (2016). [3] Kayode Sakariyah Adewole, Nor Badrul Anuar, Amirrudin Kamsin, and Arun Kumar Sangaiah. 2017. SMSAD: a framework for spam message and spam account detection. Multimedia Tools and Applications (2017), 1–36. 58
  59. 59. Evaluation ◆ Setting 1: Leave-one out cross-validation ◆ Setting 2: ▶ Human annotation ▶ convenience sampling to pick users part of multiple campaigns ▶ 700 users sampled 59
  60. 60. Results Method Feature Setting 1 Setting 2 Accuracy P R F1 AUC Baseline 1 OSN1 0.62 0.86 0.71 0.77 0.48 Baseline 2 OSN2 0.58 0.84 0.92 0.87 0.52 Baseline 3 OSN3 0.62 0.86 0.66 0.74 0.47 Our HMPS 0.77 0.99 0.87 0.93 0.88 HMPS + OSN1 0.76 0.89 0.90 0.89 0.72 HMPS + OSN2 0.84 0.98 0.88 0.93 0.87 HMPS + OSN3 0.70 0.88 0.73 0.80 0.59 Our HMPS + OSN2 - Active Learning - 0.42 0.98 0.55 0.51 60
  61. 61. 1-class vs. 2-class Classifier Method Precision Recall F1-Score AUC Baseline 1 0.68 0.69 0.65 0.50 Baseline 2 0.47 0.57 0.51 0.50 Baseline 3 0.79 0.78 0.78 0.57 HMPS + 2-class classifiers LR 0.61 0.58 0.55 0.58 LDA 0.61 0.58 0.55 0.58 DT 0.83 0.83 0.83 0.83 NB 0.60 0.58 0.57 0.58 SVM 0.65 0.63 0.62 0.63 RF 0.83 0.82 0.82 0.82 HMPS + OSN2 0.95 0.90 0.93 0.92 61
  62. 62. Feedback vs. Oversampling Oversampling + default one-class classifier Precision Recall F1-Score AUC Ratio = 0.20 0.90 0.64 0.64 0.59 Ratio = 0.30 0.88 0.74 0.74 0.63 Ratio = 0.50 0.81 0.71 0.68 0.58 Ratio = 0.75 0.91 0.68 0.69 0.56 Ratio = 1 0.91 0.68 0.70 0.57 Feedback + default one-class classifier 0.95 0.90 0.93 0.92 62
  63. 63. Contributions Summary ◆ Building automated frameworks to identify and characterise phone-based spam campaigns on social channels ◆ Evaluating the effectiveness of existing state-of-art tools in detecting spam phone numbers ◆ Mitigating phone-based spam campaigns by building SpamDoctor, a supervised detection method to flag phone numbers abused on OSNs. 63
  64. 64. How does this thesis help? ◆ Online Social Networks are a primary source of information consumption by Internet users ▶ Unmoderated content; SpamDoctor provides a useful and usable solution to fight back phone based spam attacks ◆ Bridging gap between different channels, i.e. Telephony and Web ▶ Help telecom service providers in blocking incoming and outgoing services to these phone numbers ◆ Early spam detection on OSNs due to transfer learning ▶ Cross-platform intelligence can be shared across OSNs to augment spam detection 64
  65. 65. Limitations and Future Work ◆ Address different type of campaigns differently ▶ Study spam and scam campaigns differently ◆ Utilize crowdsourcing to personalize and improve the performance of automated techniques for spam campaign identification ▶ Crowdsourced feedback to improve accuracy of models ◆ Explore the impact of images and cross referenced posts in OSNs ▶ Augmenting cross-platform intelligence 65
  66. 66. Acknowledgements ◆ Collaborators and co-authors: Dr. Payas Gupta, Prof. Mustaque Ahamad, Manish Gupta, Dhruv Kuchhal, Abhinav Khattar, Arpit Gogia, Gurpreet Singh, Saksham Suri ◆ Monitoring committee: Prof. Mustaque and Prof. Sambuddho ◆ Peers: Dr. Paridhi Jain, Dr. Niharika Sachdeva, Dr. Siddhartha Asthana, Dr. Prateek Dewan, Anupama Aggarwal, Rishabh Kaushal ◆ Members of Precog ◆ My family 66
  67. 67. Peer-reviewed Publications Gupta S., and Kumaraguru, P. Emerging phishing trends and effectiveness of the anti-phishing landing page. In Electronic Crime Research (eCrime), 2014 APWG Symposium on, pp. 36-47. IEEE, 2014. Gupta, S., Gupta, P., Ahamad, M., and Kumaraguru, P. Know your targets: Privacy and Security Implications in Instant Messaging Applications. Poster at 2nd NYUAD Annual Research Conference, Abu Dhabi, 2015. Gupta, S., Gupta, P., Ahamad, M., and Kumaraguru, P. Exploiting phone numbers and cross-application features in targeted mobile attacks. In Proceedings of the 6th Workshop on Security and Privacy in Smartphones and Mobile Devices, pp. 73-82. ACM, 2016. Gupta, S. Emerging Threats Abusing Phone Numbers Exploiting Cross-Platform Features. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). Ph.D. Forum. Gupta, S., Khattar, A., Gogia, A., Kumaraguru, P., and Chakraborty, T.. Collective Classification of Spam Campaigners on Twitter: A Hierarchical Meta-Path Based Approach. In Proceedings of the 2018 World Wide Web Conference, pp. 529-538, (WWW), 2018. Gupta, S., Kuchhal, D., Gupta, P., Ahamad, M., Gupta, M., and Kumaraguru, P. Under the Shadow of Sunshine: Characterizing Spam Campaigns Abusing Phone Numbers Across Online Social Networks. In Proceedings of the 10th ACM Conference on Web Science, pp. 67-76. ACM, 2018. Gupta, S., Bhatia, G., Suri, S., Kuchhal, D., Gupta, P., Ahamad, M., Gupta, M., and Kumaraguru, P.Angel or Demon? Characterizing Variations Across Twitter Timeline of Technical Support Campaigners." [under review in Journal of Web Science]. 67
  68. 68. Thanks! srishtig@iiitd.ac.in http://precog.iiitd.edu.in/ @Srishti_Gupta14 68

×