Social Network Analysis
Approach and Applications
Joshua S. White
PhD Candidate, Engineering Science
April 22, 2014
Commit...
Outline
Motivation . . . . . . . . . . . . . . . . 3
Problem Questions . . . . . . . . . 4
Method & Publications . . . . ....
Motivation
Partially inspired by Gladwell’s book, The Tipping Point [1], in which he discusses
how life can be thought of ...
Problem Questions
• Can we come up with a way of classifying users based on actor types?
• Can we determine who the opinio...
Method & Publications
• Establish a reliable collection mechanism.
• Establish a large dataset that can be utilized to ans...
Coalmine
• Scales well based on initial tests
• Useful for both manual and automated detection
• Allowed us to refine our d...
PySNAP
• Fixes all of the previous issues with Coalmine
• Completely reimplimented in Python with a few supportive Bash sc...
Established Dataset
• Over the course of 2012 we collected 165 TB of Twitter Data (Uncompressed)
– 175 Days Collected, 147...
Insights into the Data
| Clarkson University 9/42
Botnet Command & Control Detection
• Joshua S White, Jeanna N Matthews, and John L Stacy. Coalmine: an experience in build...
Botnet Command & Control Detection Continued
Date/Time UID Text MSG Entropy Source
Sun Mar 20 15:27:02
+0000 2011
49492150...
Phishing Website Detection
• Joshua S White, Jeanna N Matthews, and John L Stacy. A method for the automated detection phi...
Phishing Website Detection Continued
(F)raud / (L)egit URL Structural
Fingerprint
Page Title pHash Value Hamming Score
Pay...
Phishing Website Detection Continuum: ML based
detection
• Title: An Image-based Feature Extraction Approach for Phishing ...
Malware Infection Vector Detection
• BEK (The Blackhole Exploit Kit) was the predominant MaaS (Malware as a Service)
in 20...
Malware Infection Vector Detection Continued
• Joshua S. White and Jeanna N. Matthews, “It’s you on photo?: Automatic dete...
Malware Infection Vector Detection Continued
| Clarkson University 17/42
Malware Infection Vector Detection Continued
| Clarkson University 18/42
Actor Identification
• Title: Connectors, Mavens, Salesmen and More: Actor Based Online Social Network
(OSN) Analysis Metho...
Actor Identification Continued
| Clarkson University 20/42
Actor Identification Continued
• Time is important
• Previous methods did not take event sequence into account
• Liaison Ex...
Actor Identification Continued
| Clarkson University 22/42
Actor Identification Continued
| Clarkson University 23/42
Event Identification
• Still in the initial stages of this part of our work
• Given a general topic, “search term, hashtag,...
Event Identification Continued
| Clarkson University 25/42
Event Identification Continued
• Top 10 Twitter Accounts, sending and receiving KONY2012 related Tweets
Directed @ Account ...
Event Identification Continued
• Top 10 Twitter Accounts, retweeting and being retweeted regarding KONY2012
Retweeting Acco...
Event Identification Continued
| Clarkson University 28/42
Event Identification Continued
| Clarkson University 29/42
Conclusions
• We aimed to answer the following questions when we started this work:
– Can we come up with a way of classif...
Future Work
• We have applied for a data grant from Twitter
• We have, are in the process of, moving our entire dataset to...
Acknowledgements
• I would like to thank:
– Dr. Matthews
– Dr. Bay
– Dr. Lynch
– Dr. Schuckers
– Dr. Liu
| Clarkson Univer...
References
[1] Gladwell, M. (2000). The tipping point. Boston: Little, Brown and Company
| Clarkson University 33/42
Contact
whitejs@clarkson.edu
| Clarkson University 34/42
Questions
Questions?
Suplimental Material
| Clarkson University 36/42
• DDFS
| Clarkson University 37/42
| Clarkson University 38/42
• Twitter JSON Key Fields
profile_link_color Coordinates verified
In_reply_to_screen_name Geo time_zone
In_reply_to_status_i...
• BEK Infectious Account Visualization
| Clarkson University 40/42
• Tensed Predicate Logic Key
| Clarkson University 41/42
• Coalmine User Interface
| Clarkson University 42/42
Upcoming SlideShare
Loading in …5
×

Social Network Analysis Applications and Approach

469
-1

Published on

Dissertation defense entitled: Social Network Analysis Applications and Approach

Published in: Data & Analytics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
469
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Social Network Analysis Applications and Approach

  1. 1. Social Network Analysis Approach and Applications Joshua S. White PhD Candidate, Engineering Science April 22, 2014 Committee Members: Jeanna N. Matthews, PhD (Advisor) John S. Bay, PhD (External Examiner) Chris Lynch, PhD Chen Liu, PhD Stephanie C. Schuckers, PhD | Clarkson University 1/42
  2. 2. Outline Motivation . . . . . . . . . . . . . . . . 3 Problem Questions . . . . . . . . . 4 Method & Publications . . . . . . . . . 5 Coalmine . . . . . . . . . . . . . . . . . 6 PySNAP . . . . . . . . . . . . . . . . . 7 Established Dataset . . . . . . . . . . . 8 Insights into the Data . . . . . . . 9 Botnet Command & Control Detection . 10 Phishing Website Detection . . . . . . . 12 Phishing Website Detection Con- tinuum: ML based detection 14 Malware Infection Vector Detection . . 15 Actor Identification . . . . . . . . . . . 19 Event Identification . . . . . . . . . . . 24 Conclusions . . . . . . . . . . . . . . . 30 Future Work . . . . . . . . . . . . . . . 31 Acknowledgements . . . . . . . . . . . 32 References . . . . . . . . . . . . . . . . 33 Contact . . . . . . . . . . . . . . . . . 34 Questions . . . . . . . . . . . . . . . . 35 Suplimental Material . . . . . . . . . . 36 | Clarkson University 2/42
  3. 3. Motivation Partially inspired by Gladwell’s book, The Tipping Point [1], in which he discusses how life can be thought of as an epidemic. Some criticism exists as to Gladwell’s rigor, however for our use it is about inspiration and motivation not accuracy. The Books Key Points “for our purposes” • Actors (Connectors, Mavens, Salesmen). • Information spreads like disease. • Ideas reach a tipping point (critical mass). Let’s Face It - Social Networks Are Fun • We are a social species, that enjoy communicating and self adulation. | Clarkson University 3/42
  4. 4. Problem Questions • Can we come up with a way of classifying users based on actor types? • Can we determine who the opinion leaders or influencers are? • Can we determine how information spreads on these networks? • Can we detect malicious social network use? • Are there information security applications for social network data-mining? | Clarkson University 4/42
  5. 5. Method & Publications • Establish a reliable collection mechanism. • Establish a large dataset that can be utilized to answer each question. • Use a case study approach, whereby each case feeds the next. • Produce each case study as an individual publication or presentation. – 3 x Published Proceedings – 2 x Pending Proceedings – 3 x Invited Presentations | Clarkson University 5/42
  6. 6. Coalmine • Scales well based on initial tests • Useful for both manual and automated detection • Allowed us to refine our data collection capabilities At the Time (Future Work) • Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection | Clarkson University 6/42
  7. 7. PySNAP • Fixes all of the previous issues with Coalmine • Completely reimplimented in Python with a few supportive Bash scripts • Utilizes the DISCO MapReduce framework, also built on Python • Included a better method for data capture that was previously bolted on to Coalmine • Allowed us to establish a large dataset for future work | Clarkson University 7/42
  8. 8. Established Dataset • Over the course of 2012 we collected 165 TB of Twitter Data (Uncompressed) – 175 Days Collected, 147 Full Days ∗ Estimated 45 Billion Tweets – Recently released estimates place total Twitter traffic at 175 million tweets per day in 2012 – Thus our daily collection rates varied between 50% and 80% of total Twitter traffic. – We captured complete tweet data in JSON format using Twitters REST API. ∗ This data includes a large number of additional fields other than the mes- sage text, all of which can be taken into account when doing measure- ments. | Clarkson University 8/42
  9. 9. Insights into the Data | Clarkson University 9/42
  10. 10. Botnet Command & Control Detection • Joshua S White, Jeanna N Matthews, and John L Stacy. Coalmine: an experience in building a system for social media analytics. In SPIE Defense, Security, and Sensing, pages 84080A-84080A. International Society for Optics and Photonics, 2012. | Clarkson University 10/42
  11. 11. Botnet Command & Control Detection Continued Date/Time UID Text MSG Entropy Source Sun Mar 20 15:27:02 +0000 2011 49492150 668365824 Shutdown -r now 3.373557 26227518 http://twitter.com/Ebastos Sun Mar 20 01:25:20 +0000 2011 49280326 475853825 # shutdown -h now 3.373557 26227518 http://twitter.com/ohdediku Sun Mar 20 21:40:53 +0000 2011 49586229 964062720 $ sudo shutdown -h now 3.373557 26227518 http://twitter.com/souzabruno Sun Mar 20 19:38:41 +0000 2011 49555476 769280000 Text: sudo shut- down -h now 3.373557 26227518 http://twitter.com/stormyblack Sun Mar 20 18:51:51 +0000 2011 49543693 820116992 shutdown -now 3.373557 26227518 http://twitter.com/godzilla2k9 Sun Mar 20 18:52:30 +0000 2011 49543856 840126464 shutdown -h now !: 3.373557 26227518 http://twitter.com/ph3nagen Sun Mar 20 18:52:30 +0000 2011 49600582 113177600 shutdown -H now. 3.373557 26227518 http://twitter.com/willybistuer Sun Mar 20 22:37:54 +0000 2011 49597117 039251457 elmenda: su shut- down -h now 3.373557 26227518 http://twitter.com/NeoVasili | Clarkson University 11/42
  12. 12. Phishing Website Detection • Joshua S White, Jeanna N Matthews, and John L Stacy. A method for the automated detection phishing websites through both site characteristics and image analysis. In SPIE Defense, Security, and Sensing, pages 84080B- 84080B. International Society for Optics and Photonics, 2012. | Clarkson University 12/42
  13. 13. Phishing Website Detection Continued (F)raud / (L)egit URL Structural Fingerprint Page Title pHash Value Hamming Score Paypal Fraudulent http://si4r.com/_paypal .co.uk/webscr.html?cmd =SignIn&co_partnerId=2 &pUserId=&siteid=0 &pageType=&pa1=&i1 =&bshowgif=&UsingSSL =&ru=&pp=&pa2= &errmsg=&runame= 0,7,1,0,2 RETURNED NOTHING 167161696874 89800000 1 Paypal Legitimate https://www.paypal.com/ cgi-bin/webscr?cmd= _login-submit&dispatch= 5885d80a13c0db1f8e263 663d3faee8d1e83f46a369 95b3856cef1e18897ad75 27,3,0,0,2 Redirecting - Paypal 184397071904 31800000 0 | Clarkson University 13/42
  14. 14. Phishing Website Detection Continuum: ML based detection • Title: An Image-based Feature Extraction Approach for Phishing Website Detection • Authors: Hao Jiang, Joshua White, Jeanna Matthews • Builds off of our previous work in phishing website detection, specifically the image analysis approach • Utilizes a Machine Learning based approach to identifying the most prominent images on a webpage, usually the sites logo • Is able to detect phishing sites that the phash/hamming distance method concludes as not similar. – These are the “poor quality” phishing sites | Clarkson University 14/42
  15. 15. Malware Infection Vector Detection • BEK (The Blackhole Exploit Kit) was the predominant MaaS (Malware as a Service) in 2012. • It accounted for an estimated 29% of all malicious URLs. • BEK licenses went for around 1500$ USD • BEK used Twitter as it’s primary means of spreading infectious URLs • Our method detects these malicious URLs and infectious accounts on a large scale | Clarkson University 15/42
  16. 16. Malware Infection Vector Detection Continued • Joshua S. White and Jeanna N. Matthews, “It’s you on photo?: Automatic detection of Twitter accounts in- fected with the Blackhole Exploit Kit,” Malicious and Unwanted Software: "The Americas" (MALWARE), 2013 8th International Conference on , vol., no., pp.51,58, 22-24 Oct. 2013 doi: 10.1109/MALWARE.2013.6703685 | Clarkson University 16/42
  17. 17. Malware Infection Vector Detection Continued | Clarkson University 17/42
  18. 18. Malware Infection Vector Detection Continued | Clarkson University 18/42
  19. 19. Actor Identification • Title: Connectors, Mavens, Salesmen and More: Actor Based Online Social Network (OSN) Analysis Method Using Tensed Predicate Logic • Authors: Joshua White and Jeanna Matthews • Submitted to KDD2014 (Knowledge Discovery and Data Mining) Conference “Data Mining for Social Good” • Utilized multiple definitions of actor types to created tensed predicate logic descriptions • Translated these logics into semantic queries • Tested the queries against a known dataset | Clarkson University 19/42
  20. 20. Actor Identification Continued | Clarkson University 20/42
  21. 21. Actor Identification Continued • Time is important • Previous methods did not take event sequence into account • Liaison Example: | Clarkson University 21/42
  22. 22. Actor Identification Continued | Clarkson University 22/42
  23. 23. Actor Identification Continued | Clarkson University 23/42
  24. 24. Event Identification • Still in the initial stages of this part of our work • Given a general topic, “search term, hashtag,” we can identify most of the related content from the dataset • We have a means for alerting on all new posts regarding that term • We can dig historically through the data and trace the path that an itea took • We can identify the influential individuals, “accounts,” that played a part in the infor- mation spread • Our test case was the KONY2012 Event | Clarkson University 24/42
  25. 25. Event Identification Continued | Clarkson University 25/42
  26. 26. Event Identification Continued • Top 10 Twitter Accounts, sending and receiving KONY2012 related Tweets Directed @ Account Names In-Degree Origin Account Names Out-Degree tothekidswho 625 twittonpeace 47 Invisible 125 interhabernet 44 youtube 118 DailyisOut 44 helpspreadthis 95 MEDYA_TURK 42 justinbieber 83 haber_42 35 prettypinkprobz 48 gundem_haber 30 ninadobrev 48 twittofpeace 22 MeekMill 47 korkmazhaber 19 ladygaga 43 tarafsiz_haber 14 KendallJenner 39 Son_DakikaHaber 13 | Clarkson University 26/42
  27. 27. Event Identification Continued • Top 10 Twitter Accounts, retweeting and being retweeted regarding KONY2012 Retweeting Accounts In-Degree Message Source Out-Degree MedyaKonya 8 Stop____Kony 2642 twittonpeace 8 tothekidswho 753 haber_42 7 konyfamous2012 716 gundem_haber 7 Kony2012Help 615 korkmazhaber 7 stop______kony 353 DailyisOut 7 WESTOPKONY 225 interhabernet 6 zaynmalik 221 KONYA_ZAMAN 6 iSayStopKony 127 konya_time 6 Stop_2012_Kony 80 konyagazetesi 5 Kony_Awareness 72 | Clarkson University 27/42
  28. 28. Event Identification Continued | Clarkson University 28/42
  29. 29. Event Identification Continued | Clarkson University 29/42
  30. 30. Conclusions • We aimed to answer the following questions when we started this work: – Can we come up with a way of classifying users based on actor types? – Can we determine who the opinion leaders or influencers are? – Can we determine how information spreads on these networks? – Can we detect malicious social network use? – Are there information security applications for social network data-mining? • I think we did a good job at providing at least some cursory answers to these questions | Clarkson University 30/42
  31. 31. Future Work • We have applied for a data grant from Twitter • We have, are in the process of, moving our entire dataset to the lab at Clarkson and building up a new capture/analysis system • I am planning on pursuing the semantic side of social network analysis – Currently only one SNA semantic ontology exists and it’s on on paper. – I am planning on rolling both the actor and event analysis into one approach which will be part of a new ontology | Clarkson University 31/42
  32. 32. Acknowledgements • I would like to thank: – Dr. Matthews – Dr. Bay – Dr. Lynch – Dr. Schuckers – Dr. Liu | Clarkson University 32/42
  33. 33. References [1] Gladwell, M. (2000). The tipping point. Boston: Little, Brown and Company | Clarkson University 33/42
  34. 34. Contact whitejs@clarkson.edu | Clarkson University 34/42
  35. 35. Questions Questions?
  36. 36. Suplimental Material | Clarkson University 36/42
  37. 37. • DDFS | Clarkson University 37/42
  38. 38. | Clarkson University 38/42
  39. 39. • Twitter JSON Key Fields profile_link_color Coordinates verified In_reply_to_screen_name Geo time_zone In_reply_to_status_id text statuses_count In_reply_to_status_id_str entities Contributors In_reply_to_user_id place protected profile_background_color contributors_enabled trunkated profile_background_title default_profile retweeted default_profile_image description id_translator follow_request_sent followers_count location friends_count geo_endabled favorites_count profile_image_url_https listed_count following profile_background_image_url notifications retweet_count background_image_url_https name created_at profile_image_url lang Favorited sidebar_border_color use_background_image Id_str sidebar_fill_color screen_name Created_at profile_text_color show_all_inline_media Id url utc_offset | Clarkson University 39/42
  40. 40. • BEK Infectious Account Visualization | Clarkson University 40/42
  41. 41. • Tensed Predicate Logic Key | Clarkson University 41/42
  42. 42. • Coalmine User Interface | Clarkson University 42/42
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×