Thesis Proposal:
A Quantitative Analysis of the Spread
of Information in Social Networks




Joshua S. White
Advisor: Jeanna N. Matthews, PhD


                                    03/05/13
Outline
 •   Problem
 •   To Date
 •   Recently Completed Work
 •   Current Work
 •   Inspiration
 •   Unanswered Questions
 •   Current Tool Kits
 •   Our Approach
 •   Schedule of Completion
Problem
 • Social Media Networks are the fastest
   growing, and make up the largest portions, of
   Internet content today1

 • These networks have only recently (2010-
   Present) been studied in any level of detail

 • Most work has been in sampling small
   portions of the network and trying to predict
   outcomes (predicting politics)

                              1. Tom Pick. "102 Compelling Social Media and
                              Online Marketing Stats and Facts for 2012 (and
                              2013)." Business 2 Community. January 2, 2013
Problem (continued)
                                           ACM Digital Library Search Results

                                                (Sampled Dec, 2012 - Total = 20796)




                             132

                                   156
                             278
              25

                   30

                        43
          2




                                         2495
                                                                            {
                                                                                  Social Networks and Political Analysis
                                                                                  Using Social Networks as Datasets for Machine Learning
                                                                   20,132         Twitter




                                                                         {
                                                                                  Actor Types in Any Network
                                                                                  Social Network Graphing
                                                                                  Malware and Social Networks



                                                        3040
                                                                                  Social Network Meme's
                                                                   666            Botnets and Social Networks
                                                                                  Individuals Influence on Social Networks
  11359




                                                                                  Social Network Analysis Tools
                                                                                  Actor Types in Twitter
                                                 3236
To Date: Coalmine
• The basis for a Social Network Analysis Tool




                                      Coalmine: an experience in building a
                                      system for social media analytics
                                      JS White, JN Matthews, JL Stacy
                                      SPIE Defense, Security, and
                                      Sensing, 84080A-84080A-11
To Date: Coalmine
To Date: Coalmine
• Coalmine
  – Method scales well based on initial tests
  – Manual and automated detection
  – Configurable data collection capabilities
  – Trial and error filter design tool
• At the Time (Major Future Work)
  – Rebuild of the tool:
     • Fix scaling limitations
     • More extensible Map/Reduce method
         – Solve map-piping issue
     • Inclusion of multi-job support
     • New storage and distribution method
         – Solve replication and state issues
Coalmine: Data Set Overview
• Over the course of 2012 we collected 165 TB of
  Twitter Data (Uncompressed)
  – 147 “Full Days”, 100 “Partial Days”
     • Estimated 65 Billion Tweets
                                                                                     1
  – Twitter traffic at est. 175 million tweets per day in 2012
  – Collection rates between 50% and 80% for “Full Days”.
  – Data in JSON format using Twitters REST API.




                                     1. Shea Bennett. "Just How Big Is twitter In 2012
                                     [INFOGRAPHIC]," All Twitter - The Unofficial Twitter
                                     Resource, February 2013
Coalmine: Data Set Overview
• Basic observable patterns
   – Twitter has a lot of outages
   – Posting rates follow predictable patterns
To Date: Phishing Analysis




A method for the automated detection phishing websites through both site characteristics and image analysis
JS White, JN Matthews, JL Stacy
SPIE Defense, Security, and Sensing, 84080B-84080B-11
To Date: Phishing Analysis
• Phash Process:                           Results:
   – Reduce image size to 32p x 32p
   – Reduce the color to greyscale
   – Calculate the DCT (creates
     frequency scalars)
   – Reduce the DCT to 8p x 8p
   – Second DCT reduction, set bits to 1
     or 0 depending on placement
     above or below average DCT
   – Take Hash




   5
To Date: Phishing Analysis

  • Two Methods:
     – Page characteristic analysis
     – Image similarity analysis


  • Proof of concept system


  • Need for a generically customization filter
Recently Completed Work:
• BEK Infection Vector Analysis
  – Finished dev. of a filter for detection of suspect accounts
     • Submitted to the IEEE CNS (Communications Network Security)
         – “It's you on photo?: Automatic Detection of Twitter Accounts Infected
           With the Blackhole Exploit Kit”
Recently Completed Work:




   Normal
                            =
Infectious
Current Work:
• KONY2012 Meme Analysis
  – Finished extraction of relevant data, identification of tag
    variants, directed graphs of information flow
     • Preparing for submission to ASONAM (Advances in Social Network
       Analysis and Mining)
Current Work:
• Actor Types Analysis
  – Literature review completed, started identifying statistical and
    temporal characteristics of each type
     • Planned for submission to LEET'13 (Large Scale Emerging Exploits
       and Threats)




             +                          =
Inspiration
 • Our work was inspired in part by Malcolm
   Gladwell’s book, The Tipping Point 1
   – Life as an epidemic


 • Thinking this way lead us to consider the
   spread of information and trends in terms of
   an outbreak where key people, Mavens,
   Connectors, and Salesmen, are primarily
   responsible.


                           1. Gladwell, M. (2000). The tipping point.
                           Boston: Little, Brown and Company.
Some Unanswered Questions
• Automatic classification of actor types in social networks.
   – Do Gladwell's classifications apply?
       • Connectors, mavens and salesmen
   – Who are the opinion leaders?
• Privacy related implications of social network analysis
• Do social networks have the level of impact on public
  opinion/mass media that some believe?
   – Can we predict changes in the public or individuals opinions using
     social network datasets as a base?
   – Can we predict how meme's/news will spread?
   – Are individuals covertly manipulating mass media through social
     networks?
• Is there an generally applicable way to identify major events like
  natural disasters as they happen?
Current Tool Kits
 • Tool Kits and Methods:
    – Only one well developed tool kit:
       • NodeXL1
                 – Small Datasets (Under 5000 Nodes)
                 – Built In statistics and data collection
                   capabilities
                 – Built on MS Excel
                 – Allows exploration of group relationships
                 – Highest usage seems to be for political
                   related research

  1. Smith, M., Milic-Frayling, N., Shneiderman, B., Mendes Rodrigues, E., Leskovec, J., Dunne, C., (2010).
  NodeXL: a free and open network overview, discovery and exploration add-in for Excel 2007/2010,
  http://nodexl.codeplex.com/ from the Social Media Research Foundation, http://www.smrfoundation.org
Approach
• Borrow from traditional “Social Network Analysis” as it
  relates to the study of Sociology


• Most tools can't handle extremely large datasets
   – We employ the MapReduce methodology as our core for data
     analysis


• Treat the analysis system like a filtering system and build
  “rules” for how the data should be processed
      • Each rule is essentially constrained to a single Mapper


• Use case studies base on available data to develop
  individual statistics and rules
Schedule of Completion:
Questions:

Clarkson - Joshua White - Research Proposal Presentation

  • 1.
    Thesis Proposal: A QuantitativeAnalysis of the Spread of Information in Social Networks Joshua S. White Advisor: Jeanna N. Matthews, PhD 03/05/13
  • 2.
    Outline • Problem • To Date • Recently Completed Work • Current Work • Inspiration • Unanswered Questions • Current Tool Kits • Our Approach • Schedule of Completion
  • 3.
    Problem • SocialMedia Networks are the fastest growing, and make up the largest portions, of Internet content today1 • These networks have only recently (2010- Present) been studied in any level of detail • Most work has been in sampling small portions of the network and trying to predict outcomes (predicting politics) 1. Tom Pick. "102 Compelling Social Media and Online Marketing Stats and Facts for 2012 (and 2013)." Business 2 Community. January 2, 2013
  • 4.
    Problem (continued) ACM Digital Library Search Results (Sampled Dec, 2012 - Total = 20796) 132 156 278 25 30 43 2 2495 { Social Networks and Political Analysis Using Social Networks as Datasets for Machine Learning 20,132 Twitter { Actor Types in Any Network Social Network Graphing Malware and Social Networks 3040 Social Network Meme's 666 Botnets and Social Networks Individuals Influence on Social Networks 11359 Social Network Analysis Tools Actor Types in Twitter 3236
  • 5.
    To Date: Coalmine •The basis for a Social Network Analysis Tool Coalmine: an experience in building a system for social media analytics JS White, JN Matthews, JL Stacy SPIE Defense, Security, and Sensing, 84080A-84080A-11
  • 6.
  • 7.
    To Date: Coalmine •Coalmine – Method scales well based on initial tests – Manual and automated detection – Configurable data collection capabilities – Trial and error filter design tool • At the Time (Major Future Work) – Rebuild of the tool: • Fix scaling limitations • More extensible Map/Reduce method – Solve map-piping issue • Inclusion of multi-job support • New storage and distribution method – Solve replication and state issues
  • 8.
    Coalmine: Data SetOverview • Over the course of 2012 we collected 165 TB of Twitter Data (Uncompressed) – 147 “Full Days”, 100 “Partial Days” • Estimated 65 Billion Tweets 1 – Twitter traffic at est. 175 million tweets per day in 2012 – Collection rates between 50% and 80% for “Full Days”. – Data in JSON format using Twitters REST API. 1. Shea Bennett. "Just How Big Is twitter In 2012 [INFOGRAPHIC]," All Twitter - The Unofficial Twitter Resource, February 2013
  • 9.
    Coalmine: Data SetOverview • Basic observable patterns – Twitter has a lot of outages – Posting rates follow predictable patterns
  • 10.
    To Date: PhishingAnalysis A method for the automated detection phishing websites through both site characteristics and image analysis JS White, JN Matthews, JL Stacy SPIE Defense, Security, and Sensing, 84080B-84080B-11
  • 11.
    To Date: PhishingAnalysis • Phash Process: Results: – Reduce image size to 32p x 32p – Reduce the color to greyscale – Calculate the DCT (creates frequency scalars) – Reduce the DCT to 8p x 8p – Second DCT reduction, set bits to 1 or 0 depending on placement above or below average DCT – Take Hash 5
  • 12.
    To Date: PhishingAnalysis • Two Methods: – Page characteristic analysis – Image similarity analysis • Proof of concept system • Need for a generically customization filter
  • 13.
    Recently Completed Work: •BEK Infection Vector Analysis – Finished dev. of a filter for detection of suspect accounts • Submitted to the IEEE CNS (Communications Network Security) – “It's you on photo?: Automatic Detection of Twitter Accounts Infected With the Blackhole Exploit Kit”
  • 14.
    Recently Completed Work: Normal = Infectious
  • 15.
    Current Work: • KONY2012Meme Analysis – Finished extraction of relevant data, identification of tag variants, directed graphs of information flow • Preparing for submission to ASONAM (Advances in Social Network Analysis and Mining)
  • 16.
    Current Work: • ActorTypes Analysis – Literature review completed, started identifying statistical and temporal characteristics of each type • Planned for submission to LEET'13 (Large Scale Emerging Exploits and Threats) + =
  • 17.
    Inspiration • Ourwork was inspired in part by Malcolm Gladwell’s book, The Tipping Point 1 – Life as an epidemic • Thinking this way lead us to consider the spread of information and trends in terms of an outbreak where key people, Mavens, Connectors, and Salesmen, are primarily responsible. 1. Gladwell, M. (2000). The tipping point. Boston: Little, Brown and Company.
  • 18.
    Some Unanswered Questions •Automatic classification of actor types in social networks. – Do Gladwell's classifications apply? • Connectors, mavens and salesmen – Who are the opinion leaders? • Privacy related implications of social network analysis • Do social networks have the level of impact on public opinion/mass media that some believe? – Can we predict changes in the public or individuals opinions using social network datasets as a base? – Can we predict how meme's/news will spread? – Are individuals covertly manipulating mass media through social networks? • Is there an generally applicable way to identify major events like natural disasters as they happen?
  • 19.
    Current Tool Kits • Tool Kits and Methods: – Only one well developed tool kit: • NodeXL1 – Small Datasets (Under 5000 Nodes) – Built In statistics and data collection capabilities – Built on MS Excel – Allows exploration of group relationships – Highest usage seems to be for political related research 1. Smith, M., Milic-Frayling, N., Shneiderman, B., Mendes Rodrigues, E., Leskovec, J., Dunne, C., (2010). NodeXL: a free and open network overview, discovery and exploration add-in for Excel 2007/2010, http://nodexl.codeplex.com/ from the Social Media Research Foundation, http://www.smrfoundation.org
  • 20.
    Approach • Borrow fromtraditional “Social Network Analysis” as it relates to the study of Sociology • Most tools can't handle extremely large datasets – We employ the MapReduce methodology as our core for data analysis • Treat the analysis system like a filtering system and build “rules” for how the data should be processed • Each rule is essentially constrained to a single Mapper • Use case studies base on available data to develop individual statistics and rules
  • 21.
  • 22.