Coalmine:
An E xperience in B uilding a S ystem for S ocial
Media Analytics


Joshua S. White
Jeanna N. Matthews, PhD
Outline

 •   Problem
 •   Method Overview
 •   Data Collection
 •   Analysis
 •   Case Studies
 •   Conclusion / Future Work
P roblem

 • Social Media Networks
   – A communications means for good and bad
      • Proven cases of malware / botnets use
      • SPAM medium
 • Our Goal
   – To provide a generalized tool for analysis of
     potential threats that use these networks for
     communications.
Method Overview
D ata Collection
 • Initially (Spring 2011)
    – Twitter approved oAuth application
       • Firehose Subscription with white-listing
           – ~20% of all Tweets
           – (No longer available)
               » Twitter no longer allows researchers to share
                 datasets
               » We needed to develop a new collection method
               » Can not violate terms of use
• Current
  – Distributed Data Collection Infrastructure
  – Geographically dissimilar IP's to simulate multiple users
  – Registered Application with Non-authenticated API access
      • ~80 – 100% of all Tweets (1 billion+ / week)
D ata Collection
 • Storage
    – Collection in Streaming Gzip Python Dict.
      Format (10:1 Compression Ratio)
       • Converted to JSON on the fly when needed
          – Initially Stored in HDFS (Had Issues)
              » Recent work uses DDFS
    – Indexed using Luceen
       • New methods are being explored
           – Discodex w/ BSON Store
    – Storing 1.5 TB a Week
Analysis
 • Two Part Method
   – Manual Inspection
     • Query Panel Front-end




   – Automated Inspection
E xample Analysis
  Field Name            Description                             Example Data
  name                  User's REAL Name                        Text: "Robert Scoble"
  screen_name           User's Twitter username                 Text: "scobleizer"

                                                                Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-
  profile_image_url     Link to users profile image             fanatiguy_normal.jpg"
  url                   Link to user's non-Twitter site         Link: "http://www.google.com/profiles/scobleizer"
  followers_count       Number of followers user has            Number: "185496"
  friends_count         Number of people user follows           Number: "31971"
  utc_offset            Offset from GMT (in seconds)            Number: "-28800"

  geo_enabled           Whether user has enabled location       Boolean: "True"

  statuses_count        Number of statuses user has posted Number: "53522"

  Tweet Specific Fields                                          
  created_at            Tweet timestamp                         Text: "Tue Jun 14 18:30:13 +0000 2011"

  id                    Tweet id (useful for URL creation)      Number: "80703603437875201"
                        Contains the actual text + any
  text                  embedded URLs                           Whatever text the person chooses to enter. <- Could be any language supported.
                        Links to Twitter client URL <- not
  source                important                               HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"

  in_reply_to_status_id Number of status that user replied to   Number: "80671170374025220"
  in_reply_to_screen_na Screen name of user the current
  me                    status replies to                       Text: "danharmon"
                        Number of times this status is
  retweet_count         retweeted                               Number: "0"
                        Whether or not the status has been
  retweeted             retweeted                               Boolean: "false"
  'geo' flag specific:                                           
  georss:point          Lat. & Long. Location                   Number: "43.21227199 -75.39866939"
                        Points to a JSON or XML file with
  url                   further GEO Info.                       Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
Case S tudy: B otnet C2
 • One well known case:
    – Arbor Networks detected first known incident
      in 2009
      • Base 64 encoded control signals
    – Soon After:
      • A number of tools released to do the same:
         – ControlMyPC, KreosC2, etc.
Case S tudy: B otnet C2
 • Sample Manual Detection:
Case S tudy: S P AM
 • Twitter's number one problem, artificially
   increases traffic and bothers legitimate users
 • Easily detected during manual analysis




 • Automated detection based on wording and
   rates at which messages are posted
Conclusion / Future Work
 • Coalmine - A tool for Social Media Analysis
   – Scales well based on initial tests
   – Useful for both manual and automated detection
 • Future (Current) Work
   – Rebuild of the tool to fix scaling limitations
      •   More extensible Map/Reduce method
      •   Inclusion of native multi-threading capability
      •   New storage and distribution method
      •   New algorithms for automated opinion leader detection
Questions




            ?
R eferences
R eferences
R eferences

Coalmine spie 2012 presentation - jsw -d3

  • 1.
    Coalmine: An E xperiencein B uilding a S ystem for S ocial Media Analytics Joshua S. White Jeanna N. Matthews, PhD
  • 2.
    Outline • Problem • Method Overview • Data Collection • Analysis • Case Studies • Conclusion / Future Work
  • 3.
    P roblem •Social Media Networks – A communications means for good and bad • Proven cases of malware / botnets use • SPAM medium • Our Goal – To provide a generalized tool for analysis of potential threats that use these networks for communications.
  • 4.
  • 5.
    D ata Collection • Initially (Spring 2011) – Twitter approved oAuth application • Firehose Subscription with white-listing – ~20% of all Tweets – (No longer available) » Twitter no longer allows researchers to share datasets » We needed to develop a new collection method » Can not violate terms of use
  • 6.
    • Current – Distributed Data Collection Infrastructure – Geographically dissimilar IP's to simulate multiple users – Registered Application with Non-authenticated API access • ~80 – 100% of all Tweets (1 billion+ / week)
  • 7.
    D ata Collection • Storage – Collection in Streaming Gzip Python Dict. Format (10:1 Compression Ratio) • Converted to JSON on the fly when needed – Initially Stored in HDFS (Had Issues) » Recent work uses DDFS – Indexed using Luceen • New methods are being explored – Discodex w/ BSON Store – Storing 1.5 TB a Week
  • 8.
    Analysis • TwoPart Method – Manual Inspection • Query Panel Front-end – Automated Inspection
  • 9.
    E xample Analysis Field Name Description Example Data name User's REAL Name Text: "Robert Scoble" screen_name User's Twitter username Text: "scobleizer" Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop- profile_image_url Link to users profile image fanatiguy_normal.jpg" url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer" followers_count Number of followers user has Number: "185496" friends_count Number of people user follows Number: "31971" utc_offset Offset from GMT (in seconds) Number: "-28800" geo_enabled Whether user has enabled location Boolean: "True" statuses_count Number of statuses user has posted Number: "53522" Tweet Specific Fields     created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201" Contains the actual text + any text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported. Links to Twitter client URL <- not source important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>" in_reply_to_status_id Number of status that user replied to Number: "80671170374025220" in_reply_to_screen_na Screen name of user the current me status replies to Text: "danharmon" Number of times this status is retweet_count retweeted Number: "0" Whether or not the status has been retweeted retweeted Boolean: "false" 'geo' flag specific:     georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939" Points to a JSON or XML file with url further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
  • 10.
    Case S tudy:B otnet C2 • One well known case: – Arbor Networks detected first known incident in 2009 • Base 64 encoded control signals – Soon After: • A number of tools released to do the same: – ControlMyPC, KreosC2, etc.
  • 11.
    Case S tudy:B otnet C2 • Sample Manual Detection:
  • 12.
    Case S tudy:S P AM • Twitter's number one problem, artificially increases traffic and bothers legitimate users • Easily detected during manual analysis • Automated detection based on wording and rates at which messages are posted
  • 13.
    Conclusion / FutureWork • Coalmine - A tool for Social Media Analysis – Scales well based on initial tests – Useful for both manual and automated detection • Future (Current) Work – Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection
  • 14.
  • 15.
  • 16.
  • 17.