SlideShare a Scribd company logo
1 of 17
Coalmine:
An E xperience in B uilding a S ystem for S ocial
Media Analytics


Joshua S. White
Jeanna N. Matthews, PhD
Outline

 •   Problem
 •   Method Overview
 •   Data Collection
 •   Analysis
 •   Case Studies
 •   Conclusion / Future Work
P roblem

 • Social Media Networks
   – A communications means for good and bad
      • Proven cases of malware / botnets use
      • SPAM medium
 • Our Goal
   – To provide a generalized tool for analysis of
     potential threats that use these networks for
     communications.
Method Overview
D ata Collection
 • Initially (Spring 2011)
    – Twitter approved oAuth application
       • Firehose Subscription with white-listing
           – ~20% of all Tweets
           – (No longer available)
               » Twitter no longer allows researchers to share
                 datasets
               » We needed to develop a new collection method
               » Can not violate terms of use
• Current
  – Distributed Data Collection Infrastructure
  – Geographically dissimilar IP's to simulate multiple users
  – Registered Application with Non-authenticated API access
      • ~80 – 100% of all Tweets (1 billion+ / week)
D ata Collection
 • Storage
    – Collection in Streaming Gzip Python Dict.
      Format (10:1 Compression Ratio)
       • Converted to JSON on the fly when needed
          – Initially Stored in HDFS (Had Issues)
              » Recent work uses DDFS
    – Indexed using Luceen
       • New methods are being explored
           – Discodex w/ BSON Store
    – Storing 1.5 TB a Week
Analysis
 • Two Part Method
   – Manual Inspection
     • Query Panel Front-end




   – Automated Inspection
E xample Analysis
  Field Name            Description                             Example Data
  name                  User's REAL Name                        Text: "Robert Scoble"
  screen_name           User's Twitter username                 Text: "scobleizer"

                                                                Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-
  profile_image_url     Link to users profile image             fanatiguy_normal.jpg"
  url                   Link to user's non-Twitter site         Link: "http://www.google.com/profiles/scobleizer"
  followers_count       Number of followers user has            Number: "185496"
  friends_count         Number of people user follows           Number: "31971"
  utc_offset            Offset from GMT (in seconds)            Number: "-28800"

  geo_enabled           Whether user has enabled location       Boolean: "True"

  statuses_count        Number of statuses user has posted Number: "53522"

  Tweet Specific Fields                                          
  created_at            Tweet timestamp                         Text: "Tue Jun 14 18:30:13 +0000 2011"

  id                    Tweet id (useful for URL creation)      Number: "80703603437875201"
                        Contains the actual text + any
  text                  embedded URLs                           Whatever text the person chooses to enter. <- Could be any language supported.
                        Links to Twitter client URL <- not
  source                important                               HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"

  in_reply_to_status_id Number of status that user replied to   Number: "80671170374025220"
  in_reply_to_screen_na Screen name of user the current
  me                    status replies to                       Text: "danharmon"
                        Number of times this status is
  retweet_count         retweeted                               Number: "0"
                        Whether or not the status has been
  retweeted             retweeted                               Boolean: "false"
  'geo' flag specific:                                           
  georss:point          Lat. & Long. Location                   Number: "43.21227199 -75.39866939"
                        Points to a JSON or XML file with
  url                   further GEO Info.                       Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
Case S tudy: B otnet C2
 • One well known case:
    – Arbor Networks detected first known incident
      in 2009
      • Base 64 encoded control signals
    – Soon After:
      • A number of tools released to do the same:
         – ControlMyPC, KreosC2, etc.
Case S tudy: B otnet C2
 • Sample Manual Detection:
Case S tudy: S P AM
 • Twitter's number one problem, artificially
   increases traffic and bothers legitimate users
 • Easily detected during manual analysis




 • Automated detection based on wording and
   rates at which messages are posted
Conclusion / Future Work
 • Coalmine - A tool for Social Media Analysis
   – Scales well based on initial tests
   – Useful for both manual and automated detection
 • Future (Current) Work
   – Rebuild of the tool to fix scaling limitations
      •   More extensible Map/Reduce method
      •   Inclusion of native multi-threading capability
      •   New storage and distribution method
      •   New algorithms for automated opinion leader detection
Questions




            ?
R eferences
R eferences
R eferences

More Related Content

Similar to Coalmine spie 2012 presentation - jsw -d3

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 

Similar to Coalmine spie 2012 presentation - jsw -d3 (20)

474 Password Not Found
474 Password Not Found474 Password Not Found
474 Password Not Found
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streams
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
HATEOAS 101 - Opinionated Introduction to a REST API Style
HATEOAS 101 - Opinionated Introduction to a REST API StyleHATEOAS 101 - Opinionated Introduction to a REST API Style
HATEOAS 101 - Opinionated Introduction to a REST API Style
 
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Apache Metron Meetup May 4, 2016 - Big data cybersecurity
Apache Metron Meetup May 4, 2016 - Big data cybersecurityApache Metron Meetup May 4, 2016 - Big data cybersecurity
Apache Metron Meetup May 4, 2016 - Big data cybersecurity
 
Apache metron meetup presentation at capital one
Apache metron meetup presentation at capital oneApache metron meetup presentation at capital one
Apache metron meetup presentation at capital one
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Extensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopExtensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPop
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FutureAPI-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the Future
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Toward a Mobile Data Commons
Toward a Mobile Data CommonsToward a Mobile Data Commons
Toward a Mobile Data Commons
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIs
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 

More from Joshua S. White, PhD josh@securemind.org

Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Joshua S. White, PhD josh@securemind.org
 
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Joshua S. White, PhD josh@securemind.org
 
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Joshua S. White, PhD josh@securemind.org
 

More from Joshua S. White, PhD josh@securemind.org (12)

Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
 
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
 
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
 
Supraja_SMS_presentation
Supraja_SMS_presentationSupraja_SMS_presentation
Supraja_SMS_presentation
 
ase-social-informatics (6)
ase-social-informatics (6)ase-social-informatics (6)
ase-social-informatics (6)
 
Social Network Analysis Applications and Approach
Social Network Analysis Applications and ApproachSocial Network Analysis Applications and Approach
Social Network Analysis Applications and Approach
 
Clarkson joshua white - ids testing - spie 2013 presentation - jsw - d1
Clarkson   joshua white - ids testing - spie 2013 presentation - jsw - d1Clarkson   joshua white - ids testing - spie 2013 presentation - jsw - d1
Clarkson joshua white - ids testing - spie 2013 presentation - jsw - d1
 
Malware bek slides 20131023 final
Malware bek slides 20131023 finalMalware bek slides 20131023 final
Malware bek slides 20131023 final
 
CSIAC - Social Media Analysis and Privacy
CSIAC - Social Media Analysis and PrivacyCSIAC - Social Media Analysis and Privacy
CSIAC - Social Media Analysis and Privacy
 
Clarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal PresentationClarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal Presentation
 
Phishing spie 2012 presentation - jsw - d2
Phishing   spie 2012 presentation - jsw - d2Phishing   spie 2012 presentation - jsw - d2
Phishing spie 2012 presentation - jsw - d2
 
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
 

Recently uploaded

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Coalmine spie 2012 presentation - jsw -d3

  • 1. Coalmine: An E xperience in B uilding a S ystem for S ocial Media Analytics Joshua S. White Jeanna N. Matthews, PhD
  • 2. Outline • Problem • Method Overview • Data Collection • Analysis • Case Studies • Conclusion / Future Work
  • 3. P roblem • Social Media Networks – A communications means for good and bad • Proven cases of malware / botnets use • SPAM medium • Our Goal – To provide a generalized tool for analysis of potential threats that use these networks for communications.
  • 5. D ata Collection • Initially (Spring 2011) – Twitter approved oAuth application • Firehose Subscription with white-listing – ~20% of all Tweets – (No longer available) » Twitter no longer allows researchers to share datasets » We needed to develop a new collection method » Can not violate terms of use
  • 6. • Current – Distributed Data Collection Infrastructure – Geographically dissimilar IP's to simulate multiple users – Registered Application with Non-authenticated API access • ~80 – 100% of all Tweets (1 billion+ / week)
  • 7. D ata Collection • Storage – Collection in Streaming Gzip Python Dict. Format (10:1 Compression Ratio) • Converted to JSON on the fly when needed – Initially Stored in HDFS (Had Issues) » Recent work uses DDFS – Indexed using Luceen • New methods are being explored – Discodex w/ BSON Store – Storing 1.5 TB a Week
  • 8. Analysis • Two Part Method – Manual Inspection • Query Panel Front-end – Automated Inspection
  • 9. E xample Analysis Field Name Description Example Data name User's REAL Name Text: "Robert Scoble" screen_name User's Twitter username Text: "scobleizer" Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop- profile_image_url Link to users profile image fanatiguy_normal.jpg" url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer" followers_count Number of followers user has Number: "185496" friends_count Number of people user follows Number: "31971" utc_offset Offset from GMT (in seconds) Number: "-28800" geo_enabled Whether user has enabled location Boolean: "True" statuses_count Number of statuses user has posted Number: "53522" Tweet Specific Fields     created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201" Contains the actual text + any text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported. Links to Twitter client URL <- not source important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>" in_reply_to_status_id Number of status that user replied to Number: "80671170374025220" in_reply_to_screen_na Screen name of user the current me status replies to Text: "danharmon" Number of times this status is retweet_count retweeted Number: "0" Whether or not the status has been retweeted retweeted Boolean: "false" 'geo' flag specific:     georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939" Points to a JSON or XML file with url further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
  • 10. Case S tudy: B otnet C2 • One well known case: – Arbor Networks detected first known incident in 2009 • Base 64 encoded control signals – Soon After: • A number of tools released to do the same: – ControlMyPC, KreosC2, etc.
  • 11. Case S tudy: B otnet C2 • Sample Manual Detection:
  • 12. Case S tudy: S P AM • Twitter's number one problem, artificially increases traffic and bothers legitimate users • Easily detected during manual analysis • Automated detection based on wording and rates at which messages are posted
  • 13. Conclusion / Future Work • Coalmine - A tool for Social Media Analysis – Scales well based on initial tests – Useful for both manual and automated detection • Future (Current) Work – Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection