Coalmine spie 2012 presentation - jsw -d3


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Coalmine spie 2012 presentation - jsw -d3

  1. 1. Coalmine:An E xperience in B uilding a S ystem for S ocialMedia AnalyticsJoshua S. WhiteJeanna N. Matthews, PhD
  2. 2. Outline • Problem • Method Overview • Data Collection • Analysis • Case Studies • Conclusion / Future Work
  3. 3. P roblem • Social Media Networks – A communications means for good and bad • Proven cases of malware / botnets use • SPAM medium • Our Goal – To provide a generalized tool for analysis of potential threats that use these networks for communications.
  4. 4. Method Overview
  5. 5. D ata Collection • Initially (Spring 2011) – Twitter approved oAuth application • Firehose Subscription with white-listing – ~20% of all Tweets – (No longer available) » Twitter no longer allows researchers to share datasets » We needed to develop a new collection method » Can not violate terms of use
  6. 6. • Current – Distributed Data Collection Infrastructure – Geographically dissimilar IPs to simulate multiple users – Registered Application with Non-authenticated API access • ~80 – 100% of all Tweets (1 billion+ / week)
  7. 7. D ata Collection • Storage – Collection in Streaming Gzip Python Dict. Format (10:1 Compression Ratio) • Converted to JSON on the fly when needed – Initially Stored in HDFS (Had Issues) » Recent work uses DDFS – Indexed using Luceen • New methods are being explored – Discodex w/ BSON Store – Storing 1.5 TB a Week
  8. 8. Analysis • Two Part Method – Manual Inspection • Query Panel Front-end – Automated Inspection
  9. 9. E xample Analysis Field Name Description Example Data name Users REAL Name Text: "Robert Scoble" screen_name Users Twitter username Text: "scobleizer" Link: " profile_image_url Link to users profile image fanatiguy_normal.jpg" url Link to users non-Twitter site Link: "" followers_count Number of followers user has Number: "185496" friends_count Number of people user follows Number: "31971" utc_offset Offset from GMT (in seconds) Number: "-28800" geo_enabled Whether user has enabled location Boolean: "True" statuses_count Number of statuses user has posted Number: "53522" Tweet Specific Fields     created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201" Contains the actual text + any text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported. Links to Twitter client URL <- not source important HTML code: "<a href="" rel="nofollow">Echofon</a>" in_reply_to_status_id Number of status that user replied to Number: "80671170374025220" in_reply_to_screen_na Screen name of user the current me status replies to Text: "danharmon" Number of times this status is retweet_count retweeted Number: "0" Whether or not the status has been retweeted retweeted Boolean: "false" geo flag specific:     georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939" Points to a JSON or XML file with url further GEO Info. Link: ""
  10. 10. Case S tudy: B otnet C2 • One well known case: – Arbor Networks detected first known incident in 2009 • Base 64 encoded control signals – Soon After: • A number of tools released to do the same: – ControlMyPC, KreosC2, etc.
  11. 11. Case S tudy: B otnet C2 • Sample Manual Detection:
  12. 12. Case S tudy: S P AM • Twitters number one problem, artificially increases traffic and bothers legitimate users • Easily detected during manual analysis • Automated detection based on wording and rates at which messages are posted
  13. 13. Conclusion / Future Work • Coalmine - A tool for Social Media Analysis – Scales well based on initial tests – Useful for both manual and automated detection • Future (Current) Work – Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection
  14. 14. Questions ?
  15. 15. R eferences
  16. 16. R eferences
  17. 17. R eferences
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.