Your SlideShare is downloading. ×
0
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Coalmine   spie 2012 presentation - jsw -d3
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Coalmine spie 2012 presentation - jsw -d3

476

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
476
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Coalmine:An E xperience in B uilding a S ystem for S ocialMedia AnalyticsJoshua S. WhiteJeanna N. Matthews, PhD
  • 2. Outline • Problem • Method Overview • Data Collection • Analysis • Case Studies • Conclusion / Future Work
  • 3. P roblem • Social Media Networks – A communications means for good and bad • Proven cases of malware / botnets use • SPAM medium • Our Goal – To provide a generalized tool for analysis of potential threats that use these networks for communications.
  • 4. Method Overview
  • 5. D ata Collection • Initially (Spring 2011) – Twitter approved oAuth application • Firehose Subscription with white-listing – ~20% of all Tweets – (No longer available) » Twitter no longer allows researchers to share datasets » We needed to develop a new collection method » Can not violate terms of use
  • 6. • Current – Distributed Data Collection Infrastructure – Geographically dissimilar IPs to simulate multiple users – Registered Application with Non-authenticated API access • ~80 – 100% of all Tweets (1 billion+ / week)
  • 7. D ata Collection • Storage – Collection in Streaming Gzip Python Dict. Format (10:1 Compression Ratio) • Converted to JSON on the fly when needed – Initially Stored in HDFS (Had Issues) » Recent work uses DDFS – Indexed using Luceen • New methods are being explored – Discodex w/ BSON Store – Storing 1.5 TB a Week
  • 8. Analysis • Two Part Method – Manual Inspection • Query Panel Front-end – Automated Inspection
  • 9. E xample Analysis Field Name Description Example Data name Users REAL Name Text: "Robert Scoble" screen_name Users Twitter username Text: "scobleizer" Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop- profile_image_url Link to users profile image fanatiguy_normal.jpg" url Link to users non-Twitter site Link: "http://www.google.com/profiles/scobleizer" followers_count Number of followers user has Number: "185496" friends_count Number of people user follows Number: "31971" utc_offset Offset from GMT (in seconds) Number: "-28800" geo_enabled Whether user has enabled location Boolean: "True" statuses_count Number of statuses user has posted Number: "53522" Tweet Specific Fields     created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201" Contains the actual text + any text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported. Links to Twitter client URL <- not source important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>" in_reply_to_status_id Number of status that user replied to Number: "80671170374025220" in_reply_to_screen_na Screen name of user the current me status replies to Text: "danharmon" Number of times this status is retweet_count retweeted Number: "0" Whether or not the status has been retweeted retweeted Boolean: "false" geo flag specific:     georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939" Points to a JSON or XML file with url further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
  • 10. Case S tudy: B otnet C2 • One well known case: – Arbor Networks detected first known incident in 2009 • Base 64 encoded control signals – Soon After: • A number of tools released to do the same: – ControlMyPC, KreosC2, etc.
  • 11. Case S tudy: B otnet C2 • Sample Manual Detection:
  • 12. Case S tudy: S P AM • Twitters number one problem, artificially increases traffic and bothers legitimate users • Easily detected during manual analysis • Automated detection based on wording and rates at which messages are posted
  • 13. Conclusion / Future Work • Coalmine - A tool for Social Media Analysis – Scales well based on initial tests – Useful for both manual and automated detection • Future (Current) Work – Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection
  • 14. Questions ?
  • 15. R eferences
  • 16. R eferences
  • 17. R eferences

×