Coalmine:An E xperience in B uilding a S ystem for S ocialMedia AnalyticsJoshua S. WhiteJeanna N. Matthews, PhD
Outline • Problem • Method Overview • Data Collection • Analysis • Case Studies • Conclusion / Future Work
P roblem • Social Media Networks – A communications means for good and bad • Proven cases of malware / botnets use • SPAM medium • Our Goal – To provide a generalized tool for analysis of potential threats that use these networks for communications.
• Current – Distributed Data Collection Infrastructure – Geographically dissimilar IPs to simulate multiple users – Registered Application with Non-authenticated API access • ~80 – 100% of all Tweets (1 billion+ / week)
D ata Collection • Storage – Collection in Streaming Gzip Python Dict. Format (10:1 Compression Ratio) • Converted to JSON on the fly when needed – Initially Stored in HDFS (Had Issues) » Recent work uses DDFS – Indexed using Luceen • New methods are being explored – Discodex w/ BSON Store – Storing 1.5 TB a Week
Analysis • Two Part Method – Manual Inspection • Query Panel Front-end – Automated Inspection
E xample Analysis Field Name Description Example Data name Users REAL Name Text: "Robert Scoble" screen_name Users Twitter username Text: "scobleizer" Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop- profile_image_url Link to users profile image fanatiguy_normal.jpg" url Link to users non-Twitter site Link: "http://www.google.com/profiles/scobleizer" followers_count Number of followers user has Number: "185496" friends_count Number of people user follows Number: "31971" utc_offset Offset from GMT (in seconds) Number: "-28800" geo_enabled Whether user has enabled location Boolean: "True" statuses_count Number of statuses user has posted Number: "53522" Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201" Contains the actual text + any text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported. Links to Twitter client URL <- not source important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>" in_reply_to_status_id Number of status that user replied to Number: "80671170374025220" in_reply_to_screen_na Screen name of user the current me status replies to Text: "danharmon" Number of times this status is retweet_count retweeted Number: "0" Whether or not the status has been retweeted retweeted Boolean: "false" geo flag specific: georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939" Points to a JSON or XML file with url further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
Case S tudy: B otnet C2 • One well known case: – Arbor Networks detected first known incident in 2009 • Base 64 encoded control signals – Soon After: • A number of tools released to do the same: – ControlMyPC, KreosC2, etc.
Case S tudy: B otnet C2 • Sample Manual Detection:
Case S tudy: S P AM • Twitters number one problem, artificially increases traffic and bothers legitimate users • Easily detected during manual analysis • Automated detection based on wording and rates at which messages are posted
Conclusion / Future Work • Coalmine - A tool for Social Media Analysis – Scales well based on initial tests – Useful for both manual and automated detection • Future (Current) Work – Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection