This graph shows the Michael Jackson Effect, with a strong uptick in the half hour following the announcement of his death on Hollywood celebrity site TMZ.com (Doctors at UCLA Hospital had announced the death 18 minutes before that). Twitter crashed temporarily under the load at 3:15 PM. Twitter has about 200 million users and is the 9th most popular site on the web. On a typical day Twitter handles 200 million tweets and 1.6 billion search queries.
Could also make a fancier Poststript plot with:post(fit, file = "tree.ps", title = "Classification Tree for Exploit Tweets")
Mining tweets for security information (rev 2)
Mining Tweets forSecurity Information with “R” Jeff Stanton, School of Information Studies Syracuse University
@highfours: I just watched a plane crash into the hudson rive in manhattan @ReallyVirtual: Helicopter hovering above Abbotabad at 1AM (is a rare event).Twitter: Early Warning System?
140 characters max @petridishes – Screen Name #blackberry – User-created hashtag @crozzledhearts – “Retweeter” who sent this tweet after receiving it from @petridishes 30 minutes ago via web – Each tweet encoded with UTC timecode No URLs here, but they are auto-shortenedAnatomy of a Tweet
A GNU open source project An implementation of the “S” statistical language developed at Bell labs Largely an interpreted, command-line interface with some GUI add-ons More than 4300 add-on packages developed by the user community Full-featured data management and matrix manipulation with performance comparable to Octave and MATLAB Extensive graphics for visualization Starting in 2010, used by more data miners (43%) than any other single tool“R” Facts
Developed by Jeff Gentry (Fidelity) Five classes and 11 functions to: ◦ Authenticate to Twitter with Oauth and check current rate limit ◦ Manipulate, send, and receive direct messages ◦ Update user status ◦ Search for tweets containing particular keywords or hashtags ◦ Examine topic trends ◦ Examine timelinesThe “twitteR” package
Use the R “Packages” menu to install the necessary packages: bitops, RJSONIO, RCurl, and twitteR Depending upon Mac/Win/Linux, you may need to retrieve a zipped file of RCurl from: ◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri b/2.14/ Then ready the packages for use in R with the library() command: > library(bitops) > library(RCurl) > library(RJSONIO) > library(twitteR)Getting Ready – Load Packages
> expTweets <- searchTwitter(#exploit, n=500)> expDF <- do.call("rbind", lapply(expTweets, as.data.fr ame)) The second command above takes the raw tweet data in expTweets – which starts as a list/collection of separate data objects (frames) – and binds it into a single data frame for ease of analysis lapply() applies a command to each element of a list as.data.frame is a type coercion rbind is the function that joins separate objects to become rows in a dataframe do.call() repeats the rbind over all elements of the listSearch Twitter for “#exploit”
> head(expDF,1)text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age Verification Plugin http://t.co/O8wVjKca #Exploitfavorited: FALSEreplyToSN: NAcreated: 2012-01-10 18:19:11truncated: FALSEreplyToSID: <NA>id: 156802281747124224replyToUID: NAstatusSource: <a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>screenName: NotaThreat2uA Preview of the Data
> head(expDF$created,1) Histogram of expDF$created "2012-01-10 18:19:11 UTC“ 20 The created variable is conveniently coded as a 15 POSIX time variable Frequency calibrated to UTC 10 > hist(expDF$created, breaks=15, 5 freq=TRUE) 0 Shows a frequency histogram (with about 15 13:50 18:00 22:10 02:20 06:30 10:40 break points) expDF$created Nice spike at 18:00 UTC (about 1pm EST)Visualizing the Data: When Tweeted? 11
# Total time between 1st and last tweetelapsedTime = max(expDF$created) - min(expDF$created)timeBin = floor(elapsedTime/11) # Make 11 bins# Add a new variable with the bin designatorsexpDF$slice = floor((expDF$created - min(expDF$created))/(as.integer(timeBin)*3600))expSlices<-expDF[,c("screenName","slice")] # subset the dataexpTable<-table(expSlices) # Count tweets in each slice# Convert table data to matrix that heatmap() expectsexpMatrix<-matrix(expTable,ncol=length(colnames(expTable)))rownames(expMatrix)<-rownames(expTable)colnames(expMatrix)<-paste(Slice,1:12)heatmap(expMatrix,Rowv=NA,Colv=NA, col=rainbow(max(expMatrix)+1,start=0.5,end=.7))Prepare a Heatmap
library(stringr) # Provides easy string functionsstr_match(expDF$text, "^RT @") # Find RT @ at beginning of each line Regular expression matching any number of alphanumeric characters or underscore: [[:alnum:]_]*str_match(expDF$text, "^RT @[[:alnum:]_]*") # Matches the whole retweet screen nameexpDF$rtSN = str_match(expDF$text, "^RT @[[:alnum:]_]*") # Adds a new variableDo Some Parsing with Regex
0 5 10 15 20 25 30 #security rt alert exploit injection sql cross scripting site #ccureit new remote vulnerability #cyber #cyberwar #hacker buffer cms file and disclosure of 1.4 execution vulnerabilities wordpress / [webapps] analysis multiple overflow 1.3.3 Most common keywords advanced code command en for information phpmydirectory with17
#security – Another good hashtag to search on (SQL) Injection – Apparently one of the most common attacks cross (site) scripting – Another popular attack #cyber #cyberwar #ccureit #hacker – More hashtags? remote vulnerability, buffer (overflow), cms, wordpress, phpmydirectory Each/any of these keywords could provide a basis for a new tweet search term, or for keyword detection within a set of tweets obtained from another search, or for an alert dashboard with periodic updatesCommon keywords to explore 18
@shitaesy Je me couche à 20h30 en ce moment.. Jai même lu ce soir :3 #exploit Scanning across a sample of the tweets, some are spam and should be filtered out Can we create a classifier that will get rid of the non-exploit tweets?Must Remove the Non-Tweets
• AttributesInitial model developed Attribute can bewith training data 1 boolean or numeric • Most useful if Attribute independent 2 of other attributes • The fewerModel accuracy checked Attribute theon training data, but later 3 attributescross-validated on new data the betterAnatomy of a Classifier
write.table(expDF, sep=",", file="exploitData.csv")# I looked at the tweets and added# training data, using my judgment to code# the non-tweetstruExp = read.table("truExp.txt")# Add to the existing dataexpDFtrue = cbind(expDF, truExp)# Note: new variable name defaults to “V1”Create and Add Training Data
# Create true/false values for each row,# based on whether the string existsexpDFtrue$hassec = grepl("security",tolower(expDFtrue$text))# Also count some punctuation to see if# there are clues thereexpDFtrue$numhash = sapply(strsplit(as.character(expDFtrue$te xt),"#"),length)-1Easy Predictors with Grepl()
Coefficients (output from Logit analysis: Estimate Std. Error z value Pr(>|z|)(Intercept) 2.946e+00 1.904e+00 1.547 0.12187hassecTRUE 2.302e+00 1.186e+00 1.941 0.05222 .hassqlTRUE 1.770e+01 4.076e+03 0.004 0.99653hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 *hasscrTRUE 1.832e+01 4.135e+03 0.004 0.99647hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164hascybTRUE 1.202e+00 1.958e+00 0.614 0.53937numhash -2.046e+00 7.218e-01 -2.835 0.00458 **numast -2.182e+01 5.554e+03 -0.004 0.99687numdot 6.306e-01 4.167e-01 1.513 0.13017twtlen 6.548e-03 2.329e-02 0.281 0.77854# security, buffer keywords are promising, as well as the# number of hash marks and the number of dots/periodsChoose Best Attributes
Conclusion 1: R is pretty handy for grabbing and manipulating tweet data Conclusion 2: Tweet data are messy and require a good deal of clean-up, parsing, and filtering Conclusion 3: As these two examples suggest, tweets can provide breaking news about vulnerabilities and exploits ◦ WordPress Age Verification plugin versions 0.4 and below open redirect vulnerability Exploit availability tweeted at 12:19 PM Blogged at SecurityBlog 10:24 PM Added to SiloBreaker two days later ◦ Pragyan CMS v 3.0 Remote File Disclosure Exploit availability tweeted at 11:07 AM Appeared on PacketStorm next day On RealHacker three days later On WebCriminal.ru eight days laterTwitter: Early Warning System?