Mining tweets for security information (rev 2)


Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This graph shows the Michael Jackson Effect, with a strong uptick in the half hour following the announcement of his death on Hollywood celebrity site (Doctors at UCLA Hospital had announced the death 18 minutes before that). Twitter crashed temporarily under the load at 3:15 PM. Twitter has about 200 million users and is the 9th most popular site on the web. On a typical day Twitter handles 200 million tweets and 1.6 billion search queries.
  • Could also make a fancier Poststript plot with:post(fit, file = "", title = "Classification Tree for Exploit Tweets")
  • Mining tweets for security information (rev 2)

    1. 1. Mining Tweets forSecurity Information with “R” Jeff Stanton, School of Information Studies Syracuse University
    2. 2. @highfours: I just watched a plane crash into the hudson rive in manhattan @ReallyVirtual: Helicopter hovering above Abbotabad at 1AM (is a rare event).Twitter: Early Warning System?
    3. 3. 2:26 pm 2:44 PM 3:15 PM UCLA TwitterTwitter Facts
    4. 4.  140 characters max  @petridishes – Screen Name  #blackberry – User-created hashtag  @crozzledhearts – “Retweeter” who sent this tweet after receiving it from @petridishes  30 minutes ago via web – Each tweet encoded with UTC timecode  No URLs here, but they are auto-shortenedAnatomy of a Tweet
    5. 5. “R” – Open Source Analytics
    6. 6.  A GNU open source project An implementation of the “S” statistical language developed at Bell labs Largely an interpreted, command-line interface with some GUI add-ons More than 4300 add-on packages developed by the user community Full-featured data management and matrix manipulation with performance comparable to Octave and MATLAB Extensive graphics for visualization Starting in 2010, used by more data miners (43%) than any other single tool“R” Facts
    7. 7.  Developed by Jeff Gentry (Fidelity) Five classes and 11 functions to: ◦ Authenticate to Twitter with Oauth and check current rate limit ◦ Manipulate, send, and receive direct messages ◦ Update user status ◦ Search for tweets containing particular keywords or hashtags ◦ Examine topic trends ◦ Examine timelinesThe “twitteR” package
    8. 8.  Use the R “Packages” menu to install the necessary packages: bitops, RJSONIO, RCurl, and twitteR Depending upon Mac/Win/Linux, you may need to retrieve a zipped file of RCurl from: ◦ b/2.14/ Then ready the packages for use in R with the library() command: > library(bitops) > library(RCurl) > library(RJSONIO) > library(twitteR)Getting Ready – Load Packages
    9. 9. > expTweets <- searchTwitter(#exploit, n=500)> expDF <-"rbind", lapply(expTweets, ame)) The second command above takes the raw tweet data in expTweets – which starts as a list/collection of separate data objects (frames) – and binds it into a single data frame for ease of analysis lapply() applies a command to each element of a list is a type coercion rbind is the function that joins separate objects to become rows in a dataframe repeats the rbind over all elements of the listSearch Twitter for “#exploit”
    10. 10. > head(expDF,1)text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age Verification Plugin #Exploitfavorited: FALSEreplyToSN: NAcreated: 2012-01-10 18:19:11truncated: FALSEreplyToSID: <NA>id: 156802281747124224replyToUID: NAstatusSource: &lt;a href=&quot;; rel=&quot;nofollow&quot;&gt;twitterfeed&lt;/a&gt;screenName: NotaThreat2uA Preview of the Data
    11. 11. > head(expDF$created,1) Histogram of expDF$created[1] "2012-01-10 18:19:11 UTC“ 20 The created variable is conveniently coded as a 15 POSIX time variable Frequency calibrated to UTC 10 > hist(expDF$created, breaks=15, 5 freq=TRUE) 0 Shows a frequency histogram (with about 15 13:50 18:00 22:10 02:20 06:30 10:40 break points) expDF$created Nice spike at 18:00 UTC (about 1pm EST)Visualizing the Data: When Tweeted? 11
    12. 12. # Total time between 1st and last tweetelapsedTime = max(expDF$created) - min(expDF$created)timeBin = floor(elapsedTime/11) # Make 11 bins# Add a new variable with the bin designatorsexpDF$slice = floor((expDF$created - min(expDF$created))/(as.integer(timeBin)*3600))expSlices<-expDF[,c("screenName","slice")] # subset the dataexpTable<-table(expSlices) # Count tweets in each slice# Convert table data to matrix that heatmap() expectsexpMatrix<-matrix(expTable,ncol=length(colnames(expTable)))rownames(expMatrix)<-rownames(expTable)colnames(expMatrix)<-paste(Slice,1:12)heatmap(expMatrix,Rowv=NA,Colv=NA, col=rainbow(max(expMatrix)+1,start=0.5,end=.7))Prepare a Heatmap
    13. 13. xoMC_DDL TheRomamane TheKingNappy lauura_5 Sara_Katelyn kapitanluffy Zf1r3 CyberCrimeNEWS CcureIT Brain_0verride drb0n3z sapo2025 packet_storm cybfor csec Federico_II cloeliae manero94 Hamoud_Oz belmontemartin pretorienx secwatched cedricpernet g4l4drim unixfreaxjp theBestRhiannon bortzmeyer macmark_de CyberDomain cinnamon_carter binushacker escan_sachin shadowy47 iWorlds_it hacktalkblog NotaThreat2u Slice10 Slice11 Slice12Slice1 Slice2 Slice3 Slice4 Slice5 Slice6 Slice7 Slice8 Slice9
    14. 14. library(stringr) # Provides easy string functionsstr_match(expDF$text, "^RT @") # Find RT @ at beginning of each line Regular expression matching any number of alphanumeric characters or underscore: [[:alnum:]_]*str_match(expDF$text, "^RT @[[:alnum:]_]*") # Matches the whole retweet screen nameexpDF$rtSN = str_match(expDF$text, "^RT @[[:alnum:]_]*") # Adds a new variableDo Some Parsing with Regex
    15. 15. 0 2 4 6 8 10 12 14 RT @_joviann_ RT @cedricpernetRT @CyberCrimeNEWS RT @hacktalkblog RT @packet_storm RT @unixfreaxjp plot(as.factor(expDF$rtSN),las=2) 15
    16. 16. exploitWords = strsplit(levels(expDF$text)," ")exploitWords = unlist(exploitWords)exploitWords = str_replace_all(exploitWords, "^RT @[[:alnum:]_]*","")exploitWords = str_replace_all(exploitWords, "@[[:alnum:]_]*","")exploitWords = str_replace_all(exploitWords, "#Exploit","")exploitWords = str_replace_all(exploitWords, "#exploit","")exploitWords = str_replace_all(exploitWords, "^http.*","")exploitWords = str_replace_all(exploitWords, ":","")exploitWords = str_replace_all(exploitWords, "_","")exploitWords = str_replace_all(exploitWords, "-","")exploitWords = tolower(exploitWords)exploitWords = sort(exploitWords)wordCount = summary(as.factor(exploitWords))wordCount = wordCount[wordCount<(max(wordCount)-1)]wordCount = wordCount[wordCount>4]barplot(wordCount,las=2)Make a Keyword List
    17. 17. 0 5 10 15 20 25 30 #security rt alert exploit injection sql cross scripting site #ccureit new remote vulnerability #cyber #cyberwar #hacker buffer cms file and disclosure of 1.4 execution vulnerabilities wordpress / [webapps] analysis multiple overflow 1.3.3 Most common keywords advanced code command en for information phpmydirectory with17
    18. 18.  #security – Another good hashtag to search on (SQL) Injection – Apparently one of the most common attacks cross (site) scripting – Another popular attack #cyber #cyberwar #ccureit #hacker – More hashtags? remote vulnerability, buffer (overflow), cms, wordpress, phpmydirectory Each/any of these keywords could provide a basis for a new tweet search term, or for keyword detection within a set of tweets obtained from another search, or for an alert dashboard with periodic updatesCommon keywords to explore 18
    19. 19. @shitaesy Je me couche à 20h30 en ce moment.. Jai même lu ce soir :3 #exploit Scanning across a sample of the tweets, some are spam and should be filtered out Can we create a classifier that will get rid of the non-exploit tweets?Must Remove the Non-Tweets
    20. 20. • AttributesInitial model developed Attribute can bewith training data 1 boolean or numeric • Most useful if Attribute independent 2 of other attributes • The fewerModel accuracy checked Attribute theon training data, but later 3 attributescross-validated on new data the betterAnatomy of a Classifier
    21. 21. write.table(expDF, sep=",", file="exploitData.csv")# I looked at the tweets and added# training data, using my judgment to code# the non-tweetstruExp = read.table("truExp.txt")# Add to the existing dataexpDFtrue = cbind(expDF, truExp)# Note: new variable name defaults to “V1”Create and Add Training Data
    22. 22. # Create true/false values for each row,# based on whether the string existsexpDFtrue$hassec = grepl("security",tolower(expDFtrue$text))# Also count some punctuation to see if# there are clues thereexpDFtrue$numhash = sapply(strsplit(as.character(expDFtrue$te xt),"#"),length)-1Easy Predictors with Grepl()
    23. 23. Coefficients (output from Logit analysis: Estimate Std. Error z value Pr(>|z|)(Intercept) 2.946e+00 1.904e+00 1.547 0.12187hassecTRUE 2.302e+00 1.186e+00 1.941 0.05222 .hassqlTRUE 1.770e+01 4.076e+03 0.004 0.99653hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 *hasscrTRUE 1.832e+01 4.135e+03 0.004 0.99647hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164hascybTRUE 1.202e+00 1.958e+00 0.614 0.53937numhash -2.046e+00 7.218e-01 -2.835 0.00458 **numast -2.182e+01 5.554e+03 -0.004 0.99687numdot 6.306e-01 4.167e-01 1.513 0.13017twtlen 6.548e-03 2.329e-02 0.281 0.77854# security, buffer keywords are promising, as well as the# number of hash marks and the number of dots/periodsChoose Best Attributes
    24. 24. library(rpart) numhash>=2.5 | 1fit <- rpart(V1 ~ hassec + 26/71 hasbuf + numhash + numdot, method="class", data=expDFtrue)summary(fit)plot(fit, uniform=TRUE, margin=0.1, branch=0.5, compress=TRUE)text(fit, use.n=TRUE, all=TRUE, cex=.8) 0 1 14/2 12/69 “numhash” only retained attribute, split at 2.5 Overall 14 errors (83/97 = 85.5% correct) 12 false positives (12/97 = 12.4% FP) 2 false negatives (2/97 = 2.1% FN)Classification Tree Works OK 24
    25. 25.  Conclusion 1: R is pretty handy for grabbing and manipulating tweet data Conclusion 2: Tweet data are messy and require a good deal of clean-up, parsing, and filtering Conclusion 3: As these two examples suggest, tweets can provide breaking news about vulnerabilities and exploits ◦ WordPress Age Verification plugin versions 0.4 and below open redirect vulnerability  Exploit availability tweeted at 12:19 PM  Blogged at SecurityBlog 10:24 PM  Added to SiloBreaker two days later ◦ Pragyan CMS v 3.0 Remote File Disclosure  Exploit availability tweeted at 11:07 AM  Appeared on PacketStorm next day  On RealHacker three days later  On eight days laterTwitter: Early Warning System?
    26. 26. Image from: