Mining tweets for security information (rev 2)

Mining Tweets for
Security Information
with “R”
Jeff Stanton, School of Information Studies
Syracuse University

@highfours: I just
watched a plane
crash into the
hudson rive in
manhattan

@ReallyVirtual:
Helicopter
hovering above
Abbotabad at 1AM
(is a rare event).

Twitter: Early Warning System?

2:26 pm 2:44 PM 3:15 PM
UCLA TMZ.com Twitter

Twitter Facts

 140 characters max
 @petridishes – Screen Name
 #blackberry – User-created hashtag
 @crozzledhearts – “Retweeter” who sent this
tweet after receiving it from @petridishes
 30 minutes ago via web – Each tweet
encoded with UTC timecode
 No URLs here, but they are auto-shortened

Anatomy of a Tweet

“R” – Open Source Analytics

 A GNU open source project
 An implementation of the “S” statistical
language developed at Bell labs
 Largely an interpreted, command-line
interface with some GUI add-ons
 More than 4300 add-on packages developed
by the user community
 Full-featured data management and matrix
manipulation with performance comparable
to Octave and MATLAB
 Extensive graphics for visualization
 Starting in 2010, used by more data miners
(43%) than any other single tool

“R” Facts

 Developed by Jeff
Gentry (Fidelity)
 Five classes and 11
functions to:
◦ Authenticate to Twitter
with Oauth and check
current rate limit
◦ Manipulate, send, and
receive direct messages
◦ Update user status
◦ Search for tweets
containing particular
keywords or hashtags
◦ Examine topic trends
◦ Examine timelines

The “twitteR” package

 Use the R “Packages” menu to install the
necessary packages:
bitops, RJSONIO, RCurl, and twitteR
 Depending upon Mac/Win/Linux, you may need
to retrieve a zipped file of RCurl from:
◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri
b/2.14/
 Then ready the packages for use in R with the
library() command:
> library(bitops)
> library(RCurl)
> library(RJSONIO)
> library(twitteR)

Getting Ready – Load Packages

> expTweets <- searchTwitter('#exploit', n=500)
> expDF <-
do.call("rbind", lapply(expTweets, as.data.fr
ame))

 The second command above takes the raw tweet data in
expTweets – which starts as a list/collection of separate
data objects (frames) – and binds it into a single data
frame for ease of analysis
 lapply() applies a command to each element of a list
 as.data.frame is a type coercion
 rbind is the function that joins separate objects to become
rows in a dataframe
 do.call() repeats the rbind over all elements of the list

Search Twitter for “#exploit”

> head(expDF,1)
text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age
Verification Plugin http://t.co/O8wVjKca #Exploit
favorited: FALSE
replyToSN: NA
created: 2012-01-10 18:19:11
truncated: FALSE
replyToSID: <NA>
id: 156802281747124224
replyToUID: NA
statusSource: <a href="http://twitterfeed.com"
rel="nofollow">twitterfeed</a>
screenName: NotaThreat2u

A Preview of the Data

> head(expDF$created,1)
Histogram of expDF$created
[1] "2012-01-10 18:19:11
UTC“

20
 The created variable is
conveniently coded as a

15
POSIX time variable

Frequency
calibrated to UTC

10
>
hist(expDF$created, breaks=15,

5
freq=TRUE)

0
 Shows a frequency
histogram (with about 15 13:50 18:00 22:10 02:20 06:30 10:40
break points) expDF$created
 Nice spike at 18:00 UTC
(about 1pm EST)
Visualizing the Data: When Tweeted?
11

# Total time between 1st and last tweet
elapsedTime = max(expDF$created) - min(expDF$created)
timeBin = floor(elapsedTime/11) # Make 11 bins
# Add a new variable with the bin designators
expDF$slice = floor((expDF$created -
min(expDF$created))/(as.integer(timeBin)*3600))

expSlices<-expDF[,c("screenName","slice")] # subset the data
expTable<-table(expSlices) # Count tweets in each slice

# Convert table data to matrix that heatmap() expects
expMatrix<-matrix(expTable,ncol=length(colnames(expTable)))
rownames(expMatrix)<-rownames(expTable)
colnames(expMatrix)<-paste('Slice',1:12)

heatmap(expMatrix,Rowv=NA,Colv=NA,
col=rainbow(max(expMatrix)+1,start=0.5,end=.7))

Prepare a Heatmap

xoMC_DDL
TheRomamane
TheKingNappy
lauura_5
Sara_Katelyn
kapitanluffy
Zf1r3
CyberCrimeNEWS
CcureIT
Brain_0verride
drb0n3z
sapo2025
packet_storm
cybfor
csec
Federico_II
cloeliae
manero94
Hamoud_Oz
belmontemartin
pretorienx
secwatched
cedricpernet
g4l4drim
unixfreaxjp
theBestRhiannon
bortzmeyer
macmark_de
CyberDomain
cinnamon_carter
binushacker
escan_sachin
shadowy47
iWorlds_it
hacktalkblog
NotaThreat2u
Slice10

Slice11

Slice12
Slice1

Slice2

Slice3

Slice4

Slice5

Slice6

Slice7

Slice8

Slice9

library(stringr) # Provides easy string
functions
str_match(expDF$text, "^RT @") # Find RT @ at
beginning of each line

 Regular expression matching any number of
alphanumeric characters or underscore:
[[:alnum:]_]*

str_match(expDF$text, "^RT @[[:alnum:]_]*") #
Matches the whole retweet screen name

expDF$rtSN = str_match(expDF$text, "^RT
@[[:alnum:]_]*") # Adds a new variable

Do Some Parsing with Regex

0
2
4
6
8
10
12
14

RT @_joviann_

RT @cedricpernet

RT @CyberCrimeNEWS

RT @hacktalkblog

RT @packet_storm

RT @unixfreaxjp
plot(as.factor(expDF$rtSN),las=2)
15

exploitWords = strsplit(levels(expDF$text)," ")
exploitWords = unlist(exploitWords)
exploitWords = str_replace_all(exploitWords, "^RT @[[:alnum:]_]*","")
exploitWords = str_replace_all(exploitWords, "@[[:alnum:]_]*","")
exploitWords = str_replace_all(exploitWords, "#Exploit","")
exploitWords = str_replace_all(exploitWords, "#exploit","")
exploitWords = str_replace_all(exploitWords, "^http.*","")
exploitWords = str_replace_all(exploitWords, ":","")
exploitWords = str_replace_all(exploitWords, "_","")
exploitWords = str_replace_all(exploitWords, "-","")
exploitWords = tolower(exploitWords)
exploitWords = sort(exploitWords)
wordCount = summary(as.factor(exploitWords))
wordCount = wordCount[wordCount<(max(wordCount)-1)]
wordCount = wordCount[wordCount>4]
barplot(wordCount,las=2)

Make a Keyword List

0
5
10
15
20
25
30

#security
rt
alert
exploit
injection
sql
cross
scripting
site
#ccureit
new
remote
vulnerability
#cyber
#cyberwar
#hacker
buffer
cms
file
and
disclosure
of
1.4
execution
vulnerabilities
wordpress
/
[webapps]
analysis
multiple
overflow
1.3.3
Most common keywords

advanced
code
command
en
for
information
phpmydirectory
with
17

 #security – Another good hashtag to search on
 (SQL) Injection – Apparently one of the most common
attacks
 cross (site) scripting – Another popular attack
 #cyber #cyberwar #ccureit #hacker – More
hashtags?
 remote vulnerability, buffer (overflow), cms,
wordpress, phpmydirectory

 Each/any of these keywords could provide a basis for a
new tweet search term, or for keyword detection within a
set of tweets obtained from another search, or for an alert
dashboard with periodic updates

Common keywords to explore
18

@shitaesy Je me couche à 20h30 en ce
moment.. J'ai même lu ce soir :3 #exploit

 Scanning across a sample of the
tweets, some are spam and should be
filtered out
 Can we create a classifier that will get rid
of the non-exploit tweets?

Must Remove the Non-Tweets

• Attributes
Initial model developed Attribute can be
with training data
1 boolean or
numeric

• Most useful if
Attribute independent
2 of other
attributes

• The fewer
Model accuracy checked Attribute the
on training data, but later
3 attributes
cross-validated on new data
the better

Anatomy of a Classifier

write.table(expDF, sep=",",
file="exploitData.csv")

# I looked at the tweets and added
# training data, using my judgment to code
# the non-tweets
truExp = read.table("truExp.txt")

# Add to the existing data
expDFtrue = cbind(expDF, truExp)
# Note: new variable name defaults to “V1”

Create and Add Training Data

# Create true/false values for each row,
# based on whether the string exists
expDFtrue$hassec =
grepl("security",tolower(expDFtrue$text))

# Also count some punctuation to see if
# there are clues there
expDFtrue$numhash =
sapply(strsplit(as.character(expDFtrue$te
xt),"#"),length)-1

Easy Predictors with Grepl()

Coefficients (output from Logit analysis:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.946e+00 1.904e+00 1.547 0.12187
hassecTRUE 2.302e+00 1.186e+00 1.941 0.05222 .
hassqlTRUE 1.770e+01 4.076e+03 0.004 0.99653
hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 *
hasscrTRUE 1.832e+01 4.135e+03 0.004 0.99647
hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164
hascybTRUE 1.202e+00 1.958e+00 0.614 0.53937
numhash -2.046e+00 7.218e-01 -2.835 0.00458 **
numast -2.182e+01 5.554e+03 -0.004 0.99687
numdot 6.306e-01 4.167e-01 1.513 0.13017
twtlen 6.548e-03 2.329e-02 0.281 0.77854

# security, buffer keywords are promising, as well as the
# number of hash marks and the number of dots/periods

Choose Best Attributes

library(rpart) numhash>=2.5
|
1
fit <- rpart(V1 ~ hassec + 26/71
hasbuf + numhash + numdot,
method="class",
data=expDFtrue)
summary(fit)
plot(fit, uniform=TRUE,
margin=0.1, branch=0.5,
compress=TRUE)
text(fit, use.n=TRUE, all=TRUE,
cex=.8) 0 1
14/2 12/69

“numhash” only retained attribute, split at 2.5
Overall 14 errors (83/97 = 85.5% correct)
12 false positives (12/97 = 12.4% FP)
2 false negatives (2/97 = 2.1% FN)

Classification Tree Works OK
24

 Conclusion 1: R is pretty handy for grabbing and manipulating
tweet data

 Conclusion 2: Tweet data are messy and require a good deal of
clean-up, parsing, and filtering

 Conclusion 3: As these two examples suggest, tweets can provide
breaking news about vulnerabilities and exploits
◦ WordPress Age Verification plugin versions 0.4 and below open redirect
vulnerability
 Exploit availability tweeted at 12:19 PM
 Blogged at SecurityBlog 10:24 PM
 Added to SiloBreaker two days later

◦ Pragyan CMS v 3.0 Remote File Disclosure
 Exploit availability tweeted at 11:07 AM
 Appeared on PacketStorm next day
 On RealHacker three days later
 On WebCriminal.ru eight days later

Twitter: Early Warning System?

Image from: http://www.vincegolangco.com

Mining tweets for security information (rev 2)

More Related Content

Similar to Mining tweets for security information (rev 2)

More from Syracuse University

Recently uploaded

Mining tweets for security information (rev 2)

Editor's Notes