SlideShare a Scribd company logo
Mining Tweets for
Security Information
            with “R”
 Jeff Stanton, School of Information Studies
                         Syracuse University
@highfours: I just
                   watched a plane
                   crash into the
                   hudson rive in
                   manhattan

                  @ReallyVirtual:
                   Helicopter
                   hovering above
                   Abbotabad at 1AM
                   (is a rare event).

Twitter: Early Warning System?
2:26 pm    2:44 PM   3:15 PM
  UCLA      TMZ.com    Twitter




Twitter Facts
   140 characters max
            @petridishes – Screen Name
            #blackberry – User-created hashtag
            @crozzledhearts – “Retweeter” who sent this
             tweet after receiving it from @petridishes
            30 minutes ago via web – Each tweet
             encoded with UTC timecode
            No URLs here, but they are auto-shortened

Anatomy of a Tweet
“R” – Open Source Analytics
   A GNU open source project
   An implementation of the “S” statistical
    language developed at Bell labs
   Largely an interpreted, command-line
    interface with some GUI add-ons
   More than 4300 add-on packages developed
    by the user community
   Full-featured data management and matrix
    manipulation with performance comparable
    to Octave and MATLAB
   Extensive graphics for visualization
   Starting in 2010, used by more data miners
    (43%) than any other single tool


“R” Facts
   Developed by Jeff
    Gentry (Fidelity)
   Five classes and 11
    functions to:
    ◦ Authenticate to Twitter
      with Oauth and check
      current rate limit
    ◦ Manipulate, send, and
      receive direct messages
    ◦ Update user status
    ◦ Search for tweets
      containing particular
      keywords or hashtags
    ◦ Examine topic trends
    ◦ Examine timelines


The “twitteR” package
   Use the R “Packages” menu to install the
    necessary packages:
    bitops, RJSONIO, RCurl, and twitteR
   Depending upon Mac/Win/Linux, you may need
    to retrieve a zipped file of RCurl from:
    ◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri
      b/2.14/
   Then ready the packages for use in R with the
    library() command:
    >   library(bitops)
    >   library(RCurl)
    >   library(RJSONIO)
    >   library(twitteR)




Getting Ready – Load Packages
> expTweets <- searchTwitter('#exploit', n=500)
> expDF <-
  do.call("rbind", lapply(expTweets, as.data.fr
  ame))

   The second command above takes the raw tweet data in
    expTweets – which starts as a list/collection of separate
    data objects (frames) – and binds it into a single data
    frame for ease of analysis
   lapply() applies a command to each element of a list
   as.data.frame is a type coercion
   rbind is the function that joins separate objects to become
    rows in a dataframe
   do.call() repeats the rbind over all elements of the list




Search Twitter for “#exploit”
> head(expDF,1)
text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age
  Verification Plugin http://t.co/O8wVjKca #Exploit
favorited: FALSE
replyToSN: NA
created: 2012-01-10 18:19:11
truncated: FALSE
replyToSID: <NA>
id: 156802281747124224
replyToUID: NA
statusSource: &lt;a href=&quot;http://twitterfeed.com&quot;
  rel=&quot;nofollow&quot;&gt;twitterfeed&lt;/a&gt;
screenName: NotaThreat2u




A Preview of the Data
> head(expDF$created,1)
                                                               Histogram of expDF$created
[1] "2012-01-10 18:19:11
  UTC“




                                                     20
       The created variable is
        conveniently coded as a




                                                     15
        POSIX time variable




                                         Frequency
        calibrated to UTC




                                                     10
    >
        hist(expDF$created, breaks=15,




                                                     5
         freq=TRUE)




                                                     0
 Shows a frequency
  histogram (with about 15                                13:50 18:00 22:10 02:20 06:30 10:40
  break points)                                                        expDF$created
 Nice spike at 18:00 UTC
  (about 1pm EST)
Visualizing the Data: When Tweeted?
                                                                                                11
# Total time between 1st and last tweet
elapsedTime = max(expDF$created) - min(expDF$created)
timeBin = floor(elapsedTime/11) # Make 11 bins
# Add a new variable with the bin designators
expDF$slice = floor((expDF$created -
  min(expDF$created))/(as.integer(timeBin)*3600))

expSlices<-expDF[,c("screenName","slice")] # subset the data
expTable<-table(expSlices) # Count tweets in each slice

# Convert table data to matrix that heatmap() expects
expMatrix<-matrix(expTable,ncol=length(colnames(expTable)))
rownames(expMatrix)<-rownames(expTable)
colnames(expMatrix)<-paste('Slice',1:12)

heatmap(expMatrix,Rowv=NA,Colv=NA,
  col=rainbow(max(expMatrix)+1,start=0.5,end=.7))




Prepare a Heatmap
xoMC_DDL
                                                                                                               TheRomamane
                                                                                                               TheKingNappy
                                                                                                               lauura_5
                                                                                                               Sara_Katelyn
                                                                                                               kapitanluffy
                                                                                                               Zf1r3
                                                                                                               CyberCrimeNEWS
                                                                                                               CcureIT
                                                                                                               Brain_0verride
                                                                                                               drb0n3z
                                                                                                               sapo2025
                                                                                                               packet_storm
                                                                                                               cybfor
                                                                                                               csec
                                                                                                               Federico_II
                                                                                                               cloeliae
                                                                                                               manero94
                                                                                                               Hamoud_Oz
                                                                                                               belmontemartin
                                                                                                               pretorienx
                                                                                                               secwatched
                                                                                                               cedricpernet
                                                                                                               g4l4drim
                                                                                                               unixfreaxjp
                                                                                                               theBestRhiannon
                                                                                                               bortzmeyer
                                                                                                               macmark_de
                                                                                                               CyberDomain
                                                                                                               cinnamon_carter
                                                                                                               binushacker
                                                                                                               escan_sachin
                                                                                                               shadowy47
                                                                                                               iWorlds_it
                                                                                                               hacktalkblog
                                                                                                               NotaThreat2u
                                                                                 Slice10

                                                                                           Slice11

                                                                                                     Slice12
Slice1

         Slice2

                  Slice3

                           Slice4

                                    Slice5

                                             Slice6

                                                      Slice7

                                                               Slice8

                                                                        Slice9
library(stringr) # Provides easy string
  functions
str_match(expDF$text, "^RT @") # Find RT @ at
  beginning of each line

   Regular expression matching any number of
    alphanumeric characters or underscore:
    [[:alnum:]_]*

str_match(expDF$text, "^RT @[[:alnum:]_]*") #
  Matches the whole retweet screen name

expDF$rtSN = str_match(expDF$text, "^RT
  @[[:alnum:]_]*") # Adds a new variable



Do Some Parsing with Regex
0
                                               2
                                                   4
                                                       6
                                                           8
                                                               10
                                                                    12
                                                                         14




                   RT @_joviann_


       RT @cedricpernet


RT @CyberCrimeNEWS


       RT @hacktalkblog


      RT @packet_storm


            RT @unixfreaxjp
       plot(as.factor(expDF$rtSN),las=2)
 15
exploitWords = strsplit(levels(expDF$text)," ")
exploitWords = unlist(exploitWords)
exploitWords = str_replace_all(exploitWords, "^RT @[[:alnum:]_]*","")
exploitWords = str_replace_all(exploitWords, "@[[:alnum:]_]*","")
exploitWords = str_replace_all(exploitWords, "#Exploit","")
exploitWords = str_replace_all(exploitWords, "#exploit","")
exploitWords = str_replace_all(exploitWords, "^http.*","")
exploitWords = str_replace_all(exploitWords, ":","")
exploitWords = str_replace_all(exploitWords, "_","")
exploitWords = str_replace_all(exploitWords, "-","")
exploitWords = tolower(exploitWords)
exploitWords = sort(exploitWords)
wordCount = summary(as.factor(exploitWords))
wordCount = wordCount[wordCount<(max(wordCount)-1)]
wordCount = wordCount[wordCount>4]
barplot(wordCount,las=2)




Make a Keyword List
0
                                              5
                                                  10
                                                       15
                                                            20
                                                                 25
                                                                      30

                           #security
                                     rt
                                 alert
                               exploit
                            injection
                                   sql
                                cross
                            scripting
                                  site
                             #ccureit
                                 new
                              remote
                       vulnerability
                              #cyber
                         #cyberwar
                            #hacker
                               buffer
                                 cms
                                   file
                                  and
                          disclosure
                                    of
                                  1.4
                          execution
                      vulnerabilities
                         wordpress
                                      /
                         [webapps]
                            analysis
                             multiple
                            overflow
                                1.3.3
     Most common keywords

                          advanced
                                code
                          command
                                   en
                                   for
                        information
                    phpmydirectory
                                 with
17
    #security – Another good hashtag to search on
   (SQL) Injection – Apparently one of the most common
    attacks
   cross (site) scripting – Another popular attack
   #cyber #cyberwar #ccureit #hacker – More
    hashtags?
   remote vulnerability, buffer (overflow), cms,
    wordpress, phpmydirectory

   Each/any of these keywords could provide a basis for a
    new tweet search term, or for keyword detection within a
    set of tweets obtained from another search, or for an alert
    dashboard with periodic updates




Common keywords to explore
                                                                  18
@shitaesy Je me couche à 20h30 en ce
 moment.. J'ai même lu ce soir :3 #exploit

 Scanning across a sample of the
  tweets, some are spam and should be
  filtered out
 Can we create a classifier that will get rid
  of the non-exploit tweets?


Must Remove the Non-Tweets
• Attributes
Initial model developed       Attribute     can be
with training data
                                  1         boolean or
                                            numeric

                                              • Most useful if
                                  Attribute     independent
                                      2         of other
                                                attributes

                                          • The fewer
Model accuracy checked        Attribute     the
on training data, but later
                                  3         attributes
cross-validated on new data
                                            the better




Anatomy of a Classifier
write.table(expDF, sep=",",
 file="exploitData.csv")

# I looked at the tweets and added
# training data, using my judgment to code
# the non-tweets
truExp = read.table("truExp.txt")

# Add to the existing data
expDFtrue = cbind(expDF, truExp)
# Note: new variable name defaults to “V1”


Create and Add Training Data
# Create true/false values for each row,
# based on whether the string exists
expDFtrue$hassec =
 grepl("security",tolower(expDFtrue$text))

# Also count some punctuation to see if
# there are clues there
expDFtrue$numhash =
 sapply(strsplit(as.character(expDFtrue$te
 xt),"#"),length)-1



Easy Predictors with Grepl()
Coefficients (output from Logit analysis:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.946e+00 1.904e+00     1.547 0.12187
hassecTRUE   2.302e+00 1.186e+00    1.941 0.05222 .
hassqlTRUE   1.770e+01 4.076e+03    0.004 0.99653
hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 *
hasscrTRUE   1.832e+01 4.135e+03    0.004 0.99647
hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164
hascybTRUE   1.202e+00 1.958e+00    0.614 0.53937
numhash     -2.046e+00 7.218e-01 -2.835 0.00458 **
numast      -2.182e+01 5.554e+03 -0.004 0.99687
numdot       6.306e-01 4.167e-01    1.513 0.13017
twtlen       6.548e-03 2.329e-02    0.281 0.77854

# security, buffer keywords are promising, as well as the
# number of hash marks and the number of dots/periods




Choose Best Attributes
library(rpart)                              numhash>=2.5
                                                 |
                                                 1
fit <- rpart(V1 ~ hassec +                     26/71
  hasbuf + numhash + numdot,
  method="class",
  data=expDFtrue)
summary(fit)
plot(fit, uniform=TRUE,
  margin=0.1, branch=0.5,
  compress=TRUE)
text(fit, use.n=TRUE, all=TRUE,
  cex=.8)                          0                         1
                                  14/2                     12/69


      “numhash” only retained attribute, split at 2.5
      Overall 14 errors (83/97 = 85.5% correct)
      12 false positives (12/97 = 12.4% FP)
      2 false negatives (2/97 = 2.1% FN)


Classification Tree Works OK
                                                                   24
   Conclusion 1: R is pretty handy for grabbing and manipulating
    tweet data

   Conclusion 2: Tweet data are messy and require a good deal of
    clean-up, parsing, and filtering

   Conclusion 3: As these two examples suggest, tweets can provide
    breaking news about vulnerabilities and exploits
    ◦ WordPress Age Verification plugin versions 0.4 and below open redirect
      vulnerability
       Exploit availability tweeted at 12:19 PM
       Blogged at SecurityBlog 10:24 PM
       Added to SiloBreaker two days later

    ◦ Pragyan CMS v 3.0 Remote File Disclosure
       Exploit availability tweeted at 11:07 AM
       Appeared on PacketStorm next day
       On RealHacker three days later
       On WebCriminal.ru eight days later


Twitter: Early Warning System?
Image from: http://www.vincegolangco.com

More Related Content

More from Syracuse University

Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
Syracuse University
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
Syracuse University
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
Syracuse University
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
Syracuse University
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
Syracuse University
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
Syracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
Syracuse University
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
Syracuse University
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
Syracuse University
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
Syracuse University
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
Syracuse University
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
Syracuse University
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
Syracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
Syracuse University
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
Syracuse University
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
Syracuse University
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
Syracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
Syracuse University
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
Syracuse University
 

More from Syracuse University (20)

Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 

Recently uploaded

Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 

Recently uploaded (20)

Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 

Mining tweets for security information (rev 2)

  • 1. Mining Tweets for Security Information with “R” Jeff Stanton, School of Information Studies Syracuse University
  • 2. @highfours: I just watched a plane crash into the hudson rive in manhattan @ReallyVirtual: Helicopter hovering above Abbotabad at 1AM (is a rare event). Twitter: Early Warning System?
  • 3. 2:26 pm 2:44 PM 3:15 PM UCLA TMZ.com Twitter Twitter Facts
  • 4. 140 characters max  @petridishes – Screen Name  #blackberry – User-created hashtag  @crozzledhearts – “Retweeter” who sent this tweet after receiving it from @petridishes  30 minutes ago via web – Each tweet encoded with UTC timecode  No URLs here, but they are auto-shortened Anatomy of a Tweet
  • 5. “R” – Open Source Analytics
  • 6. A GNU open source project  An implementation of the “S” statistical language developed at Bell labs  Largely an interpreted, command-line interface with some GUI add-ons  More than 4300 add-on packages developed by the user community  Full-featured data management and matrix manipulation with performance comparable to Octave and MATLAB  Extensive graphics for visualization  Starting in 2010, used by more data miners (43%) than any other single tool “R” Facts
  • 7. Developed by Jeff Gentry (Fidelity)  Five classes and 11 functions to: ◦ Authenticate to Twitter with Oauth and check current rate limit ◦ Manipulate, send, and receive direct messages ◦ Update user status ◦ Search for tweets containing particular keywords or hashtags ◦ Examine topic trends ◦ Examine timelines The “twitteR” package
  • 8. Use the R “Packages” menu to install the necessary packages: bitops, RJSONIO, RCurl, and twitteR  Depending upon Mac/Win/Linux, you may need to retrieve a zipped file of RCurl from: ◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri b/2.14/  Then ready the packages for use in R with the library() command: > library(bitops) > library(RCurl) > library(RJSONIO) > library(twitteR) Getting Ready – Load Packages
  • 9. > expTweets <- searchTwitter('#exploit', n=500) > expDF <- do.call("rbind", lapply(expTweets, as.data.fr ame))  The second command above takes the raw tweet data in expTweets – which starts as a list/collection of separate data objects (frames) – and binds it into a single data frame for ease of analysis  lapply() applies a command to each element of a list  as.data.frame is a type coercion  rbind is the function that joins separate objects to become rows in a dataframe  do.call() repeats the rbind over all elements of the list Search Twitter for “#exploit”
  • 10. > head(expDF,1) text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age Verification Plugin http://t.co/O8wVjKca #Exploit favorited: FALSE replyToSN: NA created: 2012-01-10 18:19:11 truncated: FALSE replyToSID: <NA> id: 156802281747124224 replyToUID: NA statusSource: &lt;a href=&quot;http://twitterfeed.com&quot; rel=&quot;nofollow&quot;&gt;twitterfeed&lt;/a&gt; screenName: NotaThreat2u A Preview of the Data
  • 11. > head(expDF$created,1) Histogram of expDF$created [1] "2012-01-10 18:19:11 UTC“ 20  The created variable is conveniently coded as a 15 POSIX time variable Frequency calibrated to UTC 10 > hist(expDF$created, breaks=15, 5 freq=TRUE) 0  Shows a frequency histogram (with about 15 13:50 18:00 22:10 02:20 06:30 10:40 break points) expDF$created  Nice spike at 18:00 UTC (about 1pm EST) Visualizing the Data: When Tweeted? 11
  • 12. # Total time between 1st and last tweet elapsedTime = max(expDF$created) - min(expDF$created) timeBin = floor(elapsedTime/11) # Make 11 bins # Add a new variable with the bin designators expDF$slice = floor((expDF$created - min(expDF$created))/(as.integer(timeBin)*3600)) expSlices<-expDF[,c("screenName","slice")] # subset the data expTable<-table(expSlices) # Count tweets in each slice # Convert table data to matrix that heatmap() expects expMatrix<-matrix(expTable,ncol=length(colnames(expTable))) rownames(expMatrix)<-rownames(expTable) colnames(expMatrix)<-paste('Slice',1:12) heatmap(expMatrix,Rowv=NA,Colv=NA, col=rainbow(max(expMatrix)+1,start=0.5,end=.7)) Prepare a Heatmap
  • 13. xoMC_DDL TheRomamane TheKingNappy lauura_5 Sara_Katelyn kapitanluffy Zf1r3 CyberCrimeNEWS CcureIT Brain_0verride drb0n3z sapo2025 packet_storm cybfor csec Federico_II cloeliae manero94 Hamoud_Oz belmontemartin pretorienx secwatched cedricpernet g4l4drim unixfreaxjp theBestRhiannon bortzmeyer macmark_de CyberDomain cinnamon_carter binushacker escan_sachin shadowy47 iWorlds_it hacktalkblog NotaThreat2u Slice10 Slice11 Slice12 Slice1 Slice2 Slice3 Slice4 Slice5 Slice6 Slice7 Slice8 Slice9
  • 14. library(stringr) # Provides easy string functions str_match(expDF$text, "^RT @") # Find RT @ at beginning of each line  Regular expression matching any number of alphanumeric characters or underscore: [[:alnum:]_]* str_match(expDF$text, "^RT @[[:alnum:]_]*") # Matches the whole retweet screen name expDF$rtSN = str_match(expDF$text, "^RT @[[:alnum:]_]*") # Adds a new variable Do Some Parsing with Regex
  • 15. 0 2 4 6 8 10 12 14 RT @_joviann_ RT @cedricpernet RT @CyberCrimeNEWS RT @hacktalkblog RT @packet_storm RT @unixfreaxjp plot(as.factor(expDF$rtSN),las=2) 15
  • 16. exploitWords = strsplit(levels(expDF$text)," ") exploitWords = unlist(exploitWords) exploitWords = str_replace_all(exploitWords, "^RT @[[:alnum:]_]*","") exploitWords = str_replace_all(exploitWords, "@[[:alnum:]_]*","") exploitWords = str_replace_all(exploitWords, "#Exploit","") exploitWords = str_replace_all(exploitWords, "#exploit","") exploitWords = str_replace_all(exploitWords, "^http.*","") exploitWords = str_replace_all(exploitWords, ":","") exploitWords = str_replace_all(exploitWords, "_","") exploitWords = str_replace_all(exploitWords, "-","") exploitWords = tolower(exploitWords) exploitWords = sort(exploitWords) wordCount = summary(as.factor(exploitWords)) wordCount = wordCount[wordCount<(max(wordCount)-1)] wordCount = wordCount[wordCount>4] barplot(wordCount,las=2) Make a Keyword List
  • 17. 0 5 10 15 20 25 30 #security rt alert exploit injection sql cross scripting site #ccureit new remote vulnerability #cyber #cyberwar #hacker buffer cms file and disclosure of 1.4 execution vulnerabilities wordpress / [webapps] analysis multiple overflow 1.3.3 Most common keywords advanced code command en for information phpmydirectory with 17
  • 18. #security – Another good hashtag to search on  (SQL) Injection – Apparently one of the most common attacks  cross (site) scripting – Another popular attack  #cyber #cyberwar #ccureit #hacker – More hashtags?  remote vulnerability, buffer (overflow), cms, wordpress, phpmydirectory  Each/any of these keywords could provide a basis for a new tweet search term, or for keyword detection within a set of tweets obtained from another search, or for an alert dashboard with periodic updates Common keywords to explore 18
  • 19. @shitaesy Je me couche à 20h30 en ce moment.. J'ai même lu ce soir :3 #exploit  Scanning across a sample of the tweets, some are spam and should be filtered out  Can we create a classifier that will get rid of the non-exploit tweets? Must Remove the Non-Tweets
  • 20. • Attributes Initial model developed Attribute can be with training data 1 boolean or numeric • Most useful if Attribute independent 2 of other attributes • The fewer Model accuracy checked Attribute the on training data, but later 3 attributes cross-validated on new data the better Anatomy of a Classifier
  • 21. write.table(expDF, sep=",", file="exploitData.csv") # I looked at the tweets and added # training data, using my judgment to code # the non-tweets truExp = read.table("truExp.txt") # Add to the existing data expDFtrue = cbind(expDF, truExp) # Note: new variable name defaults to “V1” Create and Add Training Data
  • 22. # Create true/false values for each row, # based on whether the string exists expDFtrue$hassec = grepl("security",tolower(expDFtrue$text)) # Also count some punctuation to see if # there are clues there expDFtrue$numhash = sapply(strsplit(as.character(expDFtrue$te xt),"#"),length)-1 Easy Predictors with Grepl()
  • 23. Coefficients (output from Logit analysis: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.946e+00 1.904e+00 1.547 0.12187 hassecTRUE 2.302e+00 1.186e+00 1.941 0.05222 . hassqlTRUE 1.770e+01 4.076e+03 0.004 0.99653 hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 * hasscrTRUE 1.832e+01 4.135e+03 0.004 0.99647 hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164 hascybTRUE 1.202e+00 1.958e+00 0.614 0.53937 numhash -2.046e+00 7.218e-01 -2.835 0.00458 ** numast -2.182e+01 5.554e+03 -0.004 0.99687 numdot 6.306e-01 4.167e-01 1.513 0.13017 twtlen 6.548e-03 2.329e-02 0.281 0.77854 # security, buffer keywords are promising, as well as the # number of hash marks and the number of dots/periods Choose Best Attributes
  • 24. library(rpart) numhash>=2.5 | 1 fit <- rpart(V1 ~ hassec + 26/71 hasbuf + numhash + numdot, method="class", data=expDFtrue) summary(fit) plot(fit, uniform=TRUE, margin=0.1, branch=0.5, compress=TRUE) text(fit, use.n=TRUE, all=TRUE, cex=.8) 0 1 14/2 12/69 “numhash” only retained attribute, split at 2.5 Overall 14 errors (83/97 = 85.5% correct) 12 false positives (12/97 = 12.4% FP) 2 false negatives (2/97 = 2.1% FN) Classification Tree Works OK 24
  • 25. Conclusion 1: R is pretty handy for grabbing and manipulating tweet data  Conclusion 2: Tweet data are messy and require a good deal of clean-up, parsing, and filtering  Conclusion 3: As these two examples suggest, tweets can provide breaking news about vulnerabilities and exploits ◦ WordPress Age Verification plugin versions 0.4 and below open redirect vulnerability  Exploit availability tweeted at 12:19 PM  Blogged at SecurityBlog 10:24 PM  Added to SiloBreaker two days later ◦ Pragyan CMS v 3.0 Remote File Disclosure  Exploit availability tweeted at 11:07 AM  Appeared on PacketStorm next day  On RealHacker three days later  On WebCriminal.ru eight days later Twitter: Early Warning System?

Editor's Notes

  1. This graph shows the Michael Jackson Effect, with a strong uptick in the half hour following the announcement of his death on Hollywood celebrity site TMZ.com (Doctors at UCLA Hospital had announced the death 18 minutes before that). Twitter crashed temporarily under the load at 3:15 PM. Twitter has about 200 million users and is the 9th most popular site on the web. On a typical day Twitter handles 200 million tweets and 1.6 billion search queries.
  2. Could also make a fancier Poststript plot with:post(fit, file = &quot;tree.ps&quot;, title = &quot;Classification Tree for Exploit Tweets&quot;)