SlideShare a Scribd company logo
1 of 15
SMS Spam Filter Design Using R:
 A Machine Learning Approach

                       Reza Rahimi,
                      Ph.D Candidate,
       School of Information and Computer Science,
              University of California, Irvine.
Introduction
• In basic terms Machine Learning (ML) is about the
  construction of systems that can learn from data.
• It is used as a tool for knowledge discovery.
• Several Important classes of problems could be solved
  using machine learning techniques like:
   – Classification (Prediction):
       • Given a collection of records as a training set.
       • Each record contains a set of attributes and one of the attributes
         called class.
       • The problem is to find a model for class attribute as a function
         of other attributes.
            –   Example: Spam or Ham, Handwriting Recognition,…
– Clustering (Description):
   • Given a set of data points, with some attributes, and a similarity
     measure (metric) among them.
   • The goal is to find clusters such that data points in one cluster are
     more similar to one another.
        –   Example: Document Clustering, people categorization,…

– Association (Description):
   • Given a set of records each contains some items from a given
     collection.
   • The goal is to produce dependency rules which show the
     occurrence of an item based on occurrences of other items.
        –   Example: user habit pattern recognition,…

– Regression (Prediction):
   • Predict a value of a given continuous variables based on the values
     of other variables.
   • Could be linear or nonlinear model of dependency.
        –   Example: Stock prediction
Problem Solving Using Machine
     Learning Framework
• ML is a very mature and developed area.
• In all of the different mentioned problem classes, it
  contains rich resources of tools, techniques and
  Algorithms.
• These tools are provided in different languages and
  Framework like R, Matlab, Java, C++, Mahout,…
• The following procedure could be considered as the
  general methodology for problem solving in this
  framework:
Get a sense of data:                      Problem modeling:
   Feature extraction, dimension       Classification, Clustering, Association,
  reduction, noise cancellation,…                   Regression,…




                                         Run standard ML Algorithms:
Select the methods that satisfy your
                                         check the errors according to the
 performance criteria and metrics.
                                              standard ML Metrics.




               • In the next section I will describe design
                 of SMS Spam Filter in R language based
                 on mentioned methodology.
SMS Spam Filter using R
•   #this file is SMS Spam filter codes with different classifiers in R language
•   #Written by: Reza Rahimi
•   #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection),
•   #loading required packages, libraries and function declaration

•   #required package for text mining
•   if(!require("tm"))
•             install.packages("tm")
•
•   #required package for SVM
•   if(!require("e1071"))
•             install.packages("e1071")
•
•   #required package for KNN
•   if(!require("RWeka"))
•   install.packages("RWeka", dependencies = TRUE)
•
•   #required package for Adaboost
•   if(!require("ada"))
•             install.packages("ada")
•   library("tm")
•   library("e1071")
•   library(RWeka)
•   library("ada")
R Codes (Cont.)
•   #Initialize random generator
•   set.seed(1245)
•
•   #This function makes vector (Vector Space Model) from text message using highly repeated words
•   vsm<-function(message,highlyrepeatedwords){
•
•            tokenizedmessage<-strsplit(message, "s+")[[1]]
•
•   #making vector
•             v<-rep(0, length(highlyrepeatedwords))
•             for(i in 1:length(highlyrepeatedwords)){
•                              for(j in 1:length(tokenizedmessage)){
•                                               if(highlyrepeatedwords[i]==tokenizedmessage[j]){
•                                               v[i]<-v[i]+1
•                                               }
•                              }
•             }
•   return (v)
•   }
•   #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
•   print("Uploading SMS Spams and Hams!n")
•   smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t",
    colClasses=c("type"="character","sms"="character"))
R Codes (Cont.)
•   smstabletmp<-smstable
•
•   print("Extracting Ham and Spam Basic Statistics!")
•
•   smstabletmp$type[smstabletmp$type=="ham"] <- 1
•   smstabletmp$type[smstabletmp$type=="spam"] <- 0
•
•   #Convert character data into numeric
•   tmp<-as.numeric(smstabletmp$type)
•
•   #Basic Statisctics like mean and variance of spam and hams
•   hamavg<-mean(tmp)
•   print("Average Ham is :");hamavg
•
•   hamvariance<-var(tmp)
•   print("Var of Ham is :");hamvariance
•
•   print("Extract average token of Hams and Spams!")
•
•   nohamtokens<-0
•   noham<-0
•   nospamtokens<-0
•   nospam<-0
R Codes (Cont.)
•   for(i in 1:length(smstable$type)){
•               if(smstable[i,1]=="ham"){
•               nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens
•               noham<-noham+1
•   }else{
•               nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens
•               nospam<-nospam+1
•   }
•   }
•
•   totaltokens<-nospamtokens+nohamtokens;
•   print("total number of tokens is:")
•   print(totaltokens)
•
•   avgtokenperham<-nohamtokens/noham
•   print("Avarage number of tokens per ham message")
•   print(avgtokenperham)
•
•   avgtokenperspam<-nospamtokens/nospam
•   print("Avarage number of tokens per spam message")
•   print(avgtokenperspam)
•
•   print(" Make two different sets, training data and test data!")
R Codes (Cont.)
•   #select the percent of data that you want to use as training set
•   trdatapercent<-0.3
•
•   #training data set
•   trdata=NULL
•
•   #test data set
•   tedata=NULL
•
•   for(i in 1:length(smstable$type)){
•               if(runif(1)<trdatapercent){
•                               trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))
•               }
•               else{
•                               tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))
•               }
•   }
•
•   print("Training data size is!")
•   dim(trdata)
•
•   print("Test data size is!")
•   dim(tedata)
R Codes (Cont.)
•   # Text feature extraction using tm package
•
•   trsmses<-Corpus(VectorSource(trdata[,2]))
•   trsmses<-tm_map(trsmses, stripWhitespace)
•   trsmses<-tm_map(trsmses, tolower)
•   trsmses<-tm_map(trsmses, removeWords, stopwords("english"))
•
•   dtm <- DocumentTermMatrix(trsmses)
•
•   highlyrepeatedwords<-findFreqTerms(dtm, 80)
•
•   #These highly used words are used as an index to make VSM
•   #(vector space model) for trained data and test data
•
•   #vectorized training data set
•   vtrdata=NULL
•
•   #vectorized test data set
•   vtedata=NULL
R Codes (Cont.)
•   for(i in 1:length(trdata[,2])){
•               if(trdata[i,1]=="ham"){
•                                vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))
•               }
•               else{
•                                vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))
•               }
•
•   }
•
•   for(i in 1:length(tedata[,2])){
•               if(tedata[i,1]=="ham"){
•                                vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))
•               }
•               else{
•                                vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))
•               }
•
•   }
R Codes (Cont.)
•   # Run different classification algorithms
•   # differnet SVMs with different Kernels
•   print("----------------------------------SVM-----------------------------------------")
•   print("Linear Kernel")
•   svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear');
•   summary(svmlinmodel)
•   predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])
•   tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear
•   precisionlin<-sum(diag(tablinear))/sum(tablinear);
•   print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100
•   print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100
•   print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100
•
•   print("Polynomial Kernel")
•   svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial', probability=FALSE)
•   summary(svmpolymodel)
•   predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])
•   tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly
•
•   print("Radial Kernel")
•   svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1,
    probability=FALSE)
•   summary(svmradmodel)
•   predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])
•   tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
R Codes (Cont.)
•   print("----------------------------------KNN-----------------------------------------")
•   data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])
•   classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))
•   summary(classifier)
•   evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))

•   print("---------------------------------Adaboost-------------------------------------")
•   adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])],
    test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)
•   summary(adaptiveboost)
•   varplot(adaptiveboost)
Conclusions
•   In these slides I gave a broad overview of ML and different
    problems that could be solved in this framework.
•   I reviewed in details one way of SMS spam filter
    implementation using ML techniques with R language.
•   ML provides strong framework to solve problem in Big Data
    domain.

More Related Content

What's hot

Final spam-e-mail-detection
Final  spam-e-mail-detectionFinal  spam-e-mail-detection
Final spam-e-mail-detectionPartnered Health
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filteringijtsrd
 
Emotion Detection in text
Emotion Detection in text Emotion Detection in text
Emotion Detection in text kashif kashif
 
Facial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachFacial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachAshwinRachha
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...IRJET Journal
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentationnewsan2001
 
Facial Expression Recognition System using Deep Convolutional Neural Networks.
Facial Expression Recognition  System using Deep Convolutional Neural Networks.Facial Expression Recognition  System using Deep Convolutional Neural Networks.
Facial Expression Recognition System using Deep Convolutional Neural Networks.Sandeep Wakchaure
 
Presentation2.pptx
Presentation2.pptxPresentation2.pptx
Presentation2.pptxWanderer20
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Machine Learning
Machine LearningMachine Learning
Machine LearningGokulks007
 
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptxInternship - Python - AI ML.pptx
Internship - Python - AI ML.pptxHchethankumar
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Machine learning: Stock Price Prediction
Machine learning: Stock Price PredictionMachine learning: Stock Price Prediction
Machine learning: Stock Price Predictioneurosigdoc acm
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using mlPravin Katiyar
 
Python Summer Internship
Python Summer InternshipPython Summer Internship
Python Summer InternshipAtul Kumar
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 

What's hot (20)

Sms spam-detection
Sms spam-detectionSms spam-detection
Sms spam-detection
 
Spam Detection Using Natural Language processing
Spam Detection Using Natural Language processingSpam Detection Using Natural Language processing
Spam Detection Using Natural Language processing
 
Final spam-e-mail-detection
Final  spam-e-mail-detectionFinal  spam-e-mail-detection
Final spam-e-mail-detection
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filtering
 
Emotion Detection in text
Emotion Detection in text Emotion Detection in text
Emotion Detection in text
 
Facial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachFacial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approach
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentation
 
Facial Expression Recognition System using Deep Convolutional Neural Networks.
Facial Expression Recognition  System using Deep Convolutional Neural Networks.Facial Expression Recognition  System using Deep Convolutional Neural Networks.
Facial Expression Recognition System using Deep Convolutional Neural Networks.
 
Presentation2.pptx
Presentation2.pptxPresentation2.pptx
Presentation2.pptx
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptxInternship - Python - AI ML.pptx
Internship - Python - AI ML.pptx
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
Machine learning: Stock Price Prediction
Machine learning: Stock Price PredictionMachine learning: Stock Price Prediction
Machine learning: Stock Price Prediction
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Python Summer Internship
Python Summer InternshipPython Summer Internship
Python Summer Internship
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Emotion recognition
Emotion recognitionEmotion recognition
Emotion recognition
 

Viewers also liked

Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConmattthemathman
 
Machine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data CenterMachine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data CenterSergey A. Razin
 
Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)Sergey A. Razin
 
Self-Tuning Data Centers
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data CentersReza Rahimi
 
Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서Jaebok Oh
 
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...Denim Group
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam FilteringiNazneen
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Michael Allen
 
Security Insights at Scale
Security Insights at ScaleSecurity Insights at Scale
Security Insights at ScaleRaffael Marty
 
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Alex Pinto
 
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't ChangedAI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't ChangedRaffael Marty
 
Ankus 제품소개서
Ankus 제품소개서Ankus 제품소개서
Ankus 제품소개서onycom1
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 

Viewers also liked (17)

Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozCon
 
Machine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data CenterMachine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data Center
 
Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)
 
Self-Tuning Data Centers
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data Centers
 
Shadow wall utm
Shadow wall utmShadow wall utm
Shadow wall utm
 
Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서
 
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
What is SPAM?
What is SPAM?What is SPAM?
What is SPAM?
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
 
Security Insights at Scale
Security Insights at ScaleSecurity Insights at Scale
Security Insights at Scale
 
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
 
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't ChangedAI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
 
Ankus 제품소개서
Ankus 제품소개서Ankus 제품소개서
Ankus 제품소개서
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to SMS Spam Filter Design Using R: A Machine Learning Approach

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011Mandi Walls
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata londonkperi
 
Supervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.pptSupervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.pptVenneladonthireddy1
 
Supervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.pptSupervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.pptKush736264
 
Advanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAdvanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAnshika865276
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slidesaiclub_slides
 

Similar to SMS Spam Filter Design Using R: A Machine Learning Approach (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Supervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.pptSupervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.ppt
 
Supervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.pptSupervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.ppt
 
Advanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAdvanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.ppt
 
Language R
Language RLanguage R
Language R
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
 

More from Reza Rahimi

Boosting Personalization In SaaS Using Machine Learning.pdf
Boosting Personalization  In SaaS Using Machine Learning.pdfBoosting Personalization  In SaaS Using Machine Learning.pdf
Boosting Personalization In SaaS Using Machine Learning.pdfReza Rahimi
 
Self-Tuning and Managing Services
Self-Tuning and Managing ServicesSelf-Tuning and Managing Services
Self-Tuning and Managing ServicesReza Rahimi
 
Low Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage SystemsLow Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage SystemsReza Rahimi
 
Smart Connectivity
Smart ConnectivitySmart Connectivity
Smart ConnectivityReza Rahimi
 
The Next Big Thing in IT
The Next Big Thing in ITThe Next Big Thing in IT
The Next Big Thing in ITReza Rahimi
 
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud ComputingQoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud ComputingReza Rahimi
 
On Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud ComputingOn Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud ComputingReza Rahimi
 
Mobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud ArchitectureMobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud ArchitectureReza Rahimi
 
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile ApplicationsExploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile ApplicationsReza Rahimi
 
Fingerprint High Level Classification
Fingerprint High Level ClassificationFingerprint High Level Classification
Fingerprint High Level ClassificationReza Rahimi
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Reza Rahimi
 
Optimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP NetworkOptimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP NetworkReza Rahimi
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
 
Mobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureMobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureReza Rahimi
 
Network Information Processing
Network Information ProcessingNetwork Information Processing
Network Information ProcessingReza Rahimi
 
Pervasive Image Computation: A Mobile Phone Application for getting Informat...
Pervasive Image Computation: A Mobile  Phone Application for getting Informat...Pervasive Image Computation: A Mobile  Phone Application for getting Informat...
Pervasive Image Computation: A Mobile Phone Application for getting Informat...Reza Rahimi
 
Gaussian Integration
Gaussian IntegrationGaussian Integration
Gaussian IntegrationReza Rahimi
 
Interactive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPInteractive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPReza Rahimi
 
Quantum Computation and Algorithms
Quantum Computation and Algorithms Quantum Computation and Algorithms
Quantum Computation and Algorithms Reza Rahimi
 

More from Reza Rahimi (19)

Boosting Personalization In SaaS Using Machine Learning.pdf
Boosting Personalization  In SaaS Using Machine Learning.pdfBoosting Personalization  In SaaS Using Machine Learning.pdf
Boosting Personalization In SaaS Using Machine Learning.pdf
 
Self-Tuning and Managing Services
Self-Tuning and Managing ServicesSelf-Tuning and Managing Services
Self-Tuning and Managing Services
 
Low Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage SystemsLow Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage Systems
 
Smart Connectivity
Smart ConnectivitySmart Connectivity
Smart Connectivity
 
The Next Big Thing in IT
The Next Big Thing in ITThe Next Big Thing in IT
The Next Big Thing in IT
 
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud ComputingQoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
 
On Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud ComputingOn Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud Computing
 
Mobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud ArchitectureMobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
 
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile ApplicationsExploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
 
Fingerprint High Level Classification
Fingerprint High Level ClassificationFingerprint High Level Classification
Fingerprint High Level Classification
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
 
Optimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP NetworkOptimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP Network
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
 
Mobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureMobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big Picture
 
Network Information Processing
Network Information ProcessingNetwork Information Processing
Network Information Processing
 
Pervasive Image Computation: A Mobile Phone Application for getting Informat...
Pervasive Image Computation: A Mobile  Phone Application for getting Informat...Pervasive Image Computation: A Mobile  Phone Application for getting Informat...
Pervasive Image Computation: A Mobile Phone Application for getting Informat...
 
Gaussian Integration
Gaussian IntegrationGaussian Integration
Gaussian Integration
 
Interactive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPInteractive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCP
 
Quantum Computation and Algorithms
Quantum Computation and Algorithms Quantum Computation and Algorithms
Quantum Computation and Algorithms
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

SMS Spam Filter Design Using R: A Machine Learning Approach

  • 1. SMS Spam Filter Design Using R: A Machine Learning Approach Reza Rahimi, Ph.D Candidate, School of Information and Computer Science, University of California, Irvine.
  • 2. Introduction • In basic terms Machine Learning (ML) is about the construction of systems that can learn from data. • It is used as a tool for knowledge discovery. • Several Important classes of problems could be solved using machine learning techniques like: – Classification (Prediction): • Given a collection of records as a training set. • Each record contains a set of attributes and one of the attributes called class. • The problem is to find a model for class attribute as a function of other attributes. – Example: Spam or Ham, Handwriting Recognition,…
  • 3. – Clustering (Description): • Given a set of data points, with some attributes, and a similarity measure (metric) among them. • The goal is to find clusters such that data points in one cluster are more similar to one another. – Example: Document Clustering, people categorization,… – Association (Description): • Given a set of records each contains some items from a given collection. • The goal is to produce dependency rules which show the occurrence of an item based on occurrences of other items. – Example: user habit pattern recognition,… – Regression (Prediction): • Predict a value of a given continuous variables based on the values of other variables. • Could be linear or nonlinear model of dependency. – Example: Stock prediction
  • 4. Problem Solving Using Machine Learning Framework • ML is a very mature and developed area. • In all of the different mentioned problem classes, it contains rich resources of tools, techniques and Algorithms. • These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,… • The following procedure could be considered as the general methodology for problem solving in this framework:
  • 5. Get a sense of data: Problem modeling: Feature extraction, dimension Classification, Clustering, Association, reduction, noise cancellation,… Regression,… Run standard ML Algorithms: Select the methods that satisfy your check the errors according to the performance criteria and metrics. standard ML Metrics. • In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.
  • 6. SMS Spam Filter using R • #this file is SMS Spam filter codes with different classifiers in R language • #Written by: Reza Rahimi • #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), • #loading required packages, libraries and function declaration • #required package for text mining • if(!require("tm")) • install.packages("tm") • • #required package for SVM • if(!require("e1071")) • install.packages("e1071") • • #required package for KNN • if(!require("RWeka")) • install.packages("RWeka", dependencies = TRUE) • • #required package for Adaboost • if(!require("ada")) • install.packages("ada") • library("tm") • library("e1071") • library(RWeka) • library("ada")
  • 7. R Codes (Cont.) • #Initialize random generator • set.seed(1245) • • #This function makes vector (Vector Space Model) from text message using highly repeated words • vsm<-function(message,highlyrepeatedwords){ • • tokenizedmessage<-strsplit(message, "s+")[[1]] • • #making vector • v<-rep(0, length(highlyrepeatedwords)) • for(i in 1:length(highlyrepeatedwords)){ • for(j in 1:length(tokenizedmessage)){ • if(highlyrepeatedwords[i]==tokenizedmessage[j]){ • v[i]<-v[i]+1 • } • } • } • return (v) • } • #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection • print("Uploading SMS Spams and Hams!n") • smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t", colClasses=c("type"="character","sms"="character"))
  • 8. R Codes (Cont.) • smstabletmp<-smstable • • print("Extracting Ham and Spam Basic Statistics!") • • smstabletmp$type[smstabletmp$type=="ham"] <- 1 • smstabletmp$type[smstabletmp$type=="spam"] <- 0 • • #Convert character data into numeric • tmp<-as.numeric(smstabletmp$type) • • #Basic Statisctics like mean and variance of spam and hams • hamavg<-mean(tmp) • print("Average Ham is :");hamavg • • hamvariance<-var(tmp) • print("Var of Ham is :");hamvariance • • print("Extract average token of Hams and Spams!") • • nohamtokens<-0 • noham<-0 • nospamtokens<-0 • nospam<-0
  • 9. R Codes (Cont.) • for(i in 1:length(smstable$type)){ • if(smstable[i,1]=="ham"){ • nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens • noham<-noham+1 • }else{ • nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens • nospam<-nospam+1 • } • } • • totaltokens<-nospamtokens+nohamtokens; • print("total number of tokens is:") • print(totaltokens) • • avgtokenperham<-nohamtokens/noham • print("Avarage number of tokens per ham message") • print(avgtokenperham) • • avgtokenperspam<-nospamtokens/nospam • print("Avarage number of tokens per spam message") • print(avgtokenperspam) • • print(" Make two different sets, training data and test data!")
  • 10. R Codes (Cont.) • #select the percent of data that you want to use as training set • trdatapercent<-0.3 • • #training data set • trdata=NULL • • #test data set • tedata=NULL • • for(i in 1:length(smstable$type)){ • if(runif(1)<trdatapercent){ • trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2]))) • } • else{ • tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2]))) • } • } • • print("Training data size is!") • dim(trdata) • • print("Test data size is!") • dim(tedata)
  • 11. R Codes (Cont.) • # Text feature extraction using tm package • • trsmses<-Corpus(VectorSource(trdata[,2])) • trsmses<-tm_map(trsmses, stripWhitespace) • trsmses<-tm_map(trsmses, tolower) • trsmses<-tm_map(trsmses, removeWords, stopwords("english")) • • dtm <- DocumentTermMatrix(trsmses) • • highlyrepeatedwords<-findFreqTerms(dtm, 80) • • #These highly used words are used as an index to make VSM • #(vector space model) for trained data and test data • • #vectorized training data set • vtrdata=NULL • • #vectorized test data set • vtedata=NULL
  • 12. R Codes (Cont.) • for(i in 1:length(trdata[,2])){ • if(trdata[i,1]=="ham"){ • vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords))) • } • else{ • vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords))) • } • • } • • for(i in 1:length(tedata[,2])){ • if(tedata[i,1]=="ham"){ • vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords))) • } • else{ • vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords))) • } • • }
  • 13. R Codes (Cont.) • # Run different classification algorithms • # differnet SVMs with different Kernels • print("----------------------------------SVM-----------------------------------------") • print("Linear Kernel") • svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear'); • summary(svmlinmodel) • predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])]) • tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear • precisionlin<-sum(diag(tablinear))/sum(tablinear); • print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100 • print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100 • print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100 • • print("Polynomial Kernel") • svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial', probability=FALSE) • summary(svmpolymodel) • predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])]) • tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly • • print("Radial Kernel") • svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1, probability=FALSE) • summary(svmradmodel) • predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])]) • tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
  • 14. R Codes (Cont.) • print("----------------------------------KNN-----------------------------------------") • data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1]) • classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE)) • summary(classifier) • evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1])) • print("---------------------------------Adaboost-------------------------------------") • adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100) • summary(adaptiveboost) • varplot(adaptiveboost)
  • 15. Conclusions • In these slides I gave a broad overview of ML and different problems that could be solved in this framework. • I reviewed in details one way of SMS spam filter implementation using ML techniques with R language. • ML provides strong framework to solve problem in Big Data domain.