Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SMS Spam Filter Design Using R: A Machine Learning Approach


Published on

Published in: Technology

SMS Spam Filter Design Using R: A Machine Learning Approach

  1. 1. SMS Spam Filter Design Using R: A Machine Learning Approach Reza Rahimi, Ph.D Candidate, School of Information and Computer Science, University of California, Irvine.
  2. 2. Introduction• In basic terms Machine Learning (ML) is about the construction of systems that can learn from data.• It is used as a tool for knowledge discovery.• Several Important classes of problems could be solved using machine learning techniques like: – Classification (Prediction): • Given a collection of records as a training set. • Each record contains a set of attributes and one of the attributes called class. • The problem is to find a model for class attribute as a function of other attributes. – Example: Spam or Ham, Handwriting Recognition,…
  3. 3. – Clustering (Description): • Given a set of data points, with some attributes, and a similarity measure (metric) among them. • The goal is to find clusters such that data points in one cluster are more similar to one another. – Example: Document Clustering, people categorization,…– Association (Description): • Given a set of records each contains some items from a given collection. • The goal is to produce dependency rules which show the occurrence of an item based on occurrences of other items. – Example: user habit pattern recognition,…– Regression (Prediction): • Predict a value of a given continuous variables based on the values of other variables. • Could be linear or nonlinear model of dependency. – Example: Stock prediction
  4. 4. Problem Solving Using Machine Learning Framework• ML is a very mature and developed area.• In all of the different mentioned problem classes, it contains rich resources of tools, techniques and Algorithms.• These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,…• The following procedure could be considered as the general methodology for problem solving in this framework:
  5. 5. Get a sense of data: Problem modeling: Feature extraction, dimension Classification, Clustering, Association, reduction, noise cancellation,… Regression,… Run standard ML Algorithms:Select the methods that satisfy your check the errors according to the performance criteria and metrics. standard ML Metrics. • In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.
  6. 6. SMS Spam Filter using R• #this file is SMS Spam filter codes with different classifiers in R language• #Written by: Reza Rahimi• #Initialization: Raw Data (,• #loading required packages, libraries and function declaration• #required package for text mining• if(!require("tm"))• install.packages("tm")•• #required package for SVM• if(!require("e1071"))• install.packages("e1071")•• #required package for KNN• if(!require("RWeka"))• install.packages("RWeka", dependencies = TRUE)•• #required package for Adaboost• if(!require("ada"))• install.packages("ada")• library("tm")• library("e1071")• library(RWeka)• library("ada")
  7. 7. R Codes (Cont.)• #Initialize random generator• set.seed(1245)•• #This function makes vector (Vector Space Model) from text message using highly repeated words• vsm<-function(message,highlyrepeatedwords){•• tokenizedmessage<-strsplit(message, "s+")[[1]]•• #making vector• v<-rep(0, length(highlyrepeatedwords))• for(i in 1:length(highlyrepeatedwords)){• for(j in 1:length(tokenizedmessage)){• if(highlyrepeatedwords[i]==tokenizedmessage[j]){• v[i]<-v[i]+1• }• }• }• return (v)• }• #loading data. Original data is from• print("Uploading SMS Spams and Hams!n")• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t", colClasses=c("type"="character","sms"="character"))
  8. 8. R Codes (Cont.)• smstabletmp<-smstable•• print("Extracting Ham and Spam Basic Statistics!")•• smstabletmp$type[smstabletmp$type=="ham"] <- 1• smstabletmp$type[smstabletmp$type=="spam"] <- 0•• #Convert character data into numeric• tmp<-as.numeric(smstabletmp$type)•• #Basic Statisctics like mean and variance of spam and hams• hamavg<-mean(tmp)• print("Average Ham is :");hamavg•• hamvariance<-var(tmp)• print("Var of Ham is :");hamvariance•• print("Extract average token of Hams and Spams!")•• nohamtokens<-0• noham<-0• nospamtokens<-0• nospam<-0
  9. 9. R Codes (Cont.)• for(i in 1:length(smstable$type)){• if(smstable[i,1]=="ham"){• nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens• noham<-noham+1• }else{• nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens• nospam<-nospam+1• }• }•• totaltokens<-nospamtokens+nohamtokens;• print("total number of tokens is:")• print(totaltokens)•• avgtokenperham<-nohamtokens/noham• print("Avarage number of tokens per ham message")• print(avgtokenperham)•• avgtokenperspam<-nospamtokens/nospam• print("Avarage number of tokens per spam message")• print(avgtokenperspam)•• print(" Make two different sets, training data and test data!")
  10. 10. R Codes (Cont.)• #select the percent of data that you want to use as training set• trdatapercent<-0.3•• #training data set• trdata=NULL•• #test data set• tedata=NULL•• for(i in 1:length(smstable$type)){• if(runif(1)<trdatapercent){• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))• }• else{• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))• }• }•• print("Training data size is!")• dim(trdata)•• print("Test data size is!")• dim(tedata)
  11. 11. R Codes (Cont.)• # Text feature extraction using tm package•• trsmses<-Corpus(VectorSource(trdata[,2]))• trsmses<-tm_map(trsmses, stripWhitespace)• trsmses<-tm_map(trsmses, tolower)• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))•• dtm <- DocumentTermMatrix(trsmses)•• highlyrepeatedwords<-findFreqTerms(dtm, 80)•• #These highly used words are used as an index to make VSM• #(vector space model) for trained data and test data•• #vectorized training data set• vtrdata=NULL•• #vectorized test data set• vtedata=NULL
  12. 12. R Codes (Cont.)• for(i in 1:length(trdata[,2])){• if(trdata[i,1]=="ham"){• vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))• }• else{• vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))• }•• }•• for(i in 1:length(tedata[,2])){• if(tedata[i,1]=="ham"){• vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))• }• else{• vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))• }•• }
  13. 13. R Codes (Cont.)• # Run different classification algorithms• # differnet SVMs with different Kernels• print("----------------------------------SVM-----------------------------------------")• print("Linear Kernel")• svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type=C, kernel=linear);• summary(svmlinmodel)• predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])• tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear• precisionlin<-sum(diag(tablinear))/sum(tablinear);• print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100• print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100• print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100•• print("Polynomial Kernel")• svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel=polynomial, probability=FALSE)• summary(svmpolymodel)• predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])• tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly•• print("Radial Kernel")• svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1, probability=FALSE)• summary(svmradmodel)• predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])• tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
  14. 14. R Codes (Cont.)• print("----------------------------------KNN-----------------------------------------")• data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])• classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))• summary(classifier)• evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))• print("---------------------------------Adaboost-------------------------------------")• adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)• summary(adaptiveboost)• varplot(adaptiveboost)
  15. 15. Conclusions• In these slides I gave a broad overview of ML and different problems that could be solved in this framework.• I reviewed in details one way of SMS spam filter implementation using ML techniques with R language.• ML provides strong framework to solve problem in Big Data domain.