SMS Spam Filter Design Using R: A Machine Learning Approach
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

SMS Spam Filter Design Using R: A Machine Learning Approach

  • 3,159 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,159
On Slideshare
2,785
From Embeds
374
Number of Embeds
3

Actions

Shares
Downloads
61
Comments
0
Likes
4

Embeds 374

http://www.ics.uci.edu 340
https://twitter.com 20
https://www.rebelmouse.com 14

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SMS Spam Filter Design Using R: A Machine Learning Approach Reza Rahimi, Ph.D Candidate, School of Information and Computer Science, University of California, Irvine.
  • 2. Introduction• In basic terms Machine Learning (ML) is about the construction of systems that can learn from data.• It is used as a tool for knowledge discovery.• Several Important classes of problems could be solved using machine learning techniques like: – Classification (Prediction): • Given a collection of records as a training set. • Each record contains a set of attributes and one of the attributes called class. • The problem is to find a model for class attribute as a function of other attributes. – Example: Spam or Ham, Handwriting Recognition,…
  • 3. – Clustering (Description): • Given a set of data points, with some attributes, and a similarity measure (metric) among them. • The goal is to find clusters such that data points in one cluster are more similar to one another. – Example: Document Clustering, people categorization,…– Association (Description): • Given a set of records each contains some items from a given collection. • The goal is to produce dependency rules which show the occurrence of an item based on occurrences of other items. – Example: user habit pattern recognition,…– Regression (Prediction): • Predict a value of a given continuous variables based on the values of other variables. • Could be linear or nonlinear model of dependency. – Example: Stock prediction
  • 4. Problem Solving Using Machine Learning Framework• ML is a very mature and developed area.• In all of the different mentioned problem classes, it contains rich resources of tools, techniques and Algorithms.• These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,…• The following procedure could be considered as the general methodology for problem solving in this framework:
  • 5. Get a sense of data: Problem modeling: Feature extraction, dimension Classification, Clustering, Association, reduction, noise cancellation,… Regression,… Run standard ML Algorithms:Select the methods that satisfy your check the errors according to the performance criteria and metrics. standard ML Metrics. • In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.
  • 6. SMS Spam Filter using R• #this file is SMS Spam filter codes with different classifiers in R language• #Written by: Reza Rahimi• #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection),• #loading required packages, libraries and function declaration• #required package for text mining• if(!require("tm"))• install.packages("tm")•• #required package for SVM• if(!require("e1071"))• install.packages("e1071")•• #required package for KNN• if(!require("RWeka"))• install.packages("RWeka", dependencies = TRUE)•• #required package for Adaboost• if(!require("ada"))• install.packages("ada")• library("tm")• library("e1071")• library(RWeka)• library("ada")
  • 7. R Codes (Cont.)• #Initialize random generator• set.seed(1245)•• #This function makes vector (Vector Space Model) from text message using highly repeated words• vsm<-function(message,highlyrepeatedwords){•• tokenizedmessage<-strsplit(message, "s+")[[1]]•• #making vector• v<-rep(0, length(highlyrepeatedwords))• for(i in 1:length(highlyrepeatedwords)){• for(j in 1:length(tokenizedmessage)){• if(highlyrepeatedwords[i]==tokenizedmessage[j]){• v[i]<-v[i]+1• }• }• }• return (v)• }• #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection• print("Uploading SMS Spams and Hams!n")• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t", colClasses=c("type"="character","sms"="character"))
  • 8. R Codes (Cont.)• smstabletmp<-smstable•• print("Extracting Ham and Spam Basic Statistics!")•• smstabletmp$type[smstabletmp$type=="ham"] <- 1• smstabletmp$type[smstabletmp$type=="spam"] <- 0•• #Convert character data into numeric• tmp<-as.numeric(smstabletmp$type)•• #Basic Statisctics like mean and variance of spam and hams• hamavg<-mean(tmp)• print("Average Ham is :");hamavg•• hamvariance<-var(tmp)• print("Var of Ham is :");hamvariance•• print("Extract average token of Hams and Spams!")•• nohamtokens<-0• noham<-0• nospamtokens<-0• nospam<-0
  • 9. R Codes (Cont.)• for(i in 1:length(smstable$type)){• if(smstable[i,1]=="ham"){• nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens• noham<-noham+1• }else{• nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens• nospam<-nospam+1• }• }•• totaltokens<-nospamtokens+nohamtokens;• print("total number of tokens is:")• print(totaltokens)•• avgtokenperham<-nohamtokens/noham• print("Avarage number of tokens per ham message")• print(avgtokenperham)•• avgtokenperspam<-nospamtokens/nospam• print("Avarage number of tokens per spam message")• print(avgtokenperspam)•• print(" Make two different sets, training data and test data!")
  • 10. R Codes (Cont.)• #select the percent of data that you want to use as training set• trdatapercent<-0.3•• #training data set• trdata=NULL•• #test data set• tedata=NULL•• for(i in 1:length(smstable$type)){• if(runif(1)<trdatapercent){• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))• }• else{• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))• }• }•• print("Training data size is!")• dim(trdata)•• print("Test data size is!")• dim(tedata)
  • 11. R Codes (Cont.)• # Text feature extraction using tm package•• trsmses<-Corpus(VectorSource(trdata[,2]))• trsmses<-tm_map(trsmses, stripWhitespace)• trsmses<-tm_map(trsmses, tolower)• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))•• dtm <- DocumentTermMatrix(trsmses)•• highlyrepeatedwords<-findFreqTerms(dtm, 80)•• #These highly used words are used as an index to make VSM• #(vector space model) for trained data and test data•• #vectorized training data set• vtrdata=NULL•• #vectorized test data set• vtedata=NULL
  • 12. R Codes (Cont.)• for(i in 1:length(trdata[,2])){• if(trdata[i,1]=="ham"){• vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))• }• else{• vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))• }•• }•• for(i in 1:length(tedata[,2])){• if(tedata[i,1]=="ham"){• vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))• }• else{• vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))• }•• }
  • 13. R Codes (Cont.)• # Run different classification algorithms• # differnet SVMs with different Kernels• print("----------------------------------SVM-----------------------------------------")• print("Linear Kernel")• svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type=C, kernel=linear);• summary(svmlinmodel)• predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])• tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear• precisionlin<-sum(diag(tablinear))/sum(tablinear);• print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100• print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100• print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100•• print("Polynomial Kernel")• svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel=polynomial, probability=FALSE)• summary(svmpolymodel)• predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])• tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly•• print("Radial Kernel")• svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1, probability=FALSE)• summary(svmradmodel)• predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])• tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
  • 14. R Codes (Cont.)• print("----------------------------------KNN-----------------------------------------")• data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])• classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))• summary(classifier)• evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))• print("---------------------------------Adaboost-------------------------------------")• adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)• summary(adaptiveboost)• varplot(adaptiveboost)
  • 15. Conclusions• In these slides I gave a broad overview of ML and different problems that could be solved in this framework.• I reviewed in details one way of SMS spam filter implementation using ML techniques with R language.• ML provides strong framework to solve problem in Big Data domain.