SMS Spam Filter Design Using R: A Machine Learning Approach

SMS Spam Filter Design Using R:
A Machine Learning Approach

Reza Rahimi,
Ph.D Candidate,
School of Information and Computer Science,
University of California, Irvine.

Introduction
• In basic terms Machine Learning (ML) is about the
construction of systems that can learn from data.
• It is used as a tool for knowledge discovery.
• Several Important classes of problems could be solved
using machine learning techniques like:
– Classification (Prediction):
• Given a collection of records as a training set.
• Each record contains a set of attributes and one of the attributes
called class.
• The problem is to find a model for class attribute as a function
of other attributes.
– Example: Spam or Ham, Handwriting Recognition,…

– Clustering (Description):
• Given a set of data points, with some attributes, and a similarity
measure (metric) among them.
• The goal is to find clusters such that data points in one cluster are
more similar to one another.
– Example: Document Clustering, people categorization,…

– Association (Description):
• Given a set of records each contains some items from a given
collection.
• The goal is to produce dependency rules which show the
occurrence of an item based on occurrences of other items.
– Example: user habit pattern recognition,…

– Regression (Prediction):
• Predict a value of a given continuous variables based on the values
of other variables.
• Could be linear or nonlinear model of dependency.
– Example: Stock prediction

Problem Solving Using Machine
Learning Framework
• ML is a very mature and developed area.
• In all of the different mentioned problem classes, it
contains rich resources of tools, techniques and
Algorithms.
• These tools are provided in different languages and
Framework like R, Matlab, Java, C++, Mahout,…
• The following procedure could be considered as the
general methodology for problem solving in this
framework:

Get a sense of data: Problem modeling:
Feature extraction, dimension Classification, Clustering, Association,
reduction, noise cancellation,… Regression,…

Run standard ML Algorithms:
Select the methods that satisfy your
check the errors according to the
performance criteria and metrics.
standard ML Metrics.

• In the next section I will describe design
of SMS Spam Filter in R language based
on mentioned methodology.

SMS Spam Filter using R
• #this file is SMS Spam filter codes with different classifiers in R language
• #Written by: Reza Rahimi
• #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection),
• #loading required packages, libraries and function declaration

• #required package for text mining
• if(!require("tm"))
• install.packages("tm")
•
• #required package for SVM
• if(!require("e1071"))
• install.packages("e1071")
•
• #required package for KNN
• if(!require("RWeka"))
• install.packages("RWeka", dependencies = TRUE)
•
• #required package for Adaboost
• if(!require("ada"))
• install.packages("ada")
• library("tm")
• library("e1071")
• library(RWeka)
• library("ada")

R Codes (Cont.)
• #Initialize random generator
• set.seed(1245)
•
• #This function makes vector (Vector Space Model) from text message using highly repeated words
• vsm<-function(message,highlyrepeatedwords){
•
• tokenizedmessage<-strsplit(message, "s+")[[1]]
•
• #making vector
• v<-rep(0, length(highlyrepeatedwords))
• for(i in 1:length(highlyrepeatedwords)){
• for(j in 1:length(tokenizedmessage)){
• if(highlyrepeatedwords[i]==tokenizedmessage[j]){
• v[i]<-v[i]+1
• }
• }
• }
• return (v)
• }
• #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
• print("Uploading SMS Spams and Hams!n")
• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t",
colClasses=c("type"="character","sms"="character"))

R Codes (Cont.)
• smstabletmp<-smstable
•
• print("Extracting Ham and Spam Basic Statistics!")
•
• smstabletmp$type[smstabletmp$type=="ham"] <- 1
• smstabletmp$type[smstabletmp$type=="spam"] <- 0
•
• #Convert character data into numeric
• tmp<-as.numeric(smstabletmp$type)
•
• #Basic Statisctics like mean and variance of spam and hams
• hamavg<-mean(tmp)
• print("Average Ham is :");hamavg
•
• hamvariance<-var(tmp)
• print("Var of Ham is :");hamvariance
•
• print("Extract average token of Hams and Spams!")
•
• nohamtokens<-0
• noham<-0
• nospamtokens<-0
• nospam<-0

R Codes (Cont.)
• for(i in 1:length(smstable$type)){
• if(smstable[i,1]=="ham"){
• nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens
• noham<-noham+1
• }else{
• nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens
• nospam<-nospam+1
• }
• }
•
• totaltokens<-nospamtokens+nohamtokens;
• print("total number of tokens is:")
• print(totaltokens)
•
• avgtokenperham<-nohamtokens/noham
• print("Avarage number of tokens per ham message")
• print(avgtokenperham)
•
• avgtokenperspam<-nospamtokens/nospam
• print("Avarage number of tokens per spam message")
• print(avgtokenperspam)
•
• print(" Make two different sets, training data and test data!")

R Codes (Cont.)
• #select the percent of data that you want to use as training set
• trdatapercent<-0.3
•
• #training data set
• trdata=NULL
•
• #test data set
• tedata=NULL
•
• for(i in 1:length(smstable$type)){
• if(runif(1)<trdatapercent){
• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))
• }
• else{
• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))
• }
• }
•
• print("Training data size is!")
• dim(trdata)
•
• print("Test data size is!")
• dim(tedata)

R Codes (Cont.)
• # Text feature extraction using tm package
•
• trsmses<-Corpus(VectorSource(trdata[,2]))
• trsmses<-tm_map(trsmses, stripWhitespace)
• trsmses<-tm_map(trsmses, tolower)
• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))
•
• dtm <- DocumentTermMatrix(trsmses)
•
• highlyrepeatedwords<-findFreqTerms(dtm, 80)
•
• #These highly used words are used as an index to make VSM
• #(vector space model) for trained data and test data
•
• #vectorized training data set
• vtrdata=NULL
•
• #vectorized test data set
• vtedata=NULL

R Codes (Cont.)
• for(i in 1:length(trdata[,2])){
• if(trdata[i,1]=="ham"){
• vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))
• }
• else{
• vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))
• }
•
• }
•
• for(i in 1:length(tedata[,2])){
• if(tedata[i,1]=="ham"){
• vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))
• }
• else{
• vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))
• }
•
• }

R Codes (Cont.)
• # Run different classification algorithms
• # differnet SVMs with different Kernels
• print("----------------------------------SVM-----------------------------------------")
• print("Linear Kernel")
• svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear');
• summary(svmlinmodel)
• predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])
• tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear
• precisionlin<-sum(diag(tablinear))/sum(tablinear);
• print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100
• print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100
• print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100
•
• print("Polynomial Kernel")
• svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial', probability=FALSE)
• summary(svmpolymodel)
• predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])
• tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly
•
• print("Radial Kernel")
• svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1,
probability=FALSE)
• summary(svmradmodel)
• predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])
• tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad

R Codes (Cont.)
• print("----------------------------------KNN-----------------------------------------")
• data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])
• classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))
• summary(classifier)
• evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))

• print("---------------------------------Adaboost-------------------------------------")
• adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])],
test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)
• summary(adaptiveboost)
• varplot(adaptiveboost)

Conclusions
• In these slides I gave a broad overview of ML and different
problems that could be solved in this framework.
• I reviewed in details one way of SMS spam filter
implementation using ML techniques with R language.
• ML provides strong framework to solve problem in Big Data
domain.

SMS Spam Filter Design Using R: A Machine Learning Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to SMS Spam Filter Design Using R: A Machine Learning Approach

Similar to SMS Spam Filter Design Using R: A Machine Learning Approach (20)

More from Reza Rahimi

More from Reza Rahimi (19)

Recently uploaded

Recently uploaded (20)

SMS Spam Filter Design Using R: A Machine Learning Approach