Presentation on how to chat with PDF using ChatGPT code interpreter
SMS Spam Filter Design Using R: A Machine Learning Approach
1. SMS Spam Filter Design Using R:
A Machine Learning Approach
Reza Rahimi,
Ph.D Candidate,
School of Information and Computer Science,
University of California, Irvine.
2. Introduction
• In basic terms Machine Learning (ML) is about the
construction of systems that can learn from data.
• It is used as a tool for knowledge discovery.
• Several Important classes of problems could be solved
using machine learning techniques like:
– Classification (Prediction):
• Given a collection of records as a training set.
• Each record contains a set of attributes and one of the attributes
called class.
• The problem is to find a model for class attribute as a function
of other attributes.
– Example: Spam or Ham, Handwriting Recognition,…
3. – Clustering (Description):
• Given a set of data points, with some attributes, and a similarity
measure (metric) among them.
• The goal is to find clusters such that data points in one cluster are
more similar to one another.
– Example: Document Clustering, people categorization,…
– Association (Description):
• Given a set of records each contains some items from a given
collection.
• The goal is to produce dependency rules which show the
occurrence of an item based on occurrences of other items.
– Example: user habit pattern recognition,…
– Regression (Prediction):
• Predict a value of a given continuous variables based on the values
of other variables.
• Could be linear or nonlinear model of dependency.
– Example: Stock prediction
4. Problem Solving Using Machine
Learning Framework
• ML is a very mature and developed area.
• In all of the different mentioned problem classes, it
contains rich resources of tools, techniques and
Algorithms.
• These tools are provided in different languages and
Framework like R, Matlab, Java, C++, Mahout,…
• The following procedure could be considered as the
general methodology for problem solving in this
framework:
5. Get a sense of data: Problem modeling:
Feature extraction, dimension Classification, Clustering, Association,
reduction, noise cancellation,… Regression,…
Run standard ML Algorithms:
Select the methods that satisfy your
check the errors according to the
performance criteria and metrics.
standard ML Metrics.
• In the next section I will describe design
of SMS Spam Filter in R language based
on mentioned methodology.
6. SMS Spam Filter using R
• #this file is SMS Spam filter codes with different classifiers in R language
• #Written by: Reza Rahimi
• #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection),
• #loading required packages, libraries and function declaration
• #required package for text mining
• if(!require("tm"))
• install.packages("tm")
•
• #required package for SVM
• if(!require("e1071"))
• install.packages("e1071")
•
• #required package for KNN
• if(!require("RWeka"))
• install.packages("RWeka", dependencies = TRUE)
•
• #required package for Adaboost
• if(!require("ada"))
• install.packages("ada")
• library("tm")
• library("e1071")
• library(RWeka)
• library("ada")
7. R Codes (Cont.)
• #Initialize random generator
• set.seed(1245)
•
• #This function makes vector (Vector Space Model) from text message using highly repeated words
• vsm<-function(message,highlyrepeatedwords){
•
• tokenizedmessage<-strsplit(message, "s+")[[1]]
•
• #making vector
• v<-rep(0, length(highlyrepeatedwords))
• for(i in 1:length(highlyrepeatedwords)){
• for(j in 1:length(tokenizedmessage)){
• if(highlyrepeatedwords[i]==tokenizedmessage[j]){
• v[i]<-v[i]+1
• }
• }
• }
• return (v)
• }
• #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
• print("Uploading SMS Spams and Hams!n")
• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t",
colClasses=c("type"="character","sms"="character"))
8. R Codes (Cont.)
• smstabletmp<-smstable
•
• print("Extracting Ham and Spam Basic Statistics!")
•
• smstabletmp$type[smstabletmp$type=="ham"] <- 1
• smstabletmp$type[smstabletmp$type=="spam"] <- 0
•
• #Convert character data into numeric
• tmp<-as.numeric(smstabletmp$type)
•
• #Basic Statisctics like mean and variance of spam and hams
• hamavg<-mean(tmp)
• print("Average Ham is :");hamavg
•
• hamvariance<-var(tmp)
• print("Var of Ham is :");hamvariance
•
• print("Extract average token of Hams and Spams!")
•
• nohamtokens<-0
• noham<-0
• nospamtokens<-0
• nospam<-0
9. R Codes (Cont.)
• for(i in 1:length(smstable$type)){
• if(smstable[i,1]=="ham"){
• nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens
• noham<-noham+1
• }else{
• nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens
• nospam<-nospam+1
• }
• }
•
• totaltokens<-nospamtokens+nohamtokens;
• print("total number of tokens is:")
• print(totaltokens)
•
• avgtokenperham<-nohamtokens/noham
• print("Avarage number of tokens per ham message")
• print(avgtokenperham)
•
• avgtokenperspam<-nospamtokens/nospam
• print("Avarage number of tokens per spam message")
• print(avgtokenperspam)
•
• print(" Make two different sets, training data and test data!")
10. R Codes (Cont.)
• #select the percent of data that you want to use as training set
• trdatapercent<-0.3
•
• #training data set
• trdata=NULL
•
• #test data set
• tedata=NULL
•
• for(i in 1:length(smstable$type)){
• if(runif(1)<trdatapercent){
• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))
• }
• else{
• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))
• }
• }
•
• print("Training data size is!")
• dim(trdata)
•
• print("Test data size is!")
• dim(tedata)
11. R Codes (Cont.)
• # Text feature extraction using tm package
•
• trsmses<-Corpus(VectorSource(trdata[,2]))
• trsmses<-tm_map(trsmses, stripWhitespace)
• trsmses<-tm_map(trsmses, tolower)
• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))
•
• dtm <- DocumentTermMatrix(trsmses)
•
• highlyrepeatedwords<-findFreqTerms(dtm, 80)
•
• #These highly used words are used as an index to make VSM
• #(vector space model) for trained data and test data
•
• #vectorized training data set
• vtrdata=NULL
•
• #vectorized test data set
• vtedata=NULL
15. Conclusions
• In these slides I gave a broad overview of ML and different
problems that could be solved in this framework.
• I reviewed in details one way of SMS spam filter
implementation using ML techniques with R language.
• ML provides strong framework to solve problem in Big Data
domain.