SlideShare a Scribd company logo
Behaviour Analysis of SVM Based Spam Filtering
Using Various parameter values and accuracy
comparison
1
Shashank Mishra
Computer Science and Engineering (3rd
year), student
SRM University, Chennai, India
24shashankm@gmail.com
Abstract
The Increase use of emails generated a need of spam filter.
Machine learning algorithm forms a potential method to
classify email at a very successful rate. In this paper we will
use SVM classifier to classify emails and also note behavior of
training and test accuracy with change in parameter C
.Informally, the C parameter is a positive value that controls
the penalty for misclassified training examples.Description Of
algorithm is presented with comparison graph of different
values of C to come to a conclusion about high bias and
variance.
Keywords
Spam, Email classification, Support Vector Machine, Kernel,
and Machine learning.
Introduction
“Spam” is just a term for unsolicited bulk email messages.
Spam encompasses everything from money scams , ads for
products and services, drugs, stock market pump-and-
dump schemes, pornographic content, , malware, phishing.
Rather than solving spam, companies are forced to develop
better spam filters to block it. Service like Gmail,
Outlook.com, or Yahoo! Mail, provide much better spam
filters than they provided a decade ago. It’s impossible to
fix spam without changing the way email works, so the
problem will never be completely solved. Hence Spam
filter which is trained by machine learning algorithm is an
effective way to reduce the problem of spam. Two basic
approach used in Spam filtering techniques is machine
learning and knowledge engineering. In former emails are
categorized based on set of rules. These set of rules are
fabricated by authority e.g. software company or by user.
Since the rules needed to be updated constantly this
approach fails to provide accurate results and also
consumes plenty of time. Machine learning approach is
therefore better and efficient because it does not require
specifying any rules. Machine learning approach uses set
of training datasets which are samples of Pre classified e-
mail and then a specific algorithm like Support Vector
Machine[1],[6],[12], Naïve Bayes[2], and Neural
Networks are used to classify email as spam or ham.
Experiments on spam filtering data sets (TREC 2005 and
TREC 2006)[10] displays that SVM indeed gives an
excellent performance of classification as compared to
other classifier like Naïve Bayes or neural
networks[4],[7],[8].This technology of artificial
2
Dr D. Malathi
Professor SRM University Chennai, India.
malathi.d@ktr.srmuniv.ac.in
intelligence has reduced lots of burden of reading and
manually classifying individually hundreds of emails thus
in return consuming a lot of time.
Chengwang Xie,Lixin Ding,Xin Du[6] have used linear,
polynomial and RBF kernel to classify emails found that
SVM kernel techniques generates high accuracy and hence
considered best method to classify large amount of
datasets as in emails. . Mikko Siponen and Carl Stucke’s
Effective anti-spam strategies provide best methods to deal
with spam[9]. ] Ray Hunt,James Carpinter discuss past and
future ideas as well as techniques to classify spam[14].
Support Vector Machine
SVM is a supervised learning which is preferably used in
classification problem. There are many theories
strengthening SVM powerful property of classification
discussed by Durgesh K. Srivastava,Lekha Bhambhu[13].
In this algorithm, points which are feature of datasets are
plotted in graph as a coordinate and then a plane which
completely separates two classes is discovered using
training of datasets. Qiu Shubo,Gu Shuai,Zhang Tongxing
[15] use SVM to detect defects recognition instead of
neural networks or any other algorithm because for
classification or any defect detection SVM kernels plays a
major role to provide accurate results.SVM training
algorithm builds a model which has ability to form a hyper
plane to differentiate two classes. Here hyper plane is used
because training examples are to be divided with clear
gaps, hence it is very essential to find apt plane by
discovering proper value of parameter(C).
Fig.1 Hyper plane classification
Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 27
SVM is one of the best algorithms when it comes to text
analysis and prediction. Since it is generally considered as
large margin classifier and hence considered as best
algorithm for spam classification where huge amount of
mails are required to train.
Hypothesis used
(1)
Decision boundary
Y=0 ƟT
X< = -1
Y=1 ƟT
X > = 1
Here Ɵ is Parameter or weight for hypothesis.
X is trained dataset in form of array.
Fig. 2 decision plane
For kernels and prediction we use:
Given X compute new features depending on proximity of
landmarks l1,l2….li:
(2)
Where f is similarity function
Table 1 Effects of x on f
x fl
x=l 1
X is very far from l 0
Hence prediction for kernel
Predict=1 if Ɵ0+ Ɵ1f1 +Ɵ2f2+….>=0
Predict=0 otherwise
Tool used for Classification
-Matlab
Algorithm
Input:
Dataset is loaded.
Parameter is chosen i.e. Value of C.
Model is created using Support Vector Machine
train function i.e. whether linear or RBF kernel.
Function [model] = svmTrain(X, Y, C, kernel
Function …tol, max_passes)
Output
Predict function predicts the accuracy using.
p = svmPredict (model, X);
Accuracy = (mean (p==y)*100)%
Train the dataset on Support vector machine which will
classify any email based on vocabulary set provided to it.
Algorithm in detail
Step 1: Pre-processing of each emails
• Lower-casing: Email is converted into lower casing to
avoid capitalization (e.g. MaiL is converted to mail).
• Peeling HTML: HTML tags are stripped so that only
contents are remained and hence can be processed easily.
Since some Email comes with HTML tags this is
necessary step.
• Normalizing URLs: All URLs are replaced with a
specific single text e.g. “httpurl” and for normalizing
numbers all numbers are replaced by the text “numbers”.
Normalizing Email Addresses: All Emails are
substituted by a single text e.g. “emailtext”
• Normalizing Dollars sign: All dollar symbols ($) are
substituted by the text “dollars”
• Word Stemming: Words are reduced to their derived
form. For e.g. “discount”,”discounts”,”discounted” are
replaced to discount. Stemmer sometimes peels additional
character from the end e.g.
“include”,”includes”,”included” becomes includ.
• Elimination of Non words: White spaces like tabs,
newlines, and spaces are trimmed to a single space
character. Punctuations and non-words are eliminated.
Where C is a parameter to control the A part i.e. how
much we want to minimize A vs. how much we care
about B.
Ɵ is weight or parameter
fi=Kernel(x , l) = exp (||x-l||)/2σ2
)
(3)
Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 28
Step 2: Feature Extraction
This process converts each email into vector form i.e. Rn
feature xi belongs to {0,1} for an email whether
corresponds to i-th word in dictionary exist in email and if
true then xi=1 and if the word does not occurs in email
then xi=0.
Fig 3
.
Pre-processed training dataset that will be used to train a
SVM classifier. Each original email was processed and
converted into a vector x(i) ɛ R1899
Dataset is collected from Spam Assassin Public and
extracted After extracting them, each file is processed and
features are extracted i.e. converted into vector form .This
will allow us to build a dataset (X, y) of examples.
Datasets are then divided into training set and test set in
ratio of (80:20).
Model of Spam Classifier
Following diagram illustrates the approach used for spam
classification
Fig 4
Data collected from: Spam Assassin Public Corpus[3]
Table 2 dataset division
Experimental Analysis
After loading the dataset SVM will classify between spam
(y = 1) and non-spam (y = 0) emails. Once the training
completes, Using different parameters i.e. changing the
value of C to note the change in accuracy of train dataset
and test dataset
d=Train accuracy - Test accuracy
Here d is difference which is calculated to choose better C
value for optimised training and testing dataset i.e. If C is
very large the difference is large displaying that even if an
algorithm worked well in training set but it may not work
in testing because of overfitting and similarly if C is very
small or negative we observe the same i.e. difference is
very large because of underfitting and therefore
somewhere in between as shown in graphs there is optimal
value of C for which the d is least which shows the
algorithm works good for training as well as testing set and
hence there is neither high bias problem nor high variance.
Table 3 Accuracy for different value of C
Value
of C
Training
Accuracy
Testing
Accuracy
Difference
(d)
-1 31.2% 30% 1.92
0 31.9% 30.8% 1.1
0.01 98.22% 98% 0.22
0.1 99.8% 98.9% 0.9
0.5 99.97% 97.6% 2.37
1 99.97 97.7 2.27
10 100% 97.5 2.5
Observation
Table 4 effect of C on biases and variance
C value Problem D value
C value very low High bias High d value
C value very
high
High variance High d value
Dataset Training
Samples
Testing
samples
Total
Spam
Assassin
Public
Corpus
4000 1000 5000
Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 29
When C is very large the difference is large
displaying that even if an algorithm worked well in
training set may not work in testing because of
overfitting
When C is very small or negative we observe the
same i.e. difference is very large because of
underfitting and therefore somewhere in between
as shown in graph 1 there is optimal value of C
which is 0.01 for which the d is least shows that
the algorithm works good for training as well as
testing set and hence we found optimal value of C.
Graph 2 is magnified form of 1 to display close
value accurately
In Graph 3 when C=10 we get train accuracy as
100 but test is less than 100 i.e. overfitting
condition occurs. For other values of C train
accuracy is close to 100 as shown in graph 3.
In graph 4 test accuracy is extremely good for C
value greater than 0 and less than 1.
Hence with these observations it is easy and valid to come
to a conclusion for value of C.Below are graphs plotted to
display change and behaviour on experimental values of C
with change in accuracy.
Graphs
Graph 1 d vs. C
Graph 2 d vs. C (magnified)
Graph 3 Train accuracy vs. C
Graph 4 Test accuracy vs. C
Result: Observing above accuracy we note that if we
increase the value of C in huge amount i.e. 10 then the
model overfits i.e. train accuracy is 100% and hence it
possess high variance ,therefore when tested we get little
low accuracy of 97.5,whereas when C value is too low the
model tends to underfit and hence it possess high bias and
therefore it is very essential to choose the optimal value of
C with training algorithm to optimize the model.
Conclusion
Large no. of emails are received on daily basis and out of
these many are spams. After a certain amount of time it
becomes difficult to handle such emails and hence spam
classifier using kernel model provides optimum solution
by changing the parameter C and noting behaviour of
accuracy plotting graph.
In future it is apt method to find value of C for best fit and
hence to increase its probability of prediction making
algorithm to work well in both training and testing data.
Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 30
References
[1] Shrawan Kumar Trivedi and Shubhamoy Dey A Combining
Classifiers Approach for Detecting Email Spams” 2016 30th
international conference on Advanced Information Networking and
Applications Workshops (WAINA).
[2] Ion Androutsopoulos, Learning to Filter Spam E-Mail: A Comparison
of a Naive Bayesian and a Memory-Based Approach, 18 Sep 2000.
[3] http://www.csmining.org/index.php/spam-assassin-datasets.html
downloaded on 3-04-2017 file name “20021010_easy_ham.tar”.
[4] D. Sculley and G. Wachman. Relaxed online SVMs for spam
filtering. In The Thirtieth Annual ACM SIGIR Conference Proceedings,
2007.
[5] Nitin Indurkhya, Fred J. Damerau, Handbook of Natural Language
Processing, Second Edition.
[6] Xie C., Ding L., Du X. (2009) Anti-spam Filters Based on Support
Vector Machines. In: Cai Z., Li Z., Kang Z., Liu Y. (eds) Advances in
Computation and Intelligence. ISICA 2009. Lecture Notes in Computer
Science, vol. 5821. Springer, Berlin, Heidelberg.
[7] W.A. Awad1 and S.M. ELseuofi2, MACHINE LEARNING
METHODS FOR SPAM E-MAIL CLASSIFICATION, Vol. 3, No 1, Feb
2011.
[8] R.Malarvizhi, K.Saraswathi, Content-Based Spam Filtering and
Detection Algorithms- An Efficient Analysis & Comparison on
International Journal of Engineering Trends and Technology (IJETT) –
Volume 4 Issue 9- Sep 2013.
[9] Mikko Siponen and Carl Stucke,Effective anti-spam strategies in
companies: An international study. In Proceedings of HICSS '06, vol 6,
2006.
[10] Track D. Sculley and Gabriel M. Wachman, Relaxed Online SVMs
in the TREC Spam Filtering .
[11] “R. Tibshirani", The Elements of Statistical Learning Data Mining,
Inference, and Prediction, Trevor Hastie, second edition.
[12] :Sushama Chouhan, Behavior Analysis of SVM Based Spam
Filtering Using Various Kernel Functions and Data Representations,Vol.
2 Issue 9, September – 2013.
[13]Durgesh K. Srivastava,Lekha Bhambhu ,Data Classification using
support vector machine, Journal of Theoretical and Applied Information
Technology vol 12 No 1.
[14] Ray Hunt,James Carpinter Networks, 2006. ICON '06. 14th IEEE
International Conference on Current and New Developments in Spam
Filtering,February 2007.
[15] Qiu Shubo,Gu Shuai,Zhang Tongxing, 2010 WASE International
Conference on Paper Defects Recognition Based on SVM, INSPEC
Accession Number: 11529535 on august 2010.
Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 31

More Related Content

What's hot

Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
Abhimanyu Dwivedi
 
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
ijfls
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
Marina Santini
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
 
Learning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learningLearning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learning
Simon John
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
Anaïs Addad
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature Selection
Andres Mendez-Vazquez
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
Honglin Yu
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)
Abhimanyu Dwivedi
 

What's hot (15)

Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Learning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learningLearning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learning
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature Selection
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)
 

Similar to IEEE

A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVM
IRJET Journal
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
Dr.Pooja Jain
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
Analysis of machine learning algorithms for character recognition: a case stu...
Analysis of machine learning algorithms for character recognition: a case stu...Analysis of machine learning algorithms for character recognition: a case stu...
Analysis of machine learning algorithms for character recognition: a case stu...
nooriasukmaningtyas
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Email Spam Detection Using Machine Learning
Email Spam Detection Using Machine LearningEmail Spam Detection Using Machine Learning
Email Spam Detection Using Machine Learning
IRJET Journal
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET Journal
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...ijsrd.com
 
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailText Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailijsrd.com
 
K044065257
K044065257K044065257
K044065257
IJERA Editor
 
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large DataExtended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
AM Publications
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
trialFinal report7th sem.pdf
trialFinal report7th sem.pdftrialFinal report7th sem.pdf
trialFinal report7th sem.pdf
UMAPATEL34
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET Journal
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
gerogepatton
 
direct marketing in banking using data mining
direct marketing in banking using data miningdirect marketing in banking using data mining
direct marketing in banking using data mining
Hossein Malekinezhad
 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"
Young-Min kang
 
A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...
Editor Jacotech
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
cscpconf
 

Similar to IEEE (20)

A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVM
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
 
Analysis of machine learning algorithms for character recognition: a case stu...
Analysis of machine learning algorithms for character recognition: a case stu...Analysis of machine learning algorithms for character recognition: a case stu...
Analysis of machine learning algorithms for character recognition: a case stu...
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Email Spam Detection Using Machine Learning
Email Spam Detection Using Machine LearningEmail Spam Detection Using Machine Learning
Email Spam Detection Using Machine Learning
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning Models
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...
 
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mailText Based Fuzzy Clustering Algorithm to Filter Spam E-mail
Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
 
K044065257
K044065257K044065257
K044065257
 
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large DataExtended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
trialFinal report7th sem.pdf
trialFinal report7th sem.pdftrialFinal report7th sem.pdf
trialFinal report7th sem.pdf
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
direct marketing in banking using data mining
direct marketing in banking using data miningdirect marketing in banking using data mining
direct marketing in banking using data mining
 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"
 
A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

IEEE

  • 1. Behaviour Analysis of SVM Based Spam Filtering Using Various parameter values and accuracy comparison 1 Shashank Mishra Computer Science and Engineering (3rd year), student SRM University, Chennai, India 24shashankm@gmail.com Abstract The Increase use of emails generated a need of spam filter. Machine learning algorithm forms a potential method to classify email at a very successful rate. In this paper we will use SVM classifier to classify emails and also note behavior of training and test accuracy with change in parameter C .Informally, the C parameter is a positive value that controls the penalty for misclassified training examples.Description Of algorithm is presented with comparison graph of different values of C to come to a conclusion about high bias and variance. Keywords Spam, Email classification, Support Vector Machine, Kernel, and Machine learning. Introduction “Spam” is just a term for unsolicited bulk email messages. Spam encompasses everything from money scams , ads for products and services, drugs, stock market pump-and- dump schemes, pornographic content, , malware, phishing. Rather than solving spam, companies are forced to develop better spam filters to block it. Service like Gmail, Outlook.com, or Yahoo! Mail, provide much better spam filters than they provided a decade ago. It’s impossible to fix spam without changing the way email works, so the problem will never be completely solved. Hence Spam filter which is trained by machine learning algorithm is an effective way to reduce the problem of spam. Two basic approach used in Spam filtering techniques is machine learning and knowledge engineering. In former emails are categorized based on set of rules. These set of rules are fabricated by authority e.g. software company or by user. Since the rules needed to be updated constantly this approach fails to provide accurate results and also consumes plenty of time. Machine learning approach is therefore better and efficient because it does not require specifying any rules. Machine learning approach uses set of training datasets which are samples of Pre classified e- mail and then a specific algorithm like Support Vector Machine[1],[6],[12], Naïve Bayes[2], and Neural Networks are used to classify email as spam or ham. Experiments on spam filtering data sets (TREC 2005 and TREC 2006)[10] displays that SVM indeed gives an excellent performance of classification as compared to other classifier like Naïve Bayes or neural networks[4],[7],[8].This technology of artificial 2 Dr D. Malathi Professor SRM University Chennai, India. malathi.d@ktr.srmuniv.ac.in intelligence has reduced lots of burden of reading and manually classifying individually hundreds of emails thus in return consuming a lot of time. Chengwang Xie,Lixin Ding,Xin Du[6] have used linear, polynomial and RBF kernel to classify emails found that SVM kernel techniques generates high accuracy and hence considered best method to classify large amount of datasets as in emails. . Mikko Siponen and Carl Stucke’s Effective anti-spam strategies provide best methods to deal with spam[9]. ] Ray Hunt,James Carpinter discuss past and future ideas as well as techniques to classify spam[14]. Support Vector Machine SVM is a supervised learning which is preferably used in classification problem. There are many theories strengthening SVM powerful property of classification discussed by Durgesh K. Srivastava,Lekha Bhambhu[13]. In this algorithm, points which are feature of datasets are plotted in graph as a coordinate and then a plane which completely separates two classes is discovered using training of datasets. Qiu Shubo,Gu Shuai,Zhang Tongxing [15] use SVM to detect defects recognition instead of neural networks or any other algorithm because for classification or any defect detection SVM kernels plays a major role to provide accurate results.SVM training algorithm builds a model which has ability to form a hyper plane to differentiate two classes. Here hyper plane is used because training examples are to be divided with clear gaps, hence it is very essential to find apt plane by discovering proper value of parameter(C). Fig.1 Hyper plane classification Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC) 978-1-5090-4890-8/17/$31.00 ©2017 IEEE 27
  • 2. SVM is one of the best algorithms when it comes to text analysis and prediction. Since it is generally considered as large margin classifier and hence considered as best algorithm for spam classification where huge amount of mails are required to train. Hypothesis used (1) Decision boundary Y=0 ƟT X< = -1 Y=1 ƟT X > = 1 Here Ɵ is Parameter or weight for hypothesis. X is trained dataset in form of array. Fig. 2 decision plane For kernels and prediction we use: Given X compute new features depending on proximity of landmarks l1,l2….li: (2) Where f is similarity function Table 1 Effects of x on f x fl x=l 1 X is very far from l 0 Hence prediction for kernel Predict=1 if Ɵ0+ Ɵ1f1 +Ɵ2f2+….>=0 Predict=0 otherwise Tool used for Classification -Matlab Algorithm Input: Dataset is loaded. Parameter is chosen i.e. Value of C. Model is created using Support Vector Machine train function i.e. whether linear or RBF kernel. Function [model] = svmTrain(X, Y, C, kernel Function …tol, max_passes) Output Predict function predicts the accuracy using. p = svmPredict (model, X); Accuracy = (mean (p==y)*100)% Train the dataset on Support vector machine which will classify any email based on vocabulary set provided to it. Algorithm in detail Step 1: Pre-processing of each emails • Lower-casing: Email is converted into lower casing to avoid capitalization (e.g. MaiL is converted to mail). • Peeling HTML: HTML tags are stripped so that only contents are remained and hence can be processed easily. Since some Email comes with HTML tags this is necessary step. • Normalizing URLs: All URLs are replaced with a specific single text e.g. “httpurl” and for normalizing numbers all numbers are replaced by the text “numbers”. Normalizing Email Addresses: All Emails are substituted by a single text e.g. “emailtext” • Normalizing Dollars sign: All dollar symbols ($) are substituted by the text “dollars” • Word Stemming: Words are reduced to their derived form. For e.g. “discount”,”discounts”,”discounted” are replaced to discount. Stemmer sometimes peels additional character from the end e.g. “include”,”includes”,”included” becomes includ. • Elimination of Non words: White spaces like tabs, newlines, and spaces are trimmed to a single space character. Punctuations and non-words are eliminated. Where C is a parameter to control the A part i.e. how much we want to minimize A vs. how much we care about B. Ɵ is weight or parameter fi=Kernel(x , l) = exp (||x-l||)/2σ2 ) (3) Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC) 978-1-5090-4890-8/17/$31.00 ©2017 IEEE 28
  • 3. Step 2: Feature Extraction This process converts each email into vector form i.e. Rn feature xi belongs to {0,1} for an email whether corresponds to i-th word in dictionary exist in email and if true then xi=1 and if the word does not occurs in email then xi=0. Fig 3 . Pre-processed training dataset that will be used to train a SVM classifier. Each original email was processed and converted into a vector x(i) ɛ R1899 Dataset is collected from Spam Assassin Public and extracted After extracting them, each file is processed and features are extracted i.e. converted into vector form .This will allow us to build a dataset (X, y) of examples. Datasets are then divided into training set and test set in ratio of (80:20). Model of Spam Classifier Following diagram illustrates the approach used for spam classification Fig 4 Data collected from: Spam Assassin Public Corpus[3] Table 2 dataset division Experimental Analysis After loading the dataset SVM will classify between spam (y = 1) and non-spam (y = 0) emails. Once the training completes, Using different parameters i.e. changing the value of C to note the change in accuracy of train dataset and test dataset d=Train accuracy - Test accuracy Here d is difference which is calculated to choose better C value for optimised training and testing dataset i.e. If C is very large the difference is large displaying that even if an algorithm worked well in training set but it may not work in testing because of overfitting and similarly if C is very small or negative we observe the same i.e. difference is very large because of underfitting and therefore somewhere in between as shown in graphs there is optimal value of C for which the d is least which shows the algorithm works good for training as well as testing set and hence there is neither high bias problem nor high variance. Table 3 Accuracy for different value of C Value of C Training Accuracy Testing Accuracy Difference (d) -1 31.2% 30% 1.92 0 31.9% 30.8% 1.1 0.01 98.22% 98% 0.22 0.1 99.8% 98.9% 0.9 0.5 99.97% 97.6% 2.37 1 99.97 97.7 2.27 10 100% 97.5 2.5 Observation Table 4 effect of C on biases and variance C value Problem D value C value very low High bias High d value C value very high High variance High d value Dataset Training Samples Testing samples Total Spam Assassin Public Corpus 4000 1000 5000 Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC) 978-1-5090-4890-8/17/$31.00 ©2017 IEEE 29
  • 4. When C is very large the difference is large displaying that even if an algorithm worked well in training set may not work in testing because of overfitting When C is very small or negative we observe the same i.e. difference is very large because of underfitting and therefore somewhere in between as shown in graph 1 there is optimal value of C which is 0.01 for which the d is least shows that the algorithm works good for training as well as testing set and hence we found optimal value of C. Graph 2 is magnified form of 1 to display close value accurately In Graph 3 when C=10 we get train accuracy as 100 but test is less than 100 i.e. overfitting condition occurs. For other values of C train accuracy is close to 100 as shown in graph 3. In graph 4 test accuracy is extremely good for C value greater than 0 and less than 1. Hence with these observations it is easy and valid to come to a conclusion for value of C.Below are graphs plotted to display change and behaviour on experimental values of C with change in accuracy. Graphs Graph 1 d vs. C Graph 2 d vs. C (magnified) Graph 3 Train accuracy vs. C Graph 4 Test accuracy vs. C Result: Observing above accuracy we note that if we increase the value of C in huge amount i.e. 10 then the model overfits i.e. train accuracy is 100% and hence it possess high variance ,therefore when tested we get little low accuracy of 97.5,whereas when C value is too low the model tends to underfit and hence it possess high bias and therefore it is very essential to choose the optimal value of C with training algorithm to optimize the model. Conclusion Large no. of emails are received on daily basis and out of these many are spams. After a certain amount of time it becomes difficult to handle such emails and hence spam classifier using kernel model provides optimum solution by changing the parameter C and noting behaviour of accuracy plotting graph. In future it is apt method to find value of C for best fit and hence to increase its probability of prediction making algorithm to work well in both training and testing data. Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC) 978-1-5090-4890-8/17/$31.00 ©2017 IEEE 30
  • 5. References [1] Shrawan Kumar Trivedi and Shubhamoy Dey A Combining Classifiers Approach for Detecting Email Spams” 2016 30th international conference on Advanced Information Networking and Applications Workshops (WAINA). [2] Ion Androutsopoulos, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach, 18 Sep 2000. [3] http://www.csmining.org/index.php/spam-assassin-datasets.html downloaded on 3-04-2017 file name “20021010_easy_ham.tar”. [4] D. Sculley and G. Wachman. Relaxed online SVMs for spam filtering. In The Thirtieth Annual ACM SIGIR Conference Proceedings, 2007. [5] Nitin Indurkhya, Fred J. Damerau, Handbook of Natural Language Processing, Second Edition. [6] Xie C., Ding L., Du X. (2009) Anti-spam Filters Based on Support Vector Machines. In: Cai Z., Li Z., Kang Z., Liu Y. (eds) Advances in Computation and Intelligence. ISICA 2009. Lecture Notes in Computer Science, vol. 5821. Springer, Berlin, Heidelberg. [7] W.A. Awad1 and S.M. ELseuofi2, MACHINE LEARNING METHODS FOR SPAM E-MAIL CLASSIFICATION, Vol. 3, No 1, Feb 2011. [8] R.Malarvizhi, K.Saraswathi, Content-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison on International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013. [9] Mikko Siponen and Carl Stucke,Effective anti-spam strategies in companies: An international study. In Proceedings of HICSS '06, vol 6, 2006. [10] Track D. Sculley and Gabriel M. Wachman, Relaxed Online SVMs in the TREC Spam Filtering . [11] “R. Tibshirani", The Elements of Statistical Learning Data Mining, Inference, and Prediction, Trevor Hastie, second edition. [12] :Sushama Chouhan, Behavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations,Vol. 2 Issue 9, September – 2013. [13]Durgesh K. Srivastava,Lekha Bhambhu ,Data Classification using support vector machine, Journal of Theoretical and Applied Information Technology vol 12 No 1. [14] Ray Hunt,James Carpinter Networks, 2006. ICON '06. 14th IEEE International Conference on Current and New Developments in Spam Filtering,February 2007. [15] Qiu Shubo,Gu Shuai,Zhang Tongxing, 2010 WASE International Conference on Paper Defects Recognition Based on SVM, INSPEC Accession Number: 11529535 on august 2010. Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC) 978-1-5090-4890-8/17/$31.00 ©2017 IEEE 31