IEEE

Behaviour Analysis of SVM Based Spam Filtering
Using Various parameter values and accuracy
comparison
1
Shashank Mishra
Computer Science and Engineering (3rd
year), student
SRM University, Chennai, India
24shashankm@gmail.com
Abstract
The Increase use of emails generated a need of spam filter.
Machine learning algorithm forms a potential method to
classify email at a very successful rate. In this paper we will
use SVM classifier to classify emails and also note behavior of
training and test accuracy with change in parameter C
.Informally, the C parameter is a positive value that controls
the penalty for misclassiﬁed training examples.Description Of
algorithm is presented with comparison graph of different
values of C to come to a conclusion about high bias and
variance.
Keywords
Spam, Email classification, Support Vector Machine, Kernel,
and Machine learning.
Introduction
“Spam” is just a term for unsolicited bulk email messages.
Spam encompasses everything from money scams , ads for
products and services, drugs, stock market pump-and-
dump schemes, pornographic content, , malware, phishing.
Rather than solving spam, companies are forced to develop
better spam filters to block it. Service like Gmail,
Outlook.com, or Yahoo! Mail, provide much better spam
filters than they provided a decade ago. It’s impossible to
fix spam without changing the way email works, so the
problem will never be completely solved. Hence Spam
filter which is trained by machine learning algorithm is an
effective way to reduce the problem of spam. Two basic
approach used in Spam filtering techniques is machine
learning and knowledge engineering. In former emails are
categorized based on set of rules. These set of rules are
fabricated by authority e.g. software company or by user.
Since the rules needed to be updated constantly this
approach fails to provide accurate results and also
consumes plenty of time. Machine learning approach is
therefore better and efficient because it does not require
specifying any rules. Machine learning approach uses set
of training datasets which are samples of Pre classified e-
mail and then a specific algorithm like Support Vector
Machine[1],[6],[12], Naïve Bayes[2], and Neural
Networks are used to classify email as spam or ham.
Experiments on spam filtering data sets (TREC 2005 and
TREC 2006)[10] displays that SVM indeed gives an
excellent performance of classification as compared to
other classifier like Naïve Bayes or neural
networks[4],[7],[8].This technology of artificial
2
Dr D. Malathi
Professor SRM University Chennai, India.
malathi.d@ktr.srmuniv.ac.in
intelligence has reduced lots of burden of reading and
manually classifying individually hundreds of emails thus
in return consuming a lot of time.
Chengwang Xie,Lixin Ding,Xin Du[6] have used linear,
polynomial and RBF kernel to classify emails found that
SVM kernel techniques generates high accuracy and hence
considered best method to classify large amount of
datasets as in emails. . Mikko Siponen and Carl Stucke’s
Effective anti-spam strategies provide best methods to deal
with spam[9]. ] Ray Hunt,James Carpinter discuss past and
future ideas as well as techniques to classify spam[14].
Support Vector Machine
SVM is a supervised learning which is preferably used in
classification problem. There are many theories
strengthening SVM powerful property of classification
discussed by Durgesh K. Srivastava,Lekha Bhambhu[13].
In this algorithm, points which are feature of datasets are
plotted in graph as a coordinate and then a plane which
completely separates two classes is discovered using
training of datasets. Qiu Shubo,Gu Shuai,Zhang Tongxing
[15] use SVM to detect defects recognition instead of
neural networks or any other algorithm because for
classification or any defect detection SVM kernels plays a
major role to provide accurate results.SVM training
algorithm builds a model which has ability to form a hyper
plane to differentiate two classes. Here hyper plane is used
because training examples are to be divided with clear
gaps, hence it is very essential to find apt plane by
discovering proper value of parameter(C).
Fig.1 Hyper plane classification
Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 27

SVM is one of the best algorithms when it comes to text
analysis and prediction. Since it is generally considered as
large margin classifier and hence considered as best
algorithm for spam classification where huge amount of
mails are required to train.
Hypothesis used
(1)
Decision boundary
Y=0 ƟT
X< = -1
Y=1 ƟT
X > = 1
Here Ɵ is Parameter or weight for hypothesis.
X is trained dataset in form of array.
Fig. 2 decision plane
For kernels and prediction we use:
Given X compute new features depending on proximity of
landmarks l1,l2….li:
(2)
Where f is similarity function
Table 1 Effects of x on f
x fl
x=l 1
X is very far from l 0
Hence prediction for kernel
Predict=1 if Ɵ0+ Ɵ1f1 +Ɵ2f2+….>=0
Predict=0 otherwise
Tool used for Classification
-Matlab
Algorithm
Input:
Dataset is loaded.
Parameter is chosen i.e. Value of C.
Model is created using Support Vector Machine
train function i.e. whether linear or RBF kernel.
Function [model] = svmTrain(X, Y, C, kernel
Function …tol, max_passes)
Output
Predict function predicts the accuracy using.
p = svmPredict (model, X);
Accuracy = (mean (p==y)*100)%
Train the dataset on Support vector machine which will
classify any email based on vocabulary set provided to it.
Algorithm in detail
Step 1: Pre-processing of each emails
• Lower-casing: Email is converted into lower casing to
avoid capitalization (e.g. MaiL is converted to mail).
• Peeling HTML: HTML tags are stripped so that only
contents are remained and hence can be processed easily.
Since some Email comes with HTML tags this is
necessary step.
• Normalizing URLs: All URLs are replaced with a
specific single text e.g. “httpurl” and for normalizing
numbers all numbers are replaced by the text “numbers”.
Normalizing Email Addresses: All Emails are
substituted by a single text e.g. “emailtext”
• Normalizing Dollars sign: All dollar symbols ($) are
substituted by the text “dollars”
• Word Stemming: Words are reduced to their derived
form. For e.g. “discount”,”discounts”,”discounted” are
replaced to discount. Stemmer sometimes peels additional
character from the end e.g.
“include”,”includes”,”included” becomes includ.
• Elimination of Non words: White spaces like tabs,
newlines, and spaces are trimmed to a single space
character. Punctuations and non-words are eliminated.
Where C is a parameter to control the A part i.e. how
much we want to minimize A vs. how much we care
about B.
Ɵ is weight or parameter
fi=Kernel(x , l) = exp (||x-l||)/2σ2
)
(3)
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 28

Step 2: Feature Extraction
This process converts each email into vector form i.e. Rn
feature xi belongs to {0,1} for an email whether
corresponds to i-th word in dictionary exist in email and if
true then xi=1 and if the word does not occurs in email
then xi=0.
Fig 3
.
Pre-processed training dataset that will be used to train a
SVM classiﬁer. Each original email was processed and
converted into a vector x(i) ɛ R1899
Dataset is collected from Spam Assassin Public and
extracted After extracting them, each file is processed and
features are extracted i.e. converted into vector form .This
will allow us to build a dataset (X, y) of examples.
Datasets are then divided into training set and test set in
ratio of (80:20).
Model of Spam Classifier
Following diagram illustrates the approach used for spam
classification
Fig 4
Data collected from: Spam Assassin Public Corpus[3]
Table 2 dataset division
Experimental Analysis
After loading the dataset SVM will classify between spam
(y = 1) and non-spam (y = 0) emails. Once the training
completes, Using different parameters i.e. changing the
value of C to note the change in accuracy of train dataset
and test dataset
d=Train accuracy - Test accuracy
Here d is difference which is calculated to choose better C
value for optimised training and testing dataset i.e. If C is
very large the difference is large displaying that even if an
algorithm worked well in training set but it may not work
in testing because of overfitting and similarly if C is very
small or negative we observe the same i.e. difference is
very large because of underfitting and therefore
somewhere in between as shown in graphs there is optimal
value of C for which the d is least which shows the
algorithm works good for training as well as testing set and
hence there is neither high bias problem nor high variance.
Table 3 Accuracy for different value of C
Value
of C
Training
Accuracy
Testing
Accuracy
Difference
(d)
-1 31.2% 30% 1.92
0 31.9% 30.8% 1.1
0.01 98.22% 98% 0.22
0.1 99.8% 98.9% 0.9
0.5 99.97% 97.6% 2.37
1 99.97 97.7 2.27
10 100% 97.5 2.5
Observation
Table 4 effect of C on biases and variance
C value Problem D value
C value very low High bias High d value
C value very
high
High variance High d value
Dataset Training
Samples
Testing
samples
Total
Spam
Assassin
Public
Corpus
4000 1000 5000
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 29

When C is very large the difference is large
displaying that even if an algorithm worked well in
training set may not work in testing because of
overfitting
When C is very small or negative we observe the
same i.e. difference is very large because of
underfitting and therefore somewhere in between
as shown in graph 1 there is optimal value of C
which is 0.01 for which the d is least shows that
the algorithm works good for training as well as
testing set and hence we found optimal value of C.
Graph 2 is magnified form of 1 to display close
value accurately
In Graph 3 when C=10 we get train accuracy as
100 but test is less than 100 i.e. overfitting
condition occurs. For other values of C train
accuracy is close to 100 as shown in graph 3.
In graph 4 test accuracy is extremely good for C
value greater than 0 and less than 1.
Hence with these observations it is easy and valid to come
to a conclusion for value of C.Below are graphs plotted to
display change and behaviour on experimental values of C
with change in accuracy.
Graphs
Graph 1 d vs. C
Graph 2 d vs. C (magnified)
Graph 3 Train accuracy vs. C
Graph 4 Test accuracy vs. C
Result: Observing above accuracy we note that if we
increase the value of C in huge amount i.e. 10 then the
model overfits i.e. train accuracy is 100% and hence it
possess high variance ,therefore when tested we get little
low accuracy of 97.5,whereas when C value is too low the
model tends to underfit and hence it possess high bias and
therefore it is very essential to choose the optimal value of
C with training algorithm to optimize the model.
Conclusion
Large no. of emails are received on daily basis and out of
these many are spams. After a certain amount of time it
becomes difficult to handle such emails and hence spam
classifier using kernel model provides optimum solution
by changing the parameter C and noting behaviour of
accuracy plotting graph.
In future it is apt method to find value of C for best fit and
hence to increase its probability of prediction making
algorithm to work well in both training and testing data.
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 30

References
[1] Shrawan Kumar Trivedi and Shubhamoy Dey A Combining
Classifiers Approach for Detecting Email Spams” 2016 30th
international conference on Advanced Information Networking and
Applications Workshops (WAINA).
[2] Ion Androutsopoulos, Learning to Filter Spam E-Mail: A Comparison
of a Naive Bayesian and a Memory-Based Approach, 18 Sep 2000.
[3] http://www.csmining.org/index.php/spam-assassin-datasets.html
downloaded on 3-04-2017 file name “20021010_easy_ham.tar”.
[4] D. Sculley and G. Wachman. Relaxed online SVMs for spam
filtering. In The Thirtieth Annual ACM SIGIR Conference Proceedings,
2007.
[5] Nitin Indurkhya, Fred J. Damerau, Handbook of Natural Language
Processing, Second Edition.
[6] Xie C., Ding L., Du X. (2009) Anti-spam Filters Based on Support
Vector Machines. In: Cai Z., Li Z., Kang Z., Liu Y. (eds) Advances in
Computation and Intelligence. ISICA 2009. Lecture Notes in Computer
Science, vol. 5821. Springer, Berlin, Heidelberg.
[7] W.A. Awad1 and S.M. ELseuofi2, MACHINE LEARNING
METHODS FOR SPAM E-MAIL CLASSIFICATION, Vol. 3, No 1, Feb
2011.
[8] R.Malarvizhi, K.Saraswathi, Content-Based Spam Filtering and
Detection Algorithms- An Efficient Analysis & Comparison on
International Journal of Engineering Trends and Technology (IJETT) –
Volume 4 Issue 9- Sep 2013.
[9] Mikko Siponen and Carl Stucke,Effective anti-spam strategies in
companies: An international study. In Proceedings of HICSS '06, vol 6,
2006.
[10] Track D. Sculley and Gabriel M. Wachman, Relaxed Online SVMs
in the TREC Spam Filtering .
[11] “R. Tibshirani", The Elements of Statistical Learning Data Mining,
Inference, and Prediction, Trevor Hastie, second edition.
[12] :Sushama Chouhan, Behavior Analysis of SVM Based Spam
Filtering Using Various Kernel Functions and Data Representations,Vol.
2 Issue 9, September – 2013.
[13]Durgesh K. Srivastava,Lekha Bhambhu ,Data Classification using
support vector machine, Journal of Theoretical and Applied Information
Technology vol 12 No 1.
[14] Ray Hunt,James Carpinter Networks, 2006. ICON '06. 14th IEEE
International Conference on Current and New Developments in Spam
Filtering,February 2007.
[15] Qiu Shubo,Gu Shuai,Zhang Tongxing, 2010 WASE International
Conference on Paper Defects Recognition Based on SVM, INSPEC
Accession Number: 11529535 on august 2010.
(ICCMC)
978-1-5090-4890-8/17/$31.00 ©2017 IEEE 31

IEEE

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to IEEE

Similar to IEEE (20)

Recently uploaded

Recently uploaded (20)

IEEE