Machine learning

Proposed efficient algorithm to
filter spam using machine
learning techniques
Rozeena Saleha
MS in Information Security
rozeenasaleha15@gmail.com

Abstract
 Electronic spam is the most troublesome
Internet phenomenon challenging large global
companies, including AOL, Google,Yahoo and
Microsoft. Spam causes various problems that
may, in turn, cause economic losses. Spam
causes traffic problems and bottlenecks that
limit memory space, computing power and
speed. Spam causes users to spend time
removing it.

Problem it
address
Spam causes traffic problems and bottlenecks
that
 limit memory space,
 Computing power
 speed
 Spam causes users to spend time removing it.

Architecture
of spam
filtering rules
and existing
methods
 This study describes three machine-learning
algorithms to filter spam from valid emails with
low error rates and high efficiency using a
multilayer perceptron model.
 Black list/white list,
 Bayesian classifying algorithm,
 Keyword matching and header information
analysis

Methodology
 Decision-based systems
 Bayesian classifiers
 Support vector machine
 Neural networks
 Sample-based methods
 Decision tree classifier, multilayer perceptron
and naïve Bayes classifier provided in the
proposed model.

Multilayer
perceptron
(MLP)
The model delivers information by activating input neurons containing values
labelled on them. Activation of neurons is calculated in the middle or output
layer, as follows:

AC4.5 tree is
modelledasfollows.Atraining
setisasetofbasetuplesto
determineclassesrelatedto
thesetuples.AtupleXis
representedbyanadjective
vectorX¼(x1,x2,…,xn).
Assumethatatuplebelongstoa
predefinedclassthatis
determined byanadjective
calledasclasslabel.Thetraining
setisrandomly
selectedfromthebase;thisstep
iscalledthelearningstep.This
techniqueisveryefficientand
extensivelyusesclassification.

DecisionTree
Classifier
 A node of the tree represents a test on an adjective;
 A branch exiting from a node represents possible
outputs of a test.
 A leaf represents a class label.
 A decision tree includes a rule set by which objective
functions can be predicted.The J48 algorithm is an
optimized version of C4.5.
 The algorithm used for this model uses greedy
techniques shows classification of a sample based on a
decision tree.

Naïve
Bayesian
classifier
Probability
Distribution in
compressed
form.
An important problem with Bayes theory is
independence of
random variables, the lack of which makes it
difficult to calculate probability formula.

Extraction of
features and
implementatio
n of the
considered
algorithm
The work here was based on rules for proper scoring in terms of
the efficiency of rules.The considered rules were provided in three
forms:
1) email header information analysis,
2) keyword matching,
3) main body of the message. A score was finally obtained for
these rules.

Table
 Table 1
 Series of rules to assign a score to the received email.
 Via meaning of the name
 Via domain names
 Via blocked IP number
 By detecting apostrophe
 Via addresses of in the black list
 Via addresses of in the white list
 Via type of content
 Via bounded content
 Via content of name
 Via undeclared addresses of recipients
 Via main header
 Receiving from an address and sending to a similar address

Accuracy
 The primary dataset included 750 valid emails as well as 750
spams.To extract vector features of an email, the following
methods were used: 1) email header review,
 2) keyword review,
 3)black list and white list.
 Labels of a class in the proposed model included L and S; the
former represented Legitimate, and the latter is an alternative for
Spam.
 Three classifiers including naïve Bayes, C4.5 decision tree and
MLP neural networks were run on training data byWEKA
software.
 The training data were tested in terms of message header
information,black and white list and keyword review. Efficiency
and accuracy of training data were evaluated by 10 classes of
reliably valid data. An accuracy factor was calculated as follows.

Conclusion
Two other basic concepts are used for spam filtering operations,
false positive and false negative.The former refers to those spams
classified as valid emails and the latter refers to valid emails wrongly
classified as spam. False positive rate of a classifier is applied to its
efficiency.Table 1 shows the efficiency of the above classifiers.
Efficiency of these three techniques depends on the following
factors:
1) valid recognition rate,
2) data training time and
3) False positive rate.
The efficiency of MLP neural network was better than the other models.
MLP requires more time to develop the model. J48 and naïve
Bayes algorithm require more time on learning experimental data.

References
[1]
I. Anderouysopoulos, J. Koutsias, K.V. Chandrianos, G. Paliouras, C. SpyrpolousAn evaluation of naïve
Bayesian anti-spam filtering
Proceeding of 11th Euro Conference on Match Learn (2000)
[2]
I. Anderouysopoulos, J. Koutsias, K.V. Chandrianos, G. Paliouras, C. SpyrpolousAn experimental comparison of
naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages
Proceeding of the an International ACM SIGIR Conference on Res and Devel in Inform Retrieval (2000)
[3]
A. Asuncion, D. NewmanUCI Machine Learning Repository
(2007)
http://www.ics.uci.edu/mlearn/MLRRepository.html
[4]
E. Blanzieri, A. BrylA Survey of Learning-based Techniques of Email Spam Filtering Tech. Rep
DIT-06-056
University of Trento, Information Engineering and Computer Science Department (2008)
[5]
J. Clark, I. Koprinska, J. PoonA neural network-based approach to automated email classification
Proceeding of the IEEE/WIC International Conference on Web Intelligence (2003)
[6]
S.J. Delany, D. BridgeTextual case-based reasoning for spam filtering: a comparison of feature-based and
feature-free approaches
Artif. Intell. Rev., 26 (1–2) (2006), pp. 75-87
CrossRefView Record in Scopus
[7]
F. Fdez Riverola, E. Iglesias, F. Diaz, J.R. Mendez, J.M. ChorchodoSpam hunting: an instance-based reasoning
system for spam labeling anf filtering
Decis. Support Syst., 43 (3) (2007), pp. 722-736
ArticlePDF (492KB)View Record in Scopus
[8]
A. SeewaldAn evaluation of Naïve Bayes variants in content-based learning for spam filtering
Intell. Data Anal. (2009)
[9]
Ahmed KhorsiAn overview of content-based spam filtering techniques
Informatica, 31 (3) (October 2007), pp. 269-277
View Record in Scopus

Machine learning

More Related Content

What's hot

Similar to Machine learning

Recently uploaded

Machine learning