Proposed efficient algorithm to filter spam using machine learning techniques
Ali Shafigh Aski a, *, Navid Khalilzadeh Sourati b
a Department of Computer Engineering, Islamic Azad University, Sari Branch, Islamic Republic of Iran
b Islamic Azad University of Amol, Ayatollah Amoli Branch, Islamic Republic of Iran
Diamond Application Development Crafting Solutions with Precision
Machine learning
1.
2. Proposed efficient algorithm to
filter spam using machine
learning techniques
Rozeena Saleha
MS in Information Security
rozeenasaleha15@gmail.com
3. Abstract
Electronic spam is the most troublesome
Internet phenomenon challenging large global
companies, including AOL, Google,Yahoo and
Microsoft. Spam causes various problems that
may, in turn, cause economic losses. Spam
causes traffic problems and bottlenecks that
limit memory space, computing power and
speed. Spam causes users to spend time
removing it.
4. Problem it
address
Spam causes traffic problems and bottlenecks
that
limit memory space,
Computing power
speed
Spam causes users to spend time removing it.
5. Architecture
of spam
filtering rules
and existing
methods
This study describes three machine-learning
algorithms to filter spam from valid emails with
low error rates and high efficiency using a
multilayer perceptron model.
Black list/white list,
Bayesian classifying algorithm,
Keyword matching and header information
analysis
6. Methodology
Decision-based systems
Bayesian classifiers
Support vector machine
Neural networks
Sample-based methods
Decision tree classifier, multilayer perceptron
and naïve Bayes classifier provided in the
proposed model.
7. Multilayer
perceptron
(MLP)
The model delivers information by activating input neurons containing values
labelled on them. Activation of neurons is calculated in the middle or output
layer, as follows:
9. DecisionTree
Classifier
A node of the tree represents a test on an adjective;
A branch exiting from a node represents possible
outputs of a test.
A leaf represents a class label.
A decision tree includes a rule set by which objective
functions can be predicted.The J48 algorithm is an
optimized version of C4.5.
The algorithm used for this model uses greedy
techniques shows classification of a sample based on a
decision tree.
11. Extraction of
features and
implementatio
n of the
considered
algorithm
The work here was based on rules for proper scoring in terms of
the efficiency of rules.The considered rules were provided in three
forms:
1) email header information analysis,
2) keyword matching,
3) main body of the message. A score was finally obtained for
these rules.
12. Table
Table 1
Series of rules to assign a score to the received email.
Via meaning of the name
Via domain names
Via blocked IP number
By detecting apostrophe
Via addresses of in the black list
Via addresses of in the white list
Via type of content
Via bounded content
Via content of name
Via undeclared addresses of recipients
Via main header
Receiving from an address and sending to a similar address
13. Accuracy
The primary dataset included 750 valid emails as well as 750
spams.To extract vector features of an email, the following
methods were used: 1) email header review,
2) keyword review,
3)black list and white list.
Labels of a class in the proposed model included L and S; the
former represented Legitimate, and the latter is an alternative for
Spam.
Three classifiers including naïve Bayes, C4.5 decision tree and
MLP neural networks were run on training data byWEKA
software.
The training data were tested in terms of message header
information,black and white list and keyword review. Efficiency
and accuracy of training data were evaluated by 10 classes of
reliably valid data. An accuracy factor was calculated as follows.
14. Conclusion
Two other basic concepts are used for spam filtering operations,
false positive and false negative.The former refers to those spams
classified as valid emails and the latter refers to valid emails wrongly
classified as spam. False positive rate of a classifier is applied to its
efficiency.Table 1 shows the efficiency of the above classifiers.
Efficiency of these three techniques depends on the following
factors:
1) valid recognition rate,
2) data training time and
3) False positive rate.
The efficiency of MLP neural network was better than the other models.
MLP requires more time to develop the model. J48 and naïve
Bayes algorithm require more time on learning experimental data.
15. References
[1]
I. Anderouysopoulos, J. Koutsias, K.V. Chandrianos, G. Paliouras, C. SpyrpolousAn evaluation of naïve
Bayesian anti-spam filtering
Proceeding of 11th Euro Conference on Match Learn (2000)
[2]
I. Anderouysopoulos, J. Koutsias, K.V. Chandrianos, G. Paliouras, C. SpyrpolousAn experimental comparison of
naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages
Proceeding of the an International ACM SIGIR Conference on Res and Devel in Inform Retrieval (2000)
[3]
A. Asuncion, D. NewmanUCI Machine Learning Repository
(2007)
http://www.ics.uci.edu/mlearn/MLRRepository.html
[4]
E. Blanzieri, A. BrylA Survey of Learning-based Techniques of Email Spam Filtering Tech. Rep
DIT-06-056
University of Trento, Information Engineering and Computer Science Department (2008)
[5]
J. Clark, I. Koprinska, J. PoonA neural network-based approach to automated email classification
Proceeding of the IEEE/WIC International Conference on Web Intelligence (2003)
[6]
S.J. Delany, D. BridgeTextual case-based reasoning for spam filtering: a comparison of feature-based and
feature-free approaches
Artif. Intell. Rev., 26 (1–2) (2006), pp. 75-87
CrossRefView Record in Scopus
[7]
F. Fdez Riverola, E. Iglesias, F. Diaz, J.R. Mendez, J.M. ChorchodoSpam hunting: an instance-based reasoning
system for spam labeling anf filtering
Decis. Support Syst., 43 (3) (2007), pp. 722-736
ArticlePDF (492KB)View Record in Scopus
[8]
A. SeewaldAn evaluation of Naïve Bayes variants in content-based learning for spam filtering
Intell. Data Anal. (2009)
[9]
Ahmed KhorsiAn overview of content-based spam filtering techniques
Informatica, 31 (3) (October 2007), pp. 269-277
View Record in Scopus