SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Computer Applications Technology and Research
Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656
www.ijcat.com 839
Spam filtering by using Genetic based Feature Selection
Sorayya mirzapour kalaibar
Department of Computer, Shabestar Branch,
Islamic Azad University,
Shabestar, Iran
Seyed Naser Razavi
Computer Engineering Department,
Faulty of Electrical and Computer Engineering,
University of Tabriz, Iran
Abstract:
Spam is defined as redundant and unwanted electronica letters, and nowadays, it has created many problems in business life such as
occupying networks bandwidth and the space of user’s mailbox. Due to these problems, much research has been carried out in this
regard by using classification technique. The resent research show that feature selection can have positive effect on the efficiency of
machine learning algorithm. Most algorithms try to present a data model depending on certain detection of small set of features.
Unrelated features in the process of making model result in weak estimation and more computations. In this research it has been tried
to evaluate spam detection in legal electronica letters, and their effect on several Machin learning algorithms through presenting a
feature selection method based on genetic algorithm. Bayesian network and KNN classifiers have been taken into account in
classification phase and spam base dataset is used.
Keywords: Email spam, feature selection, genetic algorithm, classification.
1. INTRODUCTION
Nowadays, e-mail is widely becoming one of the fastest and
most economical forms of communication .Thus, the e-mail is
prone to be misused. One such misuse is the posting of
unsolicited, unwanted e-mails known as spam or junk e-
mails[1]. Spam is becoming an increasingly large problem.
Many Internet Service Providers (ISPs) receive over a billion
spam messages per day. Much of these e-mails are filtered
before they reach end users. Content-Based filtering is a key
technological method to e-mail filtering. The spam e-mail
contents usually contain common words called features.
Frequency of occurrence of these features inside an e-mail
gives an indication that the e-mail is a spam or legitimate
[2,3,4]. There are various purposes in sending spams such as
economical purposes. Some of the spams are unwanted
advertising and commercial message, while others deceive the
users to use their private information (phishing), or they
temporarily destroy the mail server by sending malicious
software to the user’s computer. Also, they create traffic, or
distribute immoral messages. Therefore, it is necessary to find
some ways to filter these troublesome and annoying emails
automatically. In order to detect spams, some methods such as
parameter optimization and feature selection have been
proposed in order to reduce processing overhead and to
guarantee high detection rate [16].The spam filtering is high
sensitive application of text classification (TC) task. A main
problem in text classification tasks which is more serious in
email filtering is existence of large number of features. For
solving the issue, various feature selection methods are
considered, which extract a lower dimensional feature space
from original one and offer it as input to classifier[5]. In this
paper, we incorporate genetic algorithm to find an optimal
subset of features of the spam base data set. The selected
features are used for classification of the spam base.
2. LITERATURE REVIEW
Features selection approaches are usually employed to reduce
the size of the feature set, and to select a subset of the original
features. Over the past years, the following methods have
been considered to select effective features such as the
algorithms based on population to select important features
and to remove irrelevant and redundant features such as
genetic algorithm (GA), particle swarm optimization (PSO),
and ant colony algorithm (ACO). Some algorithms are
developed to classify and filter e-mails. The RIPPER
algorithm [6] is an algorithm that employs rule-based to
filtering e-mails. Drucker, et. al. [7] proposed an SVM
algorithm for spam categorization. Sahami, et. al. [8]
proposed Bayesian junk E-mail filter using bag-of-words
representation and Naïve Bayes algorithm. Clark, et. al. [9]
used the bag-of-words representation and ANN for automated
spam filtering system. Branke, J. [10] discussed how the
genetic algorithm can be used to assist in designing and
training. Riley. J. [11] described a method of utilizing genetic
algorithms to train fixed architecture feed-forward and
recurrent neural networks. Yao. X. and Liu. Y. [12] reviewed
the different combinations between ANN and GA, and used
GA to evolve ANN connection weights, architectures,
learning rules, and input features. Wang and et al. presented
feature selection incorporation based on genetic algorithm and
support vector machine based on SRM to detect spam and
legitimate emails. The presented method had better results
than main SVM [13]. Zhu developed a new method based on
rough set and SVM in order to improve the level of
classification. Rough set was used as a feature selection to
decrease the number of feature and SVM as a classifier[14].
Fagboula and et al. considered GA to select an appropriate
subset of features, and they used SVM as a classifier. In order
to improve the classification accuracy and computation time,
some experiments were carried out in terms of data set of
Spam assassin [15]. Patwadhan and Ozarkar presented
random forest algorithm and partial decision trees for spam
classification. Some feature selection methods have been used
as a preprocessing stage such as Correlation based feature
selection, Chi-square, Entropy, Information Gain, Gain Ratio,
Mutual Information, Symmetrical Uncertainty, One R and
Relief. Using above mentioned methods resulting in selecting
International Journal of Computer Applications Technology and Research
Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656
www.ijcat.com 840
more efficient and useful features decrease time complexity
and increase accuracy [17].
3. GENETIC ALGORITHMS
A genetic algorithm (GA) is one of a number of heuristic
techniques that are based on natural selection from the
population members attempt to find high-quality solutions to
large and complex optimization problems. This algorithm can
identify and exploit regularities in the environment, and
converges on solutions (can also be regarded as locating the
local maxima) that were globally optimal [18]. This method is
very effective and widely used to find-out optimal or near
optimal solutions to a wide variety of problems. The genetic
algorithm repeatedly modifies a population of individual
solutions. At each step the genetic algorithm tries to select the
best individuals. From the current “parent” population genetic
algorithm creates “children”, who constitute next generation.
Over successive generations the population evolves toward an
optimal solution. The genetic algorithm uses three main rules
at each step to create the next generation. Select the
individuals, called parents that contribute to the population at
the next generation. Crossover rules that combine two parents
to form children for the next generation. Mutation rules, apply
random changes to individual parents to form children.
4. FEATURE SELECTION
Features selection approaches are usually employed to
reduce the size of the feature set, and to select a subset
of the original features. We use the proposed genetic
algorithms to optimize the features that contribute
significantly to the classification.
4.1. Feature Selection Using Proposed
Genetic Algorithm
In this section, the method of feature selection by using
the proposed genetic Algorithm has been presented. The
procedure of the proposed method has been stated in
details in the following section.
4.1.1. Initialize population
In the genetic algorithm, each solution to the feature
selection problem is a string of binary numbers, called
chromosome. In this algorithm initial population is
generated randomly. IN feature representation as a
chromosome, if the value of chromosome [i] is 1, the ith
feature is selected for classification, while if it is 0, then
these features will be removed [19,20]. Figure 1 shows
feature presentation as a chromosome.
In this research, we used weighted F-score for calculate
the fitness value of each chromosome. The algorithm
starts by randomly initializing a population of N
number of initial chromosome.
4.1.2 Cross over
The crossover is the most important operation in GA.
Crossover as name suggests is a process of
recombination of bit strings via an exchange of
segments between pairs of chromosomes. There are
various kinds of crossover. In one point cross-over, a bit
position is randomly selected that need to change. In
this process, a random number is generated which is a
number (less than or equal to the chromosome length)
as the crossover position [21]. Here one crossover point
is selected, binary string from beginning of
chromosome to the crossover point is copied from one
parent, the rest is copied from the second parent[22].
4.1.3. Proposed mutation
Mutation has the effect of ensuring that all possible
chromosomes can maintain good gene in the newly
generated chromosomes. In our approach, Mutation
operator is a two-steps process, and is a combination of
random and substitution mutation operator. Also is
occurs on the basis of two various mutation rates.
Mutation operator firstly events substitution step with
the probability of 0.03. In each generation, the best
chromosome involving better features and higher fitness
is selected, and it substitutes for the weakest
chromosome having lesser fitness than others. ( َ‫هزحل‬ ‫ایي‬
َ‫ک‬ ‫ضْد‬ ‫هی‬ ‫بعذی‬ ‫ًسل‬ َ‫ب‬ ‫جاری‬ ‫ًسل‬ ‫بزتز‬ ‫کزّهْسّم‬ ‫اًتقال‬ ‫باعث‬
‫داضت‬ ‫خْاُذ‬ ‫دًبال‬ َ‫ب‬ ‫ُن‬ ‫را‬ ‫الگْریتن‬ ‫سزیع‬ ‫)ُوگزایی‬Otherwise, it
enters the second mutation step with probability of 0.02.
This step changes some gens of chromosome randomly
by inverting their binary cells. In fact the second is
considered to prevent reducing exploration capability of
search space to keep diversity in other chromosomes.
Generally mutation probability is equal to 0.05.
5. RESULTS SIMULATION
In order to investigate the impact of our approach on
email spam classification, spam base data set that
downloaded from the UCI Machine Learning
Repository are used [23]. Data set of Spam base
involving 4601 emails was proposed by Mark Hopkins,
and his colleagues. In This data set that is divided into
two parts, 1 shows spam, and zero indicates non-spam.
This data set involves 57 features with continuous
values. In simulation of the proposed method, training
set involving 70% of the main data set and two
experimental sets have been separately considered for
feature selection and classification. Each one involves
15% of the main data set. After performing feature
selection using the training set, the test set was used to
evaluate the selected subset of features. The evaluation
of the overall process was based on weighted f-score
which is a suitable measure for the spam classification
...
01...101
Chromosome:
Figure 1. Feature Subset: {F1,F3 , …, Fn-1 }
International Journal of Computer Applications Technology and Research
Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656
www.ijcat.com 841
problem. The performance of spam filtering techniques
is determined by two well-known measures used in text
classification. These measures are precision and recall
[24, 25]. Here four metric have been used for evaluating
the performance of proposed method such as precision,
accuracy, recall and F1 score. These metrics are
computed as follows:
(1)i
i
i i
TP
TP FP
 

(2)
i
i
i i
TP
TP FN
 

(3)
2
1
( )
F
 
 
 


(4)
iiii
ii
FNTNFPTP
TNTP
Accuracy



Where:
TPi = the number of test samples that have been
properly classified in ci class.
FPi = the number of test samples that have been
incorrectly classified in ci class.
TNi = the number of test samples belonging to ci class,
and have been correctly classified in other classes.
FNi = the number of test samples belonging to ci class,
and have been incorrectly classified in other classes.
The methods of Bayesian network and K nearest
neighbors algorithm (KNN) have been used for
classification. The executed program and the obtained
average have been compared 8 times to investigate the
performance of each classifier. The results obtained
from the proposed method of feature selection have
been compared without considering feature selection.
The obtained results show that when the parameters are
presented in tables 1 the best performance is observed
in terms of GAFS.
Table 1: the parameters of feature selection by using
genetic algorithm
80
Initial population
0.03
Mutation rate1
0.02
Mutation rate2
0.7
Crossover
100
Generations
6. RESULT EVALUATION
In this section, the results of experiments have been
presented to evaluate the efficiency of proposed
method. ( ‫بٌذ‬ َ‫طبق‬ ّ‫د‬ َ‫هقایس‬ ‫ًتایج‬1ّ2‫دقت‬ ‫ًظز‬ ‫اس‬‫جذّل‬ ‫در‬2
‫ضکل‬ ‫ُوچٌیي‬ ‫است‬ ٍ‫ضذ‬ ٍ‫آّرد‬2‫پیطٌِادی‬ ‫رّش‬ ‫تاثیز‬ ‫گزافیکی‬ ‫ًوْدار‬
‫بذست‬ ‫ًتایج‬ ‫دُذ.طبق‬ ‫هی‬ ‫ًطاى‬ ‫را‬ ‫سائذ‬ ‫ُای‬ ‫ّیژگی‬ ‫کاُص‬ ‫هیشاى‬ ‫بز‬
‫است‬ َ‫تْاًست‬ ‫پیطٌِادی‬ ‫رّش‬ ‫ّرک‬ ‫ًت‬ ‫بٌذبیشیي‬ َ‫طبق‬ ‫هْرد‬ ‫در‬ ٍ‫آهذ‬
ُ ‫ّیژگی‬ ‫اس‬ ‫تْجِی‬ ‫قابل‬ ‫تعذاد‬ ‫حذف‬ ‫بز‬ ٍّ‫عال‬‫ا‬ ‫ًیش‬ ‫را‬ ‫بٌذی‬ َ‫طبق‬ ‫دقت‬
ُ . ‫دُذ‬ ‫افشایص‬‫ّیژگی‬ ‫حذف‬ ‫ّجْد‬ ‫با‬ ‫اى‬ ‫اى‬ ‫کا‬ ‫بٌذ‬ َ‫طبق‬ ‫هْرد‬ ‫در‬ ‫وچٌیي‬
.‫است‬ ٍ‫رسیذ‬ ‫ّیژگی‬ ‫اًتخاب‬ ‫اس‬ ‫قبل‬ ‫دقت‬ ‫ُواى‬ َ‫ب‬ ‫ًشدیک‬ ‫دقتی‬ َ‫ب‬ ‫ُا‬
‫جذّل‬ ‫در‬ ‫دیگز‬ ‫هعیار‬ َ‫س‬ ‫بزای‬ ٍ‫آهذ‬ ‫بذست‬ ‫ًتایج‬3.‫است‬ ٍ‫ضذ‬ ٍ‫داد‬ ‫ًطاى‬
َ‫ب‬ ‫ارسیابی‬ ‫هعیار‬ َ‫س‬ ‫ُز‬ ‫تواهی‬ ‫در‬ .. ‫بیشیي‬ ‫بٌذ‬ َ‫طبق‬ ‫جذّل‬ ‫ایي‬ ‫طبق‬
‫ا‬ َ‫هالحظ‬ ‫قابل‬ ‫بِبْد‬‫اى‬ ‫اى‬ ‫کا‬ ‫بٌذ‬ َ‫طبق‬ ّ ‫است‬ ٍ‫رسیذ‬ ‫ی‬‫اختالف‬ ‫با‬
‫ًاچیشی‬.‫است‬ ٍ‫رسیذ‬ ‫قبلی‬ ‫دقت‬ َ‫ب‬ ‫ًشدیک‬ ‫دقتی‬ َ‫ب‬ )Evaluation
results obtained for Bayesian Network and KNN
classifiers are shown in table 2. These results indicate
that feature selection by GA technique improves email
spam classification. GA FS and all features by using
mentioned classifiers have been compared in terms of
Accuracy, number of selected feature, recall, precision
and F score of spam class. As it is observed in table 2,
all evaluation measure of proposed GAFS in Bayesian
network, is more than the All feature while the number
of selected features is lesser. In addition, in comparing
two classifiers, Bayesian network algorithm, better
results were presented in comparison to KNN.
Table 2: comparing feature selection methods in
terms of accuracy
Algorithms
classifier
All Feature GA FS
Bayesian network 0.891 0.918
KNN (N=1) 0.9 0.891
International Journal of Computer Applications Technology and Research
Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656
www.ijcat.com 842
Figure 3: column graph of comparing the number
of selected features
7. CONCLUTION
In this paper, the proposed GA based feature selection
method has been presented and evaluated by using data
set of Spam Base. The results obtained from proposed
method were compared with position without feature
selection. The obtained results show that( ‫تعذاد‬ َ‫ب‬ َ‫تْج‬ ‫با‬
ٍ‫ضذ‬ ‫حذف‬ ‫ُای‬ ‫)ّیژگی‬ the proposed method has accuracy
comparable with without feature selection methods. In
addition, in Bayesian network classifier ‫ًسبت‬ ‫بِتزی‬ ‫(ًتایج‬
) ّ َ‫داضت‬ ‫دیگز‬ ‫بٌذ‬ َ‫طبق‬ َ‫ب‬
all evaluation criteria have been considerably improved.
ّ ‫ُا‬ ‫ّیژگی‬ ‫کاُص‬ ‫بز‬ ‫ای‬ َ‫هالحظ‬ ‫قابل‬ ‫تاثیز‬ ‫پیطٌِادی‬ ‫رّش‬ ‫(پس‬
‫افشایص‬‫/بِبْد‬).‫است‬ َ‫داضت‬ ‫دقت‬ We can use of parameter
optimization in this work also the proposed algorithm
can be combined with other classification algorithms in
the future.
REFERENCE
[1] GOWEDER, A. M., RASHED, T., ELBEKAIE, A., &
ALHAMMI, H. A. (2008). An Anti-Spam System Using
Artificial Neural Networks and Genetic Algorithms.
Paper presented at the Proceedings of the 2008
International Arab Conference on Information
Technology.
[2] Bruening, P.(2004). Technological Responses to the
Problem of Spam: Preserving Free Speech and Open
Internet Values. First Conference on E-mail and Anti-
Spam.
[3] Graham, P.(2003). A Plan for Spam. MIT Conference on
Spam.
[4] William, S., et. al. (2005). A Unified Model of Spam
Filtration, MIT Spam Conference, Cambridge.
[5] GOWEDER, A. M., RASHED, T., ELBEKAIE, A., &
ALHAMMI, H. A. (2008). An Anti-Spam System Using
Artificial Neural Networks and Genetic Algorithms.
Paper presented at the Proceedings of the 2008
International Arab Conference on Information
Technology.
[6] Cohen, W. (1996). Learning Rules that Classify E-mail,
In AAAI Spring Symposium on Machine Learning in
Information Access, California.
[7] Drucker, H., et. al.(1999) Support Vector Machines for
Spam Categorization, In IEEE Transactions on Neural
Networks.
Sahami, M., et. al.,(1998). A Bayesian Approach to
Filtering Junk E-Mail, In Learning for Text
Categorization, AAAI Technical Report, U.S.A.
[8] Riley. J. (2002). An evolutionary approach to training
Feed-Forward and Recurrent Neural Networks", Master
thesis of Applied Science in Information Technology,
Department of Computer Science, Royal Melbourne
Institute of Technology, Australia.
[9] Clark, et. al. (2003). A Neural Network Based Approach
to Automated E-Mail Classification, IEEE/WIC
International Conference on Web Intelligence.
[10] Branke, J. (1995). Evolutionary algorithms for neural
network design and training, In Proceedings 1st Nordic
Workshop on Genetic Algorithms and its Applications,
Finland.
[11] Yao. X., Liu. Y. (1997). A new evolutionary system for
evolving artificial neural networks", IEEE Transactions
on Neural Networks.
[12] Wang, H.-b., Y. Yu, and Z. Liu. (2005) SVM classifier
incorporating feature selection using GA for spam
detection, in Embedded and Ubiquitous Computing–
EUC 2005., Springer. p. 1147-1154.
classifier
measures
KNN(N=1) Bayesian network
All Feature GA FS All Feature GA FS
precision 0.892 0.886 0.89 0.935
recall 0.871 0.860 0.851 0.869
F1 score 0.882 0.871 0.87 0.900
Table 3: comparing feature selection methods
International Journal of Computer Applications Technology and Research
Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656
www.ijcat.com 843
[13] Zhu, Z. (2008). An email classification model based on
rough set and support vector machine. in Fuzzy Systems
and Knowledge Discovery.
[14] .Temitayo, F., O. Stephen, and A. Abimbola. (2012).
Hybrid GA-SVM for efficient feature selection in e-mail
classification. Computer Engineering and Intelligent
Systems. 3(3): p. 17-28.
[15] Stern, H. (2008) A Survey of Modern Spam Tools. in
CEAS. Citeseer.
[16] Ozarkar, P. and M. Patwardhan. (2013).
INTERNATIONAL JOURNAL OF COMPUTER
ENGINEERING & TECHNOLOGY (IJCET). Journal
Impact Factor. 4(3): p. 123-139.
[17] Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of
statistical spam filtering techniques. ACM Transactions
on Asian Language Information Processing (TALIP),
3(4), 243-269.
[18] Vafaie H, De Jong K. (1992). Genetic algorithms as a
tool for feature selection in machine learning. In
Proceedings of Fourth International Conference on Tools
with Artificial Intelligence (TAI '92). 200-203.
[19] Yang J, Honavar V. (1998). Feature subset selection
using a genetic algorithm. Intelligent Systems and their
Applications, IEEE, 13(2):44-49.
[20] Shrivastava, J. N., & Bindu, M. H. (2014). E-mail Spam
Filtering Using Adaptive Genetic Algorithm.
International Journal of Intelligent Systems &
Applications, 6(2).
[21] Karimpour, J., A.A. Noroozi, and A. Abadi. (2012). The
Impact of Feature Selection on Web Spam Detection.
International Journal of Intelligent Systems and
Applications (IJISA), 4(9): p. 61.
[22] UCI repository of Machine learning Databases. (1998).
Department of Information and Computer Science,
University of California, Irvine, CA,
http://www.ics.uci.edu/~mlearn/MLRepository.html,
Hettich, S., Blake, C. L., and Merz, C. J.
[23] Liao, C., Alpha, S., Dixon.P. (2004). Feature
Preparation in Text Categorization, Oracle Corporation.
[24] Clark, et. al. (2003). A Neural Network Based Approach
to Automated E-Mail Classification, IEEE/WIC
International Conference on Web Intelligence

More Related Content

What's hot

A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approachesijtsrd
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET Journal
 
Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...IJECEIAES
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...Rudradityo Saha
 
Identifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeIdentifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeEditor IJCATR
 
03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septian03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septianIAESIJEECS
 
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmIndonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmjournalBEEI
 
An incremental learning based framework for image spam filtering
An incremental learning based framework for image spam filteringAn incremental learning based framework for image spam filtering
An incremental learning based framework for image spam filteringIJCSEA Journal
 
A Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data ClusteringA Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data Clusteringidescitation
 
An Improved Leader Election Algorithm for Distributed Systems
An Improved Leader Election Algorithm for Distributed SystemsAn Improved Leader Election Algorithm for Distributed Systems
An Improved Leader Election Algorithm for Distributed Systemsijngnjournal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesIJSRED
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
A Survey on: Sound Source Separation Methods
A Survey on: Sound Source Separation MethodsA Survey on: Sound Source Separation Methods
A Survey on: Sound Source Separation MethodsIJCERT
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeIJSRD
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

What's hot (20)

A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection System
 
Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
 
Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...
 
Final Report(SuddhasatwaSatpathy)
Final Report(SuddhasatwaSatpathy)Final Report(SuddhasatwaSatpathy)
Final Report(SuddhasatwaSatpathy)
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
 
Identifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeIdentifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision Tree
 
03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septian03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septian
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmIndonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
 
An incremental learning based framework for image spam filtering
An incremental learning based framework for image spam filteringAn incremental learning based framework for image spam filtering
An incremental learning based framework for image spam filtering
 
A Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data ClusteringA Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data Clustering
 
An Improved Leader Election Algorithm for Distributed Systems
An Improved Leader Election Algorithm for Distributed SystemsAn Improved Leader Election Algorithm for Distributed Systems
An Improved Leader Election Algorithm for Distributed Systems
 
Ijecet 06 09_010
Ijecet 06 09_010Ijecet 06 09_010
Ijecet 06 09_010
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning Techniques
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
A Survey on: Sound Source Separation Methods
A Survey on: Sound Source Separation MethodsA Survey on: Sound Source Separation Methods
A Survey on: Sound Source Separation Methods
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

Similar to Genetic algorithm feature selection for spam filtering

Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square IJECEIAES
 
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boostImproved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boosteSAT Journals
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...IJNSA Journal
 
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmEditor IJCATR
 
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
 
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsIRJET Journal
 
research paper
research paperresearch paper
research paperKalyan Ram
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Software Testing Using Genetic Algorithms
Software Testing Using Genetic AlgorithmsSoftware Testing Using Genetic Algorithms
Software Testing Using Genetic AlgorithmsIJCSES Journal
 
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHMEMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHMIRJET Journal
 
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...IJNSA Journal
 
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...IJERD Editor
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTIJDKP
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
 

Similar to Genetic algorithm feature selection for spam filtering (20)

Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature Selection Approach based on Firefly Algorithm and Chi-square
 
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boostImproved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
EMAIL SPAM CLASSIFICATION USING HYBRID APPROACH OF RBF NEURAL NETWORK AND PAR...
 
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
 
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification
 
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
 
research paper
research paperresearch paper
research paper
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Software Testing Using Genetic Algorithms
Software Testing Using Genetic AlgorithmsSoftware Testing Using Genetic Algorithms
Software Testing Using Genetic Algorithms
 
my IEEE
my IEEEmy IEEE
my IEEE
 
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHMEMAIL SPAM DETECTION USING HYBRID ALGORITHM
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
 
JUNE-77.pdf
JUNE-77.pdfJUNE-77.pdf
JUNE-77.pdf
 
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
 
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
 
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
A Novel Framework For Numerical Character Recognition With Zoning Distance Fe...
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 

More from Editor IJCATR

Text Mining in Digital Libraries using OKAPI BM25 Model
 Text Mining in Digital Libraries using OKAPI BM25 Model Text Mining in Digital Libraries using OKAPI BM25 Model
Text Mining in Digital Libraries using OKAPI BM25 ModelEditor IJCATR
 
Green Computing, eco trends, climate change, e-waste and eco-friendly
Green Computing, eco trends, climate change, e-waste and eco-friendlyGreen Computing, eco trends, climate change, e-waste and eco-friendly
Green Computing, eco trends, climate change, e-waste and eco-friendlyEditor IJCATR
 
Policies for Green Computing and E-Waste in Nigeria
 Policies for Green Computing and E-Waste in Nigeria Policies for Green Computing and E-Waste in Nigeria
Policies for Green Computing and E-Waste in NigeriaEditor IJCATR
 
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...Editor IJCATR
 
Optimum Location of DG Units Considering Operation Conditions
Optimum Location of DG Units Considering Operation ConditionsOptimum Location of DG Units Considering Operation Conditions
Optimum Location of DG Units Considering Operation ConditionsEditor IJCATR
 
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...Editor IJCATR
 
Web Scraping for Estimating new Record from Source Site
Web Scraping for Estimating new Record from Source SiteWeb Scraping for Estimating new Record from Source Site
Web Scraping for Estimating new Record from Source SiteEditor IJCATR
 
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
 Evaluating Semantic Similarity between Biomedical Concepts/Classes through S... Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...Editor IJCATR
 
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
 Semantic Similarity Measures between Terms in the Biomedical Domain within f... Semantic Similarity Measures between Terms in the Biomedical Domain within f...
Semantic Similarity Measures between Terms in the Biomedical Domain within f...Editor IJCATR
 
A Strategy for Improving the Performance of Small Files in Openstack Swift
 A Strategy for Improving the Performance of Small Files in Openstack Swift  A Strategy for Improving the Performance of Small Files in Openstack Swift
A Strategy for Improving the Performance of Small Files in Openstack Swift Editor IJCATR
 
Integrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and RegistrationIntegrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and RegistrationEditor IJCATR
 
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
 Assessment of the Efficiency of Customer Order Management System: A Case Stu... Assessment of the Efficiency of Customer Order Management System: A Case Stu...
Assessment of the Efficiency of Customer Order Management System: A Case Stu...Editor IJCATR
 
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*Editor IJCATR
 
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Security in Software Defined Networks (SDN): Challenges and Research Opportun...Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Security in Software Defined Networks (SDN): Challenges and Research Opportun...Editor IJCATR
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
 
Hangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector MachineHangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector MachineEditor IJCATR
 
Application of 3D Printing in Education
Application of 3D Printing in EducationApplication of 3D Printing in Education
Application of 3D Printing in EducationEditor IJCATR
 
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...Editor IJCATR
 
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...Editor IJCATR
 
Decay Property for Solutions to Plate Type Equations with Variable Coefficients
Decay Property for Solutions to Plate Type Equations with Variable CoefficientsDecay Property for Solutions to Plate Type Equations with Variable Coefficients
Decay Property for Solutions to Plate Type Equations with Variable CoefficientsEditor IJCATR
 

More from Editor IJCATR (20)

Text Mining in Digital Libraries using OKAPI BM25 Model
 Text Mining in Digital Libraries using OKAPI BM25 Model Text Mining in Digital Libraries using OKAPI BM25 Model
Text Mining in Digital Libraries using OKAPI BM25 Model
 
Green Computing, eco trends, climate change, e-waste and eco-friendly
Green Computing, eco trends, climate change, e-waste and eco-friendlyGreen Computing, eco trends, climate change, e-waste and eco-friendly
Green Computing, eco trends, climate change, e-waste and eco-friendly
 
Policies for Green Computing and E-Waste in Nigeria
 Policies for Green Computing and E-Waste in Nigeria Policies for Green Computing and E-Waste in Nigeria
Policies for Green Computing and E-Waste in Nigeria
 
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
 
Optimum Location of DG Units Considering Operation Conditions
Optimum Location of DG Units Considering Operation ConditionsOptimum Location of DG Units Considering Operation Conditions
Optimum Location of DG Units Considering Operation Conditions
 
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
 
Web Scraping for Estimating new Record from Source Site
Web Scraping for Estimating new Record from Source SiteWeb Scraping for Estimating new Record from Source Site
Web Scraping for Estimating new Record from Source Site
 
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
 Evaluating Semantic Similarity between Biomedical Concepts/Classes through S... Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
 
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
 Semantic Similarity Measures between Terms in the Biomedical Domain within f... Semantic Similarity Measures between Terms in the Biomedical Domain within f...
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
 
A Strategy for Improving the Performance of Small Files in Openstack Swift
 A Strategy for Improving the Performance of Small Files in Openstack Swift  A Strategy for Improving the Performance of Small Files in Openstack Swift
A Strategy for Improving the Performance of Small Files in Openstack Swift
 
Integrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and RegistrationIntegrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and Registration
 
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
 Assessment of the Efficiency of Customer Order Management System: A Case Stu... Assessment of the Efficiency of Customer Order Management System: A Case Stu...
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
 
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
 
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Security in Software Defined Networks (SDN): Challenges and Research Opportun...Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
 
Hangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector MachineHangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector Machine
 
Application of 3D Printing in Education
Application of 3D Printing in EducationApplication of 3D Printing in Education
Application of 3D Printing in Education
 
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
 
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
 
Decay Property for Solutions to Plate Type Equations with Variable Coefficients
Decay Property for Solutions to Plate Type Equations with Variable CoefficientsDecay Property for Solutions to Plate Type Equations with Variable Coefficients
Decay Property for Solutions to Plate Type Equations with Variable Coefficients
 

Genetic algorithm feature selection for spam filtering

  • 1. International Journal of Computer Applications Technology and Research Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656 www.ijcat.com 839 Spam filtering by using Genetic based Feature Selection Sorayya mirzapour kalaibar Department of Computer, Shabestar Branch, Islamic Azad University, Shabestar, Iran Seyed Naser Razavi Computer Engineering Department, Faulty of Electrical and Computer Engineering, University of Tabriz, Iran Abstract: Spam is defined as redundant and unwanted electronica letters, and nowadays, it has created many problems in business life such as occupying networks bandwidth and the space of user’s mailbox. Due to these problems, much research has been carried out in this regard by using classification technique. The resent research show that feature selection can have positive effect on the efficiency of machine learning algorithm. Most algorithms try to present a data model depending on certain detection of small set of features. Unrelated features in the process of making model result in weak estimation and more computations. In this research it has been tried to evaluate spam detection in legal electronica letters, and their effect on several Machin learning algorithms through presenting a feature selection method based on genetic algorithm. Bayesian network and KNN classifiers have been taken into account in classification phase and spam base dataset is used. Keywords: Email spam, feature selection, genetic algorithm, classification. 1. INTRODUCTION Nowadays, e-mail is widely becoming one of the fastest and most economical forms of communication .Thus, the e-mail is prone to be misused. One such misuse is the posting of unsolicited, unwanted e-mails known as spam or junk e- mails[1]. Spam is becoming an increasingly large problem. Many Internet Service Providers (ISPs) receive over a billion spam messages per day. Much of these e-mails are filtered before they reach end users. Content-Based filtering is a key technological method to e-mail filtering. The spam e-mail contents usually contain common words called features. Frequency of occurrence of these features inside an e-mail gives an indication that the e-mail is a spam or legitimate [2,3,4]. There are various purposes in sending spams such as economical purposes. Some of the spams are unwanted advertising and commercial message, while others deceive the users to use their private information (phishing), or they temporarily destroy the mail server by sending malicious software to the user’s computer. Also, they create traffic, or distribute immoral messages. Therefore, it is necessary to find some ways to filter these troublesome and annoying emails automatically. In order to detect spams, some methods such as parameter optimization and feature selection have been proposed in order to reduce processing overhead and to guarantee high detection rate [16].The spam filtering is high sensitive application of text classification (TC) task. A main problem in text classification tasks which is more serious in email filtering is existence of large number of features. For solving the issue, various feature selection methods are considered, which extract a lower dimensional feature space from original one and offer it as input to classifier[5]. In this paper, we incorporate genetic algorithm to find an optimal subset of features of the spam base data set. The selected features are used for classification of the spam base. 2. LITERATURE REVIEW Features selection approaches are usually employed to reduce the size of the feature set, and to select a subset of the original features. Over the past years, the following methods have been considered to select effective features such as the algorithms based on population to select important features and to remove irrelevant and redundant features such as genetic algorithm (GA), particle swarm optimization (PSO), and ant colony algorithm (ACO). Some algorithms are developed to classify and filter e-mails. The RIPPER algorithm [6] is an algorithm that employs rule-based to filtering e-mails. Drucker, et. al. [7] proposed an SVM algorithm for spam categorization. Sahami, et. al. [8] proposed Bayesian junk E-mail filter using bag-of-words representation and Naïve Bayes algorithm. Clark, et. al. [9] used the bag-of-words representation and ANN for automated spam filtering system. Branke, J. [10] discussed how the genetic algorithm can be used to assist in designing and training. Riley. J. [11] described a method of utilizing genetic algorithms to train fixed architecture feed-forward and recurrent neural networks. Yao. X. and Liu. Y. [12] reviewed the different combinations between ANN and GA, and used GA to evolve ANN connection weights, architectures, learning rules, and input features. Wang and et al. presented feature selection incorporation based on genetic algorithm and support vector machine based on SRM to detect spam and legitimate emails. The presented method had better results than main SVM [13]. Zhu developed a new method based on rough set and SVM in order to improve the level of classification. Rough set was used as a feature selection to decrease the number of feature and SVM as a classifier[14]. Fagboula and et al. considered GA to select an appropriate subset of features, and they used SVM as a classifier. In order to improve the classification accuracy and computation time, some experiments were carried out in terms of data set of Spam assassin [15]. Patwadhan and Ozarkar presented random forest algorithm and partial decision trees for spam classification. Some feature selection methods have been used as a preprocessing stage such as Correlation based feature selection, Chi-square, Entropy, Information Gain, Gain Ratio, Mutual Information, Symmetrical Uncertainty, One R and Relief. Using above mentioned methods resulting in selecting
  • 2. International Journal of Computer Applications Technology and Research Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656 www.ijcat.com 840 more efficient and useful features decrease time complexity and increase accuracy [17]. 3. GENETIC ALGORITHMS A genetic algorithm (GA) is one of a number of heuristic techniques that are based on natural selection from the population members attempt to find high-quality solutions to large and complex optimization problems. This algorithm can identify and exploit regularities in the environment, and converges on solutions (can also be regarded as locating the local maxima) that were globally optimal [18]. This method is very effective and widely used to find-out optimal or near optimal solutions to a wide variety of problems. The genetic algorithm repeatedly modifies a population of individual solutions. At each step the genetic algorithm tries to select the best individuals. From the current “parent” population genetic algorithm creates “children”, who constitute next generation. Over successive generations the population evolves toward an optimal solution. The genetic algorithm uses three main rules at each step to create the next generation. Select the individuals, called parents that contribute to the population at the next generation. Crossover rules that combine two parents to form children for the next generation. Mutation rules, apply random changes to individual parents to form children. 4. FEATURE SELECTION Features selection approaches are usually employed to reduce the size of the feature set, and to select a subset of the original features. We use the proposed genetic algorithms to optimize the features that contribute significantly to the classification. 4.1. Feature Selection Using Proposed Genetic Algorithm In this section, the method of feature selection by using the proposed genetic Algorithm has been presented. The procedure of the proposed method has been stated in details in the following section. 4.1.1. Initialize population In the genetic algorithm, each solution to the feature selection problem is a string of binary numbers, called chromosome. In this algorithm initial population is generated randomly. IN feature representation as a chromosome, if the value of chromosome [i] is 1, the ith feature is selected for classification, while if it is 0, then these features will be removed [19,20]. Figure 1 shows feature presentation as a chromosome. In this research, we used weighted F-score for calculate the fitness value of each chromosome. The algorithm starts by randomly initializing a population of N number of initial chromosome. 4.1.2 Cross over The crossover is the most important operation in GA. Crossover as name suggests is a process of recombination of bit strings via an exchange of segments between pairs of chromosomes. There are various kinds of crossover. In one point cross-over, a bit position is randomly selected that need to change. In this process, a random number is generated which is a number (less than or equal to the chromosome length) as the crossover position [21]. Here one crossover point is selected, binary string from beginning of chromosome to the crossover point is copied from one parent, the rest is copied from the second parent[22]. 4.1.3. Proposed mutation Mutation has the effect of ensuring that all possible chromosomes can maintain good gene in the newly generated chromosomes. In our approach, Mutation operator is a two-steps process, and is a combination of random and substitution mutation operator. Also is occurs on the basis of two various mutation rates. Mutation operator firstly events substitution step with the probability of 0.03. In each generation, the best chromosome involving better features and higher fitness is selected, and it substitutes for the weakest chromosome having lesser fitness than others. ( َ‫هزحل‬ ‫ایي‬ َ‫ک‬ ‫ضْد‬ ‫هی‬ ‫بعذی‬ ‫ًسل‬ َ‫ب‬ ‫جاری‬ ‫ًسل‬ ‫بزتز‬ ‫کزّهْسّم‬ ‫اًتقال‬ ‫باعث‬ ‫داضت‬ ‫خْاُذ‬ ‫دًبال‬ َ‫ب‬ ‫ُن‬ ‫را‬ ‫الگْریتن‬ ‫سزیع‬ ‫)ُوگزایی‬Otherwise, it enters the second mutation step with probability of 0.02. This step changes some gens of chromosome randomly by inverting their binary cells. In fact the second is considered to prevent reducing exploration capability of search space to keep diversity in other chromosomes. Generally mutation probability is equal to 0.05. 5. RESULTS SIMULATION In order to investigate the impact of our approach on email spam classification, spam base data set that downloaded from the UCI Machine Learning Repository are used [23]. Data set of Spam base involving 4601 emails was proposed by Mark Hopkins, and his colleagues. In This data set that is divided into two parts, 1 shows spam, and zero indicates non-spam. This data set involves 57 features with continuous values. In simulation of the proposed method, training set involving 70% of the main data set and two experimental sets have been separately considered for feature selection and classification. Each one involves 15% of the main data set. After performing feature selection using the training set, the test set was used to evaluate the selected subset of features. The evaluation of the overall process was based on weighted f-score which is a suitable measure for the spam classification ... 01...101 Chromosome: Figure 1. Feature Subset: {F1,F3 , …, Fn-1 }
  • 3. International Journal of Computer Applications Technology and Research Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656 www.ijcat.com 841 problem. The performance of spam filtering techniques is determined by two well-known measures used in text classification. These measures are precision and recall [24, 25]. Here four metric have been used for evaluating the performance of proposed method such as precision, accuracy, recall and F1 score. These metrics are computed as follows: (1)i i i i TP TP FP    (2) i i i i TP TP FN    (3) 2 1 ( ) F         (4) iiii ii FNTNFPTP TNTP Accuracy    Where: TPi = the number of test samples that have been properly classified in ci class. FPi = the number of test samples that have been incorrectly classified in ci class. TNi = the number of test samples belonging to ci class, and have been correctly classified in other classes. FNi = the number of test samples belonging to ci class, and have been incorrectly classified in other classes. The methods of Bayesian network and K nearest neighbors algorithm (KNN) have been used for classification. The executed program and the obtained average have been compared 8 times to investigate the performance of each classifier. The results obtained from the proposed method of feature selection have been compared without considering feature selection. The obtained results show that when the parameters are presented in tables 1 the best performance is observed in terms of GAFS. Table 1: the parameters of feature selection by using genetic algorithm 80 Initial population 0.03 Mutation rate1 0.02 Mutation rate2 0.7 Crossover 100 Generations 6. RESULT EVALUATION In this section, the results of experiments have been presented to evaluate the efficiency of proposed method. ( ‫بٌذ‬ َ‫طبق‬ ّ‫د‬ َ‫هقایس‬ ‫ًتایج‬1ّ2‫دقت‬ ‫ًظز‬ ‫اس‬‫جذّل‬ ‫در‬2 ‫ضکل‬ ‫ُوچٌیي‬ ‫است‬ ٍ‫ضذ‬ ٍ‫آّرد‬2‫پیطٌِادی‬ ‫رّش‬ ‫تاثیز‬ ‫گزافیکی‬ ‫ًوْدار‬ ‫بذست‬ ‫ًتایج‬ ‫دُذ.طبق‬ ‫هی‬ ‫ًطاى‬ ‫را‬ ‫سائذ‬ ‫ُای‬ ‫ّیژگی‬ ‫کاُص‬ ‫هیشاى‬ ‫بز‬ ‫است‬ َ‫تْاًست‬ ‫پیطٌِادی‬ ‫رّش‬ ‫ّرک‬ ‫ًت‬ ‫بٌذبیشیي‬ َ‫طبق‬ ‫هْرد‬ ‫در‬ ٍ‫آهذ‬ ُ ‫ّیژگی‬ ‫اس‬ ‫تْجِی‬ ‫قابل‬ ‫تعذاد‬ ‫حذف‬ ‫بز‬ ٍّ‫عال‬‫ا‬ ‫ًیش‬ ‫را‬ ‫بٌذی‬ َ‫طبق‬ ‫دقت‬ ُ . ‫دُذ‬ ‫افشایص‬‫ّیژگی‬ ‫حذف‬ ‫ّجْد‬ ‫با‬ ‫اى‬ ‫اى‬ ‫کا‬ ‫بٌذ‬ َ‫طبق‬ ‫هْرد‬ ‫در‬ ‫وچٌیي‬ .‫است‬ ٍ‫رسیذ‬ ‫ّیژگی‬ ‫اًتخاب‬ ‫اس‬ ‫قبل‬ ‫دقت‬ ‫ُواى‬ َ‫ب‬ ‫ًشدیک‬ ‫دقتی‬ َ‫ب‬ ‫ُا‬ ‫جذّل‬ ‫در‬ ‫دیگز‬ ‫هعیار‬ َ‫س‬ ‫بزای‬ ٍ‫آهذ‬ ‫بذست‬ ‫ًتایج‬3.‫است‬ ٍ‫ضذ‬ ٍ‫داد‬ ‫ًطاى‬ َ‫ب‬ ‫ارسیابی‬ ‫هعیار‬ َ‫س‬ ‫ُز‬ ‫تواهی‬ ‫در‬ .. ‫بیشیي‬ ‫بٌذ‬ َ‫طبق‬ ‫جذّل‬ ‫ایي‬ ‫طبق‬ ‫ا‬ َ‫هالحظ‬ ‫قابل‬ ‫بِبْد‬‫اى‬ ‫اى‬ ‫کا‬ ‫بٌذ‬ َ‫طبق‬ ّ ‫است‬ ٍ‫رسیذ‬ ‫ی‬‫اختالف‬ ‫با‬ ‫ًاچیشی‬.‫است‬ ٍ‫رسیذ‬ ‫قبلی‬ ‫دقت‬ َ‫ب‬ ‫ًشدیک‬ ‫دقتی‬ َ‫ب‬ )Evaluation results obtained for Bayesian Network and KNN classifiers are shown in table 2. These results indicate that feature selection by GA technique improves email spam classification. GA FS and all features by using mentioned classifiers have been compared in terms of Accuracy, number of selected feature, recall, precision and F score of spam class. As it is observed in table 2, all evaluation measure of proposed GAFS in Bayesian network, is more than the All feature while the number of selected features is lesser. In addition, in comparing two classifiers, Bayesian network algorithm, better results were presented in comparison to KNN. Table 2: comparing feature selection methods in terms of accuracy Algorithms classifier All Feature GA FS Bayesian network 0.891 0.918 KNN (N=1) 0.9 0.891
  • 4. International Journal of Computer Applications Technology and Research Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656 www.ijcat.com 842 Figure 3: column graph of comparing the number of selected features 7. CONCLUTION In this paper, the proposed GA based feature selection method has been presented and evaluated by using data set of Spam Base. The results obtained from proposed method were compared with position without feature selection. The obtained results show that( ‫تعذاد‬ َ‫ب‬ َ‫تْج‬ ‫با‬ ٍ‫ضذ‬ ‫حذف‬ ‫ُای‬ ‫)ّیژگی‬ the proposed method has accuracy comparable with without feature selection methods. In addition, in Bayesian network classifier ‫ًسبت‬ ‫بِتزی‬ ‫(ًتایج‬ ) ّ َ‫داضت‬ ‫دیگز‬ ‫بٌذ‬ َ‫طبق‬ َ‫ب‬ all evaluation criteria have been considerably improved. ّ ‫ُا‬ ‫ّیژگی‬ ‫کاُص‬ ‫بز‬ ‫ای‬ َ‫هالحظ‬ ‫قابل‬ ‫تاثیز‬ ‫پیطٌِادی‬ ‫رّش‬ ‫(پس‬ ‫افشایص‬‫/بِبْد‬).‫است‬ َ‫داضت‬ ‫دقت‬ We can use of parameter optimization in this work also the proposed algorithm can be combined with other classification algorithms in the future. REFERENCE [1] GOWEDER, A. M., RASHED, T., ELBEKAIE, A., & ALHAMMI, H. A. (2008). An Anti-Spam System Using Artificial Neural Networks and Genetic Algorithms. Paper presented at the Proceedings of the 2008 International Arab Conference on Information Technology. [2] Bruening, P.(2004). Technological Responses to the Problem of Spam: Preserving Free Speech and Open Internet Values. First Conference on E-mail and Anti- Spam. [3] Graham, P.(2003). A Plan for Spam. MIT Conference on Spam. [4] William, S., et. al. (2005). A Unified Model of Spam Filtration, MIT Spam Conference, Cambridge. [5] GOWEDER, A. M., RASHED, T., ELBEKAIE, A., & ALHAMMI, H. A. (2008). An Anti-Spam System Using Artificial Neural Networks and Genetic Algorithms. Paper presented at the Proceedings of the 2008 International Arab Conference on Information Technology. [6] Cohen, W. (1996). Learning Rules that Classify E-mail, In AAAI Spring Symposium on Machine Learning in Information Access, California. [7] Drucker, H., et. al.(1999) Support Vector Machines for Spam Categorization, In IEEE Transactions on Neural Networks. Sahami, M., et. al.,(1998). A Bayesian Approach to Filtering Junk E-Mail, In Learning for Text Categorization, AAAI Technical Report, U.S.A. [8] Riley. J. (2002). An evolutionary approach to training Feed-Forward and Recurrent Neural Networks", Master thesis of Applied Science in Information Technology, Department of Computer Science, Royal Melbourne Institute of Technology, Australia. [9] Clark, et. al. (2003). A Neural Network Based Approach to Automated E-Mail Classification, IEEE/WIC International Conference on Web Intelligence. [10] Branke, J. (1995). Evolutionary algorithms for neural network design and training, In Proceedings 1st Nordic Workshop on Genetic Algorithms and its Applications, Finland. [11] Yao. X., Liu. Y. (1997). A new evolutionary system for evolving artificial neural networks", IEEE Transactions on Neural Networks. [12] Wang, H.-b., Y. Yu, and Z. Liu. (2005) SVM classifier incorporating feature selection using GA for spam detection, in Embedded and Ubiquitous Computing– EUC 2005., Springer. p. 1147-1154. classifier measures KNN(N=1) Bayesian network All Feature GA FS All Feature GA FS precision 0.892 0.886 0.89 0.935 recall 0.871 0.860 0.851 0.869 F1 score 0.882 0.871 0.87 0.900 Table 3: comparing feature selection methods
  • 5. International Journal of Computer Applications Technology and Research Volume 3– Issue 12, 839 - 843, 2014, ISSN:- 2319–8656 www.ijcat.com 843 [13] Zhu, Z. (2008). An email classification model based on rough set and support vector machine. in Fuzzy Systems and Knowledge Discovery. [14] .Temitayo, F., O. Stephen, and A. Abimbola. (2012). Hybrid GA-SVM for efficient feature selection in e-mail classification. Computer Engineering and Intelligent Systems. 3(3): p. 17-28. [15] Stern, H. (2008) A Survey of Modern Spam Tools. in CEAS. Citeseer. [16] Ozarkar, P. and M. Patwardhan. (2013). INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET). Journal Impact Factor. 4(3): p. 123-139. [17] Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4), 243-269. [18] Vafaie H, De Jong K. (1992). Genetic algorithms as a tool for feature selection in machine learning. In Proceedings of Fourth International Conference on Tools with Artificial Intelligence (TAI '92). 200-203. [19] Yang J, Honavar V. (1998). Feature subset selection using a genetic algorithm. Intelligent Systems and their Applications, IEEE, 13(2):44-49. [20] Shrivastava, J. N., & Bindu, M. H. (2014). E-mail Spam Filtering Using Adaptive Genetic Algorithm. International Journal of Intelligent Systems & Applications, 6(2). [21] Karimpour, J., A.A. Noroozi, and A. Abadi. (2012). The Impact of Feature Selection on Web Spam Detection. International Journal of Intelligent Systems and Applications (IJISA), 4(9): p. 61. [22] UCI repository of Machine learning Databases. (1998). Department of Information and Computer Science, University of California, Irvine, CA, http://www.ics.uci.edu/~mlearn/MLRepository.html, Hettich, S., Blake, C. L., and Merz, C. J. [23] Liao, C., Alpha, S., Dixon.P. (2004). Feature Preparation in Text Categorization, Oracle Corporation. [24] Clark, et. al. (2003). A Neural Network Based Approach to Automated E-Mail Classification, IEEE/WIC International Conference on Web Intelligence