SlideShare a Scribd company logo
1 of 5
Download to read offline
IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
10
Hybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset Selection
Rahul Kaushik1
, Bright Keswani2
1
M.Tech. Scholar, Gyan Vihar University Jaipur
Rahul.naresh108@gmail.com
2
Associate Proffesor, Gyan Vihar University Jaipur
kbright@rediffmail.com
Abstract
Rapid advance of computer technologies in data
processing, collection, and storage has provided
unparalleled opportunities to expand capabilities in
production, services, communications, and research.
However, immense quantities of high-dimensional data
renew the challenges to the state-of-the-art data mining
techniques. Artificial Bee Colony (ABC) is a popular meta-
heuristic search algorithm used in solving numerous
combinatorial optimization problems. Feature Selection
(FS) helps to speed up the process of classification by
extracting the relevant and useful information from the
dataset. FS is seen as an optimization problem because
selecting the appropriate feature subset is very important.
Classifier Ensemble is the best solution for the pitfall of
accuracy lag in a single classifier. We first briefly introduce
the concept of data mining and key components of feature
selection. This paper proposed a hybrid approach for the
feature selection by using artificial bee colony and
particular swarm optimization. The proposed technique is
simulated using the WEKA and results show the better
performance of the proposed technique.
Keywords: Data Mining, Feature Subset Selection,
Swarm Intelligence, ABC, PSO.
1. Introduction
Data mining is the process of finding patterns and
relations in large databases. Data mining is especially
advantageous in high-volume, frequently changing
data such as in financial application areas. The
primary purpose of data mining is to extract
information from huge amounts of raw data. Data
mining using statistical methods as well as machine
learning methods such as induced decision trees,
neural networks, among others, have been used for
this purpose with good results. The actual mining or
extraction of patterns from the data requires the data
to be clean since input data are the primary, if not the
only, source of knowledge in these systems. Cleaning
and preprocessing data involves several steps
including procedures for handling incomplete, noisy,
or missing data; sampling of appropriate data; feature
selection; feature construction; and also formatting
the data as per the representational requirements of
methods (e.g., decision trees, neural networks) used
to extract knowledge from these data [1].
2. Feature Subset Selection
Feature selection is one of the most important factors
which can influence the classification accuracy rate.
If the dataset contains a number of features, the
dimension of the space will be large and non-clean,
degrading the classification accuracy rate. An
efficient and robust feature selection method can
eliminate noisy, irrelevant and redundant data [2].
Feature subset selection algorithms can be
categorized into two types: filter algorithms and
wrapper algorithms. Filter algorithms select the
feature subset before the application of any
classification algorithm, and remove the less
important features from the subset. Wrapper methods
define the learning algorithm, the performance
criteria and the search strategy. The learning
algorithm searches for the subset using the training
data and the performance of the current subset [3].
In feature subset selection problem, the prediction
accuracy of the selected subset depends on the size of
the subset as well as the features selected.
Unfortunately, the prediction accuracy is not a
monotonic function of the feature subset with respect
to the set inclusion. Furthermore, in many practical
applications, the number of features in the original set
ranges from medium size (in hundreds) to large-scale
instances (in tens or hundreds of thousands).
Accordingly, the subset selection problem is an NP-
IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
11
hard combinatorial problem and requires efficient
solution algorithms.
3. Swarm Intelligence
The term swarm is used in a general manner to refer
to any restrained collection of interacting agents or
individuals. The classical example of a swarm is bees
swarming around their hive; nevertheless the
metaphor can easily be extended to other systems
with a similar architecture. An ant colony can be
thought of as a swarm whose individual agents are
ants. Similarly a flock of birds is a swarm of birds.
An immune system is a swarm of cells and molecules
as well as a crowd is a swarm of people. Particle
Swarm Optimization (PSO) Algorithm models the
social behaviour of bird flocking or fish schooling.
3.1 Artificial Bee Colony (ABC)
Algorithm
In ABC algorithm, the colony of artificial bees
contains three groups of bees: employed bees,
onlookers and scouts. First half of the colony consists
of the employed artificial bees and the second half
includes the onlookers. For every food source, there
is only one employed bee. In other words, the number
of employed bees is equal to the number of food
sources. The employed bee of an abandoned food
source becomes a scout. The search carried out by the
artificial bees can be summarized as follows [4]:
• Employed bees determine a food source
within the neighbourhood of the food source
in their memory.
• Employed bees share their information with
onlookers within the hive and then the
onlookers select one of the food sources.
• Onlookers select a food source within the
neighbourhood of the food sources chosen
by them.
An employed bee of which the source has been
abandoned becomes a scout and starts to search a
new food source randomly.
3.1.1 ABC Feature Selection
Unlike optimization problems, where the possible
solutions to the problem can be represented by
vectors with real values, the candidate solutions to
the feature selection problem are represented by bit
vectors. Each food source is associated with a bit
vector of size N, where N is the total number of
features. The position in the vector corresponds to the
number of features to be evaluated. If the value at the
corresponding position is 1, this indicates that the
feature is part of the subset to be evaluated. On the
other hand, if the value is 0, it indicates that the
feature is not part of the subset to be assessed [5].
Additionally, each food source stores its quality
(fitness), which is given by the accuracy of the
classifier using the feature subset indicated by the bit
vector.
4. Proposed Work
In the proposed work, we have developed a hybrid
approach for the feature selection by using artificial
bee colony and the particular swarm optimization.
The food sources in ABC are the total attributes of
the dataset. The subset selection procedure is carried
out by the PSO while merit of that subset is
calculated by the ABC. In other words, PSO selects a
subset from the features available in the dataset and
the ABC checks the accuracy of that subset. The
process can be easily explained by the following
algorithm and the flowchart.
4.1 Proposed Algorithm
1. The algorithm is initialized with N resources
where N is the total number of attributes in
the dataset. Each Resource is initialized with
N particles. The resource i.e. attribute with
largest fitness will be elected in the result.
2. The resources are classified by using
accuracy as fitness. The particle of the
selected feature gets used for classification.
3. Determine neighbors of chosen resources by
employed bees. The velocity of
corresponding particle within the resources
is updated. Each employed bee visits a
resource and explores its neighborhood. The
attribute selection process is carried out by
the particle velocity of the attribute.
= + ∗ −
IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
12
The velocity with highest velocity will be used as the
selected attribute. If this value is lower than the
threshold value, then the attribute is rejected.
4. Calculate the accuracy of elected subset of
features. If the new resource has greater
accuracy than existing consider this as a
resource. Update the memory.
5. Onlooker bees will collect the information
about the fitness i.e. accuracy by using the
velocity of particle of that resource. Now the
onlooker bees select a resource and become
employed bee. Go to step 3.
6. Update velocity of each particle within each
resource.
7. Find abandoned food sources and produce
new scout bees: for each abandoned food
source, a scout bee is created and a new food
source is generated. And start the process
again until all resources are explored.
The algorithm uses the particle and its velocity to be
updated by the ABC algorithm. The proposed
algorithm is implemented using the WEKA tool.
5. Data Set Description
In this research, “Car Evaluation Data Set” is used
and it is taken from the website “UCI repository”. [6]
The first describe about the characteristics of dataset
and Table 2 describe about the Car Evaluation
Database was derived from a simple hierarchical
decision model originally developed for the
demonstration of DEX. Table 2 explains the attribute
information of the cars dataset and table 3 shows the
result based on table 2 attributes that are applied on
various algorithms.
Table 1: Various characteristics of data set [6]
Data Set Characteristics Multivariate
Number of instances 1728
Number of attribute 6
Attribute characteristics Categorical
Missing Values? No
Table 2: Attribute information of Cars Dataset
Attribute information
Buying: vhigh, high, med,
low.
Buying price
Maint: vhigh, high, med,
low.
Price of the maintenance
Doors: 2, 3, 4, 5more. Number of doors
Persons: 2, 4, more. Persons capacity
Lug_boot: small, med, big. The size of luggage boot
Safety: low, med, high. Estimated safety of the car
Table 3: Results of Various Algorithms on Car Dataset
Algorithm Name Selected
Attributes
Merit
Hybrid Search 1(6) 0.187
ABC Search 2(4,6) 0.172
Random Search 3(2,4,6) 0.13
Next we take the data set on diabetes that are
describes in table 4 and after applying various
algorithms, its result shows in table: 5. After that we
take the data set on glass that are describes in table:6
and applying various algorithms, its result shows in
table:7.
Table 4: Dataset Description about Diabetes
Dataset Name Diabetes
Dataset Source UCI Repository[50]
Total number of
instance
768
Number of
Attributes
9
IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
13
Attributes Name 1. Number of times pregnant
2. Plasma glucose concentration a
2 hours in an oral glucose tolerance
test
3. Diastolic blood pressure (mm
Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in
kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Table 5: Results Of Various Algorithms On Diabetes
Dataset
Algorithm
Name
Selected
Attributes
Merit
Hybrid Search 4(2,6,7,8) 0.164
ABC Search 4(2,6,7,8) 0.1642
Random Search 5(2,5,6,7,8) 0.154
Table 6: Dataset Description Of Glass
Dataset
Name
Glass
Dataset
Source
UCI Repository[6]
Total
number of
instance
214
Number of
Attributes
10
Attributes
Informatio
n
1. Id number: 1 to 214
2. RI: refractive index
3. Na: Sodium (unit measurement:
weight percent in corresponding oxide,
as are attributes 4-10)
4. Mg: Magnesium
5. Al: Aluminum
6. Si: Silicon
7. K: Potassium
8. Ca: Calcium
9. Ba: Barium
10. Fe: Iron
11. Type of glass: (class attribute)
-- 1
building_windows_float_processed
--2
building_windows_non_float_processe
d -- 3
vehicle_windows_float_processed
--4
vehicle_windows_non_float_processed
(none in this database)
-- 5 containers
-- 6 tableware
-- 7 headlamps
Table 7: Results Of Various Algorithms on Glass
Dataset
Algorithm
Name
Selected Attributes Merit
Hybrid Search 8(1,2,3,4,6,7,8,9) 0.5113
ABC Search 7(1,3,4,6,7,8,9) 0.5087
Random
Search
8(1,2,3,4,6,7,8,9) 0.511
Below two graphs are shown that describes Merit
comparison of different Algorithm on Different
Datasets and graph 2shows the number of attributes
in subset.
IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
14
Figure 1: Merit comparison of different Algorithm on
Different Datasets
Figure 2: Number of Attributes in selected Subset of
different Algorithm on Different Dataset
6. Conclusion
Data mining as a multidisciplinary joint effort from
databases, machine learning, and statistics, is
championing in turning mountains of data into
nuggets. Feature selection is one of the important and
frequently used techniques in data preprocessing for
data mining. It reduces the number of features,
removes irrelevant, redundant, or noisy data, and
brings the immediate effects for applications:
speeding up a data mining algorithm, improving
mining performance such as predictive accuracy and
result comprehensibility. In this research, proposed a
hybrid approach for the feature selection by using
artificial bee colony and the particular swarm
optimization. We proposed an algorithm by using the
concept of ABC, PSO. PSO carried out the subset
selection procedure; on the other hand, subset is
calculated by using the ABC. The simulation is
performed by using three datasets taken from UCI
repository. The merit of all dataset is better for the
proposed algorithm as compared to existing
algorithm. The comparison is performed between the
ABC search and RANDOM search and the proposed
(HYBRID search). In future following work can be
done: Other swarm intelligence techniques can be
applied to improve the merit; the algorithm can be
extended for ranking the attributes.
References
[1] Selwyn Piramuthu et al, “Evaluating feature
selection methods for learning in data mining
applications”, European Journal of Operational
Research 156 (2004) 483–494.
[2] Yuanning Liu et al., “An improved particle
swarm optimization for feature selection”, J.
Bionic Eng.8(2), 191–200, 2011.
[3] Guyon I, Elisseeff A., “An introduction to
variable and feature selection.” Journal of
Machine Learning Research, 2003, 3, 1157–
1182.
[4] D Karaboga et al. ,“On the performance of
artificial bee colony (ABC) algorithm”, Appl.
Soft Comput.8, 687–697, 2008.
[5] Mauricio Schiezaro et al., “Data feature selection
based on Artificial Bee Colony algorithm”,
Schiezaro and PedriniEURASIP Journal on
Image and Video Processing, 2013.
[6] Bache, K. & Lichman, M. (2013). UCI Machine
Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information
and Computer Science.
0
2
4
6
8
10
Glass
Diabetes
Cars
0
0.2
0.4
0.6
Hybrid
Search
ABC
Search
Random
Search
Glass
Diabetes
Car

More Related Content

What's hot

IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET Journal
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestIJDKP
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
 
Feature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised ClusteringFeature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised ClusteringEditor IJCATR
 
Modified position update in spider monkey optimization algorithm
Modified position update in spider monkey optimization algorithmModified position update in spider monkey optimization algorithm
Modified position update in spider monkey optimization algorithmDr Sandeep Kumar Poonia
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
C LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHmC LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHm
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHmIJCI JOURNAL
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A novel hybrid crossover based abc algorithm
A novel hybrid crossover based abc algorithmA novel hybrid crossover based abc algorithm
A novel hybrid crossover based abc algorithmDr Sandeep Kumar Poonia
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...IOSR Journals
 
PPT file
PPT filePPT file
PPT filebutest
 
A New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsA New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsijcsa
 
research paper
research paperresearch paper
research paperKalyan Ram
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAEFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAIJCI JOURNAL
 
GI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUES
GI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUESGI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUES
GI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUESAM Publications
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

What's hot (19)

RMABC
RMABCRMABC
RMABC
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forest
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 
Feature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised ClusteringFeature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised Clustering
 
G046024851
G046024851G046024851
G046024851
 
Modified position update in spider monkey optimization algorithm
Modified position update in spider monkey optimization algorithmModified position update in spider monkey optimization algorithm
Modified position update in spider monkey optimization algorithm
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
 
T180203125133
T180203125133T180203125133
T180203125133
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
C LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHmC LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHm
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
A novel hybrid crossover based abc algorithm
A novel hybrid crossover based abc algorithmA novel hybrid crossover based abc algorithm
A novel hybrid crossover based abc algorithm
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...
 
PPT file
PPT filePPT file
PPT file
 
A New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsA New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item sets
 
research paper
research paperresearch paper
research paper
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAEFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
 
GI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUES
GI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUESGI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUES
GI-ANFIS APPROACH FOR ENVISAGE HEART ATTACK DISEASE USING DATA MINING TECHNIQUES
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 

Similar to Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul

A hybrid approach from ant colony optimization and K-neare.pdf
A hybrid approach from ant colony optimization and K-neare.pdfA hybrid approach from ant colony optimization and K-neare.pdf
A hybrid approach from ant colony optimization and K-neare.pdfRayhanaKarar
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
 
Frequent Item Set Mining - A Review
Frequent Item Set Mining - A ReviewFrequent Item Set Mining - A Review
Frequent Item Set Mining - A Reviewijsrd.com
 
Comparative study of_hybrids_of_artificial_bee_colony_algorithm
Comparative study of_hybrids_of_artificial_bee_colony_algorithmComparative study of_hybrids_of_artificial_bee_colony_algorithm
Comparative study of_hybrids_of_artificial_bee_colony_algorithmDr Sandeep Kumar Poonia
 
Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm IJECEIAES
 
Artificial bee colony with fcm for data clustering
Artificial bee colony with fcm for data clusteringArtificial bee colony with fcm for data clustering
Artificial bee colony with fcm for data clusteringAlie Banyuripan
 
Demand Value Identification using Improved Vector Analysis
Demand Value Identification using Improved Vector AnalysisDemand Value Identification using Improved Vector Analysis
Demand Value Identification using Improved Vector AnalysisIRJET Journal
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
 
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...cscpconf
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimizationAlexander Decker
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classificationrahulmonikasharma
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
 

Similar to Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul (20)

A hybrid approach from ant colony optimization and K-neare.pdf
A hybrid approach from ant colony optimization and K-neare.pdfA hybrid approach from ant colony optimization and K-neare.pdf
A hybrid approach from ant colony optimization and K-neare.pdf
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
T0 numt qxodc=
T0 numt qxodc=T0 numt qxodc=
T0 numt qxodc=
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Frequent Item Set Mining - A Review
Frequent Item Set Mining - A ReviewFrequent Item Set Mining - A Review
Frequent Item Set Mining - A Review
 
Comparative study of_hybrids_of_artificial_bee_colony_algorithm
Comparative study of_hybrids_of_artificial_bee_colony_algorithmComparative study of_hybrids_of_artificial_bee_colony_algorithm
Comparative study of_hybrids_of_artificial_bee_colony_algorithm
 
Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm
 
Artificial bee colony with fcm for data clustering
Artificial bee colony with fcm for data clusteringArtificial bee colony with fcm for data clustering
Artificial bee colony with fcm for data clustering
 
Demand Value Identification using Improved Vector Analysis
Demand Value Identification using Improved Vector AnalysisDemand Value Identification using Improved Vector Analysis
Demand Value Identification using Improved Vector Analysis
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
 
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
T0 numtq0n tk=
T0 numtq0n tk=T0 numtq0n tk=
T0 numtq0n tk=
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
 

Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul

  • 1. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014 An Indexed and Referred Journal ISSN (Online): 2231 –5268 www.ijcsms.com IJCSMS www.ijcsms.com 10 Hybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset Selection Rahul Kaushik1 , Bright Keswani2 1 M.Tech. Scholar, Gyan Vihar University Jaipur Rahul.naresh108@gmail.com 2 Associate Proffesor, Gyan Vihar University Jaipur kbright@rediffmail.com Abstract Rapid advance of computer technologies in data processing, collection, and storage has provided unparalleled opportunities to expand capabilities in production, services, communications, and research. However, immense quantities of high-dimensional data renew the challenges to the state-of-the-art data mining techniques. Artificial Bee Colony (ABC) is a popular meta- heuristic search algorithm used in solving numerous combinatorial optimization problems. Feature Selection (FS) helps to speed up the process of classification by extracting the relevant and useful information from the dataset. FS is seen as an optimization problem because selecting the appropriate feature subset is very important. Classifier Ensemble is the best solution for the pitfall of accuracy lag in a single classifier. We first briefly introduce the concept of data mining and key components of feature selection. This paper proposed a hybrid approach for the feature selection by using artificial bee colony and particular swarm optimization. The proposed technique is simulated using the WEKA and results show the better performance of the proposed technique. Keywords: Data Mining, Feature Subset Selection, Swarm Intelligence, ABC, PSO. 1. Introduction Data mining is the process of finding patterns and relations in large databases. Data mining is especially advantageous in high-volume, frequently changing data such as in financial application areas. The primary purpose of data mining is to extract information from huge amounts of raw data. Data mining using statistical methods as well as machine learning methods such as induced decision trees, neural networks, among others, have been used for this purpose with good results. The actual mining or extraction of patterns from the data requires the data to be clean since input data are the primary, if not the only, source of knowledge in these systems. Cleaning and preprocessing data involves several steps including procedures for handling incomplete, noisy, or missing data; sampling of appropriate data; feature selection; feature construction; and also formatting the data as per the representational requirements of methods (e.g., decision trees, neural networks) used to extract knowledge from these data [1]. 2. Feature Subset Selection Feature selection is one of the most important factors which can influence the classification accuracy rate. If the dataset contains a number of features, the dimension of the space will be large and non-clean, degrading the classification accuracy rate. An efficient and robust feature selection method can eliminate noisy, irrelevant and redundant data [2]. Feature subset selection algorithms can be categorized into two types: filter algorithms and wrapper algorithms. Filter algorithms select the feature subset before the application of any classification algorithm, and remove the less important features from the subset. Wrapper methods define the learning algorithm, the performance criteria and the search strategy. The learning algorithm searches for the subset using the training data and the performance of the current subset [3]. In feature subset selection problem, the prediction accuracy of the selected subset depends on the size of the subset as well as the features selected. Unfortunately, the prediction accuracy is not a monotonic function of the feature subset with respect to the set inclusion. Furthermore, in many practical applications, the number of features in the original set ranges from medium size (in hundreds) to large-scale instances (in tens or hundreds of thousands). Accordingly, the subset selection problem is an NP-
  • 2. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014 An Indexed and Referred Journal ISSN (Online): 2231 –5268 www.ijcsms.com IJCSMS www.ijcsms.com 11 hard combinatorial problem and requires efficient solution algorithms. 3. Swarm Intelligence The term swarm is used in a general manner to refer to any restrained collection of interacting agents or individuals. The classical example of a swarm is bees swarming around their hive; nevertheless the metaphor can easily be extended to other systems with a similar architecture. An ant colony can be thought of as a swarm whose individual agents are ants. Similarly a flock of birds is a swarm of birds. An immune system is a swarm of cells and molecules as well as a crowd is a swarm of people. Particle Swarm Optimization (PSO) Algorithm models the social behaviour of bird flocking or fish schooling. 3.1 Artificial Bee Colony (ABC) Algorithm In ABC algorithm, the colony of artificial bees contains three groups of bees: employed bees, onlookers and scouts. First half of the colony consists of the employed artificial bees and the second half includes the onlookers. For every food source, there is only one employed bee. In other words, the number of employed bees is equal to the number of food sources. The employed bee of an abandoned food source becomes a scout. The search carried out by the artificial bees can be summarized as follows [4]: • Employed bees determine a food source within the neighbourhood of the food source in their memory. • Employed bees share their information with onlookers within the hive and then the onlookers select one of the food sources. • Onlookers select a food source within the neighbourhood of the food sources chosen by them. An employed bee of which the source has been abandoned becomes a scout and starts to search a new food source randomly. 3.1.1 ABC Feature Selection Unlike optimization problems, where the possible solutions to the problem can be represented by vectors with real values, the candidate solutions to the feature selection problem are represented by bit vectors. Each food source is associated with a bit vector of size N, where N is the total number of features. The position in the vector corresponds to the number of features to be evaluated. If the value at the corresponding position is 1, this indicates that the feature is part of the subset to be evaluated. On the other hand, if the value is 0, it indicates that the feature is not part of the subset to be assessed [5]. Additionally, each food source stores its quality (fitness), which is given by the accuracy of the classifier using the feature subset indicated by the bit vector. 4. Proposed Work In the proposed work, we have developed a hybrid approach for the feature selection by using artificial bee colony and the particular swarm optimization. The food sources in ABC are the total attributes of the dataset. The subset selection procedure is carried out by the PSO while merit of that subset is calculated by the ABC. In other words, PSO selects a subset from the features available in the dataset and the ABC checks the accuracy of that subset. The process can be easily explained by the following algorithm and the flowchart. 4.1 Proposed Algorithm 1. The algorithm is initialized with N resources where N is the total number of attributes in the dataset. Each Resource is initialized with N particles. The resource i.e. attribute with largest fitness will be elected in the result. 2. The resources are classified by using accuracy as fitness. The particle of the selected feature gets used for classification. 3. Determine neighbors of chosen resources by employed bees. The velocity of corresponding particle within the resources is updated. Each employed bee visits a resource and explores its neighborhood. The attribute selection process is carried out by the particle velocity of the attribute. = + ∗ −
  • 3. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014 An Indexed and Referred Journal ISSN (Online): 2231 –5268 www.ijcsms.com IJCSMS www.ijcsms.com 12 The velocity with highest velocity will be used as the selected attribute. If this value is lower than the threshold value, then the attribute is rejected. 4. Calculate the accuracy of elected subset of features. If the new resource has greater accuracy than existing consider this as a resource. Update the memory. 5. Onlooker bees will collect the information about the fitness i.e. accuracy by using the velocity of particle of that resource. Now the onlooker bees select a resource and become employed bee. Go to step 3. 6. Update velocity of each particle within each resource. 7. Find abandoned food sources and produce new scout bees: for each abandoned food source, a scout bee is created and a new food source is generated. And start the process again until all resources are explored. The algorithm uses the particle and its velocity to be updated by the ABC algorithm. The proposed algorithm is implemented using the WEKA tool. 5. Data Set Description In this research, “Car Evaluation Data Set” is used and it is taken from the website “UCI repository”. [6] The first describe about the characteristics of dataset and Table 2 describe about the Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX. Table 2 explains the attribute information of the cars dataset and table 3 shows the result based on table 2 attributes that are applied on various algorithms. Table 1: Various characteristics of data set [6] Data Set Characteristics Multivariate Number of instances 1728 Number of attribute 6 Attribute characteristics Categorical Missing Values? No Table 2: Attribute information of Cars Dataset Attribute information Buying: vhigh, high, med, low. Buying price Maint: vhigh, high, med, low. Price of the maintenance Doors: 2, 3, 4, 5more. Number of doors Persons: 2, 4, more. Persons capacity Lug_boot: small, med, big. The size of luggage boot Safety: low, med, high. Estimated safety of the car Table 3: Results of Various Algorithms on Car Dataset Algorithm Name Selected Attributes Merit Hybrid Search 1(6) 0.187 ABC Search 2(4,6) 0.172 Random Search 3(2,4,6) 0.13 Next we take the data set on diabetes that are describes in table 4 and after applying various algorithms, its result shows in table: 5. After that we take the data set on glass that are describes in table:6 and applying various algorithms, its result shows in table:7. Table 4: Dataset Description about Diabetes Dataset Name Diabetes Dataset Source UCI Repository[50] Total number of instance 768 Number of Attributes 9
  • 4. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014 An Indexed and Referred Journal ISSN (Online): 2231 –5268 www.ijcsms.com IJCSMS www.ijcsms.com 13 Attributes Name 1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Class variable (0 or 1) Table 5: Results Of Various Algorithms On Diabetes Dataset Algorithm Name Selected Attributes Merit Hybrid Search 4(2,6,7,8) 0.164 ABC Search 4(2,6,7,8) 0.1642 Random Search 5(2,5,6,7,8) 0.154 Table 6: Dataset Description Of Glass Dataset Name Glass Dataset Source UCI Repository[6] Total number of instance 214 Number of Attributes 10 Attributes Informatio n 1. Id number: 1 to 214 2. RI: refractive index 3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10) 4. Mg: Magnesium 5. Al: Aluminum 6. Si: Silicon 7. K: Potassium 8. Ca: Calcium 9. Ba: Barium 10. Fe: Iron 11. Type of glass: (class attribute) -- 1 building_windows_float_processed --2 building_windows_non_float_processe d -- 3 vehicle_windows_float_processed --4 vehicle_windows_non_float_processed (none in this database) -- 5 containers -- 6 tableware -- 7 headlamps Table 7: Results Of Various Algorithms on Glass Dataset Algorithm Name Selected Attributes Merit Hybrid Search 8(1,2,3,4,6,7,8,9) 0.5113 ABC Search 7(1,3,4,6,7,8,9) 0.5087 Random Search 8(1,2,3,4,6,7,8,9) 0.511 Below two graphs are shown that describes Merit comparison of different Algorithm on Different Datasets and graph 2shows the number of attributes in subset.
  • 5. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014 An Indexed and Referred Journal ISSN (Online): 2231 –5268 www.ijcsms.com IJCSMS www.ijcsms.com 14 Figure 1: Merit comparison of different Algorithm on Different Datasets Figure 2: Number of Attributes in selected Subset of different Algorithm on Different Dataset 6. Conclusion Data mining as a multidisciplinary joint effort from databases, machine learning, and statistics, is championing in turning mountains of data into nuggets. Feature selection is one of the important and frequently used techniques in data preprocessing for data mining. It reduces the number of features, removes irrelevant, redundant, or noisy data, and brings the immediate effects for applications: speeding up a data mining algorithm, improving mining performance such as predictive accuracy and result comprehensibility. In this research, proposed a hybrid approach for the feature selection by using artificial bee colony and the particular swarm optimization. We proposed an algorithm by using the concept of ABC, PSO. PSO carried out the subset selection procedure; on the other hand, subset is calculated by using the ABC. The simulation is performed by using three datasets taken from UCI repository. The merit of all dataset is better for the proposed algorithm as compared to existing algorithm. The comparison is performed between the ABC search and RANDOM search and the proposed (HYBRID search). In future following work can be done: Other swarm intelligence techniques can be applied to improve the merit; the algorithm can be extended for ranking the attributes. References [1] Selwyn Piramuthu et al, “Evaluating feature selection methods for learning in data mining applications”, European Journal of Operational Research 156 (2004) 483–494. [2] Yuanning Liu et al., “An improved particle swarm optimization for feature selection”, J. Bionic Eng.8(2), 191–200, 2011. [3] Guyon I, Elisseeff A., “An introduction to variable and feature selection.” Journal of Machine Learning Research, 2003, 3, 1157– 1182. [4] D Karaboga et al. ,“On the performance of artificial bee colony (ABC) algorithm”, Appl. Soft Comput.8, 687–697, 2008. [5] Mauricio Schiezaro et al., “Data feature selection based on Artificial Bee Colony algorithm”, Schiezaro and PedriniEURASIP Journal on Image and Video Processing, 2013. [6] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 0 2 4 6 8 10 Glass Diabetes Cars 0 0.2 0.4 0.6 Hybrid Search ABC Search Random Search Glass Diabetes Car