Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul
1. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
10
Hybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset SelectionHybrid Approach for Feature Subset Selection
Rahul Kaushik1
, Bright Keswani2
1
M.Tech. Scholar, Gyan Vihar University Jaipur
Rahul.naresh108@gmail.com
2
Associate Proffesor, Gyan Vihar University Jaipur
kbright@rediffmail.com
Abstract
Rapid advance of computer technologies in data
processing, collection, and storage has provided
unparalleled opportunities to expand capabilities in
production, services, communications, and research.
However, immense quantities of high-dimensional data
renew the challenges to the state-of-the-art data mining
techniques. Artificial Bee Colony (ABC) is a popular meta-
heuristic search algorithm used in solving numerous
combinatorial optimization problems. Feature Selection
(FS) helps to speed up the process of classification by
extracting the relevant and useful information from the
dataset. FS is seen as an optimization problem because
selecting the appropriate feature subset is very important.
Classifier Ensemble is the best solution for the pitfall of
accuracy lag in a single classifier. We first briefly introduce
the concept of data mining and key components of feature
selection. This paper proposed a hybrid approach for the
feature selection by using artificial bee colony and
particular swarm optimization. The proposed technique is
simulated using the WEKA and results show the better
performance of the proposed technique.
Keywords: Data Mining, Feature Subset Selection,
Swarm Intelligence, ABC, PSO.
1. Introduction
Data mining is the process of finding patterns and
relations in large databases. Data mining is especially
advantageous in high-volume, frequently changing
data such as in financial application areas. The
primary purpose of data mining is to extract
information from huge amounts of raw data. Data
mining using statistical methods as well as machine
learning methods such as induced decision trees,
neural networks, among others, have been used for
this purpose with good results. The actual mining or
extraction of patterns from the data requires the data
to be clean since input data are the primary, if not the
only, source of knowledge in these systems. Cleaning
and preprocessing data involves several steps
including procedures for handling incomplete, noisy,
or missing data; sampling of appropriate data; feature
selection; feature construction; and also formatting
the data as per the representational requirements of
methods (e.g., decision trees, neural networks) used
to extract knowledge from these data [1].
2. Feature Subset Selection
Feature selection is one of the most important factors
which can influence the classification accuracy rate.
If the dataset contains a number of features, the
dimension of the space will be large and non-clean,
degrading the classification accuracy rate. An
efficient and robust feature selection method can
eliminate noisy, irrelevant and redundant data [2].
Feature subset selection algorithms can be
categorized into two types: filter algorithms and
wrapper algorithms. Filter algorithms select the
feature subset before the application of any
classification algorithm, and remove the less
important features from the subset. Wrapper methods
define the learning algorithm, the performance
criteria and the search strategy. The learning
algorithm searches for the subset using the training
data and the performance of the current subset [3].
In feature subset selection problem, the prediction
accuracy of the selected subset depends on the size of
the subset as well as the features selected.
Unfortunately, the prediction accuracy is not a
monotonic function of the feature subset with respect
to the set inclusion. Furthermore, in many practical
applications, the number of features in the original set
ranges from medium size (in hundreds) to large-scale
instances (in tens or hundreds of thousands).
Accordingly, the subset selection problem is an NP-
2. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
11
hard combinatorial problem and requires efficient
solution algorithms.
3. Swarm Intelligence
The term swarm is used in a general manner to refer
to any restrained collection of interacting agents or
individuals. The classical example of a swarm is bees
swarming around their hive; nevertheless the
metaphor can easily be extended to other systems
with a similar architecture. An ant colony can be
thought of as a swarm whose individual agents are
ants. Similarly a flock of birds is a swarm of birds.
An immune system is a swarm of cells and molecules
as well as a crowd is a swarm of people. Particle
Swarm Optimization (PSO) Algorithm models the
social behaviour of bird flocking or fish schooling.
3.1 Artificial Bee Colony (ABC)
Algorithm
In ABC algorithm, the colony of artificial bees
contains three groups of bees: employed bees,
onlookers and scouts. First half of the colony consists
of the employed artificial bees and the second half
includes the onlookers. For every food source, there
is only one employed bee. In other words, the number
of employed bees is equal to the number of food
sources. The employed bee of an abandoned food
source becomes a scout. The search carried out by the
artificial bees can be summarized as follows [4]:
• Employed bees determine a food source
within the neighbourhood of the food source
in their memory.
• Employed bees share their information with
onlookers within the hive and then the
onlookers select one of the food sources.
• Onlookers select a food source within the
neighbourhood of the food sources chosen
by them.
An employed bee of which the source has been
abandoned becomes a scout and starts to search a
new food source randomly.
3.1.1 ABC Feature Selection
Unlike optimization problems, where the possible
solutions to the problem can be represented by
vectors with real values, the candidate solutions to
the feature selection problem are represented by bit
vectors. Each food source is associated with a bit
vector of size N, where N is the total number of
features. The position in the vector corresponds to the
number of features to be evaluated. If the value at the
corresponding position is 1, this indicates that the
feature is part of the subset to be evaluated. On the
other hand, if the value is 0, it indicates that the
feature is not part of the subset to be assessed [5].
Additionally, each food source stores its quality
(fitness), which is given by the accuracy of the
classifier using the feature subset indicated by the bit
vector.
4. Proposed Work
In the proposed work, we have developed a hybrid
approach for the feature selection by using artificial
bee colony and the particular swarm optimization.
The food sources in ABC are the total attributes of
the dataset. The subset selection procedure is carried
out by the PSO while merit of that subset is
calculated by the ABC. In other words, PSO selects a
subset from the features available in the dataset and
the ABC checks the accuracy of that subset. The
process can be easily explained by the following
algorithm and the flowchart.
4.1 Proposed Algorithm
1. The algorithm is initialized with N resources
where N is the total number of attributes in
the dataset. Each Resource is initialized with
N particles. The resource i.e. attribute with
largest fitness will be elected in the result.
2. The resources are classified by using
accuracy as fitness. The particle of the
selected feature gets used for classification.
3. Determine neighbors of chosen resources by
employed bees. The velocity of
corresponding particle within the resources
is updated. Each employed bee visits a
resource and explores its neighborhood. The
attribute selection process is carried out by
the particle velocity of the attribute.
= + ∗ −
3. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
12
The velocity with highest velocity will be used as the
selected attribute. If this value is lower than the
threshold value, then the attribute is rejected.
4. Calculate the accuracy of elected subset of
features. If the new resource has greater
accuracy than existing consider this as a
resource. Update the memory.
5. Onlooker bees will collect the information
about the fitness i.e. accuracy by using the
velocity of particle of that resource. Now the
onlooker bees select a resource and become
employed bee. Go to step 3.
6. Update velocity of each particle within each
resource.
7. Find abandoned food sources and produce
new scout bees: for each abandoned food
source, a scout bee is created and a new food
source is generated. And start the process
again until all resources are explored.
The algorithm uses the particle and its velocity to be
updated by the ABC algorithm. The proposed
algorithm is implemented using the WEKA tool.
5. Data Set Description
In this research, “Car Evaluation Data Set” is used
and it is taken from the website “UCI repository”. [6]
The first describe about the characteristics of dataset
and Table 2 describe about the Car Evaluation
Database was derived from a simple hierarchical
decision model originally developed for the
demonstration of DEX. Table 2 explains the attribute
information of the cars dataset and table 3 shows the
result based on table 2 attributes that are applied on
various algorithms.
Table 1: Various characteristics of data set [6]
Data Set Characteristics Multivariate
Number of instances 1728
Number of attribute 6
Attribute characteristics Categorical
Missing Values? No
Table 2: Attribute information of Cars Dataset
Attribute information
Buying: vhigh, high, med,
low.
Buying price
Maint: vhigh, high, med,
low.
Price of the maintenance
Doors: 2, 3, 4, 5more. Number of doors
Persons: 2, 4, more. Persons capacity
Lug_boot: small, med, big. The size of luggage boot
Safety: low, med, high. Estimated safety of the car
Table 3: Results of Various Algorithms on Car Dataset
Algorithm Name Selected
Attributes
Merit
Hybrid Search 1(6) 0.187
ABC Search 2(4,6) 0.172
Random Search 3(2,4,6) 0.13
Next we take the data set on diabetes that are
describes in table 4 and after applying various
algorithms, its result shows in table: 5. After that we
take the data set on glass that are describes in table:6
and applying various algorithms, its result shows in
table:7.
Table 4: Dataset Description about Diabetes
Dataset Name Diabetes
Dataset Source UCI Repository[50]
Total number of
instance
768
Number of
Attributes
9
4. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
13
Attributes Name 1. Number of times pregnant
2. Plasma glucose concentration a
2 hours in an oral glucose tolerance
test
3. Diastolic blood pressure (mm
Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in
kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Table 5: Results Of Various Algorithms On Diabetes
Dataset
Algorithm
Name
Selected
Attributes
Merit
Hybrid Search 4(2,6,7,8) 0.164
ABC Search 4(2,6,7,8) 0.1642
Random Search 5(2,5,6,7,8) 0.154
Table 6: Dataset Description Of Glass
Dataset
Name
Glass
Dataset
Source
UCI Repository[6]
Total
number of
instance
214
Number of
Attributes
10
Attributes
Informatio
n
1. Id number: 1 to 214
2. RI: refractive index
3. Na: Sodium (unit measurement:
weight percent in corresponding oxide,
as are attributes 4-10)
4. Mg: Magnesium
5. Al: Aluminum
6. Si: Silicon
7. K: Potassium
8. Ca: Calcium
9. Ba: Barium
10. Fe: Iron
11. Type of glass: (class attribute)
-- 1
building_windows_float_processed
--2
building_windows_non_float_processe
d -- 3
vehicle_windows_float_processed
--4
vehicle_windows_non_float_processed
(none in this database)
-- 5 containers
-- 6 tableware
-- 7 headlamps
Table 7: Results Of Various Algorithms on Glass
Dataset
Algorithm
Name
Selected Attributes Merit
Hybrid Search 8(1,2,3,4,6,7,8,9) 0.5113
ABC Search 7(1,3,4,6,7,8,9) 0.5087
Random
Search
8(1,2,3,4,6,7,8,9) 0.511
Below two graphs are shown that describes Merit
comparison of different Algorithm on Different
Datasets and graph 2shows the number of attributes
in subset.
5. IJCSMS International Journal of Computer Science & Management Studies, Vol. 14, Issue 03, March 2014
An Indexed and Referred Journal
ISSN (Online): 2231 –5268
www.ijcsms.com
IJCSMS
www.ijcsms.com
14
Figure 1: Merit comparison of different Algorithm on
Different Datasets
Figure 2: Number of Attributes in selected Subset of
different Algorithm on Different Dataset
6. Conclusion
Data mining as a multidisciplinary joint effort from
databases, machine learning, and statistics, is
championing in turning mountains of data into
nuggets. Feature selection is one of the important and
frequently used techniques in data preprocessing for
data mining. It reduces the number of features,
removes irrelevant, redundant, or noisy data, and
brings the immediate effects for applications:
speeding up a data mining algorithm, improving
mining performance such as predictive accuracy and
result comprehensibility. In this research, proposed a
hybrid approach for the feature selection by using
artificial bee colony and the particular swarm
optimization. We proposed an algorithm by using the
concept of ABC, PSO. PSO carried out the subset
selection procedure; on the other hand, subset is
calculated by using the ABC. The simulation is
performed by using three datasets taken from UCI
repository. The merit of all dataset is better for the
proposed algorithm as compared to existing
algorithm. The comparison is performed between the
ABC search and RANDOM search and the proposed
(HYBRID search). In future following work can be
done: Other swarm intelligence techniques can be
applied to improve the merit; the algorithm can be
extended for ranking the attributes.
References
[1] Selwyn Piramuthu et al, “Evaluating feature
selection methods for learning in data mining
applications”, European Journal of Operational
Research 156 (2004) 483–494.
[2] Yuanning Liu et al., “An improved particle
swarm optimization for feature selection”, J.
Bionic Eng.8(2), 191–200, 2011.
[3] Guyon I, Elisseeff A., “An introduction to
variable and feature selection.” Journal of
Machine Learning Research, 2003, 3, 1157–
1182.
[4] D Karaboga et al. ,“On the performance of
artificial bee colony (ABC) algorithm”, Appl.
Soft Comput.8, 687–697, 2008.
[5] Mauricio Schiezaro et al., “Data feature selection
based on Artificial Bee Colony algorithm”,
Schiezaro and PedriniEURASIP Journal on
Image and Video Processing, 2013.
[6] Bache, K. & Lichman, M. (2013). UCI Machine
Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information
and Computer Science.
0
2
4
6
8
10
Glass
Diabetes
Cars
0
0.2
0.4
0.6
Hybrid
Search
ABC
Search
Random
Search
Glass
Diabetes
Car