ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Prediction model of algal blooms using logistic regression and confusion matrix IJECEIAES
Algal blooms data are collected and refined as experimental data for algal blooms prediction. Refined algal blooms dataset is analyzed by logistic regression analysis, and statistical tests and regularization are performed to find the marine environmental factors affecting algal blooms. The predicted value of algal bloom is obtained through logistic regression analysis using marine environment factors affecting algal blooms. The actual values and the predicted values of algal blooms dataset are applied to the confusion matrix. By improving the decision boundary of the existing logistic regression, and accuracy, sensitivity and precision for algal blooms prediction are improved. In this paper, the algal blooms prediction model is established by the ensemble method using logistic regression and confusion matrix. Algal blooms prediction is improved, and this is verified through big data analysis.
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
Feature selection on boolean symbolic objectsijcsity
With the boom in IT technology, the data sets used in
application are more and more larger and are
described by a huge number of attributes, therefore, the feature selection become an important discipline in
Knowle
dge discovery and data mining, allowing the experts
to select the most relevant features to impr
ove
the quality of their studies and to reduce the time processing of their algorithm. In addition to that, the data
used by the applications become richer. They are now represented by a set of complex and structured
objects, instead of simple numerical ma
trixes. The purpose of
our
algorithm is to do feature selection on
rich data,
called Boolean Symbolic Objects
(BSOs)
. These objects are desc
ribed by multivalued features.
The
BSOs
are considered as higher level units which can model complex data, such as c
luster of
individuals, aggregated data or taxonomies. In this paper we will introduce a new feature selection
criterion for
BSOs
, and we will explain
how we improved its complexity.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Prediction model of algal blooms using logistic regression and confusion matrix IJECEIAES
Algal blooms data are collected and refined as experimental data for algal blooms prediction. Refined algal blooms dataset is analyzed by logistic regression analysis, and statistical tests and regularization are performed to find the marine environmental factors affecting algal blooms. The predicted value of algal bloom is obtained through logistic regression analysis using marine environment factors affecting algal blooms. The actual values and the predicted values of algal blooms dataset are applied to the confusion matrix. By improving the decision boundary of the existing logistic regression, and accuracy, sensitivity and precision for algal blooms prediction are improved. In this paper, the algal blooms prediction model is established by the ensemble method using logistic regression and confusion matrix. Algal blooms prediction is improved, and this is verified through big data analysis.
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
Feature selection on boolean symbolic objectsijcsity
With the boom in IT technology, the data sets used in
application are more and more larger and are
described by a huge number of attributes, therefore, the feature selection become an important discipline in
Knowle
dge discovery and data mining, allowing the experts
to select the most relevant features to impr
ove
the quality of their studies and to reduce the time processing of their algorithm. In addition to that, the data
used by the applications become richer. They are now represented by a set of complex and structured
objects, instead of simple numerical ma
trixes. The purpose of
our
algorithm is to do feature selection on
rich data,
called Boolean Symbolic Objects
(BSOs)
. These objects are desc
ribed by multivalued features.
The
BSOs
are considered as higher level units which can model complex data, such as c
luster of
individuals, aggregated data or taxonomies. In this paper we will introduce a new feature selection
criterion for
BSOs
, and we will explain
how we improved its complexity.
The solution of problem of parameterization of the proximity function in ace ...eSAT Journals
Abstract
In this work, a new approach for defining the value of the proximity function, which is carried out in the second step of the
Algorithms for Calculating Estimates (ACE) in the area of Pattern Recognition, is presented. The value of the proximity function
is defined as a part of corresponding features of two objects. The main attention is paid to essential features of the polytypic in a
given training set. One of the important problems of the ACE is to compare the values of fuzzy attributes. The main idea of this
approach is considering the proximity the corresponding quantitative and qualitative features together. Here a complexity of
comparing the qualitative features and an approach of overcoming such complexity are considered. Such features include the
features with fuzzy values. The membership function of fuzzy set theory is used for determining membership degrees of the feature
values describing with linguistic values for improve the quality of ACE. The steps of the algorithm for transfer the results is
obtained from the comparison of the two values of fuzzy feature by using membership function to the proximity function. The
membership function with two parameters (b and c) is used. For defining optimal values of these parameters evolutionary
algorithms for solving optimization problems are used, one of them is Genetic algorithm. By using genetic algorithm initial
parameters’ values of the membership function are generated and transmitted to the proximity function. The ACE is run and value
of functional quality is defined during the training process with given training set. If the value of the functional quality is not
sufficiently high than the values obtained by Genetic algorithm, these values are regenerated using special operators (selection,
crossover, mutation) of the Genetic algorithm. The algorithm for selection optimal values of the parameters of the membership
function using the Genetic algorithm is given.
Key Words: ACE, proximity function, Genetic algorithm, membership function, parameters, operators.
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...ijaia
This paper uses a case based study – “product sales estimation” on real-time data to help us understand
the applicability of linear and non-linear models in machine learning and data mining. A systematic
approach has been used here to address the given problem statement of sales estimation for a particular set
of products in multiple categories by applying both linear and non-linear machine learning techniques on
a data set of selected features from the original data set. Feature selection is a process that reduces the
dimensionality of the data set by excluding those features which contribute minimal to the prediction of the
dependent variable. The next step in this process is training the model that is done using multiple
techniques from linear & non-linear domains, one of the best ones in their respective areas. Data Remodeling
has then been done to extract new features from the data set by changing the structure of the
dataset & the performance of the models is checked again. Data Remodeling often plays a very crucial and
important role in boosting classifier accuracies by changing the properties of the given dataset. We then try
to explore and analyze the various reasons due to which one model performs better than the other & hence
try and develop an understanding about the applicability of linear & non-linear machine learning models.
The target mentioned above being our primary goal, we also aim to find the classifier with the best possible
accuracy for product sales estimation in the given scenario.
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which
suggest usage of different resources of the server. Feature selection algorithms have been used to extract
the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command.
Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for
clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index
respectively. The best cluster is further identified by the determined indices. The features of the best cluster
are considered into a set of optimal parameters.
An introduction to variable and feature selectionMarco Meoni
Presentation of a great paper from Isabelle Guyon (Clopinet) and André Elisseeff (Max Planck Institute) back in 2003, which outlines the main techniques for feature selection and model validation in machine learning systems
Metasem: An R Package For Meta-Analysis Using Structural Equation Modelling: ...Pubrica
This presentation explains about the Metasem: An r package for Meta-Analysis using Structural Equation Modelling:
1. SEM is used meta-analytical model formulated for conducting Meta-analysis which is used to analyse structural relationships. SEM can be univariate, multivariate, and three-level meta-analysis
2. Structural equation model (SEM) in general optimized and fit by using OpenMx package
3. The routine analysis of batch mode either interactively or noninteractively can be analysed by R package. Using the graphical interface like R studio is a convenient method for users to interfere with the analysis.
Learn More: https://pubrica.com/academy/
Why Pubrica:
When you order our services, we promise you the following – Plagiarism free, always on Time, outstanding customer support, written to Standard, Unlimited Revisions support and High-quality Subject Matter Experts.
Find freelance Meta-Analysis professionals, consultants, freelancers and get your project done - https://bit.ly/30V8QUK
Contact us:
Web: https://pubrica.com/
Email: sales@pubrica.com
WhatsApp : +91 9884350006
United Kingdom : +44-1143520021
Optimization is considered to be one of the pillars of statistical learning and also plays a major role in the design and development of intelligent systems such as search engines, recommender systems, and speech and image recognition software. Machine Learning is the study that gives the computers the ability to learn and also the ability to think without being explicitly programmed. A computer is said to learn from an experience with respect to a specified task and its performance related to that task. The machine learning algorithms are applied to the problems to reduce efforts. Machine learning algorithms are used for manipulating the data and predict the output for the new data with high precision and low uncertainty. The optimization algorithms are used to make rational decisions in an environment of uncertainty and imprecision. In this paper a methodology is presented to use the efficient optimization algorithm as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.
Performance Analysis of a Gaussian Mixture based Feature Selection Algorithmrahulmonikasharma
Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. The work reported in this paper includes the implementation of unsupervised feature saliency algorithm (UFSA) for ranking different features. This algorithm used the concept of feature saliency and expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. In addition to feature ranking, the algorithm returns an effective model for the given dataset. The results (ranks) obtained from UFSA have been compared with the ranks obtained by Relief-F and Representation Entropy, using four clustering techniques EM, Simple K-Means, Farthest-First and Cobweb.For the experimental study, benchmark datasets from the UCI Machine Learning Repository have been used.
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryWaqas Tariq
Triple Modular Redundancy (TMR) is generally used to increase the reliability of real time systems where three similar modules are used in parallel and the final output is arrived at using voting methods. Numerous majority voting techniques have been proposed in literature however their performances are compromised for some typical set of module output value. Here we propose a new voting scheme for analog systems retaining the advantages of previous reported schemes and reduce the disadvantages associated with them. The scheme utilizes a genetic algorithm and previous performances history of the modules to calculate the final output. The scheme has been simulated using MATLAB and the performance of the voter has been compared with that of fuzzy voter proposed by Shabgahi et al [4]. The performance of the voter proposed here is better than the existing voters.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
The solution of problem of parameterization of the proximity function in ace ...eSAT Journals
Abstract
In this work, a new approach for defining the value of the proximity function, which is carried out in the second step of the
Algorithms for Calculating Estimates (ACE) in the area of Pattern Recognition, is presented. The value of the proximity function
is defined as a part of corresponding features of two objects. The main attention is paid to essential features of the polytypic in a
given training set. One of the important problems of the ACE is to compare the values of fuzzy attributes. The main idea of this
approach is considering the proximity the corresponding quantitative and qualitative features together. Here a complexity of
comparing the qualitative features and an approach of overcoming such complexity are considered. Such features include the
features with fuzzy values. The membership function of fuzzy set theory is used for determining membership degrees of the feature
values describing with linguistic values for improve the quality of ACE. The steps of the algorithm for transfer the results is
obtained from the comparison of the two values of fuzzy feature by using membership function to the proximity function. The
membership function with two parameters (b and c) is used. For defining optimal values of these parameters evolutionary
algorithms for solving optimization problems are used, one of them is Genetic algorithm. By using genetic algorithm initial
parameters’ values of the membership function are generated and transmitted to the proximity function. The ACE is run and value
of functional quality is defined during the training process with given training set. If the value of the functional quality is not
sufficiently high than the values obtained by Genetic algorithm, these values are regenerated using special operators (selection,
crossover, mutation) of the Genetic algorithm. The algorithm for selection optimal values of the parameters of the membership
function using the Genetic algorithm is given.
Key Words: ACE, proximity function, Genetic algorithm, membership function, parameters, operators.
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...ijaia
This paper uses a case based study – “product sales estimation” on real-time data to help us understand
the applicability of linear and non-linear models in machine learning and data mining. A systematic
approach has been used here to address the given problem statement of sales estimation for a particular set
of products in multiple categories by applying both linear and non-linear machine learning techniques on
a data set of selected features from the original data set. Feature selection is a process that reduces the
dimensionality of the data set by excluding those features which contribute minimal to the prediction of the
dependent variable. The next step in this process is training the model that is done using multiple
techniques from linear & non-linear domains, one of the best ones in their respective areas. Data Remodeling
has then been done to extract new features from the data set by changing the structure of the
dataset & the performance of the models is checked again. Data Remodeling often plays a very crucial and
important role in boosting classifier accuracies by changing the properties of the given dataset. We then try
to explore and analyze the various reasons due to which one model performs better than the other & hence
try and develop an understanding about the applicability of linear & non-linear machine learning models.
The target mentioned above being our primary goal, we also aim to find the classifier with the best possible
accuracy for product sales estimation in the given scenario.
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which
suggest usage of different resources of the server. Feature selection algorithms have been used to extract
the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command.
Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for
clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index
respectively. The best cluster is further identified by the determined indices. The features of the best cluster
are considered into a set of optimal parameters.
An introduction to variable and feature selectionMarco Meoni
Presentation of a great paper from Isabelle Guyon (Clopinet) and André Elisseeff (Max Planck Institute) back in 2003, which outlines the main techniques for feature selection and model validation in machine learning systems
Metasem: An R Package For Meta-Analysis Using Structural Equation Modelling: ...Pubrica
This presentation explains about the Metasem: An r package for Meta-Analysis using Structural Equation Modelling:
1. SEM is used meta-analytical model formulated for conducting Meta-analysis which is used to analyse structural relationships. SEM can be univariate, multivariate, and three-level meta-analysis
2. Structural equation model (SEM) in general optimized and fit by using OpenMx package
3. The routine analysis of batch mode either interactively or noninteractively can be analysed by R package. Using the graphical interface like R studio is a convenient method for users to interfere with the analysis.
Learn More: https://pubrica.com/academy/
Why Pubrica:
When you order our services, we promise you the following – Plagiarism free, always on Time, outstanding customer support, written to Standard, Unlimited Revisions support and High-quality Subject Matter Experts.
Find freelance Meta-Analysis professionals, consultants, freelancers and get your project done - https://bit.ly/30V8QUK
Contact us:
Web: https://pubrica.com/
Email: sales@pubrica.com
WhatsApp : +91 9884350006
United Kingdom : +44-1143520021
Optimization is considered to be one of the pillars of statistical learning and also plays a major role in the design and development of intelligent systems such as search engines, recommender systems, and speech and image recognition software. Machine Learning is the study that gives the computers the ability to learn and also the ability to think without being explicitly programmed. A computer is said to learn from an experience with respect to a specified task and its performance related to that task. The machine learning algorithms are applied to the problems to reduce efforts. Machine learning algorithms are used for manipulating the data and predict the output for the new data with high precision and low uncertainty. The optimization algorithms are used to make rational decisions in an environment of uncertainty and imprecision. In this paper a methodology is presented to use the efficient optimization algorithm as an alternative for the gradient descent machine learning algorithm as an optimization algorithm.
Performance Analysis of a Gaussian Mixture based Feature Selection Algorithmrahulmonikasharma
Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. The work reported in this paper includes the implementation of unsupervised feature saliency algorithm (UFSA) for ranking different features. This algorithm used the concept of feature saliency and expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. In addition to feature ranking, the algorithm returns an effective model for the given dataset. The results (ranks) obtained from UFSA have been compared with the ranks obtained by Relief-F and Representation Entropy, using four clustering techniques EM, Simple K-Means, Farthest-First and Cobweb.For the experimental study, benchmark datasets from the UCI Machine Learning Repository have been used.
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryWaqas Tariq
Triple Modular Redundancy (TMR) is generally used to increase the reliability of real time systems where three similar modules are used in parallel and the final output is arrived at using voting methods. Numerous majority voting techniques have been proposed in literature however their performances are compromised for some typical set of module output value. Here we propose a new voting scheme for analog systems retaining the advantages of previous reported schemes and reduce the disadvantages associated with them. The scheme utilizes a genetic algorithm and previous performances history of the modules to calculate the final output. The scheme has been simulated using MATLAB and the performance of the voter has been compared with that of fuzzy voter proposed by Shabgahi et al [4]. The performance of the voter proposed here is better than the existing voters.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Data mining is indispensable for business organizations for extracting useful information from the huge
volume of stored data which can be used in managerial decision making to survive in the competition. Due
to the day-to-day advancements in information and communication technology, these data collected from e-
commerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high
dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets
selected in subsequent iterations by feature selection should be same or similar even in case of small
perturbations of the dataset and is called as selection stability. It is recently becomes important topic of
research community. The selection stability has been measured by various measures. This paper analyses
the selection of the suitable search method and stability measure for the feature selection algorithms and
also the influence of the characteristics of the dataset as the choice of the best approach is highly problem
dependent.
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which suggest usage of different resources of the server. Feature selection algorithms have been used to extract the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command. Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index respectively. The best cluster is further identified by the determined indices. The features of the best cluster are considered into a set of optimal parameters.
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...ijcsit
In this paper, we present an algorithm for feature selection. This algorithm labeled QC-FS: Quantum
Clustering for Feature Selection performs the selection in two steps. Partitioning the original features
space in order to group similar features is performed using the Quantum Clustering algorithm. Then the
selection of a representative for each cluster is carried out. It uses similarity measures such as correlation
coefficient (CC) and the mutual information (MI). The feature which maximizes this information is chosen
by the algorithm
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which
suggest usage of different resources of the server. Feature selection algorithms have been used to extract
the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command.
Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for
clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index
respectively. The best cluster is further identified by the determined indices. The features of the best cluster
are considered into a set of optimal parameters.
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which
suggest usage of different resources of the server. Feature selection algorithms have been used to extract
the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command.
Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for
clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index
respectively. The best cluster is further identified by the determined indices. The features of the best cluster
are considered into a set of optimal parameters.
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which
suggest usage of different resources of the server. Feature selection algorithms have been used to extract
the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command.
Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for
clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index
respectively. The best cluster is further identified by the determined indices. The features of the best cluster
are considered into a set of optimal parameters.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
Abstract: The common challenge for machine learning and data mining tasks is the curse of High Dimensionality. Feature selection reduces the dimensionality by selecting the relevant and optimal features from the huge dataset. In this research work, a clustering and genetic algorithm based feature selection (CLUST-GA-FS) is proposed that has three stages namely irrelevant feature removal, redundant feature removal, and optimal feature generation. The performance of the feature selection algorithms are analyzed using the parameters like classification accuracy, precision, recall and error rate. Recently, an increasing attention is given to the stability of feature selection algorithms which is an indicator that requires that similar subsets of features are selected every time the algorithm is executed on the same dataset. This work analyzes the stability of the algorithm on four publicly available dataset using stability measurements Average normal hamming distance(ANHD), Dices coefficient, Tanimoto distance, Jaccards index and Kuncheva index.
An Influence of Measurement Scale of Predictor Variable on Logistic Regressio...IJECEIAES
Much real world decision making is based on binary categories of information that agree or disagree, accept or reject, succeed or fail and so on. Information of this category is the output of a classification method that is the domain of statistical field studies (eg Logistic Regression method) and machine learning (eg Learning Vector Quantization (LVQ)). The input argument of a classification method has a very crucial role to the resulting output condition. This paper investigated the influence of various types of input data measurement (interval, ratio, and nominal) to the performance of logistic regression method and LVQ in classifying an object. Logistic regression modeling is done in several stages until a model that meets the suitability model test is obtained. Modeling on LVQ was tested on several codebook sizes and selected the most optimal LVQ model. The best model of each method compared to its performance on object classification based on Hit Ratio indicator. In logistic regression model obtained 2 models that meet the model suitability test is a model with predictive variables scaled interval and nominal, while in LVQ modeling obtained 3 pieces of the most optimal model with a different codebook. In the data with interval-scale predictor variable, the performance of both methods is the same. The performance of both models is just as bad when the data have the predictor variables of the nominal scale. In the data with predictor variable has ratio scale, the LVQ method able to produce moderate enough performance, while on logistic regression modeling is not obtained the model that meet model suitability test. Thus if the input dataset has interval or ratio-scale predictor variables than it is preferable to use the LVQ method for modeling the object classification.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
A Review on Feature Selection Methods For Classification TasksEditor IJCATR
In recent years, application of feature selection methods in medical datasets has greatly increased. The challenging task in
feature selection is how to obtain an optimal subset of relevant and non redundant features which will give an optimal solution without
increasing the complexity of the modeling task. Thus, there is a need to make practitioners aware of feature selection methods that have
been successfully applied in medical data sets and highlight future trends in this area. The findings indicate that most existing feature
selection methods depend on univariate ranking that does not take into account interactions between variables, overlook stability of the
selection algorithms and the methods that produce good accuracy employ more number of features. However, developing a universal
method that achieves the best classification accuracy with fewer features is still an open research area.
Similar to A Modified KS-test for Feature Selection (20)
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 3 (Jul. - Aug. 2013), PP 73-79
www.iosrjournals.org
www.iosrjournals.org 73 | Page
A Modified KS-test for Feature Selection
K Subrahmanyam1
, Nalla Shiva Sankar1
, Sai Praveen Baggam2
,
Raghavendra Rao S3
Abstract: A central problem in machine learning is identifying a representative set of features from which to
construct a classification model for a particular task. Feature selection, as a preprocessing step to machine
learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and
improving result comprehensibility. The central hypothesis is that good feature sets contain features that are
highly correlated with the class, yet uncorrelated with each other. In this paper a fast redundancy removal filter
is proposed based on modified Kolmogorov-Smirnov statistic, utilizing class label information while comparing
feature pairs. Results obtained from this algorithm are compared with other two algorithms capable of
removing irrelevancy and redundancy, such as Correlation Feature Selection algorithm (CFS) and simple
Kolmogorov Smirnov-Correlation Based Filter (KS-CBF).
The efficiency and effectiveness of various methods is tested with two of the standard classifiers such as
the Decision- Tree classifier and the K-NN classifier. In most cases, classification accuracy using the reduced
feature set produced using the proposed approach equaled or bettered accuracy obtained using the complete
feature set and other two algorithms.
I. Introduction
Feature selection, as a preprocessing step to machine learning, is effective in reducing dimensionality,
removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. It is a process of
choosing a subset of original features so that the feature space is optimally reduced according to a certain
evaluation criterion. In recent years, data has become increasingly larger in both number of instances and
number of features in many applications such as genome projects, text categorization, image retrieval, and
customer relationship management. This enormity may cause serious problems to many machine learning
algorithms with respect to scalability and learning performance. For example, high dimensional data (i.e., data
sets with hundreds or thousands of features) can contain high degree of irrelevant and redundant information
which may greatly degrade the performance of learning algorithms. Therefore, feature selection becomes very
necessary for machine learning tasks when facing high dimensional data nowadays.
Feature subset selection is the process of identifying and removing as much irrelevant and redundant
information as possible. This reduces the dimensionality of the data and may allow learning algorithms to
operate faster and more effectively. Feature selection evaluation methods fall into two broad categories, Filter
model and Wrapper model [2].The Filter model relies on general characteristics of the training data to select
some features without involving any learning algorithm. The wrapper model requires one predetermined
learning algorithm in feature selection and uses its performance to evaluate and determine which features are
selected. As for each new subset of features, the wrapper model needs to learn a hypothesis (or a classifier). It
tends to find features better suited to the predetermined learning algorithm resulting in superior learning
performance, but it also tends to be more computationally expensive and less generality than the Filter model.
When the number of features becomes very large, the Filter model is usually chosen due to its computational
efficiency. Filters have the advantage of fast execution and generality to a large family of classifiers than
wrappers [13].
Figure 1 provides a depiction of a simple classification process where a Feature Selection process that
uses a filter is involved. The training and testing datasets after the dimensionality reduction process is fed to the
ML (Machine Learning) algorithm. In some cases, accuracy on future classification can be improved; in others,
the result is a more compact, easily interpreted representation of the target concept. In this work, we have
employed a Filter model for the evaluation of features selected.
2. A Modified KS-test for Feature Selection
www.iosrjournals.org 74 | Page
Figure 1: Classification Process that involves Feature Selection stage with Filter approach
In the following sections, we will be describing about various approaches used in our work along with our
proposed approach. Other approaches exist such as Rank wrapper algorithm [6], Relief algorithm [.Following it,
the datasets used and results obtained are mentioned.
II. Theoretical Framework
2.1. Correlation Based Feature Selection:
This algorithm [1] is based on information theory and uses symmetrical uncertainty (SU) as the filter
for the evaluation of the feature set selected. This algorithm involves certain concepts such as mutual
information [3], entropy, information gain and symmetrical uncertainty. The process used here in finding
correlations between various attributes is different from that used in FCBF [11].In the first step, it processes the
given training dataset and initial feature set and removes all the irrelevant features by finding the strength of
prediction of feature-to-class. In the second step, it uses this relevant feature set and training dataset to remove
all the redundant features and finally presents the significant feature set that is well supervised and uncorrelated
with other features.
Entropy: Entropy as given by Shannon is a measure of the amount of uncertainty about a source of messages
[5]. The entropy of variable Y before and after observing values of another variable X can be described by:
H(Y) = - ∑p (yi) log (p (yi))
And
H(Y/X) = - ∑ p (xj) ∑ p (yi/xj) log (p (yi/xj))
Here p (yi) is the prior probabilities for all values of random variable Y and p (yi|xj) is the conditional
probability of yi given xj. By treating Y as classes and X as features in a data set, the entropy is 0, i.e., without
any uncertainty at all if all members of a feature belong to the same class. On the other hand, members in a
feature set are totally random to a class if the value of entropy is 1. The range of entropy is between 0 and 1.
Information Gain:
The amount by which the entropy of X decreases reflects additional information about Y provided by X
and is called information gain, given by
Gain, I(Y; X) = H(Y) − H(Y |X)
= H(X) − H (X|Y)
= H(Y) + H(X) − H(X, Y).
However, information gain is biased if feature with more values[4], which the features with greater
numbers of values will gain more information than those with fewer values even if the former ones are actually
less informative than the latter ones. Also, the range of information gain is not from 0 to 1. Its values should be
normalized in order to ensure they are comparable and have the same affect.
Symmetrical Uncertainty:
Because of the limitation provided by the usage of Information gain, we use another heuristic called
Symmetrical Uncertainty and is given by:
SU(Y; X) = 2[I(Y; X) / (H(X) + H(Y))]
It averages the values of two uncertainty variables, compensates for information gain’s bias toward features with
more values, and normalizes its values to the range [0, 1]. A value of 1 indicates that knowing the value of either
one completely predicts the value of the other and a value of 0 indicates that X and Y are independent each other.
3. A Modified KS-test for Feature Selection
www.iosrjournals.org 75 | Page
Algorithm 1:
Table 1: Correlation Feature Selection Algorithm
1. //Remove irrelevant features
2. Input original data set D that includes features X and target class Y
3. For each feature Xi
Calculate mutual information SU(Y; Xi)
4. Sort SU(Y; Xi) in descending order
5. Put Xj whose SU(Y; Xi)>0 into relevant feature set Rxy
6. //Remove redundant features
7. Input relevant features set Rxy
8. For each feature Xj
Calculate pair wise mutual information SU (Xj; Xk) for all j≠k
9. Sxy= ∑ (SU (Xj; Xk))
10. Calculate means µR and µS of Rxy and Sxy, respectively
W= µS/µR
11. R=W. Rxy- Sxy
12. Select Xj whose R>0 into final set F
2.2. Kolmogorov-Smirnov Test:
Equivalence of two random variables may be evaluated using the Kolmogorov-Smirnov (KS) test [12].
In statistics, the Kolmogorov–Smirnov test (K–S test) is a nonparametric test for the equality of continuous,
one-dimensional probability distributions that can be used to compare a sample with a reference probability
distribution (one-sample K–S test), or to compare two samples (two-sample K–S test).
The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of
the sample and the cumulative distribution function of the reference distribution, or between the empirical
distribution functions of two samples. The null distribution of this statistic is calculated under the null
hypothesis that the samples are drawn from the same distribution (in the two-sample case) or that the sample is
drawn from the reference distribution (in the one-sample case).
The empirical distribution function Fn for n independent and identical observations Xi is defined as:
where is the indicator function, equal to 1 if Xi ≤ x and equal to 0 otherwise.
The Kolmogorov–Smirnov statistic for a given cumulative distribution function F(x) is
where sup x is the supremum of the set of distances.
The cumulative distribution function of K is given by
Under null hypothesis that the sample comes from the hypothesized distribution F(x),
in distribution, where B(t) is the Brownian bridge.
If F is continuous then under the null hypothesis converges to the Kolmogorov distribution,
which does not depend on F. This result may also be known as the Kolmogorov theorem; the goodness-of-
fit test or the Kolmogorov–Smirnov test is constructed by using the critical values of the Kolmogorov
distribution. The null hypothesis is rejected at level α if
where Kα is found from
The asymptotic power of this test is 1.The various test parameters for the KS-test and minimum estimations are
given in [9] and [10].
We will be using the KS two-sample test for analyzing the redundancy of two features. So, a brief description
about it is given.
4. A Modified KS-test for Feature Selection
www.iosrjournals.org 76 | Page
Two-Sample Kolmogorov Smirnov Test:
The Kolmogorov–Smirnov test may also be used to test whether two underlying one-dimensional
probability distributions differ. In this case, the Kolmogorov–Smirnov statistic is
where F1,n and F2,n' are the empirical distribution functions of the first and the second sample respectively.
The null hypothesis is rejected at level α if
It should be noted here that the two-sample test checks whether the two data samples come from the same
distribution. This does not specify what that common distribution is (e.g. normal or not normal).
KS test
Steps taken to calculate KS include:
1. Discretization of both features Fi, Fj into k bins.
2. Estimation of probabilities in each bin.
2.1. Calculation of cumulative probability distributions for both features.
2.2. Calculate KS statistic
Table 2: KS test for two features Fi, Fj
Algorithm 2:
Algorithm K-S CBF
Relevance analysis
1. Calculate the SU(X,C) relevance indices and create an ordered list S of features
according to the decreasing value of their relevance.
Redundancy analysis
2. Take as the feature X the first feature from the S list
3. Find and remove all features for which X is approximately equivalent according
to the K-S test
4. Set the next remaining feature in the list as X and repeat step 3 for all features
that follow it in the S list.
Table 3: A two-step Kolmogorov-Smirnov Correlation Based Filter (K-S CBF) algorithm.
III. Modified KS-CBF Test:
The proposed algorithm introduces the concept of binning the input dataset into n bins. Redundancy is
calculated in each of the bins individually and the set of redundant features for each bin are stored. Finally, the
set of features that are common in all the bins (here, it can also be taken as an input parameter to decide at run-
time as required) are considered as redundant for the input dataset and they can be eliminated.
In each bin, again we divide the available set of records into a particular number of partitions for which
the actual KS-test is applied. The set of redundant features in each of these partitions is mixed up with those for
the other partitions in that bin. So, now we can expect a good redundant subset to be produced from each bin.
The union operation ensures that the possibility of redundancy has been checked for every feature in its entirety.
The intersection operation ensures that a feature which is actually non-redundant will not be claimed as being
redundant. Now, since we perform an intersection of the redundant features obtained from all the bins, the final
5. A Modified KS-test for Feature Selection
www.iosrjournals.org 77 | Page
feature subset produced will not contain these features and could sufficiently represent a significant subset of
features that can be used for the classification process.
Algorithm 3:
Modified K-S CBF Algorithm:
Relevance analysis
1. Order features based on decreasing value of SUC (f, C) index which reflects their decreasing value of their
relevance.
Redundancy analysis
2. Pass the dataset with relevant features for KS test measure.
3. Discretize the dataset into n bins each containing approximately same number of records.
4. For each bin Bi
4.1. Form k data partitions each approximately containing the same number of records.
4.2. For each of the k partitions P1, P2, …., Pk
4.2.1. Initialize Fi with the first feature in the F-list.
4.2.2. Find all features for which Fi forms an approximate redundant cover using K-S test.
4.2.3. Set the next remaining feature in the list as Fi and repeat above step for all features that
follow it in the F list.
4.3. Take the union of all these redundant features into Bi
5. Get the common features that are redundant in all bins.
6. Remove those features and get the significant subset of features.
Table 4: A modified Kolmogorov-Smirnov Correlation Based Filter algorithm.
In the first step, a Symmetrical Uncertainity filter has been applied to remove the irrelevant features.
The dataset containing this filtered subset is now passed to the second step to perform the redundancy analysis
and obtain the set of redundant features. These redundant features are removed from the remaining feature set
and finally the dataset with significant features is only considered for further analysis.
In most cases, classification accuracy using this reduced feature set produced equaled or bettered
accuracy using the complete feature set. However, the dimensionality of the feature set has been reduced to a
better extent then compared to the simple KS-CBF algorithm. This can give a good computational gain over the
simple KS-CBF when testing with a classifier as the new feature set contains less number of dimensions. In few
cases, it is however observed that the proposed algorithm has been producing results that are quite less efficient
when tested with a classifier than that compared with the simple KS-CBF results. However, this can still be
handled and improved by varying the values of number of bins which is taken as an input parameter. Feature
selection can sometimes degrade machine learning performance in cases where some features were eliminated
which were highly predictive of very small areas of the instance space. The size of the dataset also plays a role
in this and as we are further dividing into smaller bins, it sometimes affects the process when there are no
sufficient records in a bin to perform the test. Also, when the information available in the dataset is very random
with a large range of distinct feature values, then also this algorithm could produce very good results than
others.
It has been observed that, as the number of bins during the test is increased, the number of redundant
features is increasing and the final feature set produced is getting smaller. However, in some cases, this has been
leading to a slight decrease in detection rates. Yet, the performance gain obtained is very much high. So, the
effect of slight decrease in detection rates can be negotiated with the rapid increase in performance.
Dataset No. Features No. Instances No. Classes Class Distribution Balanced Dataset
Ionosphere
Pima
Wdbc
Wine
Dos-Normal
Probe-Normal
U2R-Normal
R2L-Normal
34
8
31
13
41
41
41
41
351
768
569
178
4765
4346
326
2137
2
2
2
2
2
2
2
2
126/225
500/268
212/357
59/71
2371/2394
1996/2350
52/274
1062/1075
No
No
No
No
No
No
No
No
6. A Modified KS-test for Feature Selection
www.iosrjournals.org 78 | Page
IV. Datasets and Results:
The results obtained with the above features are tested with two of the standard classifiers- Decision Tree and
K-NN classifier.
Table 8.2 : Selected feature sets for various UCI datasets
Table 8.3 : Detection Rates with various feature subsets for KDD99 dataset
Table 8.1 : Selected Features of KDD 99 datasets
Data Set
No: of
features
Correlation Feature Selection
Simple
KS-test
Modified KS-test
Normal-DoS 41 1,5,10,23,24,25,,31,33,35,38,39
2-6, 12, 23, 24,
31-37,
2-6,12,23,32,36
Normal-Probe 41 4,24,27-32,40,41
3-6, 12, 23, 24,
27, 32-37, 40
3-6,12,23,32,33,37
Normal-U2R 41 1,10,13,14,17,23,35-37
1,3,5,6,32-
34,24,36,37
1,3,5,6,24,32-34,36
Normal-R2L 41 1,10,22,23,31,32,34-37
1,3,5,6,10,22-
24,31-37
2-6,24,32,33,36,37
Dataset No: of features Correlation Feature Selection Simple KS-test Modified KS-test
Ionosphere 34 6,12,22,24,25,27,29,32-34 4,25,28 4,25,28
Pima 8 2,5,6,8 2-8 2,6,7
Wdbc 31 8,9,11,18,21,27-31,
1,4,5,8,9,12,14-
18,21,24,25,28
1,4,5,8,9,14-16,18,19,21-
23,28
Wine 14 1,2,6,9,10,12 1,2,7,10,12,13 1,2,7,10,12,13
Data Set
K-NN Classifier Decision Tree
Full
Set
Correlation
Feature
Selection
Simple
KS-test
Modified
KS-test
Full
Set
Correlation
Feature
Selection
Simple
KS-test
Modified
KS-test
Normal-DoS 99.92 99.55 99.73 99.39 99.40 99.73 99.41 99.41
Normal-
Probe
97.37 91.5 97.28 97.28 100 90.44 100 100
Normal-
U2R
96.80 92.0 93.89 99.20 100 95.2 100 100
Normal-R2L 99.28 83.73 98.40 98.92 97.72 94.2 97.37 98.95
7. A Modified KS-test for Feature Selection
www.iosrjournals.org 79 | Page
Table 8.4 : Detection Rates with various feature subsets for UCI datasets
It can be seen that the proposed approach selects a less number of significant feature subset in many
cases than the other two algorithms. Also, the accuracy and efficiency of this approach was much better in most
of the cases. In few cases, the accuracy slightly reduced but it is not that much far away from the other methods
and this method outperformed both the CFS and KS-test in terms of efficiency i.e. in terms of execution
performance. As compared to the results in [4] and [6], the results obtained in our experiment are good and
encouraging.
V. Conclusion:
A modified KS-test which can efficiently select good feature subsets has been proposed in this paper.
The proposed algorithm has the computational demands that are very much similar to the traditional KS-test and
is proportional to the total number of bins. A comparative test with some widely used feature selection
algorithms showed its better performance both in terms of efficiency and accuracy. The statistical significance
of 0.05 has been used in test which can be varied. The number of bins and number of partitions should be given
depending on the number of records in the incoming dataset. According to our observations, good results will be
obtained if a bin is made to contain atleast 40 records.
Since a filter approach is used, the results can be well suited to any classifier process. The results in our
experiment have been tested with C4.5 and K-NN classifier. Various variants of the Kolmogorov-Smirnov test
exist and the algorithm may be used with other indices for relevance indication.
Future Work:
In future we would like to compare and contrast the ability of the modified KS-Test against the features
obtained using the rough set theory, Support Vector Machines, and Decision Trees. We have identified the
problem of Intrusion Detection as our subject matter and we will apply these techniques using GANN weight
extraction algorithm [14] to find the accuracy of the Intrusion Detection Systems based on these techniques.
Further we would be developing an extension for SNORT using the best of these four techniques according to
our findings.
References:
[1]. Te-Shun Chou, Kang K.Yen, JunLuo, NikiPissinou and KiaMakki Correlation Based Feature Selection for Intrusion Detection
Design.
[2]. Feature Subset Selection: A Correlation Based Filter Approach Mark A. Hall, Lloyd A. Smith ([mhall,
las]@cs.waikato.ac.nz)Department of Computer Science, University of Waikato, Hamilton, New Zealand.
[3]. A Novel Information Theory for Filter Feature Selection BoyenBonov, FransiscoEscolonov and MiguelAnguel.
[4]. Ensemble Fuzzy Belief Intrusion Detection Design AT Florida International University Miami, Florida BY Te-Shun Chou.
[5]. Conditional Entropy Metrics for Feature Selection by Iain Bancarz.
[6]. Feature Selection for Supervised Classification: A Kolmogorov-Smirnov Class Correlation-Based Filter Marcin Blachnik,
Włodzisław Duch, Adam Kachel, Jacek Biesiada.
[7]. J. Biesiada and W. Duch, “Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter
Solution,” Proceedings of the 4th
International Conference on Computer Recognition Systems.
[8]. Data Mining: Concepts and Techniques, 2nd
edition: Jiawei Han and Micheline Kamber.
[9]. The Kolmogorov-Smirnov Test When Parameters are estimated from data: Hovhannes Keutelian, Fermilab.
[10]. Minimum Kolmogorov-Smirnov test Statistic Parameter Estimates: Michael D:Weber, Lawrence M.Leemis and Rex K.Kincaid .
[11]. L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” Proceedings of the
Twentieth International Conference on Machine Leaning, pp. 856-863, Washington, D.C.
[12]. Feature Selection For High Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter: Jacek Biesiada and Wlodzislaw
Duch
[13]. Correlation Based Feature Selection For Machine Learning: A Dissertation work by Mark A.Hall, Department of Computer Science,
University of Waikato, Hamilton, New Zealand.
[14]. Artificial Neural Network Classifier for Intrusion Detection in Computer Networks,ICN-2010,Karnataka, India.
Data Set
K-NN Classifier Decision Tree
Full
Set
Correlation
Feature
Selection
Simple
KS-test
Modified
KS-test
Full
Set
Correlation
Feature
Selection
Simple
KS-test
Modified
KS-test
Ionosphere 80.82 87.67 84.24 84.24 86.98 84.24 69.18 69.36
Pima 74.02 75.32 71.42 75.65 59.7 54.5 50.0 56.49
Wdbc 67.36 92.05 67.36 67.78 85.77 85.35 93.31 93.61
Wine 95.55 99.77 96.65 97.77 99.33 100 97.78 97.78