This document provides a comprehensive overview and categorization of feature selection methods for machine learning classification problems. It identifies the four main steps in typical feature selection methods: generation procedure, evaluation function, stopping criterion, and validation procedure. Generation procedures are categorized as complete, heuristic, or random. Evaluation functions are categorized as distance measures, information measures, dependence measures, consistency measures, or classifier error rates. The document then surveys 32 existing feature selection methods and categorizes them based on the type of generation procedure and evaluation function used. It provides representative examples and discusses strengths and weaknesses of different approaches.
Experimental comparison of ranking techniquesjleyvlop
The document describes an experimental comparison of ranking techniques for multi-criteria decision analysis (MCDA). It conducted a simulation experiment to compare the performance of two MCDA ranking methods: the NFR (Normalized Frequency Ratio) method and a new MOEA (Multi-Objective Evolutionary Algorithm) procedure. The experiment varied the ranking procedure and size of ranking problems. It found that the MOEA procedure significantly outperformed NFR across all conditions, producing lower error rates and more accurate rankings. The MOEA procedure was also less affected by increases in problem size.
This document discusses using feature selection techniques to address the curse of dimensionality in microarray data analysis. It presents the problem of having many more features than samples in bioinformatics tasks like cancer classification and network inference. It describes filter, wrapper and embedded feature selection approaches and proposes a blocking strategy that uses multiple learning algorithms to evaluate feature subsets in order to improve selection robustness when samples are limited. Finally, it lists several microarray gene expression datasets that are commonly used to evaluate feature selection methods.
Integrated bio-search approaches with multi-objective algorithms for optimiza...TELKOMNIKA JOURNAL
Optimal selection of features is very difficult and crucial to achieve, particularly for the task of classification. It is due to the traditional method of selecting features that function independently and generated the collection of irrelevant features, which therefore affects the quality of the accuracy of the classification. The goal of this paper is to leverage the potential of bio-inspired search algorithms, together with wrapper, in optimizing multi-objective algorithms, namely ENORA and NSGA-II to generate an optimal set of features. The main steps are to idealize the combination of ENORA and NSGA-II with suitable bio-search algorithms where multiple subset generation has been implemented. The next step is to validate the optimum feature set by conducting a subset evaluation. Eight (8) comparison datasets of various sizes have been deliberately selected to be checked. Results shown that the ideal combination of multi-objective algorithms, namely ENORA and NSGA-II, with the selected bio-inspired search algorithm is promising to achieve a better optimal solution (i.e. a best features with higher classification accuracy) for the selected datasets. This discovery implies that the ability of bio-inspired wrapper/filtered system algorithms will boost the efficiency of ENORA and NSGA-II for the task of selecting and classifying features.
This document discusses feature selection techniques for classification problems. It begins by outlining class separability measures like divergence, Bhattacharyya distance, and scatter matrices. It then discusses feature subset selection approaches, including scalar feature selection which treats features individually, and feature vector selection which considers feature sets and correlations. Examples are provided to demonstrate calculating class separability measures for different feature combinations on sample datasets. Exhaustive search and suboptimal techniques like forward, backward, and floating selection are discussed for choosing optimal feature subsets. The goal of feature selection is to select a subset of features that maximizes class separation.
Graph-based algorithm for checking wrong indirect relationships in non-free c...TELKOMNIKA JOURNAL
In this context, this paper proposes a combination of parameterised decision mining and relation sequences to detect wrong indirect relationship in the non-free choice. The existing decision mining without parameter can only detect the direction, but not the correctness. This paper aims to identify the direction and correctness with decision mining with parameter. This paper discovers a graph process model based on the event log. Then, it analyses the graph process model for obtaining decision points. Each decision point is processed by using parameterised decision mining, so that decision rules are formed. The derived decision rules are used as parameters of checking wrong indirect relationship in the non-free choice. The evaluation shows that the checking wrong indirect relationships in non-free choice with parameterised decision mining have 100% accuracy, whereas the existing decision mining has 90.7% accuracy.
A Software Measurement Using Artificial Neural Network and Support Vector Mac...ijseajournal
Today, Software measurement are based on various techniques such that neural network, Genetic
algorithm, Fuzzy Logic etc. This study involves the efficiency of applying support vector machine using
Gaussian Radial Basis kernel function to software measurement problem to increase the performance and
accuracy. Support vector machines (SVM) are innovative approach to constructing learning machines that
Minimize generalization error. There is a close relationship between SVMs and the Radial Basis Function
(RBF) classifiers. Both have found numerous applications such as in optical character recognition, object
detection, face verification, text categorization, and so on. The result demonstrated that the accuracy and
generalization performance of SVM Gaussian Radial Basis kernel function is better than RBFN. We also
examine and summarize the several superior points of the SVM compared with RBFN.
IRJET- Analyzing Voting Results using Influence MatrixIRJET Journal
This document discusses analyzing voting results using an influence matrix. It proposes modeling voting outcomes as results from an opinion dynamics process, where opinions evolve according to social influence. It formulates estimating the maximum posteriori opinions and influence matrix from voting data. The influence matrix technique is described to solve fluid flow problems numerically. It demonstrates vote prediction and dynamic results visualization based on the estimated influence matrix. Future work could explore modeling stubborn agents' topic-dependent beliefs instead of independently.
applications of operation research in businessraaz kumar
1) Operations research is a quantitative approach to decision making based on the scientific method of problem solving. It involves modeling real-life situations as mathematical problems to arrive at optimal or near-optimal solutions.
2) The key steps in operations research problem solving are defining the problem, determining alternative solutions, evaluating alternatives using criteria, choosing the best alternative, implementing the chosen alternative, and evaluating the results.
3) Common techniques used in operations research include linear programming, transportation modeling, assignment modeling, and simulation methods like PERT/CPM. These techniques help optimize objectives while satisfying constraints.
Experimental comparison of ranking techniquesjleyvlop
The document describes an experimental comparison of ranking techniques for multi-criteria decision analysis (MCDA). It conducted a simulation experiment to compare the performance of two MCDA ranking methods: the NFR (Normalized Frequency Ratio) method and a new MOEA (Multi-Objective Evolutionary Algorithm) procedure. The experiment varied the ranking procedure and size of ranking problems. It found that the MOEA procedure significantly outperformed NFR across all conditions, producing lower error rates and more accurate rankings. The MOEA procedure was also less affected by increases in problem size.
This document discusses using feature selection techniques to address the curse of dimensionality in microarray data analysis. It presents the problem of having many more features than samples in bioinformatics tasks like cancer classification and network inference. It describes filter, wrapper and embedded feature selection approaches and proposes a blocking strategy that uses multiple learning algorithms to evaluate feature subsets in order to improve selection robustness when samples are limited. Finally, it lists several microarray gene expression datasets that are commonly used to evaluate feature selection methods.
Integrated bio-search approaches with multi-objective algorithms for optimiza...TELKOMNIKA JOURNAL
Optimal selection of features is very difficult and crucial to achieve, particularly for the task of classification. It is due to the traditional method of selecting features that function independently and generated the collection of irrelevant features, which therefore affects the quality of the accuracy of the classification. The goal of this paper is to leverage the potential of bio-inspired search algorithms, together with wrapper, in optimizing multi-objective algorithms, namely ENORA and NSGA-II to generate an optimal set of features. The main steps are to idealize the combination of ENORA and NSGA-II with suitable bio-search algorithms where multiple subset generation has been implemented. The next step is to validate the optimum feature set by conducting a subset evaluation. Eight (8) comparison datasets of various sizes have been deliberately selected to be checked. Results shown that the ideal combination of multi-objective algorithms, namely ENORA and NSGA-II, with the selected bio-inspired search algorithm is promising to achieve a better optimal solution (i.e. a best features with higher classification accuracy) for the selected datasets. This discovery implies that the ability of bio-inspired wrapper/filtered system algorithms will boost the efficiency of ENORA and NSGA-II for the task of selecting and classifying features.
This document discusses feature selection techniques for classification problems. It begins by outlining class separability measures like divergence, Bhattacharyya distance, and scatter matrices. It then discusses feature subset selection approaches, including scalar feature selection which treats features individually, and feature vector selection which considers feature sets and correlations. Examples are provided to demonstrate calculating class separability measures for different feature combinations on sample datasets. Exhaustive search and suboptimal techniques like forward, backward, and floating selection are discussed for choosing optimal feature subsets. The goal of feature selection is to select a subset of features that maximizes class separation.
Graph-based algorithm for checking wrong indirect relationships in non-free c...TELKOMNIKA JOURNAL
In this context, this paper proposes a combination of parameterised decision mining and relation sequences to detect wrong indirect relationship in the non-free choice. The existing decision mining without parameter can only detect the direction, but not the correctness. This paper aims to identify the direction and correctness with decision mining with parameter. This paper discovers a graph process model based on the event log. Then, it analyses the graph process model for obtaining decision points. Each decision point is processed by using parameterised decision mining, so that decision rules are formed. The derived decision rules are used as parameters of checking wrong indirect relationship in the non-free choice. The evaluation shows that the checking wrong indirect relationships in non-free choice with parameterised decision mining have 100% accuracy, whereas the existing decision mining has 90.7% accuracy.
A Software Measurement Using Artificial Neural Network and Support Vector Mac...ijseajournal
Today, Software measurement are based on various techniques such that neural network, Genetic
algorithm, Fuzzy Logic etc. This study involves the efficiency of applying support vector machine using
Gaussian Radial Basis kernel function to software measurement problem to increase the performance and
accuracy. Support vector machines (SVM) are innovative approach to constructing learning machines that
Minimize generalization error. There is a close relationship between SVMs and the Radial Basis Function
(RBF) classifiers. Both have found numerous applications such as in optical character recognition, object
detection, face verification, text categorization, and so on. The result demonstrated that the accuracy and
generalization performance of SVM Gaussian Radial Basis kernel function is better than RBFN. We also
examine and summarize the several superior points of the SVM compared with RBFN.
IRJET- Analyzing Voting Results using Influence MatrixIRJET Journal
This document discusses analyzing voting results using an influence matrix. It proposes modeling voting outcomes as results from an opinion dynamics process, where opinions evolve according to social influence. It formulates estimating the maximum posteriori opinions and influence matrix from voting data. The influence matrix technique is described to solve fluid flow problems numerically. It demonstrates vote prediction and dynamic results visualization based on the estimated influence matrix. Future work could explore modeling stubborn agents' topic-dependent beliefs instead of independently.
applications of operation research in businessraaz kumar
1) Operations research is a quantitative approach to decision making based on the scientific method of problem solving. It involves modeling real-life situations as mathematical problems to arrive at optimal or near-optimal solutions.
2) The key steps in operations research problem solving are defining the problem, determining alternative solutions, evaluating alternatives using criteria, choosing the best alternative, implementing the chosen alternative, and evaluating the results.
3) Common techniques used in operations research include linear programming, transportation modeling, assignment modeling, and simulation methods like PERT/CPM. These techniques help optimize objectives while satisfying constraints.
This document discusses sampling techniques and procedures. It provides an overview of sample vs census studies, outlines the sampling design process, and classifies sampling techniques into nonprobability and probability methods. Specific techniques covered include convenience, judgmental, quota, snowball, simple random, systematic, stratified, and cluster sampling. The document also discusses choosing between nonprobability and probability sampling, and provides examples of procedures for drawing probability samples.
Software Cost Estimation Using Clustering and Ranking SchemeEditor IJMTER
Software cost estimation is an important task in the software design and development process.
Planning and budgeting tasks are carried out with reference to the software cost values. A variety of
software properties are used in the cost estimation process. Hardware, products, technology and
methodology factors are used in the cost estimation process. The software cost estimation quality is
measured with reference to the accuracy levels.
Software cost estimation is carried out using three types of techniques. They are regression based
model, anology based model and machine learning model. Each model has a set of technique for the
software cost estimation process. 11 cost estimation techniques fewer than 3 different categories are
used in the system. The Attribute Relational File Format (ARFF) is used maintain the software product
property values. The ARFF file is used as the main input for the system.
The proposed system is designed to perform the clustering and ranking of software cost
estimation methods. Non overlapped clustering technique is enhanced with optimal centroid estimation
mechanism. The system improves the clustering and ranking process accuracy. The system produces
efficient ranking results on software cost estimation methods.
Boost model accuracy of imbalanced covid 19 mortality predictionBindhuBhargaviTalasi
The document proposes using a GAN-based oversampling technique to boost the accuracy of an imbalanced COVID-19 mortality prediction model. It analyzes patient data to find correlations between features like gender and age with mortality. A GAN is used to generate synthetic minority class data to address data imbalance. The GAN-enhanced random forest model is evaluated using metrics like recall, precision and F1 score, and shows improved performance over the base model at predicting COVID-19 mortality.
Threshold benchmarking for feature ranking techniquesjournalBEEI
In prediction modeling, the choice of features chosen from the original feature set is crucial for accuracy and model interpretability. Feature ranking techniques rank the features by its importance but there is no consensus on the number of features to be cut-off. Thus, it becomes important to identify a threshold value or range, so as to remove the redundant features. In this work, an empirical study is conducted for identification of the threshold benchmark for feature ranking algorithms. Experiments are conducted on Apache Click dataset with six popularly used ranker techniques and six machine learning techniques, to deduce a relationship between the total number of input features (N) to the threshold range. The area under the curve analysis shows that ≃ 33-50% of the features are necessary and sufficient to yield a reasonable performance measure, with a variance of 2%, in defect prediction models. Further, we also find that the log2(N) as the ranker threshold value represents the lower limit of the range.
ext mining for the Vaccine Adverse Event Reporting System: medical text class...PC LO
Botsis, T., Nguyen, M. D., Woo, E. J., Markatou, M., & Ball, R. (2011). Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection. Journal of the American Medical Informatics Association, 18(5), 631-638.
Taguchi design of experiments nov 24 2013Charlton Inao
This document provides an overview of Taguchi design of experiments. It defines Taguchi methods, which use orthogonal arrays to conduct a minimal number of experiments that can provide full information on factors that affect performance. The assumptions of Taguchi methods include additivity of main effects. The key steps in an experiment are selecting variables and their levels, choosing an orthogonal array, assigning variables to columns, conducting experiments, and analyzing data through sensitivity analysis and ANOVA.
Using evolutionary testing to improve efficiency and qualityFaysal Ahmed
Evolutionary testing uses genetic algorithms to automatically generate test cases that improve software testing efficiency and quality. It transforms testing goals into optimization problems solved through evolutionary algorithms. Experiments show evolutionary testing outperforms manual, random, and static testing by generating more effective test cases. It provides a systematic, automated approach for testing temporal behavior, safety, and structural coverage.
This document discusses factorial design, which is an experimental design technique used for optimization. It involves studying the effects of two or more factors simultaneously. There are two main types: full factorial design, which tests all possible combinations of factors and levels, and fractional factorial design, which reduces the number of runs when there are many factors. Factorial designs allow evaluation of both main effects and interaction effects. They are useful for formulation development and method optimization in chromatography. Software is used to analyze the results of factorial experiments.
The document discusses various black box testing techniques including:
1) Equivalence partitioning which creates input/output partitions to test valid and invalid values.
2) Boundary analysis which tests partition boundaries and values just beyond limits.
3) State transition testing which uses a state diagram to test all transitions and states.
It also briefly covers decision tables which present conditions and corresponding actions through rules in a table. Examples are provided to illustrate the techniques.
1. The document provides an introduction to regression models and panel data, outlining key concepts such as the definition of panel data, benefits of using panel data including controlling for individual heterogeneity, and limitations of panel data including problems with data collection.
2. Panel data involves observing the same cross-section of individuals, countries, firms etc. over multiple time periods, allowing analysis of both time and individual variability.
3. Using panel data offers advantages over cross-sectional or time series data alone, such as better accounting for unobserved heterogeneity and enabling analysis of dynamic adjustments over time.
Choice models are statistical models that attempt to capture the rational decision-making process by which individuals choose between options. There are several types of choice models including multinomial logit models, conditional fixed-effects logit models, alternative specific conditional (McFadden) models, ordered logit models, stereotype logistic models, and nested logit models. All choice models are estimated using maximum likelihood and allow the effects of independent variables to differ across outcome categories.
The document discusses operations research (OR), including its origins during WWII to optimize resource allocation, its goal of applying scientific principles to optimize complex business and organizational problems, and its use of quantitative modeling and analysis. OR aims to find the global optimum solution by analyzing relationships between system components. It uses interdisciplinary teams and scientific methods to develop mathematical and other models of real-world problems, which are then solved using techniques like linear programming. The models represent important variables and constraints. OR has wide applications in areas like the military, production, transportation, and resource allocation.
Statistics deals with the collection, organization, presentation, analysis, and interpretation of numerical data. Descriptive statistics summarize and describe data without generalizing, while inferential statistics makes generalizations from data. There are various methods for collecting data, including direct interviews, indirect questionnaires, observation, experimentation, and registration from agencies. Data can be presented textually, in tables, or graphically in charts and graphs. Population refers to all data for a phenomenon, while a sample is a subset that aims to represent the population.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document introduces the Matrix Data Analysis Diagram, one of the Seven New Quality Control Tools. The Matrix Data Analysis Diagram is a tool that uses numerical data to present and analyze relationships between two sets of factors in a matrix format. It is useful for studying parameters of production, market information, and new product planning by allowing the quantification of relationships. The document provides an example of how to construct a Matrix Data Analysis Diagram to prioritize design options based on criteria.
Object Oriented Methodology (OOM) is a system development approach encouraging and facilitating re-use of software components. We enforce our concern on components re-usability of existing component using Java Language .
The document introduces Object Process Methodology (OPM) as a holistic modeling methodology that uses a single model to represent both the structure and behavior of systems using object-process diagrams and an object-process language. It discusses key OPM concepts like objects, processes, structural and procedural relationships. The document also presents a case study of modeling an epaper project in OPM and compares OPM to UML, finding OPM better supported understanding of system dynamics. Tools like OPCAT for OPM modeling and collaboration are also summarized.
Survey on semi supervised classification methods and feature selectioneSAT Journals
Abstract Data mining also called knowledge discovery is a process of analyzing data from several perspective and summarize it into useful information. It has tremendous application in the area of classification like pattern recognition, discovering several disease type, analysis of medical image, recognizing speech, for identifying biometric, drug discovery etc. This is a survey based on several semisupervised classification method used by classifiers , in this both labeled and unlabeled data can be used for classification purpose.It is less expensive than other classification methods . Different techniques surveyed in this paper are low density separation approach, transductive SVM, semi-supervised based logistic discriminate procedure, self training nearest neighbour rule using cut edges, self training nearest neighbour rule using cut edges. Along with classification methods a review about various feature selection methods is also mentioned in this paper. Feature selection is performed to reduce the dimension of large dataset. After reducing attribute the data is given for classification hence the accuracy and performance of classification system can be improved .Several feature selection method include consistency based feature selection, fuzzy entropy measure feature selection with similarity classifier, Signal to noise ratio,Positive approximation. So each method has several benefits. Index Terms: Semisupervised classification, Transductive support vector machine, Feature selection, unlabeled samples
This document discusses different methods for simultaneous test assembly. It describes three main methods: 1) Simultaneous selection of items and sets using decision variables to model constraints, 2) Simultaneous selection using a "pivot item" to reduce variables, 3) Selection of all items per set by removing item decision variables when bounds are met. The goal of simultaneous assembly is to avoid problems of sequential assembly by solving for all tests at once.
Network Based Intrusion Detection System using Filter Based Feature Selection...IRJET Journal
This document proposes a mutual information-based feature selection algorithm to select optimal features for network intrusion detection classification. The algorithm aims to handle dependent data features better than previous methods. It evaluates the effectiveness of the algorithm on network intrusion detection cases. Most previous methods suffer from low detection rates and high false alarm rates. The proposed approach uses feature selection, filtering, clustering, and clustering ensemble techniques in a hybrid data mining method to achieve high accuracy for intrusion detection systems.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
This document discusses sampling techniques and procedures. It provides an overview of sample vs census studies, outlines the sampling design process, and classifies sampling techniques into nonprobability and probability methods. Specific techniques covered include convenience, judgmental, quota, snowball, simple random, systematic, stratified, and cluster sampling. The document also discusses choosing between nonprobability and probability sampling, and provides examples of procedures for drawing probability samples.
Software Cost Estimation Using Clustering and Ranking SchemeEditor IJMTER
Software cost estimation is an important task in the software design and development process.
Planning and budgeting tasks are carried out with reference to the software cost values. A variety of
software properties are used in the cost estimation process. Hardware, products, technology and
methodology factors are used in the cost estimation process. The software cost estimation quality is
measured with reference to the accuracy levels.
Software cost estimation is carried out using three types of techniques. They are regression based
model, anology based model and machine learning model. Each model has a set of technique for the
software cost estimation process. 11 cost estimation techniques fewer than 3 different categories are
used in the system. The Attribute Relational File Format (ARFF) is used maintain the software product
property values. The ARFF file is used as the main input for the system.
The proposed system is designed to perform the clustering and ranking of software cost
estimation methods. Non overlapped clustering technique is enhanced with optimal centroid estimation
mechanism. The system improves the clustering and ranking process accuracy. The system produces
efficient ranking results on software cost estimation methods.
Boost model accuracy of imbalanced covid 19 mortality predictionBindhuBhargaviTalasi
The document proposes using a GAN-based oversampling technique to boost the accuracy of an imbalanced COVID-19 mortality prediction model. It analyzes patient data to find correlations between features like gender and age with mortality. A GAN is used to generate synthetic minority class data to address data imbalance. The GAN-enhanced random forest model is evaluated using metrics like recall, precision and F1 score, and shows improved performance over the base model at predicting COVID-19 mortality.
Threshold benchmarking for feature ranking techniquesjournalBEEI
In prediction modeling, the choice of features chosen from the original feature set is crucial for accuracy and model interpretability. Feature ranking techniques rank the features by its importance but there is no consensus on the number of features to be cut-off. Thus, it becomes important to identify a threshold value or range, so as to remove the redundant features. In this work, an empirical study is conducted for identification of the threshold benchmark for feature ranking algorithms. Experiments are conducted on Apache Click dataset with six popularly used ranker techniques and six machine learning techniques, to deduce a relationship between the total number of input features (N) to the threshold range. The area under the curve analysis shows that ≃ 33-50% of the features are necessary and sufficient to yield a reasonable performance measure, with a variance of 2%, in defect prediction models. Further, we also find that the log2(N) as the ranker threshold value represents the lower limit of the range.
ext mining for the Vaccine Adverse Event Reporting System: medical text class...PC LO
Botsis, T., Nguyen, M. D., Woo, E. J., Markatou, M., & Ball, R. (2011). Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection. Journal of the American Medical Informatics Association, 18(5), 631-638.
Taguchi design of experiments nov 24 2013Charlton Inao
This document provides an overview of Taguchi design of experiments. It defines Taguchi methods, which use orthogonal arrays to conduct a minimal number of experiments that can provide full information on factors that affect performance. The assumptions of Taguchi methods include additivity of main effects. The key steps in an experiment are selecting variables and their levels, choosing an orthogonal array, assigning variables to columns, conducting experiments, and analyzing data through sensitivity analysis and ANOVA.
Using evolutionary testing to improve efficiency and qualityFaysal Ahmed
Evolutionary testing uses genetic algorithms to automatically generate test cases that improve software testing efficiency and quality. It transforms testing goals into optimization problems solved through evolutionary algorithms. Experiments show evolutionary testing outperforms manual, random, and static testing by generating more effective test cases. It provides a systematic, automated approach for testing temporal behavior, safety, and structural coverage.
This document discusses factorial design, which is an experimental design technique used for optimization. It involves studying the effects of two or more factors simultaneously. There are two main types: full factorial design, which tests all possible combinations of factors and levels, and fractional factorial design, which reduces the number of runs when there are many factors. Factorial designs allow evaluation of both main effects and interaction effects. They are useful for formulation development and method optimization in chromatography. Software is used to analyze the results of factorial experiments.
The document discusses various black box testing techniques including:
1) Equivalence partitioning which creates input/output partitions to test valid and invalid values.
2) Boundary analysis which tests partition boundaries and values just beyond limits.
3) State transition testing which uses a state diagram to test all transitions and states.
It also briefly covers decision tables which present conditions and corresponding actions through rules in a table. Examples are provided to illustrate the techniques.
1. The document provides an introduction to regression models and panel data, outlining key concepts such as the definition of panel data, benefits of using panel data including controlling for individual heterogeneity, and limitations of panel data including problems with data collection.
2. Panel data involves observing the same cross-section of individuals, countries, firms etc. over multiple time periods, allowing analysis of both time and individual variability.
3. Using panel data offers advantages over cross-sectional or time series data alone, such as better accounting for unobserved heterogeneity and enabling analysis of dynamic adjustments over time.
Choice models are statistical models that attempt to capture the rational decision-making process by which individuals choose between options. There are several types of choice models including multinomial logit models, conditional fixed-effects logit models, alternative specific conditional (McFadden) models, ordered logit models, stereotype logistic models, and nested logit models. All choice models are estimated using maximum likelihood and allow the effects of independent variables to differ across outcome categories.
The document discusses operations research (OR), including its origins during WWII to optimize resource allocation, its goal of applying scientific principles to optimize complex business and organizational problems, and its use of quantitative modeling and analysis. OR aims to find the global optimum solution by analyzing relationships between system components. It uses interdisciplinary teams and scientific methods to develop mathematical and other models of real-world problems, which are then solved using techniques like linear programming. The models represent important variables and constraints. OR has wide applications in areas like the military, production, transportation, and resource allocation.
Statistics deals with the collection, organization, presentation, analysis, and interpretation of numerical data. Descriptive statistics summarize and describe data without generalizing, while inferential statistics makes generalizations from data. There are various methods for collecting data, including direct interviews, indirect questionnaires, observation, experimentation, and registration from agencies. Data can be presented textually, in tables, or graphically in charts and graphs. Population refers to all data for a phenomenon, while a sample is a subset that aims to represent the population.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document introduces the Matrix Data Analysis Diagram, one of the Seven New Quality Control Tools. The Matrix Data Analysis Diagram is a tool that uses numerical data to present and analyze relationships between two sets of factors in a matrix format. It is useful for studying parameters of production, market information, and new product planning by allowing the quantification of relationships. The document provides an example of how to construct a Matrix Data Analysis Diagram to prioritize design options based on criteria.
Object Oriented Methodology (OOM) is a system development approach encouraging and facilitating re-use of software components. We enforce our concern on components re-usability of existing component using Java Language .
The document introduces Object Process Methodology (OPM) as a holistic modeling methodology that uses a single model to represent both the structure and behavior of systems using object-process diagrams and an object-process language. It discusses key OPM concepts like objects, processes, structural and procedural relationships. The document also presents a case study of modeling an epaper project in OPM and compares OPM to UML, finding OPM better supported understanding of system dynamics. Tools like OPCAT for OPM modeling and collaboration are also summarized.
Survey on semi supervised classification methods and feature selectioneSAT Journals
Abstract Data mining also called knowledge discovery is a process of analyzing data from several perspective and summarize it into useful information. It has tremendous application in the area of classification like pattern recognition, discovering several disease type, analysis of medical image, recognizing speech, for identifying biometric, drug discovery etc. This is a survey based on several semisupervised classification method used by classifiers , in this both labeled and unlabeled data can be used for classification purpose.It is less expensive than other classification methods . Different techniques surveyed in this paper are low density separation approach, transductive SVM, semi-supervised based logistic discriminate procedure, self training nearest neighbour rule using cut edges, self training nearest neighbour rule using cut edges. Along with classification methods a review about various feature selection methods is also mentioned in this paper. Feature selection is performed to reduce the dimension of large dataset. After reducing attribute the data is given for classification hence the accuracy and performance of classification system can be improved .Several feature selection method include consistency based feature selection, fuzzy entropy measure feature selection with similarity classifier, Signal to noise ratio,Positive approximation. So each method has several benefits. Index Terms: Semisupervised classification, Transductive support vector machine, Feature selection, unlabeled samples
This document discusses different methods for simultaneous test assembly. It describes three main methods: 1) Simultaneous selection of items and sets using decision variables to model constraints, 2) Simultaneous selection using a "pivot item" to reduce variables, 3) Selection of all items per set by removing item decision variables when bounds are met. The goal of simultaneous assembly is to avoid problems of sequential assembly by solving for all tests at once.
Network Based Intrusion Detection System using Filter Based Feature Selection...IRJET Journal
This document proposes a mutual information-based feature selection algorithm to select optimal features for network intrusion detection classification. The algorithm aims to handle dependent data features better than previous methods. It evaluates the effectiveness of the algorithm on network intrusion detection cases. Most previous methods suffer from low detection rates and high false alarm rates. The proposed approach uses feature selection, filtering, clustering, and clustering ensemble techniques in a hybrid data mining method to achieve high accuracy for intrusion detection systems.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Optimization Technique for Feature Selection and Classification Using Support...IJTET Journal
Abstract— Classification problems often have a large number of features in the data sets, but only some of them are useful for classification. Data Mining Performance gets reduced by Irrelevant and redundant features. Feature selection aims to choose a small number of relevant features to achieve similar or even better classification performance than using all features. It has two main objectives are maximizing the classification performance and minimizing the number of features. Moreover, the existing feature selection algorithms treat the task as a single objective problem. Selecting attribute is done by the combination of attribute evaluator and search method using WEKA Machine Learning Tool. We compare SVM classification algorithm to automatically classify the data using selected features with different standard dataset.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
Classification problems specified in high dimensional data with smallnumber of observation are generally becoming common in specific microarray data. In the time of last two periods of years, manyefficient classification standard models and also Feature Selection (FS) algorithm which isalso referred as FS technique have basically been proposed for higher prediction accuracies. Although, the outcome of FS algorithm related to predicting accuracy is going to be unstable over the variations in considered trainingset, in high dimensional data. In this paperwe present a latest evaluation measure Q-statistic that includes the stability of the selected feature subset in inclusion to prediction accuracy. Then we are going to propose the standard Booster of a FS algorithm that boosts the basic value of the preferred Q-statistic of the algorithm applied. Therefore study on synthetic data and 14 microarray data sets shows that Booster boosts not only the value of Q-statistics but also the prediction accuracy of the algorithm applied.
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
Since dealing with high dimensional data is computationally complex and sometimes even intractable, recently several feature reductions methods have been developed to reduce the dimensionality of the data in order to simplify the calculation analysis in various applications such as text categorization, signal processing, image retrieval, gene expressions and etc. Among feature reduction techniques, feature selection is one the most popular methods due to the preservation of the original features. However, most of the current feature selection methods do not have a good performance when fed on imbalanced data sets which are pervasive in real world applications. In this paper, we propose a new unsupervised feature selection method attributed to imbalanced data sets, which will remove redundant features from the original feature space based on the distribution of features. To show the effectiveness of the proposed method, popular feature selection methods have been implemented and compared. Experimental results on the several imbalanced data sets, derived from UCI repository database, illustrate the effectiveness of our proposed methods in comparison with the other compared methods in terms of both accuracy and the number of selected features.
This document is a research proposal on attribute selection and representation for software defect prediction. The proposal discusses limitations in existing attribute selection methods and the importance of pre-processing data. It aims to propose a new attribute selection method that improves accuracy by addressing shortcomings, and to study appropriate classifiers. The methodology involves a literature review on pre-processing, attribute selection and classification methods. It will then propose and implement a new attribute selection process, compare it using different classifiers and pre-processing, and evaluate it against existing techniques in a technical report.
A Review on Feature Selection Methods For Classification TasksEditor IJCATR
In recent years, application of feature selection methods in medical datasets has greatly increased. The challenging task in
feature selection is how to obtain an optimal subset of relevant and non redundant features which will give an optimal solution without
increasing the complexity of the modeling task. Thus, there is a need to make practitioners aware of feature selection methods that have
been successfully applied in medical data sets and highlight future trends in this area. The findings indicate that most existing feature
selection methods depend on univariate ranking that does not take into account interactions between variables, overlook stability of the
selection algorithms and the methods that produce good accuracy employ more number of features. However, developing a universal
method that achieves the best classification accuracy with fewer features is still an open research area.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
Effective Feature Selection for Feature Possessing Group Structurerahulmonikasharma
This document proposes a new method called efficient group variable selection (EGVS) for feature selection when features have a group structure. EGVS has two stages: 1) within-group variable selection evaluates each feature individually to select discriminative features within each group. 2) Between-group variable selection re-evaluates all features to remove redundancy and obtain an optimal subset by considering relationships between groups. The method is demonstrated on benchmark datasets, showing it increases classification accuracy by leveraging the group structure during feature selection.
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
Large amount of data have been stored and manipulated using various database
technologies. Processing all the attributes for the particular means is the difficult task. To avoid such
difficulties, feature selection process is processed.In this paper,we are collect a eight various benchmark
datasets from UCI repository.Feature selection process is carried out using fuzzy entropy based
relevance measure algorithm and follows three selection strategies like Mean selection strategy,Half
selection strategy and Neural network for threshold selection strategy. After the features are selected,
they are evaluated using Radial Basis Function (RBF) network,Stacking,Bagging,AdaBoostM1 and Antminer
classification methodologies.The test results depicts that Neural network for threshold selection
strategy works well in selecting features and Ant-miner methodology works best in bringing out better
accuracy with selected feature than processing with original dataset.The obtained result of this
experiment shows that clearly the Ant-miner is superiority than other classifiers.Thus, this proposed Antminer
algorithm could be a more suitable method for producing good results with fewer features than
the original datasets.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Study on Relavance Feature Selection MethodsIRJET Journal
This document summarizes research on feature selection methods. It discusses how feature selection is used to reduce dimensionality when working with large datasets that have thousands of variables. Several feature selection algorithms are examined, including ant colony optimization, quadratic programming, variable ranking using filter, wrapper and embedded methods, and fast correlation-based filtering with sequential forward selection. Feature selection can improve classification efficiency and understanding of data by identifying the most meaningful features.
A Survey on Classification of Feature Selection Strategiesijtsrd
Feature selection is an important part of machine learning. The Feature selection refers to the process of reducing the inputs for processing and analysis, or of finding the most meaningful inputs. A related term, feature engineering (or feature extraction), refers to the process of extracting useful information or features from existing data. Mining of particular information related to a concept is done on the basis of the feature of the data. The accessing of these features hence for data retrieval can be termed as the feature extraction mechanism. Different type of feature extraction methods is being used. In this paper, the different feature selection methodologies are examined in terms of need and method adopted for feature selection. The three types of method are mainly available, such as Shannons Entropy, Bayesian with K2 Prior and Bayesian Dirichlet with uniform prior (default). The objectives of this survey paper is to identify the existing contribution made by using their above mentioned algorithms and the result obtained. R. Pradeepa | K. Palanivel"A Survey on Classification of Feature Selection Strategies" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-2 , February 2018, URL: http://www.ijtsrd.com/papers/ijtsrd9632.pdf http://www.ijtsrd.com/computer-science/data-miining/9632/a-survey-on-classification-of-feature-selection-strategies/r-pradeepa
The document discusses test case optimization using genetic and tabu search algorithms for structural testing. It proposes a hybrid algorithm that combines genetic algorithm and tabu search algorithm. Genetic algorithm is initially used to generate test cases, but it can get stuck in local optima. The hybrid approach uses tabu search to improve the mutation step of genetic algorithm. This helps guide the search away from previously visited solutions and avoid local optima, improving test case optimization. An experiment on a voter validation form showed the hybrid approach produced a more efficient test suite compared to genetic algorithm alone.
This document presents an evaluation scheme for hierarchical information browsing structures. The scheme evaluates comprehensiveness, coherence, and correctness. It was tested in a case study using surveys of domain experts to evaluate an automatic category generation algorithm. The study found the algorithm's strengths and weaknesses with good reliability. The evaluation scheme provides a low-cost way to gather specific feedback for improving browsing structures.
Similar to Feature selection for classification (20)
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
2. 132 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
algorithm and yields a more general concept. This helps in getting a better insight into the underlying
concept of a real-world classification problem [23,24]. Feature selection methods try to pick a subset of
features that are relevant to the target concept.
Feature selection is defined by many authors by looking at it from various angles. But as expected,
many of those are similar in intuition and/or content. The following lists those that are conceptually
different and cover a range of definitions.
1. Idealized: find the minimally sized feature subset that is necessary and sufficient to the target
concept [22].
2. Classical: select a subset of M features from a set of N features, M < N, such that the value of a
criterion function is optimized over all subsets of size M [34].
3. Improving Prediction accuracy: the aim of feature selection is to choose a subset of features for
improving prediction accuracy or decreasing the size of the structure without significantly decreasing
prediction accuracy of the classifier built using only the selected features [24].
4. Approximating original class distribution: the goal of feature selection is to select a small subset
such that the resulting class distribution, given only the values for the selected features, is as close as
possible to the original class distribution given all feature values [24].
Notice that the third definition emphasizes the prediction accuracy of a classifier, built using only the
selected features, whereas the last definition emphasizes the class distribution given the training dataset.
These two are quite different conceptually. Hence, our definition considers both factors.
Feature selection attempts to select the minimally sized subset of features according to the following
criteria. The criteria can be:
1. the classification accuracy does not significantly decrease; and
2. the resulting class distribution, given only the values for the selected features, is as close as possible
to the original class distribution, given all features.
Ideally, feature selection methods search through the subsets of features, and try to find the best
one among the competing 2N
candidate subsets according to some evaluation function. However this
procedure is exhaustive as it tries to find only the best one. It may be too costly and practically prohibitive,
even for a medium-sized feature set size (N). Other methods based on heuristic or random search
methods attempt to reduce computational complexity by compromising performance. These methods
need a stopping criterion to prevent an exhaustive search of subsets. In our opinion, there are four basic
steps in a typical feature selection method (see Figure 1):
1. a generation procedure to generate the next candidate subset;
2. an evaluation function to evaluate the subset under examination;
3. a stopping criterion to decide when to stop; and
4. a validation procedure to check whether the subset is valid.
The generation procedure is a search procedure [46,26]. Basically, it generates subsets of features for
evaluation. The generation procedure can start: (i) with no features, (ii) with all features, or (iii) with
a random subset of features. In the first two cases, features are iteratively added or removed, whereas
in the last case, features are either iteratively added or removed or produced randomly thereafter [26].
An evaluation function measures the goodness of a subset produced by some generation procedure, and
this value is compared with the previous best. If it is found to be better, then it replaces the previous best
3. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 133
Fig. 1. Feature selection process with validation.
subset. Without a suitable stopping criterion the feature selection process may run exhaustively or forever
through the space of subsets. Generation procedures and evaluation functions can influence the choice for
a stopping criterion. Stopping criteria based on a generation procedure include: (i) whether a predefined
number of features are selected, and (ii) whether a predefined number of iterations reached. Stopping
criteria based on an evaluation function can be: (i) whether addition (or deletion) of any feature does
not produce a better subset; and (ii) whether an optimal subset according to some evaluation function
is obtained. The loop continues until some stopping criterion is satisfied. The feature selection process
halts by outputting a selected subset of features to a validation procedure. There are many variations
to this three-step feature selection process, which are discussed in Section 3. The validation procedure
is not a part of the feature selection process itself, but a feature selection method (in practice) must be
validated. It tries to test the validity of the selected subset by carrying out different tests, and comparing
the results with previously established results, or with the results of competing feature selection methods
using artificial datasets, real-world datasets, or both.
There have been quite a few attempts to study feature selection methods based on some framework or
structure. Prominent among these are Doak’s [13] and Siedlecki and Sklansky’s [46] surveys. Siedlecki
and Sklansky discussed the evolution of feature selection methods and grouped the methods into
past, present, and future categories. Their main focus was the branch and bound methods [34] and
its variants, [16]. No experimental study was conducted in this paper. Their survey was published in
the year 1987, and since then many new and efficient methods have been introduced (e.g., Focus [2],
Relief [22], LVF [28]). Doak followed a similar approach to Siedlecki and Sklansky’s survey and grouped
the different search algorithms and evaluation functions used in feature selection methods independently,
and ran experiments using some combinations of evaluation functions and search procedures.
In this article, a survey is conducted for feature selection methods starting from the early 1970’s [33]
to the most recent methods [28]. In the next section, the two major steps of feature selection (generation
procedure and evaluation function) are divided into different groups, and 32 different feature selection
methods are categorized based on the type of generation procedure and evaluation function that is used.
4. 134 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
This framework helps in finding the unexplored combinations of generation procedures and evaluation
functions. In Section 3 we briefly discuss the methods under each category, and select a representative
method for a detailed description using a simple dataset. Section 4 describes an empirical comparison
of the representative methods using three artificial datasets suitably chosen to highlight their benefits
and limitations. Section 5 consists of discussions on various data set characteristics that influence the
choice of a suitable feature selection method, and some guidelines, regarding how to choose a feature
selection method for an application at hand, are given based on a number of criterial extracted from these
characteristics of data. This paper concludes in Section 6 with further discussions on future research
based on the findings of Section 2. Our objective is that this article will assist in finding better feature
selection methods for given applications.
2. Study of Feature Selection Methods
In this section we categorize the two major steps of feature selection: generation procedure and
evaluation function. The different types of evaluation functions are compared based on a number of
criteria. A framework is presented in which a total of 32 methods are grouped based on the types of
generation procedure and evaluation function used in them.
2.1. Generation Procedures
If the original feature set contains N number of features, then the total number of competing candidate
subsets to be generated is 2N
. This is a huge number even for medium-sized N. There are different
approaches for solving this problem, namely: complete, heuristic, and random.
2.1.1. Complete
This generation procedure does a complete search for the optimal subset according to the evaluation
function used. An exhaustive search is complete. However, Schlimmer [43] argues that “just because the
search must be complete does not mean that it must be exhaustive.” Different heuristic functions are used
to reduce the search without jeopardizing the chances of finding the optimal subset. Hence, although the
order of the search space is O(2N
), a fewer subsets are evaluated. The optimality of the feature subset,
according to the evaluation function, is guaranteed because the procedure can backtrack. Backtracking
can be done using various techniques, such as: branch and bound, best first search, and beam search.
2.1.2. Heuristic
In each iteration of this generation procedure, all remaining features yet to be selected (rejected) are
considered for selection (rejection). There are many variations to this simple process, but generation of
subsets is basically incremental (either increasing or decreasing). The order of the search space is O(N2
)
or less; some exceptions are Relief [22], DTM [9] that are discussed in detail in the next section. These
procedures are very simple to implement and very fast in producing results, because the search space is
only quadratic in terms of the number of features.
2.1.3. Random
This generation procedure is rather new in its use in feature selection methods compared to the other
two categories. Although the search space is O(2N
), but these methods typically search fewer number of
5. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 135
subsets than 2N
by setting a maximum number of iterations possible. Optimality of the selected subset
depends on the resources available. Each random generation procedure would require values of some
parameters. Assignment of suitable values to these parameters is an important task for achieving good
results.
2.2. Evaluation Functions
An optimal subset is always relative to a certain evaluation function (i.e., an optimal subset chosen
using one evaluation function may not be the same as that which uses another evaluation function).
Typically, an evaluation function tries to measure the discriminating ability of a feature or a subset
to distinguish the different class labels. Langley [26] grouped different feature selection methods into
two broad groups (i.e., filter and wrapper) based on their dependence on the inductive algorithm that
will finally use the selected subset. Filter methods are independent of the inductive algorithm, whereas
wrapper methods use the inductive algorithm as the evaluation function. Ben-Bassat [4] grouped the
evaluation functions existing until 1982 into three categories: in formation or uncertainty measures,
distance measures, and dependence measures, and suggested that the dependence measures can be
divided between the first two categories. He has not considered the classification error rate as an
evaluation function, as no wrapper method existed in 1982. Doak [13] divided the evaluation functions
into three categories: data intrinsic measures, classification error rate, and estimated or incremental
error rate, where the third category is basically a variation of the second category. The data intrinsic
category includes distance, entropy, and dependence measures. Considering these divisions and the
latest developments, we divide the evaluation functions into five categories: distance, information (or
uncertainty), dependence, consistency, and classifier error rate. In the following subsections we briefly
discuss each of these types of evaluation functions.
2.2.1. Distance Measures
It is also known as separability, divergence, or discrimination measure. For a two-class problem, a
feature X is preferred to another feature Y if X induces a greater difference between the two-class
conditional probabilities than Y ; if the difference is zero, then X and Y are indistinguishable. An example
is the Euclidean distance measure.
2.2.2. Information Measures
These measures typically determine the information gain from a feature. The information gain from a
feature X is defined as the difference between the prior uncertainty and expected posterior uncertainty
using X. Feature X is preferred to feature Y if the information gain from feature X is greater than that
from feature Y (e.g., entropy measure) [4].
2.2.3. Dependence Measures
Dependence measures or correlation measures qualify the ability to predict the value of one variable
from the value of another. The coefficient is a classical dependence measure and can be used to find the
correlation between a feature and a class. If the correlation of feature X with class C is higher than the
correlation of feature Y with C, then feature X is preferred to Y . A slight variation of this is to determine
the dependence of a feature on other features; this value indicates the degree of redundancy of the feature.
All evaluation functions based on dependence measures can be divided between distance and information
6. 136 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
Table 1
A Comparison of Evaluation Functions
Evaluation Function Generality Time Complexity Accuracy
Distance Measure Yes Low –
Information Measure Yes Low –
Dependence Measure Yes Low –
Consistency Measure Yes Moderate –
Classifier Error Rate No High Very High
measures. But, these are still kept as a separate category, because conceptually, they represent a different
viewpoint [4]. More about the above three measures can be found in Ben-Basset’s [4] survey.
2.2.4. Consistency Measures
These measures are rather new and have been in much focus recently. These are characteristically
different from other measures, because of their heavy reliance on the training dataset and the use of the
Min-Features bias in selecting a subset of features [3]. Min-Features bias prefers consistent hypotheses
definable over as few features as possible. More about this bias is discussed in Section 3.6. These
measures find out the minimally sized subset that satisfies the acceptable inconsistency rate, that is usually
set by the user.
2.2.5. Classifier Error Rate Measures
The methods using this type of evaluation function are called “wrapper methods”, (i.e., the classifier is
the evaluation function). As the features are selected using the classifier that later on uses these selected
features in predicting the class labels of unseen instances, the accuracy level is very high although
computationally quite costly [21].
Table 1 shows a comparison of various evaluation functions irrespective of the type of generation
procedure used. The different parameters used for the comparison are:
1. generality: how suitable is the selected subset for different classifiers (not just for one classifier);
2. time complexity: time taken for selecting the subset of features; and
3. accuracy: how accurate is the prediction using the selected subset.
The ‘–’ in the last column means that nothing can be concluded about the accuracy of the
corresponding evaluation function. Except ‘classifier error rate,’ the accuracy of all other evaluation
functions depends on the data set and the classifier (for classification after feature selection) used. The
results in Table 1 show a non-surprising trend (i.e., the more time spent, the higher the accuracy). The
table also tells us which measure should be used under different circumstances, for example, with time
constraints, given several classifiers to choose from, classifier error rate should not be selected as an
evaluation function.
2.3. The Framework
In this subsection, we suggest a framework on which the feature selection methods are categorized.
Generation procedures and evaluation functions are considered as two dimensions, and each method is
7. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 137
Table 2
Two Dimensional Categorization of Feature Selection Methods
Evaluation Generation
Measures Heuristic Complete Random
Distance
Measure
I—Sec 3.1 Relief Relief-F
Sege84
II—Sec 3.2 Branch
and Bound
BFF
Bobr88
III
Information
Measure
IV—Sec 3.3 DTM Koll-
Saha96
V—Sec 3.4 MDLM VI
Dependency
Measure
VII—Sec 3.5 POE1ACC
PRESET
VIII IX
Consistency
Measure
X XI—Sec 3.6 Focus
Sch193 MIFES-1
XII—Sec 3.7 LVF
Classifier Error
Rate
XII—Sec 3.8 SBS SFS
SBS-
SLASH PQSS BDS Moor-
Lee94 RC Quei-Gels84
XIV—Sec 3.8 Ichi-
Skla84a Ichi-Skla84b
AMB & B BS
XV—Sec 3.8 LVW
GA SA RGSS
RMHC-PF1
grouped depending on the type of generation procedure and evaluation function used. To our knowledge,
there has not been any attempt to group the methods considering both steps. First, we have chosen a total
of 32 methods from the literature, and then these are grouped according to the combination of generation
procedure and evaluation function used (see Table 2).
A distinct achievement of this framework is the finding of a number of combinations of generation
procedure and evaluation function (the empty boxes in the table) that do not appear in any existing
method yet (to the best of our knowledge).
In the framework, a column stands for a type of generation procedure and a row stands for a type
of evaluation function. The assignment of evaluation functions within the categories may be equivocal,
because several evaluation functions may be placed in different categories when considering them in
different perspectives, and one evaluation function may be obtained as a mathematical transformation
of another evaluation function [4]. We have tried to resolve this as naturally as possible. In the next
two sections we explain each category and methods, and choose a method in each category, for a detailed
discussion using pseudo code and a mini-dataset (Section 3), and for an empirical comparison (Section 4).
3. Categories and Their Differences
There are a total of 15 (i.e., 5 types of evaluation functions and 3 types of generation procedures)
combinations of generation procedures, and evaluation functions in Table 2. The blank boxes in the table
signify that no method exists for these combinations. These combinations are the potential future work.
In this section, we discuss each category, by briefly describing the methods under it, and then, choosing
a representative method for each category. We explain it in detail with the help of its pseudo code and
a hand-run over a prototypical dataset. The methods in the last row represent the “wrapper” methods,
where the evaluation function is the “classifier error rate”. A typical wrapper method can use different
8. 138 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
Table 3
Sixteen Instances of CorrAL
# A0 A1 B0 B1 I C Class # A0 A1 B0 B1 I C Class
1 0 0 0 0 0 1 0 9 1 0 0 0 1 1 0
2 0 0 0 1 1 1 0 10 1 0 0 1 1 0 0
3 0 0 1 0 0 1 0 11 1 0 1 0 0 1 0
4 0 0 1 1 0 0 1 12 1 0 1 1 0 0 1
5 0 1 0 0 0 1 0 13 1 1 0 0 0 0 1
6 0 1 0 1 1 1 0 14 1 1 0 1 0 1 1
7 0 1 1 0 1 0 0 15 1 1 1 0 1 0 1
8 0 1 1 1 0 0 1 16 1 1 1 1 0 0 1
kinds of classifiers for evaluation; hence no representative method is chosen for the categories under this
evaluation function. Instead they are discussed briefly as a whole towards the end of this section. The
prototypical dataset used for hand-run of the representative methods is shown in Table 3 which consists
of 16 instances (originally 32 instances) of original CorrAL dataset. This mini-dataset has binary classes,
and six boolean features (A0, A1, B0, B1, I, C) where feature I is irrelevant, feature C is correlated to
the class label 75% of the time, and the other four features are relevant to the boolean target concept:
(A0QA1) ∼ (B0QB1). In all the pseudo codes, D denotes the training set, S is the original feature set,
N is the number of features, T is the selected subset, and M is the number of selected (or required)
features.
3.1. Category I: Generation—Heuristic, Evaluation—Distance
3.1.1. Brief Description of Various Methods
As seen from Table 2, the most prominent method in this category is Relief [22]. We first discuss Relief
and its variant followed by a brief discussion on the other method.
Relief uses a statistical method to select the relevant features. It is a feature weight-based algorithm
inspired by instance-based learning algorithms ([1,8]). From the set of training instances, it first chooses a
sample of instances, it first chooses a sample of instances; the user must provide the number of instances
(NoSample) in this sample. Relief randomly picks this sample of instances, and for each instance in it
finds Near Hit and near Miss instances based on a Euclidean distance measure. Near Hit is the instance
having minimum Euclidean distance among all instances of the same class as that of the chosen instance;
near Miss is the instance having minimum Euclidean distance among all instances of different class.
It updates the weights of the features that are initialized to zero in the beginning based on an intuitive
idea that a feature is more relevant if it distinguishes between an instance and its near Miss, and less
relevant if it distinguishes between an instance and its near Hit. After exhausting all instances in the
sample, it chooses all features having weight greater than or equal to a threshold. This threshold can
be automatically evaluated using a function that uses the number of instances in the sample; it can also
be determined by inspection (all features with positive weights are selected). Relief works for noisy
and correlated features, and requires only linear time in the number of given features and NoSample. It
works both for nominal and continuous data. One major limitation is that it does not help with redundant
9. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 139
Fig. 2. Relief.
features and, hence, generates non-optimal feature set size in the presence of redundant features. This
can be overcome by a subsequent exhaustive search over the subsets of all features selected by Relief.
Relief works only for binary classes. This limitation is overcome by Relief-F [25] that also tackles the
problem of incomplete data. Insufficient training instances Relief. Another limitation is that the user may
find difficult in choosing a proper NoSample. More about this is discussed in Section 4.3.
Jakub Segen’s [44] method uses an evaluation function that minimizes the sum of a statistical
discrepancy measure and the feature complexity measure (in bits). It finds the first feature that best
distinguishes the classes, and iteratively looks for additional features which in combination with
the previously chosen features improve class discrimination. This process stops once the minimal
representation criterion is achieved.
3.1.2. Hand-Run of CorrAL Dataset (see Figure 2)
Relief randomly chooses one instance from the sample. Let us assume instance #5 (i.e., [0,1,0,0,0,1]
is chosen). It finds the near Hit and the near Miss of this instance using Euclidean distance measure.
The difference between two discrete feature values is one if their values do not match and zero
otherwise. For the chosen instance #5, the instance #1 is the near Hit (Difference 5 1), and instance
#13 and #14 are the near Miss (difference 5 2). Let us choose instance #13 as the near Miss. Next, it
updates each feature weight Wj . This is iterated NoSample times specified by the user. Features having
weights greater than or equal to the threshold are selected. Usually, the weights are negative for irrelevant
features, and positive for relevant and redundant features. For CorrAL dataset is selects {A0,B0,B1,C}
(more about this in Section 4.3).
3.2. Category II: Generation—Complete, Evaluation—Distance
3.2.1. Brief Description of Various Methods
This combination is found in old methods such as branch and bound [34]. Other methods in this
category are variations of branch and bound (B & B) method considering the generation procedure used
(BFF [50]), or the evaluation function used (Bobrowski’s method [5]). We first discuss the branch and
bound method, followed by a brief discussion of the other two methods.
10. 140 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
Narendra and Fukunaga have defined feature selection in a classical way (see definition 2 in Section 1)
that requires evaluation functions be monotonic (i.e., a subset of features should not be better than any
larger set that contains the subset). This definition has a severe drawback for real-world problems because
the appropriate size of the target feature subset is generally unknown. But this definition can be slightly
modified to make it applicable for general problems as well, by saying that B & B attempts to satisfy two
criteria: (i) the selected subset be as small as possible; and (ii) a bound be placed on the value calculated
by the evaluation function [13]. As per this modification, B & B starts searching from the original feature
set and it proceeds by removing features from it. A bound is placed on the value of the evaluation
function to to create a rapid search. As the evaluation function obeys the monotonicity principle, any
subset for which the value is less than the bound is removed from the search tree (i.e., all subsets of it are
discarded from the search space). Evaluation functions generally used are: Mahalanobis distance [15],
the discriminant function, the Fisher criterion, the Bhattacharya distance, and the divergence [34]. Xu,
Yan, and Chang [50] proposed a similar algorithm (BFF), where the search procedure is modified to
solve the problem of searching an optimal path in a weighted tree with the informed best first search
strategy in artificial intelligence. This algorithm guarantees the best global or subset without exhaustive
enumeration, for any criterion satisfying the monotonicity principle.
Bobrowski [5] proves that the homogeneity coefficient f ∗
Ik can be used in measuring the degree of
linear dependence among some measurements, and shows that it is applicable to the feature selection
problem due to its monotonicity principle (i.e., if S1, S2 then f ∗
Ik(S1) > f ∗
Ik(S2)). Hence, this can be
suitably converted to a feature selection method by implementing it as an evaluation function for branch
and bound with backtracking or a better first generation procedure.
The fact that an evaluation function must be monotonic to be applicable to these methods, prevents the
use of many common evaluation functions. This problem is partially solved by relaxing the monotonicity
criterion and introducing the approximate monotonicity concept [16].
3.2.2. Hand-run of CorrAL Dataset (see Figure 3)
The authors use Mahalanobis distance measure as the evaluation function. The algorithm needs input
of the required number of features (M) and it attempts to find out the best subset of N 2M features to
Fig. 3. Branch and bound.
11. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 141
Fig. 4. Decision tree method (DTM).
reject. It begins with the full set of features S0
, removes one feature from Sl
jˆ21
in turn to generate subsets
Sl
j , where l is the current level and j specifies different subsets at the lth level. If U(Sl
j ). U(Sl
jˆ21
), Sl
j
stops growing (its branch is pruned), otherwise, it grows to level l 1 1, i.e., one more feature will be
removed. If for the CorrAL dataset, M is set to 4, then subset (A0, A1, B0, I) is selected as the best
subset of 4 features.
3.3. Category IV: Generation—Heuristic, Evaluation—Information
3.3.1. Brief Description of Various Methods
We have found two methods under this category: decision tree method (DTM) [9], and Koller and
Sahami’s [24] method. DTM shows that the use of feature selection can improve case-based learning. In
this method feature selection is used in an application on natural language processing. C4.5 [39] is run
over the training set, and the features that appear in the pruned decision tree are selected. In other words,
the union of the subsets of features, appearing in the paths to any leaf node in the pruned tree, is the
selected subset. The second method, which is very recent, is based on the intuition that any feature, given
little or no additional information beyond that subsumed by the remaining features, is either irrelevant or
redundant, and should be eliminated. To realize this, Koller and Sahami try to approximate the Markov
blanket where a subset T is a Markov blanket for feature fi if, given T , fi is conditionally independent
both of the class label and of all features not in T (including fi itself). The implementation of the Markov
blanket is suboptimal in many ways, particularly due to many naive approximations.
3.3.2. Hand-Run of CorrAL Dataset (see Figure 3)
C4.5 uses an information-based heuristic, a simple form of which for two class problem (as our
example) is
I(p,n) =
−p
p + n
log2
p
p + n
−
n
p + n
log2
n
p + n
,
where p is the number of instances with class label one and n is the number of instances with class label
zero. Assume that using an attribute F1 as the root in the tree will partition the training set into T0 and T 1
(because each feature takes binary values only). Entropy of feature F1 is
E(F1) =
p0 + n0
p + n
I(p0,n0) +
p1 + n1
p + n
I(p1,n1).
Considering the CorrAL dataset, let us assume that feature C is evaluated. If it takes value ‘0’, then one
instance is positive (class 5 1) and seven instances are negative (class 5 0, and for value ‘1’, six instances
are positive (class 5 1) and two are negative (class 5 0). Hence
12. 142 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
E(C) =
1 + 7
16
I(1,7) +
6 + 2
16
I(6,2)
is 0.552421. In fact this value is minimum among all features, and C is selected as the root of the
decision tree. The original training set of sixteen instances is divided into two nodes, each representing
eight instances using feature C. For these two nodes, again features having the least entropy are selected,
and so on. This process halts when each partition contains instances of a single class, or until no test
offers any improvement. The decision tree constructed, thus is pruned basically to avoid over-fitting.
DTM chooses all features appearing in the pruned decision tree (i.e., A0, A1, B0, B1, C).
3.4. Category V: Generation—Complete, Evaluation—Information
Under this category we found Minimum Description Length Method (MDLM) [45]. In this method,
the authors have attempted to eliminate all useless (irrelevant and/or redundant) features. According to
the authors, if the features in a subset V can be expressed as a fixed non-class-dependent function F of
the features in another subset U, then once the values of the features in subset U are known, the feature
subset V becomes useless. For feature selection, U and V together make up the complete feature set
and one has the task of separating them. To solve this, the authors use the minimum description length
criterion (MDLC) introduced by Rissanen [40]. They formulated an expression that can be interpreted
as the number of bits required to transmit the classes of the instances, the optimal parameters, the useful
features, and (finally) the useless features. The algorithm exhaustively searches all the possible subsets
(2N
) and outputs the subset satisfying MDLC. This method can find all the useful features and only those
of the observations are Gaussian. For non-Gaussian cases, it may not be able to find the useful features
(more about this in Section 4.3).
3.4.1. Hand-Run of CorrAL Dataset
As seen in Figure 5, MDLM is basically the evaluation of an equation that represents the description
length given the candidate subset of features. The actual implementation suggested by the authors is as
follows. Calculate the covariance matrices of the whole feature vectors for all classes and for each class
separately. The covariance matrix for useful subsets are obtained as sub-matrices of these. Determine
the determinants of the sub-matrices DL(i) and DL for equation as shown in Figure 5, and find out the
subset having minimum description length. For the CorrAL dataset the chosen feature subset is {C} with
a minimum description length of 119.582.
3.5. Category VII: Generation—Heuristic, Evaluation—Dependence
3.5.1. Brief Description of Various Methods
We found POE1ACC (Probability of ERROR & Average Correlation Coefficient) method [33] and
PRESET [31] under this combination. As many as seven techniques of feature selection are presented
in POE1ACC, but we choose the last (i.e., seventh) method, because the authors consider this to be the
most important among the seven.
In this method, the first feature selected is the feature with the smallest probability of error (Pe).
The next feature selected is the feature that produces the minimum weighted sum of Pe and average
correlation coefficient (ACC), and so on. ACC is the mean of the correlation coefficients of the candidate
feature with the features previously selected at that point. This method can rank all the features based on
13. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 143
Fig. 5. Minimum description length method (MDLM).
Fig. 6. POE + ACC.
the weighted sum, and a required number of features (M) is used as the stopping criterion. PRESET uses
the concept of rough set; it first finds a reduct (i.e., a reduct R of a set P classifies instances equally well
as P does) and removes all features not appearing in the reduct. Then it ranks the features based on their
significance. The significance of a feature is a measure expressing how important a feature is regarding
classification. This measure is based on dependence of attributes.
3.5.2. Hand-Run of CorrAL Dataset (see Figure 6)
The first feature chosen is the feature having minimum Pe, and in each iteration thereafter, the feature
having minimum w1(Pe)1w2(ACC) is selected. In our experiments, we consider w1 to be 0.1 and w2 to be
0.9 (authors suggest these values for their case study). To calculate Pe, compute the a priori probabilities
of different classes. For the mini-dataset, it is 9
/16 for class 0 and 7
/16 for class 1. Next, for each feature,
calculate the class-conditional probabilities given the class label. We find that for class label 0, the class-
conditional probability of feature C having the value 0 is 2
/9 and that of value 1 is 7
/9, and for class label
1, the class-conditional probability of feature C having the value 0 is 6
/7 and that of value 1 is 1
/7. For
14. 144 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
each feature value (e.g., feature C taking value 0), we find out the class label for the product of a priori
class probability and class-conditional probability given that the class label is maximum. When feature
C takes the value 0, the prediction is class 1, and when feature C takes value 1, the prediction is class 0.
For all instances, we count the number of mismatches between the actual and the predicted class values.
Feature C has 3
/16 fraction of mismatches (Pe). In fact, among all features, C has the least Pe and hence,
is selected as the first feature. In the second step, correlations of all remaining features (A0, A1, B0, B1,
I) with feature C are calculated. Using the expression w1(Pe)1w2(ACC) we find that feature A0 has the
least value among the five features and is chosen as the second feature. If M (required number) is 4, then
subset (C, A0, B0, I) is selected.
3.6. Category XI: Generation—Complete, Evaluation—Consistency
3.6.1. Brief Description of Various Methods
The methods under this combination are developed in the recent years. We discuss Focus [2], and
Schlimmer’s [43] method, and MIFES 1 [35], and select Focus as the representative method for detailed
discussion and empirical comparison.
Focus implements the Min-Features bias that prefers consistent hypotheses definable over as few
features as possible. In the simplest implementation, it does a breadth-first search and checks for any
inconsistency considering only the candidate subset of features. Strictly speaking, Focus is unable to
handle noise, but a simple modification that allows a certain percentage of inconsistency will enable it to
find the minimally sized subset satisfying the permissible inconsistency.
The other two methods can be seen as variants. Schlimmer’s method uses a systematic enumeration
scheme as generation procedure and the inconsistency criterion as the evaluation function. It uses a
heuristic function that makes the search for the optimal subset faster. This heuristic function is a reliability
measure, based on the intuition that the probability that an inconsistency will be observed, is proportional
to the percentage of values that have been observed very infrequently considering the subset of features.
All supersets of an unreliable subset are also reliable. MIFES 1 is very similar to Focus in its selection
of features. It represents the set of instances in the form of a matrix1
, each element of which stands for a
unique combination of a positive instance (class 5 1) and a negative instance (class 5 0). A feature f is
said to cover an element of the matrix if it assumes opposite values for a positive instance and a negative
instance associated with the element. It searches for a cover with N 2 1 features starting from one with
all N features, and iterates until no further reduction in the size of the cover is attained.
3.6.2. Hand-Run of CorrAL Dataset (see Figure 7)
As Focus uses breadth-first generation procedure, it first generates all subsets of size one (i.e., {A0},
{A1}, {B0}, {B1}, {I}, {C}), followed by all subsets of size two, and so on. For each subset generated,
thus, it checks whether there are at least two instances in the dataset having equal values for all features
under examination with different class labels (i.e., inconsistency). If such a case arises, it rejects the
subset saying inconsistent, and moves on to test the next subset. This continues until it finds a subset
having no inconsistency or the search is complete (i.e., all possible subsets are found inconsistent). For
subset {A0}, instance #1 and 4 have the same A0 value (i.e., 0), but different class labels (i.e., 0 and 1
1 MIFES 1 can only handle binary classes and boolean features.
15. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 145
Fig. 7. Focus.
respectively). Hence it rejects it and moves on to the next subset {A1}. It evaluates a total of 41 subsets
( 6
1
+ 6
2
+ 6
3
) before selecting the subset (i.e., A0, A1, B0, B1).
3.7. Category XII: Generation—Random, Evaluation—Consistency
3.7.1. Brief Description of Various Methods
This category is rather new and its representative is LVF [28]. LVF randomly searches the space
of subsets using a Las Vegas algorithm [6] that makes probabilistic choices to help guide them more
quickly to an optimal solution, and uses a consistency measure that is different from that of Focus.
For each candidate subset, it calculates an inconsistency count based on the intuition that the most
frequent class label among those instances matching this subset of features is the most probable class
label. An inconsistency threshold is fixed in the beginning (0 by default), and any subset, having an
inconsistency rate greater than it, is rejected. This method can find the optimal subset even for datasets
with noise, if the rough correct noise level is specified in the beginning. An advantage is that the user
does not have to wait too long for a good subset because it outputs any subset that is found to be better
than the previous best (both by the size of the subset and its inconsistency rate). This algorithm is
efficient, as only the subsets whose number of features is smaller than or equal to that of the current
best subset are checked for inconsistency. It is simple to implement and guaranteed to find the optimal
subset if resources permit. One drawback is that it may take more time to find the optimal subset than
algorithms using heuristic generation procedure for certain problems, since it cannot take advantage of
prior knowledge.
3.7.2. Hand-Run of CorrAL Dataset (see Figure 8)
LVF [28] chooses a feature subset randomly and calculates its cardinality. The best subset is initialized
to the complete feature set. If the randomly chosen subset has cardinality less than or equal to that of the
current best subset, it evaluates the inconsistency rate of the new subset. If the rate is less than or equal
to the threshold value (default value is zero), the new subset is made the current best subset. Let us say
that the subset {A0, B0, C} is randomly chosen. Patterns (0,1,0), (1,0,0), and (1,0,1) have mixed class
labels with class distribution (1,2), (1,1), and (1,1) for class 0 and 1 respectively. The inconsistency count
is the sum of the number of matching instances minus the number of the matching instances with the
most frequent class label for each pattern. Hence for the subset {A0, B0, C}, the inconsistency count is
3 (5(3 22) 1 (2 2 1)) and the inconsistency rate is 3
/16. As we have not specified any threshold
inconsistency rate and it is zero by default, so this subset is rejected. But if the randomly chosen subset
is {A0, A1, B0, B1}, then the inconsistency rate is zero and it becomes the current best subset. No subset
of smaller size has inconsistency rate zero leading to the selection of {A0, A1, B0, B1}.
16. 146 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
Fig. 8. LVF.
3.8. Category XIII-XV: Evaluation—Classifier Error Rate
In this section, we briefly discuss and give suitable references for the methods under classifier error
rate evaluation function that are commonly known as “wrapper” methods.
3.8.1. Heuristic
Under the heuristic generation function, we find popular methods such as sequential forward selection
(SFS) and sequential backward selection (SBS) [12], sequential backward selection-SLASH (SBS-
SLASH) [10], (p,q) sequential search (PQSS), bi-directional search (BDS) [13], Schemata Search [32],
relevance in context (RC) [14], and Queiros and Gelsema’s [37] method are variants of SFS, or SBS,
or both. SFS starts from the empty set, and in each iteration generates new subsets by adding a feature
selected by some evaluation function. SBS starts from the complete feature set, and in each iteration
generates new subsets by discarding a feature selected by some evaluation function. SBS-SLASH is
based on the observation that when there is a large number of features, some classifiers (e.g., ID3/C4.5)
frequently do not use many of them. It starts with the full feature set (like SBS), but after taking a step,
eliminates (slashes) any feature not used in what was learned at that step. BDS allows to search from
both ends, whereas PQSS provides some degree of backtracking by allowing both addition and deletion
for each subset. If PQSS starts from the empty set, then it adds more features than it discards in each
step, and if it starts from the complete feature set, then it discards more features and adds less features
in each step. Schemata search starts either from the empty set or the complete set, and in each iteration,
finds the best subset by either removing or adding only one feature to the subset. It evaluates each subset
using leave-one-out cross validation (LOOCV), and in each iteration, selects the subset having the least
LOOCV error. It continues this way until no single-feature change improves it. RC considers the fact that
some features will be relevant only in some parts of the space. It is similar to SBS, but with the crucial
difference that it makes local, instance-specific decisions on feature relevance, as opposed to global ones.
Queiros and Gelsema’s method is similar to SFS, but it suggests that at each iteration, each feature in
various settings by considering different interactions with the set of features previously selected should
be evaluated. Two simple settings are: (i) always assume independence of features (do not consider
17. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 147
the previously selected features), and (ii) never assume independence (consider the previously selected
features). Bayesian classifier error rate as the evaluation function was used.
3.8.1. Complete
Under complete generation procedure, we find four wrapper methods. Ichino and Sklansky have
devised two methods based on two different classifiers, namely: linear classifier [19] and box
classifier [20]. In both methods, they solve the feature selection problem using zero-one integer
programming [17]. Approximate monotonic branch and bound (AMB&B) was introduced to combat the
disadvantage of B&B by permitting evaluation functions that are not monotonic [16]. In it, the bound of
B&B is relaxed to generate subsets that appear below some subset violating the bound; but the selected
subset should not violate the bound. Beam Search (BS), [13] is a type of best-first search that uses a
bounded queue to limit the scope of the search. The queue is ordered from best to worst with the best
subsets at the front of the queue. The generation procedure proceeds by taking the subset at the front of
the queue, and producing all possible subsets by adding a feature to it, and so on. Each subset is placed
in its proper sorted position in the queue. If there is no limit on the size of the queue, BS is an exhaustive
search; if the limit on the size of the queue is one, it is equivalent to SFS.
3.8.2. Random
The five methods found under random generation procedure are: LVW [27], genetic algorithm [49],
simulated nnealing [13], random generation plus sequential selection (RGSS) [13], and RMHC-PF1 [47].
LVW generates subsets in a perfectly random fashion (it uses Las Vegas algorithm [6], genetic algorithm
(GA) and simulated annealing (SA), although there is some element of randomness in their generation,
follow specific procedures for generating subsets continuously. RGSS injects randomness to SFS and
SBS. It generates a random subset, and runs SFS and SBS starting from the random subset. Random
mutation hill climbing-prototype and feature selection (RMHC-PF1) selects prototypes (instances) and
features simultaneously for nearest neighbor classification problem, using a bet vector that records both
the prototypes and features. It uses the error rate of an 1-nearest neighbour classifier as the evaluation
function. In each iteration it randomly mutates a bit in the vector to produce the next subset. All these
methods need proper assignment of values to different parameters. In addition to maximum number of
iterations, other parameters may be required, such as for LVW the threshold inconsistency rate; for GA
initial population size, crossover rate, and mutation rate; for SA the annealing schedule, number of times
to loop, initial temperature, and mutation probability.
3.9. Summary
Figure 9 shows a summary of the feature selection methods based on the 3 types of generation
procedures. Complete generation procedures are further subdivided into ‘exhaustive’ and ‘non-
exhaustive’; under ‘exhaustive’ category, a method may evaluate ‘All’ 2N
subsets, or it may do a ‘breadth-
first’ search to stop searching as soon as an optimal subset is found: under ‘non-exhaustive’ category we
find ‘branch & bound’, ‘best first’, and ‘beam search’ as different search techniques. Heuristic generation
procedures are subdivided into ‘forward selection’, ‘backward selection’, ‘combined forward/backward’,
and ‘instance-based’ categories. Similarly, random generation procedures are grouped into ‘type I’ and
‘type II.’ In ‘type I’ methods the probability of a subset to be generated remains constant, and it is
the same for all subsets, whereas in ‘type II’ methods this probability changes as the program runs.
18. 148 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
Fig. 9. Summary of feature selection methods.
The underlined methods represent their categories in the Table 2, and are implemented for an empirical
comparison among the categorical methods in the next section.
4. An Empirical Comparison
In this section, we first discuss the different issues involved in the validation procedure commonly
used in the feature selection methods. Next, we briefly discuss three artificial datasets chosen for the
experiments. At the end, we compare and analyze the results of running the categorical methods over the
chosen datasets.
4.1. Validation
There are two commonly used validation procedures for feature selection methods: (a) using artificial
datasets and (b) using real-world datasets. Artificial datasets are constructed keeping in mind a certain
target concept; so all the actual relevant features for this concept are known. Validation procedures
check whether the output (selected subset) is the same as the actual subset. A training dataset with
known relevant features as well as irrelevant/redundant features or noise is taken. The feature selection
method is run over this dataset, and the result is compared with the known relevant features. In the
second procedure, real-world datasets are chosen that may be benchmark datasets. As the actual subset is
unknown in this case, the selected subset is tested for its accuracy with the help of any classifier suitable
to the task. Typical classifiers used are: naive Bayesian classifier, Cart [7], ID3 [38], FRINGE [36],
AQ15 [30], CN2 [11], and C4.5 [39]. The achieved accuracy may be compared with that of some well-
known methods, and its efficacy is analyzed.
In this section, we are trying to show an empirical comparison of representative feature selection
methods. In our opinion, the second method of validation is not suitable to our task because particular
classifiers support specific datasets; a test with such combination of classifier and dataset may wrongly
show a high accuracy rate unless a variety of classifiers are chosen2
and many statistically different
19. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 149
datasets are used3
. Also, as different feature selection methods have a different bias in selecting features,
similar to that of different classifiers, it is not fair to use certain combinations of methods and classifiers,
and try to generalize from the results that some feature selection methods are better than others without
considering the classifier. Fortunately many recent papers, that introduce new feature selection methods,
have done some extensive tests in comparison. Hence, if a reader wants to probe further, he/she is referred
in this survey to the original work. To give some intuition regarding the wrapper methods, we include
one wrapper method (LVW [27]) in the empirical comparison along with the representative methods in
the categories I, II, IV, V, VII, XI, and XII (other categories are empty).
We choose three artificial datasets with combinations of relevant, irrelevant, redundant features, and
noise to perform comparison. They are simple enough but have characteristics that can be used for testing.
For each dataset, a training set commonly used in literature for comparing the results, is chosen for
checking whether a feature selection method can find the relevant features.
4.2. Datasets
The three datasets are chosen in which the first dataset has irrelevant and correlated features, the
second has irrelevant and redundant features, and the third has irrelevant features and it is noisy
(misclassification). Our attempt is to find out the strengths and weaknesses of the methods in these
settings.
4.2.1. CorrAL [21]
This dataset has 32 instances, binary classes, and six boolean features (A0,A1,B0,B1,I,C) out of
which feature I is irrelevant, feature C is correlated to the class label 75% of the time, and the other four
features are relevant to the boolean target concept: (A0QA1) ∼ (B0QB1).
4.2.2. Modified Par3+3
Par3+3 is a modified version of the original Parity3 data that has no redundant feature. The dataset
contains binary classes and twelve boolean features of which three features (A1,A2,A3) are relevant,
(A7,A8,A9) are redundant, and (A4,A5,A6) and (A7,A8,A9) are irrelevant (randomly chosen). The
training set contains 64 instances and the target concept is the odd parity of the first three features.
4.2.3. Monk3 [48]
It has binary classes and six discrete features (A1,...,A6) only the second, fourth and fifth of which
are relevant to the target concept: (A553 ∼ A451)Q(A5!54 ∼ A2!53) with “!5” denoting inequality. The
training set contains 122 instances, 5% of which is noise (misclassification).
4.3. Results and Analyses
Table 4 shows the results of running the eight categorical methods over the three artificial datasets. In
the following paragraphs we analyze these results, and compare the methods based on the datasets used.
4.3.1. Relief [22]
For CorrAL dataset, Relief selects (A0,B0,B1,C) for the range [14] of NoSample. For all values of
NoSample, Relief prefers the correlated feature C over the relevant feature A1. The experiment over
20. 150 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
Table 4
Table Showing the Features Selected for the Datasets
Method CorrA: Modified Par3+3 Monk3
(RA) (A0, A1, B0, B1) ({A1, A7},{A2, A8}, (A2, A4, A5)
{A3, A9})
Relief A0, B0, B1, C A1, A2, A3, A7, A8, A9 A2, A5 always & one or
both of A3, A4
B&B A0, A1, B0, I A1, A2, A3 A1, A3, A4
DTM A0, A1, B0, B1 C A6, A4 A2, A5
MDLM C A2, A3, A8, A10, A12 A2, A3, A5
POE+ACC C, A0, B0, I A5, A1, A11, A2, A12, A3 A2, A5, A1, A3
Focus A0, A1, B0, B1 A1, A2, A3 A1, A2, A4, A5
LVF A0, A1, B0, B1 A3, A7, A∗
8 A2, A4, A5 at 5% I-T
A1, A2, A4, A5 at 0% I-T
LVW A0, A1, B0, B1 A2, A3, A∗
7 A2, A4, A5
RA: relevant attributes; I-T: inconsistency threshold; *: this is one of many found.
modified Par3+3 dataset clearly shows that Relief is unable to detect redundant features (features A7,
A8, A9 are same as A1, A2, A3 respectively). But it was partially successful (depends on the value of
NoSample) for Monk3 dataset that contains 5% noise.
Experimentally, we found that for the range [15] of NoSample, it selects only the relevant features;
beyond this range it either chooses some irrelevant features, or does not select some relevant features.
Hence, for a dataset with no a priori information, the user may find it difficult to choose an appropriate
value for NoSample. We suggest not to use a value too large or too small.
4.3.2. B & B [34]
The CorrAL dataset it rejects the feature B1 as the first feature, and in the next iteration it rejects
feature C. In case of Par3+3 it works well; when asked to select three features, it selects (A1,A2,A3),
and when asked to select six, it selects (A1,A2,A3,A7,A8,A9). It failed for Monk3 dataset which has
noise.
4.3.3. DTM [9]
As explained in Section 3.3, DTM prefers the features that are highly correlated with the class label,
and in case of CorrAL dataset the correlated feature C is selected as the root. Par3+3 is a difficult task
for C4.5 inductive algorithm, and it fails to find the relevant features. For the Monk3 dataset that contains
noise, DTM was partially successful in choosing A2, A5 after pruning.
4.3.4. MDLM [45]
MDLM works well when all the observations are Gaussian. But none of the three datasets are perfectly
Gaussian, which is the reason why MDLM largely failed. For CorrAL, it selects the subset consisting of
21. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 151
only the correlated feature C which indicates that it prefers correlated features. For the other two datasets
the results are also not good.
4.3.5. POE+ACC [33]
The results for this method show the rankings of the features. As seen, it was not very successful for any
of the dataset. For CorrAL, it chooses feature C as the first feature, as explained in Section 3.5. When we
manually input some other first features (e.g., A0), then it correctly ranks the other three relevant features
in the top four positions (A0,A1,B0,B1). This interesting finding points out the importance of the first
feature on all the remaining features for POE+ACC. For Par3+3 and Monk3 it does not work well.
4.3.6. Focus [2]
Focus works well when the dataset has no noise (as in the case of CorrAL and Par3+3), but it selects
an extra feature A1 for Monk3 that has 5% noise, as the actual subset (A2,A4,A5) does not satisfy the
default inconsistency rate zero. Focus is also very fast in producing the results for CorrAL and Par3+3,
because the actual subsets appear in the beginning which supports breadth-first search.
4.3.7. LVF [28]
LVF works well for all the datasets in our experiment. For CorrAL it correctly selects the actual subset.
It is particularly efficient for Par3+3 that has three redundant features, because it can generate at least
one result out of the eight possible actual subsets quite early while randomly generating the subsets. For
the Monk3 dataset, it needs an approximate noise level to produce the actual subset; otherwise (default
case of zero inconsistency rate) it selects an extra feature A1.
5. Discussions and Guidelines
Feature selection is used in many application areas as a tool to remove irrelevant and/or redundant
features. As is well known, there is no single feature selection method that can be applied to all
applications. The choice of a feature selection method depends on various data set characteristics: (i) data
types, (ii) data size, and (iii) noise. Based on different criteria from these characteristics, we give some
guidelines to a potential user of feature selection as to which method to select for a particular application.
5.1. Data Types
Various data types are found, in practice, for features and class labels.
Features: Feature values can be continuous (C), discrete (D), or nominal including boolean (N). The
nominal data type requires special handling because it is not easy to assign real values to it. Hence we
partition the methods based on their ability to handle different data types (C, D, N).
Class Label: Some feature selection methods can handle only binary classes, and others can deal with
multiple (more than two) classes. Hence, the methods are separated based on their ability to handle
multiple classes.
22. 152 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
5.2. Data Size
This aspect of feature selection deals with: (i) whether a method can perform well for small training
set, and (ii) whether it can handle large data size. In this information age, datasets are commonly large
in size; hence the second criterion is more practical and of interest to the current as well as the future
researchers and practitioners in this area [28]; so we separate the methods based on their ability to handle
large dataset.
5.3. Noise
Typical noise encountered during the feature selection process are: (i) misclassification, and
(ii) conflicting date ([29,41]). We partition the methods based on their ability to handle noise.
5.4. Some Guidelines
In this subsection we give some guidelines based on the criteria extracted from these characteristics.
The criteria are as follows:
• ability to handle different data types (C: (Y/N), D: (Y/N), N: (Y/N)), where C, D, and N denote
continuous, discrete, and nominal;
• ability to handle multiple (more than two) classes (Y/N);
• ability to handle large dataset (Y/N);
• ability to handle noise (Y/N); and
• ability to produce optimal subset if data is not noisy (Y/N).
Table 5 lists the capabilities regarding these five characteristics of sixteen feature selection methods that
appear in the framework in Section 2. The methods having “classifier error rate” as an evaluation function
are not considered because their capabilities depend on the particular classifier used. A method can be
implemented in different ways, so the entries in the table are based on the presentation of the methods in
the papers. The “2” suggests that the method does not discuss the particular characteristics. A user may
employ his/her prior knowledge regarding these five characteristics to find an appropriate method for a
particular application. For example, if the data contains noise, then the methods marked “Y” under the
column “Noise” should be considered. If there are more than one method satisfying the criteria mentioned
by the user, then the user needs to have a closer look at the methods to find the most suitable among them,
or design a new method that suits the application.
6. Conclusion and Further Work
We present a definition of feature selection after discussing many existing definitions. Four steps
of a typical feature selection process are recognized: generation procedure, evaluation function,
stopping criterion, and validation procedure. We group the generation procedures into thee categories:
complete, heuristic, and random, and the evaluation functions into five categories: distance, information,
dependence, consistency, and classifier error rate measures. Thirty two existing feature selection methods
are categorized based on the combinations of generation procedure and evaluation function used in
23. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 153
Table 5
Feature Selection Methods and their Capabilities
Category Method Ability to handle/produce
Data Types Multiple Large Noise Optimal
C D N Classes Dataset Subset
I Relief Y Y Y N Y Y N
Relief-F Y Y Y Y Y Y N
Sege84 N Y Y N – – N
Quei-Gels84 N Y Y Y – – N
II B & B Y Y N Y – – Y11
BFF Y Y N Y – – Y11
Bobr88 Y Y N Y – – Y11
IV DTM Y Y Y Y Y – N
Koll-Saha96 N Y Y Y Y – N
MDLM Y Y N Y – – N
VII POE1ACC Y Y Y Y – – N
PRESET Y Y Y Y Y – N
XI Focuc N Y Y Y N N Y
Sch193 N Y Y Y N N Y11
MIFES1
1 N N N Y N N Y
XII LVF N Y Y Y Y Y∗ Y11
1: it can handle only boolean features, 11: if certain assumptions are valid, *: user is
required to provide the noise level, **: provided there are enough resources.
them. Methods under each category are described briefly, and a representative method is chosen for
each category that is described in detail using its pseudo code and a hand-run over a dataset (a version
of CorrAL). These representative methods are compared using three artificial datasets with different
properties, and the results are analyzed. We discuss data set characteristics, and give guidelines regarding
how to choose a particular method if the user has prior knowledge of the problem domain.
This article is a survey of feature selection methods. Our finding is that a number of combinations of
generation procedure and evaluation function have not yet been attempted (to the best of our knowledge).
This is quite interesting, considering the fact that feature selection has been the focus of interest of
various groups of researchers for the last few decades. Our comparison of the categorical methods reveal
many interesting facts regarding their advantages and disadvantages in handling different characteristics
of the data. The guidelines in Section 5 should be useful to the users of feature selection who would
24. 154 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
have been otherwise baffled by the sheer number of methods. Properly used, these guidelines can be of
practical value in choosing the suitable method.
Although feature selection is a well-developed research area, researchers still try to find better methods
to make their classifiers more efficient. In this scenario the framework which shows the unattempted
combinations of generation procedures and evaluation functions will be helpful. By testing particular
implementation of evaluation function and generation procedure combinations previously used in some
categories, we can design new and efficient methods for some unattempted combinations. It is obvious
that not all combinations will be efficient. The combination of consistency measure and heuristic
generation procedure in the category ‘X’ may give some good results. One may also test different
combinations that previously exist, such as branch and bound generation procedure and consistency
measure in category XI.
Acknowledgments
The authors wish to thank the two referees, and Professor Hiroshi Motoda of Osaka University for their
valuable comments, constructive suggestion, and thorough reviews on the earlier version of this article.
Thanks also to Lee Leslie Leon, Sng Woei Tyng, and Wong Siew Kuen for implementing some of the
feature selection methods and analyzing their results.
References
[1] Aha, D.W., Kibler, D. and Albert, M.K., Instance-Based Learning Algorithms. Machine Learning, 6:37–66, 1991.
[2] Almuallim, H., and Dietterich, T.G., Learning with many irrelevant features. In: Proceedings of Ninth National Conference
on Artifical Intelligence, MIT Press, Cambridge, Massachusetts, 547–552, 1992.
[3] Almuallim, H., and Dietterich, T.G., Learning Boolean Concepts in the Presence of Many Irrelevant Features. Artificial
Intelligence, 69(1–2):279–305, November, 1994.
[4] Ben-Bassat, M., Pattern recognition and reduction of dimensionality. In: Handbook of Statistics, (P. R. Krishnaiah and
L. N. Kanal, eds.), North Holland, 773–791, 1982.
[5] Bobrowski, L., Feature selection based on some homogeneity coefficient. In: Proceedings of Ninth International
Conference on Pattern Recognition, 544–546, 1988.
[6] Brassard, G., and Bratley, P., Fundamentals of Algorithms. Prentice Hall, New Jersey, 1996.
[7] Breiman, G., Friedman, J.H., Olshen, R.A. and Stone, C.J., Classification and Regression Trees. Wadsworth International
Group, Belmont, California, 1984.
[8] Callan, J.P., Fawcett, T.E. and Rissland, E.L., An adaptive approach to case-based search. In: Proceedings of the Twelfth
International Joint Conference on Artificial Intelligence, 803–808, 1991.
[9] Cardie, C., Using decision trees to improve case-based learning. In: Proceedings of Tenth International Conference on
Machine Learning, 25–32, 1993.
[10] Caruana, R. and Freitag, D., Greedy attribute selection. In: Proceedings of Eleventh International Conference on Machine
Learning, Morgan Kaufmann, New Brunswick, New Jersey, 28–36, 1994.
[11] Clark, P. and Niblett, T., The CN2 Induction Algorithm. Machine Learning, 3:261–283, 1989.
[12] Devijver, P.A. and Kittler, J., Pattern Recognition: A Statistical Approach. Prentice Hall, 1982.
[13] Doak, J., An evaluation of feature selection methods and their application to computer security. Technical report, Davis,
CA: University of California, Department of Computer Science, 1992.
[14] Domingos, P., Context-sensitive feature selection for lazy learners. Artificial Intelligence Review, 1996.
[15] Duran, B.S. and Odell, P.K., Cluster Analysis—A Survey. Springer-Verlag, 1974.
[16] Foroutan, I. and Sklansky, J., Feature selection for automatic classification of non-gaussian data. IEEE Transactions on
Systems, Man, and Cybernatics, SMC-17(2):187–198, 1987.
25. M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 155
[17] Geoffrion, A.M., Integer programming by implicit enumeration and balas, method. SIAM Review, 9:178–190, 1967.
[18] Holte, R.C., Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–
90, 1993.
[19] Ichino, M. and Sklansky, J., Feature selection for linear classifier. In: Proceedings of the Seventh International Conference
on Pattern Recognition, volume 1, 124–127, July–Aug 1984.
[20] Ichino, M. and Sklansky, J., Optimum feature selection by zero-one programming. IEEE Trans. on Systems, Man and
Cybernetics, SMC-14(5):737–746, September/October 1984.
[21] John, G.H., Kohavi, R. and Pfleger, K., Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh
International Conference on Machine Learning, 121–129, 1994.
[22] Kira, K. and Rendell, L.A., The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of
Ninth National Conference on Artificial Intelligence, 129–134, 1992.
[23] Kohavi, R. and Sommerfield, D., Feature subset selection using the wrapper method: Overfitting and dynamic search
space topology. In: Proceedings of First International Conference on Knowledge Discovery and Data Mining, Morgan
Kaufmann, 192–197, 1995.
[24] Koller, D. and Sahami, M., Toward optimal feature selection. In: Proceedings of International Conference on Machine
Learning, 1996.
[25] Kononenko, I., Estimating attributes: Analysis and extension of RELIEF. In: Proceedings of European Conference on
Machine Learning, 171–182, 1994.
[26] Langley, P., Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall Symposium on Relevance,
1–5, 1994.
[27] Liu, H. and Setiono, R., Feature selection and classification—a probabilistic wrapper approach. In: Proceedings of Ninth
International Conference on Industrial and Engineering Applications of AI and ES, 284–292, 1996.
[28] Liu, H. and Setiono, R., A probabilistic approach to feature selection—a filter solution. In: Proceedings of International
Conference on Machine Learning, 319–327, 1996.
[29] Michalski, R.S., Carbonell, J.G. and Mitchell, T.M (eds.), The effect of noise on concept learning, In: Machine Learning:
An Artificial Intelligence Approach (vol. II), Morgan Kaufmann, San Mateo, CA, 419–424, 1986.
[30] Michalski, R.S., Mozetic, I., Hong, J. and Lavrac, N., The aq15 inductive learning system: An overview and experiments.
Technical Report UIUCDCS-R-86-1260, University of Illinois, July 1986.
[31] Modrzejewski, M., Feature selection using rough sets theory. In: Proceedings of the European Conference on Machine
Learning (P. B. Brazdil, ed.), 213–226, 1993.
[32] Moore, A.W. and Lee, M.S., Efficient algorithms for minimizing cross validation error. In: Proceedings of Eleventh
International Conference on Machine Learning, Morgan Kaufmann, New Brunswick, New Jersey, 190–198, 1994.
[33] Mucciardi, A.N. and Gose, E.E., A comparison of seven techniques for choosing subsets of pattern recognition. IEEE
Transactions on Computers, C-20:1023–1031, September 1971.
[34] Narendra, P.M. and Fukunaga, K., A branch and bound algorithm for feature selection. IEEE Transactions on Computers,
C-26(9):917–922, September 1977.
[35] Oliveira, A.L. and Vincentelli, A.S., Constructive induction using a non-greedy strategy for feature selection. In:
Proceedings of Ninth International Conference on Machine Learning, 355–360, Morgan Kaufmann, Aberdeen, Scotland,
1992.
[36] Pagallo, G. and Haussler, D., Boolean feature discovery in empirical learning. Machine Learning, 1(1):81–106, 1986.
[37] Queiros, C.E. and Gelsema, E.S., On feature selection. In: Proceedings of Seventh International Conference on Pattern
Recognition, 1:128–130, July-Aug 1984.
[38] Quinlan, J., Induction of decision trees. In: Machine Learning, Morgan Kaufmann, 81–106. 1986.
[39] Quinlan, J.R., C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
[40] Rissanen, J., Modelling by shortest data description. Automatica, 14:465–471, 1978.
[41] Schaffer, C., Overfitting avoidance as bias. Machine Learning, 10(2):153–178, 1993.
[42] Schaffer, C., A conservation law for generalization performance. In: Proceedings of Eleventh International Conference on
Machine Learning, Morgan Kaufmann, New Brunswick, NJ, 259–265, 1994.
[43] Schlimmer, J.C., Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal
pruning. In: Proceedings of Tenth International Conference on Machine Learning, 284–290, (1993).
[44] Segen, J., Feature selection and constructive inference. In: Proceedings of Seventh International Conference on Pattern
Recognition, 1344–1346, 1984.
26. 156 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156
[45] Sheinvald, J., Dom, B. and Niblack, W., A modelling approach to feature selection. In: Proceedings of Tenth International
Conference on Pattern Recognition, 1:535–539, June 1990.
[46] Siedlecki, W. and Sklansky, J., On automatic feature selection. International Journal of Pattern Recognition and Artificial
Intelligence, 2:197–220, 1988.
[47] Skalak, D.B., Prototype and feature selection by sampling and random mutation hill-climbing algorithms. In: Proceedings
of Eleventh International Conference on Machine Learning, Morgan Kaufmann, New Brunswick, 293–301, 1994.
[48] Thrun, S.B., et al., The monk’s problem: A performance comparison of different algorithms. Technical Report CMU-CS-
91-197, Carnegie Mellon University, December 1991.
[49] Vafaie, H. and Imam, I.F., Feature selection methods: genetic algorithms vs. greedy-like search. In: Proceedings of
International Conference on Fuzzy and Intelligent Control Systems, 1994.
[50] Xu, L., Yan, P. and Chang, T., Best first strategy for feature selection. In: Proceedings of Ninth International Conference
on Pattern Recognition, 706–708, 1988.