ABSTRACT
Classifier ensembles have been used successfully to improve accuracy rates of the underlying classification mechanisms. Through the use of aggregated classifications, it becomes possible to achieve lower error rates in classification than by using a single classifier instance. Ensembles are most often used with collections of decision trees or neural networks owing to their higher rates of error when used individually. In this paper, we will consider a unique implementation of a classifier ensemble which utilizes kNN classifiers. Each classifier is tailored to detecting membership in a specific class using a best subset selection process for variables. This provides the diversity needed to successfully implement an ensemble. An aggregating mechanism for determining the final classification from the ensemble is presented and tested against several well known datasets.
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...ijseajournal
- The document presents VINAYAKA, a semi-supervised projected clustering method using differential evolution.
- VINAYAKA uses a hybrid cluster validation index combining the Subspace Clustering Quality Estimate index (for internal validation) and Gini index gain (for external validation). Differential evolution optimizes this index to find optimal subspace cluster centers.
- The method is tested on the Wisconsin breast cancer dataset, and synthetic datasets are used to demonstrate that the hybrid index can identify the correct number of clusters more accurately than an internal index alone.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
1. The document discusses using a hybrid genetic algorithm-support vector machine (GA-SVM) approach for feature selection in email classification to improve SVM performance.
2. SVM has been shown to be inefficient and consume a lot of computational resources when classifying large email datasets with many features.
3. The hybrid GA-SVM approach uses a genetic algorithm to optimize feature selection for SVM in order to improve classification accuracy and reduce computation time for email spam detection.
11.hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
The document summarizes a study that develops a hybrid genetic algorithm-support vector machine (GA-SVM) technique for feature selection in email classification. The technique uses a genetic algorithm to optimize the feature selection and parameters of an SVM classifier. The goal is to improve the SVM's classification accuracy and reduce computation time for large email datasets. The study tests the hybrid GA-SVM approach on a spam email dataset. The results show improvements in classification accuracy and computation time over using SVM alone.
K-means and bayesian networks to determine building damage levelsTELKOMNIKA JOURNAL
Many troubles in life require decision-making with convoluted processes because they are caused by uncertainty about the process of relationships that appear in the system. This problem leads to the creation of a model called the Bayesian Network. Bayesian Network is a Bayesian supported development supported by computing advancements. The Bayesian network has also been developed in various fields. At this time, information can implement Bayesian Networks in determining the extent of damage to buildings using individual building data. In practice, there is mixed data which is a combination of continuous and discrete variables. Therefore, to simplify the study it is assumed that all variables are discrete in order to solve practical problems in the implementation of theory. Discretization method used is the K-Means clustering because the percentage of validity obtained by this method is greater than the binning method.
The Influence of Age Assignments on the Performance of Immune AlgorithmsMario Pavone
How long a B cell remains, evolves and matures inside a population plays a crucial role on the capability for an immune algorithm to jump out from local optima, and find the global optimum. Assigning the right age to each clone (or offspring, in general) means to find the proper balancing between the exploration and exploitation. In this research work we present an experimental study conducted on an immune algorithm, based on the clonal selection principle, and performed on eleven different age assignments, with the main aim to verify if at least one, or two, of the top 4 in the previous efficiency ranking produced on the one-max problem, still appear among the top 4 in the new efficiency ranking obtained on a different complex problem. Thus, the NK landscape model has been considered as the test problem, which is a mathematical model formulated for the study of tunably rugged fitness landscape. From the many experiments performed is possible to assert that in the elitism variant of the immune algorithm, two of the best age assignments previously discovered, still continue to appear among the top 3 of the new rankings produced; whilst they become three in the no elitism version. Further, in the first variant none of the 4 top previous ones ranks ever in the first position, unlike on the no elitism variant, where the previous best one continues to appear in 1st position more than the others. Finally, this study confirms that the idea to assign the same age of the parent to the cloned B cell is not a good strategy since it continues to be as the worst also in the new
efficiency ranking.
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...ijseajournal
- The document presents VINAYAKA, a semi-supervised projected clustering method using differential evolution.
- VINAYAKA uses a hybrid cluster validation index combining the Subspace Clustering Quality Estimate index (for internal validation) and Gini index gain (for external validation). Differential evolution optimizes this index to find optimal subspace cluster centers.
- The method is tested on the Wisconsin breast cancer dataset, and synthetic datasets are used to demonstrate that the hybrid index can identify the correct number of clusters more accurately than an internal index alone.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
1. The document discusses using a hybrid genetic algorithm-support vector machine (GA-SVM) approach for feature selection in email classification to improve SVM performance.
2. SVM has been shown to be inefficient and consume a lot of computational resources when classifying large email datasets with many features.
3. The hybrid GA-SVM approach uses a genetic algorithm to optimize feature selection for SVM in order to improve classification accuracy and reduce computation time for email spam detection.
11.hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
The document summarizes a study that develops a hybrid genetic algorithm-support vector machine (GA-SVM) technique for feature selection in email classification. The technique uses a genetic algorithm to optimize the feature selection and parameters of an SVM classifier. The goal is to improve the SVM's classification accuracy and reduce computation time for large email datasets. The study tests the hybrid GA-SVM approach on a spam email dataset. The results show improvements in classification accuracy and computation time over using SVM alone.
K-means and bayesian networks to determine building damage levelsTELKOMNIKA JOURNAL
Many troubles in life require decision-making with convoluted processes because they are caused by uncertainty about the process of relationships that appear in the system. This problem leads to the creation of a model called the Bayesian Network. Bayesian Network is a Bayesian supported development supported by computing advancements. The Bayesian network has also been developed in various fields. At this time, information can implement Bayesian Networks in determining the extent of damage to buildings using individual building data. In practice, there is mixed data which is a combination of continuous and discrete variables. Therefore, to simplify the study it is assumed that all variables are discrete in order to solve practical problems in the implementation of theory. Discretization method used is the K-Means clustering because the percentage of validity obtained by this method is greater than the binning method.
The Influence of Age Assignments on the Performance of Immune AlgorithmsMario Pavone
How long a B cell remains, evolves and matures inside a population plays a crucial role on the capability for an immune algorithm to jump out from local optima, and find the global optimum. Assigning the right age to each clone (or offspring, in general) means to find the proper balancing between the exploration and exploitation. In this research work we present an experimental study conducted on an immune algorithm, based on the clonal selection principle, and performed on eleven different age assignments, with the main aim to verify if at least one, or two, of the top 4 in the previous efficiency ranking produced on the one-max problem, still appear among the top 4 in the new efficiency ranking obtained on a different complex problem. Thus, the NK landscape model has been considered as the test problem, which is a mathematical model formulated for the study of tunably rugged fitness landscape. From the many experiments performed is possible to assert that in the elitism variant of the immune algorithm, two of the best age assignments previously discovered, still continue to appear among the top 3 of the new rankings produced; whilst they become three in the no elitism version. Further, in the first variant none of the 4 top previous ones ranks ever in the first position, unlike on the no elitism variant, where the previous best one continues to appear in 1st position more than the others. Finally, this study confirms that the idea to assign the same age of the parent to the cloned B cell is not a good strategy since it continues to be as the worst also in the new
efficiency ranking.
This document presents the results of a cluster analysis study conducted on student supermarket shoppers. The analysis used factor scores measuring the importance students place on supermarket features like price, quality, staff, and accessibility. Three clusters were identified. ANOVA testing showed significant differences between clusters. Cluster profiles were developed based on average factor scores and interpretations identified economically-minded, quality-focused, and convenience-prioritizing shopper segments. The results provide insights to better target marketing strategies at different student shopper groups.
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...IJECEIAES
The variable selection is an important technique the reducing dimensionality of data frequently used in data preprocessing for performing data mining. This paper presents a new variable selection algorithm uses the heuristic variable selection (HVS) and Minimum Redundancy Maximum Relevance (MRMR). We enhance the HVS method for variab le selection by incorporating (MRMR) filter. Our algorithm is based on wrapper approach using multi-layer perceptron. We called this algorithm a HVS-MRMR Wrapper for variables selection. The relevance of a set of variables is measured by a convex combination of the relevance given by HVS criterion and the MRMR criterion. This approach selects new relevant variables; we evaluate the performance of HVS-MRMR on eight benchmark classification problems. The experimental results show that HVS-MRMR selected a less number of variables with high classification accuracy compared to MRMR and HVS and without variables selection on most datasets. HVS-MRMR can be applied to various classification problems that require high classification accuracy.
RANKING BASED ON COLLABORATIVE FEATURE-WEIGHTING APPLIED TO THE RECOMMENDATIO...ijaia
Current research on recommendation systems focuses on optimization and evaluation of the quality
of ranked recommended results. One of the most common approaches used in digital paper
libraries to present and recommend relevant search results, is ranking the papers based on their
features. However, feature utility or relevance varies greatly from highly relevant to less relevant,
and redundant. Departing from the existing recommendation systems, in which all item features
are considered to be equally important, this study presents the initial development of an approach
to feature weighting with the goal of obtaining a novel recommendation method in which features
which are more effective have a higher contribution/weight to the ranking process. Furthermore,
it focuses on obtaining ranking of results returned by a query through a collaborative weighting
procedure carried out by human users. The collaborative feature-weighting procedure is shown to
be incremental, which in turn leads to an incremental approach to feature-based similarity evaluation.
The obtained system is then evaluated using Normalized Discounted Cumulative Gain
(NDCG) with respect to a crowd-sourced ranked results. Comparison between the performance of
the proposed and Ranking SVM methods shows that the overall ranking accuracy of the proposed
approach outperforms the ranking accuracy of Ranking SVM method.
Images as Occlusions of Textures: A Framework for Segmentationjohn236zaq
The document proposes a new framework for unsupervised image segmentation based on modeling images as occlusions of random textures. It describes segmenting images by using local histograms to identify textures in synthetic texture mosaics and real histology images. The framework draws on existing work in nonnegative matrix factorization and image deconvolution. Results show promise in segmenting histopathology images, which are difficult to segment due to subtle boundaries between regions.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Mammogram image segmentation using rough clusteringeSAT Journals
This document discusses using rough clustering algorithms for mammogram image segmentation. It proposes using Rough K-Means clustering on Haralick texture features extracted from mammogram images. The Rough K-Means algorithm is compared to traditional K-Means and Fuzzy C-Means using metrics like mean square error and root mean square error. Preliminary results found that Rough K-Means produced better segmentation results than the other methods. The document provides background on rough set theory, image segmentation, feature extraction, and different clustering algorithms that can be used.
A comparative study of clustering and biclustering of microarray dataijcsit
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they
coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can
be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of
utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes
that are coexpressed under clusters of conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate
this problem by finding suboptimal solutions. In this paper, we make a new survey on clustering and
biclustering of gene expression data, also called microarray data.
This document discusses using feature selection techniques to address the curse of dimensionality in microarray data analysis. It presents the problem of having many more features than samples in bioinformatics tasks like cancer classification and network inference. It describes filter, wrapper and embedded feature selection approaches and proposes a blocking strategy that uses multiple learning algorithms to evaluate feature subsets in order to improve selection robustness when samples are limited. Finally, it lists several microarray gene expression datasets that are commonly used to evaluate feature selection methods.
Fuzzy Max-Min Composition in Quality Specifications of Multi-Agent Systemijcoa
In this paper a new technique named Fuzzy Max- Min Composition is used to obtain prioritization of quality specifications that assist quality engineer in achieving the desired level of quality for Multi-agent systems. Intuitionistic fuzzy Set has been used to capture the uncertainties associated with stakeholder’s recommendation.
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
Medical data mining has great deal for exploring new knowledge from large amount of data. Classification is one of the important data mining techniques for classification of data. In this research work, we have used various data mining based classification techniques for classification of cancer diseases patient or not. We applied the Breast Cancer-Wisconsin (Original) data set into different data mining techniques and compared the accuracy of models with two different data partitions. BayesNet achieved highest accuracy as 97.13% in case of 10-fold data partitions. We have also applied the info gain feature selection technique on BayesNet and Support Vector Machine (SVM) and achieved best accuracy 97.28% accuracy with BayesNet in case of 6 feature subset.
ROLE OF CERTAINTY FACTOR IN GENERATING ROUGH-FUZZY RULEIJCSEA Journal
The generation of effective feature-based rules is essential to the development of any intelligent system. This paper presents an approach that integrates a powerful fuzzy rule generation algorithm with a rough set-assisted feature reduction method to generate diagnostic rule with a certainty factor. Certainty factor of each rule is calculated by considering both the membership value of each linguistic term introduced at time of fuzzyfication of data as well as possibility values, due to inconsistent data, generated by rough set theory at time of rule generation. In time of knowledge inferencing in an intelligent system, certainty factor of each rule will play an important role to find out the appropriate rule to be selected. Experimental results demonstrate the superiority of our approach.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
This document describes a clustering-based method for gene selection to classify samples in microarray data. It involves calculating the relevance of each gene to class labels and the redundancy between genes using mutual information. Genes are clustered based on their relevance, with the most relevant gene selected as the cluster representative. Min-hash clustering is then applied to reduce redundant genes and cluster size. The goal is to select a minimal set of non-redundant genes that can accurately classify samples by reducing noise from irrelevant genes.
An Influence of Measurement Scale of Predictor Variable on Logistic Regressio...IJECEIAES
Much real world decision making is based on binary categories of information that agree or disagree, accept or reject, succeed or fail and so on. Information of this category is the output of a classification method that is the domain of statistical field studies (eg Logistic Regression method) and machine learning (eg Learning Vector Quantization (LVQ)). The input argument of a classification method has a very crucial role to the resulting output condition. This paper investigated the influence of various types of input data measurement (interval, ratio, and nominal) to the performance of logistic regression method and LVQ in classifying an object. Logistic regression modeling is done in several stages until a model that meets the suitability model test is obtained. Modeling on LVQ was tested on several codebook sizes and selected the most optimal LVQ model. The best model of each method compared to its performance on object classification based on Hit Ratio indicator. In logistic regression model obtained 2 models that meet the model suitability test is a model with predictive variables scaled interval and nominal, while in LVQ modeling obtained 3 pieces of the most optimal model with a different codebook. In the data with interval-scale predictor variable, the performance of both methods is the same. The performance of both models is just as bad when the data have the predictor variables of the nominal scale. In the data with predictor variable has ratio scale, the LVQ method able to produce moderate enough performance, while on logistic regression modeling is not obtained the model that meet model suitability test. Thus if the input dataset has interval or ratio-scale predictor variables than it is preferable to use the LVQ method for modeling the object classification.
International Journal of Research in Engineering and Science is an open access peer-reviewed international forum for scientists involved in research to publish quality and refereed papers. Papers reporting original research or experimentally proved review work are welcome. Papers for publication are selected through peer review to ensure originality, relevance, and readability.
This document discusses measuring the importance and weight of decision makers in group decision making processes. It introduces a method to calculate weights for decision makers based on the number of iterations their pairwise comparison matrices take to reach convergence when using the eigenvector method for criteria weighting. A case study applies this method to determine the weights of three decision makers (A, B, and C) who provide pairwise comparison matrices for weighting four criteria related to selecting a form. The number of iterations for each decision maker's matrix to converge determines their absolute and relative weights in the group decision making process.
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
This document proposes a multi-cluster based majority under-sampling approach for handling skewed data in data mining. It begins with an introduction to the class imbalance problem, where one class has significantly more samples than the other class. Common techniques for addressing this include under-sampling the majority class and over-sampling the minority class. However, under-sampling can lose important majority class information. The proposed approach first clusters the majority class into multiple clusters. It then under-samples each cluster based on its size to select representative samples, avoiding information loss. It also oversamples the minority class. Experimental results on several datasets show the approach improves prediction of the minority class compared to standard under-sampling.
Cluster analysis is a technique used to group objects based on characteristics they possess. It involves measuring the distance or similarity between objects and grouping those that are most similar together. There are two main types: hierarchical cluster analysis, which groups objects sequentially into clusters; and nonhierarchical cluster analysis, which directly assigns objects to pre-specified clusters. The choice of method depends on factors like sample size and research objectives.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
Assessing the compactness and isolation of individual clustersperfj
This document summarizes a new cluster validation method called the NEN/FIN method. The method assesses the compactness and isolation of individual clusters by calculating average nearest-exterior-neighbor versus furthest-interior-neighbor (NEN/FIN) ratios for clusters and comparing them to reference data sets to determine how unusual they are. Higher and more unusual NEN/FIN ratios indicate more compact and isolated, or "real", clusters. The method provides an absolute assessment of cluster quality without requiring parameter estimates.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
This document presents the results of a cluster analysis study conducted on student supermarket shoppers. The analysis used factor scores measuring the importance students place on supermarket features like price, quality, staff, and accessibility. Three clusters were identified. ANOVA testing showed significant differences between clusters. Cluster profiles were developed based on average factor scores and interpretations identified economically-minded, quality-focused, and convenience-prioritizing shopper segments. The results provide insights to better target marketing strategies at different student shopper groups.
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...IJECEIAES
The variable selection is an important technique the reducing dimensionality of data frequently used in data preprocessing for performing data mining. This paper presents a new variable selection algorithm uses the heuristic variable selection (HVS) and Minimum Redundancy Maximum Relevance (MRMR). We enhance the HVS method for variab le selection by incorporating (MRMR) filter. Our algorithm is based on wrapper approach using multi-layer perceptron. We called this algorithm a HVS-MRMR Wrapper for variables selection. The relevance of a set of variables is measured by a convex combination of the relevance given by HVS criterion and the MRMR criterion. This approach selects new relevant variables; we evaluate the performance of HVS-MRMR on eight benchmark classification problems. The experimental results show that HVS-MRMR selected a less number of variables with high classification accuracy compared to MRMR and HVS and without variables selection on most datasets. HVS-MRMR can be applied to various classification problems that require high classification accuracy.
RANKING BASED ON COLLABORATIVE FEATURE-WEIGHTING APPLIED TO THE RECOMMENDATIO...ijaia
Current research on recommendation systems focuses on optimization and evaluation of the quality
of ranked recommended results. One of the most common approaches used in digital paper
libraries to present and recommend relevant search results, is ranking the papers based on their
features. However, feature utility or relevance varies greatly from highly relevant to less relevant,
and redundant. Departing from the existing recommendation systems, in which all item features
are considered to be equally important, this study presents the initial development of an approach
to feature weighting with the goal of obtaining a novel recommendation method in which features
which are more effective have a higher contribution/weight to the ranking process. Furthermore,
it focuses on obtaining ranking of results returned by a query through a collaborative weighting
procedure carried out by human users. The collaborative feature-weighting procedure is shown to
be incremental, which in turn leads to an incremental approach to feature-based similarity evaluation.
The obtained system is then evaluated using Normalized Discounted Cumulative Gain
(NDCG) with respect to a crowd-sourced ranked results. Comparison between the performance of
the proposed and Ranking SVM methods shows that the overall ranking accuracy of the proposed
approach outperforms the ranking accuracy of Ranking SVM method.
Images as Occlusions of Textures: A Framework for Segmentationjohn236zaq
The document proposes a new framework for unsupervised image segmentation based on modeling images as occlusions of random textures. It describes segmenting images by using local histograms to identify textures in synthetic texture mosaics and real histology images. The framework draws on existing work in nonnegative matrix factorization and image deconvolution. Results show promise in segmenting histopathology images, which are difficult to segment due to subtle boundaries between regions.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Mammogram image segmentation using rough clusteringeSAT Journals
This document discusses using rough clustering algorithms for mammogram image segmentation. It proposes using Rough K-Means clustering on Haralick texture features extracted from mammogram images. The Rough K-Means algorithm is compared to traditional K-Means and Fuzzy C-Means using metrics like mean square error and root mean square error. Preliminary results found that Rough K-Means produced better segmentation results than the other methods. The document provides background on rough set theory, image segmentation, feature extraction, and different clustering algorithms that can be used.
A comparative study of clustering and biclustering of microarray dataijcsit
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they
coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can
be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of
utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes
that are coexpressed under clusters of conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate
this problem by finding suboptimal solutions. In this paper, we make a new survey on clustering and
biclustering of gene expression data, also called microarray data.
This document discusses using feature selection techniques to address the curse of dimensionality in microarray data analysis. It presents the problem of having many more features than samples in bioinformatics tasks like cancer classification and network inference. It describes filter, wrapper and embedded feature selection approaches and proposes a blocking strategy that uses multiple learning algorithms to evaluate feature subsets in order to improve selection robustness when samples are limited. Finally, it lists several microarray gene expression datasets that are commonly used to evaluate feature selection methods.
Fuzzy Max-Min Composition in Quality Specifications of Multi-Agent Systemijcoa
In this paper a new technique named Fuzzy Max- Min Composition is used to obtain prioritization of quality specifications that assist quality engineer in achieving the desired level of quality for Multi-agent systems. Intuitionistic fuzzy Set has been used to capture the uncertainties associated with stakeholder’s recommendation.
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
Medical data mining has great deal for exploring new knowledge from large amount of data. Classification is one of the important data mining techniques for classification of data. In this research work, we have used various data mining based classification techniques for classification of cancer diseases patient or not. We applied the Breast Cancer-Wisconsin (Original) data set into different data mining techniques and compared the accuracy of models with two different data partitions. BayesNet achieved highest accuracy as 97.13% in case of 10-fold data partitions. We have also applied the info gain feature selection technique on BayesNet and Support Vector Machine (SVM) and achieved best accuracy 97.28% accuracy with BayesNet in case of 6 feature subset.
ROLE OF CERTAINTY FACTOR IN GENERATING ROUGH-FUZZY RULEIJCSEA Journal
The generation of effective feature-based rules is essential to the development of any intelligent system. This paper presents an approach that integrates a powerful fuzzy rule generation algorithm with a rough set-assisted feature reduction method to generate diagnostic rule with a certainty factor. Certainty factor of each rule is calculated by considering both the membership value of each linguistic term introduced at time of fuzzyfication of data as well as possibility values, due to inconsistent data, generated by rough set theory at time of rule generation. In time of knowledge inferencing in an intelligent system, certainty factor of each rule will play an important role to find out the appropriate rule to be selected. Experimental results demonstrate the superiority of our approach.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
This document describes a clustering-based method for gene selection to classify samples in microarray data. It involves calculating the relevance of each gene to class labels and the redundancy between genes using mutual information. Genes are clustered based on their relevance, with the most relevant gene selected as the cluster representative. Min-hash clustering is then applied to reduce redundant genes and cluster size. The goal is to select a minimal set of non-redundant genes that can accurately classify samples by reducing noise from irrelevant genes.
An Influence of Measurement Scale of Predictor Variable on Logistic Regressio...IJECEIAES
Much real world decision making is based on binary categories of information that agree or disagree, accept or reject, succeed or fail and so on. Information of this category is the output of a classification method that is the domain of statistical field studies (eg Logistic Regression method) and machine learning (eg Learning Vector Quantization (LVQ)). The input argument of a classification method has a very crucial role to the resulting output condition. This paper investigated the influence of various types of input data measurement (interval, ratio, and nominal) to the performance of logistic regression method and LVQ in classifying an object. Logistic regression modeling is done in several stages until a model that meets the suitability model test is obtained. Modeling on LVQ was tested on several codebook sizes and selected the most optimal LVQ model. The best model of each method compared to its performance on object classification based on Hit Ratio indicator. In logistic regression model obtained 2 models that meet the model suitability test is a model with predictive variables scaled interval and nominal, while in LVQ modeling obtained 3 pieces of the most optimal model with a different codebook. In the data with interval-scale predictor variable, the performance of both methods is the same. The performance of both models is just as bad when the data have the predictor variables of the nominal scale. In the data with predictor variable has ratio scale, the LVQ method able to produce moderate enough performance, while on logistic regression modeling is not obtained the model that meet model suitability test. Thus if the input dataset has interval or ratio-scale predictor variables than it is preferable to use the LVQ method for modeling the object classification.
International Journal of Research in Engineering and Science is an open access peer-reviewed international forum for scientists involved in research to publish quality and refereed papers. Papers reporting original research or experimentally proved review work are welcome. Papers for publication are selected through peer review to ensure originality, relevance, and readability.
This document discusses measuring the importance and weight of decision makers in group decision making processes. It introduces a method to calculate weights for decision makers based on the number of iterations their pairwise comparison matrices take to reach convergence when using the eigenvector method for criteria weighting. A case study applies this method to determine the weights of three decision makers (A, B, and C) who provide pairwise comparison matrices for weighting four criteria related to selecting a form. The number of iterations for each decision maker's matrix to converge determines their absolute and relative weights in the group decision making process.
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
This document proposes a multi-cluster based majority under-sampling approach for handling skewed data in data mining. It begins with an introduction to the class imbalance problem, where one class has significantly more samples than the other class. Common techniques for addressing this include under-sampling the majority class and over-sampling the minority class. However, under-sampling can lose important majority class information. The proposed approach first clusters the majority class into multiple clusters. It then under-samples each cluster based on its size to select representative samples, avoiding information loss. It also oversamples the minority class. Experimental results on several datasets show the approach improves prediction of the minority class compared to standard under-sampling.
Cluster analysis is a technique used to group objects based on characteristics they possess. It involves measuring the distance or similarity between objects and grouping those that are most similar together. There are two main types: hierarchical cluster analysis, which groups objects sequentially into clusters; and nonhierarchical cluster analysis, which directly assigns objects to pre-specified clusters. The choice of method depends on factors like sample size and research objectives.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
Assessing the compactness and isolation of individual clustersperfj
This document summarizes a new cluster validation method called the NEN/FIN method. The method assesses the compactness and isolation of individual clusters by calculating average nearest-exterior-neighbor versus furthest-interior-neighbor (NEN/FIN) ratios for clusters and comparing them to reference data sets to determine how unusual they are. Higher and more unusual NEN/FIN ratios indicate more compact and isolated, or "real", clusters. The method provides an absolute assessment of cluster quality without requiring parameter estimates.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
This document summarizes a research paper that evaluates cluster quality using a modified density subspace clustering approach. It discusses how density subspace clustering can be used to identify clusters in high-dimensional datasets by detecting density-connected clusters in all subspaces. The proposed approach uses a density subspace clustering algorithm to select attribute subsets and identify the best clusters. It then calculates intra-cluster and inter-cluster distances to evaluate cluster quality and compares the results to other clustering algorithms in terms of accuracy and runtime. Experimental results showed that the proposed method improves clustering quality and performs faster than existing techniques.
This document presents a novel fuzzy k-nearest neighbor equality (FK-NNE) algorithm for classifying masses in mammograms as benign or malignant. The algorithm assigns membership values to different classes based on distances to k-nearest neighbors. It achieved 94.46% sensitivity, 96.81% specificity, and 96.52% accuracy, outperforming k-nearest neighbors, fuzzy k-nearest neighbors, and k-nearest neighbor equality algorithms. The algorithm considers relative importance of neighbors and assigns partial membership to classes, addressing issues with insufficient knowledge faced by other techniques. Experimental results demonstrated FK-NNE had the best performance with an area under the ROC curve of 0.9734, indicating high diagnostic accuracy.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
This document presents a proposed framework called MBKEM (Mini Batch K-means with Entropy Measure) for clustering heterogeneous categorical data. MBKEM uses an entropy distance measure within a mini batch k-means algorithm. The framework is evaluated using secondary data from a public survey. Evaluation metrics show MBKEM outperforms other clustering algorithms with high accuracy, v-measure, adjusted rand index, and Fowlkes-Mallow's index. MBKEM also has faster average cluster generation time than other methods. The proposed framework provides an improved solution for clustering heterogeneous categorical data.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
This document provides a survey of optimization approaches that have been applied to text document clustering. It discusses several clustering algorithms and categorizes them as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, frequent pattern-based clustering, and constraint-based clustering. It then describes several soft computing techniques that have been used as optimization approaches for text document clustering, including genetic algorithms, bees algorithms, particle swarm optimization, and ant colony optimization. These optimization techniques perform a global search to improve the quality and efficiency of document clustering algorithms.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This document presents a new hybrid model for feature selection and classification that combines genetic algorithm and naive Bayes. The proposed method first uses a binary coded genetic algorithm to select a reduced subset of important features from datasets. It then applies a naive Bayes classification method to evaluate the selected features and identify the subset that achieves the highest classification accuracy. The performance of the proposed hybrid model is evaluated on eight datasets and compared to recent publications, finding it achieves satisfactory or higher classification accuracy using fewer features.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used tocategories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like
2 statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used to categories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming
very tedious process. The key disadvantage of high dimensional data which we can pen down is curse
of dimensionality. As the magnitude of datasets grows the data points become sparse and density of
area becomes less making it difficult to cluster that data which further reduces the performance of
traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve
clustering results using limited supervision. The supervision is generally given as pair wise
constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based
approaches. We first show that a recently-proposed objective function for semi-supervised clustering
based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of
constraint penalty functions, can be expressed as a special case of the global kernel k-means objective
[3]. A recent theoretical connection between global kernel k-means and several graph clustering
objectives enables us to perform semi-supervised clustering of data. In particular, some methods have
been proposed for semi supervised clustering based on pair wise similarity or dissimilarity
information. In this paper, we propose a kernel approach for semi supervised clustering and present in
detail two special cases of this kernel approach.
Literature Survey On Clustering TechniquesIOSR Journals
This document provides a literature review of different clustering techniques. It begins by defining clustering and describing the main categories of clustering methods: hierarchical, partitioning, density-based, grid-based, and model-based. It then summarizes some examples of algorithms for each category in 1-2 sentences. For hierarchical methods, it discusses BIRCH, CURE, and CHAMELEON. For partitioning methods, it mentions k-means clustering and k-medoids. For density-based methods, it lists DBSCAN, OPTICS, DENCLUE. For grid-based methods, it lists CLIQUE, STING, MAFIA, WAVE CLUSTER, O-CLUSTER, ASGC, and
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
This document discusses cancer data partitioning using clustering techniques. It begins with an introduction to clustering concepts and different clustering methods like k-means, hierarchical agglomerative clustering, and partitioning methods. It then reviews literature on clustering algorithms and ensemble methods applied to problems like speaker diarization and tumor clustering from gene expression data. The document analyzes issues with existing clustering methodology and proposes a new dynamic ensemble membership selection scheme to support data structure and complexity independent clustering for cancer data partitioning. The method combines partition around medoids clustering with an incremental semi-supervised cluster ensemble framework to improve healthcare data partitioning accuracy.
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
Machine learning for text classification is the
underpinning
of document
cataloging
, news filtering,
document
steering
and
exemplif
ication
. In text mining realm, effective feature selection is significant to
make the learning task more accurate and competent. One of the
traditional
lazy
text classifier
k
-
Nearest
Neighborhood (
k
NN) has
a
major pitfall in calculating the similarity between
all
the
objects in training and
testing se
t
s,
there by leads to exaggeration of
both
computational complexity
of the algorithm
and
massive
consumption
of
main memory
. To diminish these shortcomings
in
viewpoint
of a
data
-
mining
practitioner
a
n
amalgamati
ve technique is proposed in this paper using
a novel restructured version of
k
NN called
Augmented
k
NN
(AkNN)
and
k
-
Medoids
(kMdd)
clustering.
The proposed work
comprises
preprocesses
on
the
initial training
set
by
imposing
attribute feature selection
for reduc
tion of high dimensionality, also it
detects and excludes the high
-
fliers
samples
in t
he
initial
training set
and
re
structure
s
a
constricted
training
set
.
The kMdd clustering algorithm generates the cluster centers (as interior objects) for each category
and
restructures
the constricted training set
with centroids
. This technique
is
amalgamated with
AkNN
classifier
that
was prearranged with
text mining similarity measure
s.
Eventually, s
ignifican
tweights
and ranks were
assigned to each object in the new
training set based upon the
ir
accessory towards the
object in testing set
.
Experiments
conducted
on Reuters
-
21578 a
UCI benchmark
text mining
data
set
, and
comparisons with
traditional
k
NN
classifier designates
the
referred
method
yield
spreeminentrecital
in b
oth clustering and
classification
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
This summary provides the high-level information from the document in 3 sentences:
The document proposes a Particle Swarm Optimization (PSO) based ensemble classification model to improve classification of high-dimensional biomedical datasets. It develops an optimized PSO technique to select optimal features and initialize weights for base classifiers in the ensemble model. Experimental results on microarray datasets show the proposed model achieves higher accuracy, true positive rate, and lower error rate compared to traditional feature selection based classification models.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
Clustering Approach Recommendation System using Agglomerative AlgorithmIRJET Journal
The document discusses clustering approaches and agglomerative algorithms for recommendation systems. It proposes a new clustering approach based on agglomerative algorithms that uses simple calculations to identify clusters. Clustering algorithms aim to group similar objects into clusters while maximizing dissimilarity between objects in different clusters. The document reviews various hierarchical and partitioning clustering methods and algorithms presented in previous literature. It then proposes using an agglomerative clustering approach for recommendation systems that identifies clusters through simple calculations.
Similar to INCREASING ACCURACY THROUGH CLASS DETECTION: ENSEMBLE CREATION USING OPTIMIZED BINARY KNN CLASSIFIERS (20)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
INCREASING ACCURACY THROUGH CLASS DETECTION: ENSEMBLE CREATION USING OPTIMIZED BINARY KNN CLASSIFIERS
1. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
DOI : 10.5121/ijcsea.2011.1201 1
INCREASING ACCURACY THROUGH CLASS
DETECTION:
ENSEMBLE CREATION USING OPTIMIZED BINARY
KNN CLASSIFIERS
Benjamin Thirey1
and Christopher Eastburg2
1
Department of Mathematical Sciences, United States Military Academy, West Point,
New York, USA
benjamin.thirey@usma.edu
2
Department of Mathematical Sciences, United States Military Academy, West Point,
New York, USA
christopher.eastburg@usma.edu
ABSTRACT
Classifier ensembles have been used successfully to improve accuracy rates of the underlying
classification mechanisms. Through the use of aggregated classifications, it becomes possible to achieve
lower error rates in classification than by using a single classifier instance. Ensembles are most often
used with collections of decision trees or neural networks owing to their higher rates of error when used
individually. In this paper, we will consider a unique implementation of a classifier ensemble which
utilizes kNN classifiers. Each classifier is tailored to detecting membership in a specific class using a
best subset selection process for variables. This provides the diversity needed to successfully implement
an ensemble. An aggregating mechanism for determining the final classification from the ensemble is
presented and tested against several well known datasets.
KEYWORDS
k Nearest Neighbor, Classifier Ensembles, Forward Subset Selection
1. INTRODUCTION
1.1. k-Nearest Neighbor Algorithm
The k-Nearest Neighbors, or kNN algorithm is well known to the data mining community, and
is one of the top algorithms in the field [1]. The algorithm achieves classification between m
different classes. Each instance to be classified is an item that contains a collection of r
different attributes in set A={a1, a2, ..., ar} where aj corresponds to the jth
attribute. Therefore, an
instance is a vector p = <p1, p2, …, pr> of attribute values. For some predetermined value of k,
the nearest k neighbors are determined through the use of a distance metric which is calculated
using the difference in distances between each of the attributes of the instance in question and
its neighbors. Euclidean distance is by far the most popular metric for calculating proximity.
An instance’s membership within a given class can be computed either as a probability or by
simple majority of the class with the most representation in the closest k neighbors. At the
simplest level, this is a problem of binary classification, where data is classified as being in a
certain class or not.
Due to different units of measurement, there is also a need for normalization across attribute
variables in order to prevent one variable from dominating the classification mechanism [2].
One of the problems with kNN is that without some sort of weighting scheme for variables,
each of the variables is treated as being equally important toward determining similarity
2. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
2
between instances. Combining different scales of measurement across attributes when
computing the distance metric between instances can cause severe distortions in the calculations
for determining nearest neighbors. Several different variable weighting schemes and selection
methods to overcome this are discussed by Wettschereck, Aha, Mohri [3]. Given the means by
which neighbors in kNN are calculated, irrelevant variables can have a large effect on final
classification. This becomes especially problematic in cases where a large number of predictor
variables are present [4]. Closely related to this problem is the curse of dimensionality whereby
the average distance between points becomes larger as the number of predictor variables
increases. One of the benefits of proper variable selection is that it has the potential to help
mitigate the curse of dimensionality.
It is generally held that kNN implementations are sensitive to the selection of variables, so
choice of the appropriate subset of variables for use in classification plays a critical role [5].
One of the methods is through the use of forward subset selection (FSS) with the kNN
algorithm [6]. FSS begins by identifying the variable which leads to the highest amount of
accuracy with regards to classifying an instance. That attribute is then selected for inclusion in
the subset of best variables. The remaining variables are then paired up with the set, and the
next variable for inclusion is again calculated by determining which one leads to the greatest
increase in classifier accuracy. This process of variable inclusion continues until no further
gains can be made in accuracy. Clearly, this is a greedy method of determining attributes for
inclusion since the variable selected at each step is the one providing the biggest gains in
accuracy. Therefore, the subset selected at the conclusion of the algorithm will not necessarily
be the most optimal since not all potential combinations of variables were considered.
Additionally, this algorithm is quite processor intensive.
Backward subset selection (BSS) operates in a similar manner, except that all variables are
initially included and then a variable is discarded during each pass through the attributes until
no further improvements in accuracy are achieved. Work by Aha and Bankert [6] found that
FSS of variables led to higher classification rates than BSS. They also conjectured that BSS
does not perform as well with large numbers of variables.
kNN relies on forming a classification based on clusters of data points. There are a variety of
ways to consider kNN clusters for final classification. Simple majority rule is the most
common, but there are other ways of weighting the data [1]. Wettshereck, Aha, and Mohri [5]
provide a comprehensive overview of various selection and weighting schemes used in lazy
learning algorithms, such as kNN, where computation is postponed until classification. These
modifications to the weighting calculations of the algorithm include not only global settings, but
local adjustments to the weights of individual variables. The weights are adjustable depending
on the composition of the underlying data. This allows for greater accuracy and adaptability in
certain portions of the data without imposing global variable weightings.
1.2. Classifier Ensembles
Classifier ensembles render classifications using the collective output of several different
machine learning algorithms instead of only one. Much of the initial development of ensemble
methods came through the use of trees and neural nets to perform classification tasks. It was
recognized that the use of aggregated output from multiple trees or neural nets could achieve
lower error rates than the classification from a single instance of a classifier. The majority of
the research in the area of ensembles uses either decision tree or neural net classifiers. Work
regarding ensemble selection from a collection of various classifiers of different types has been
successful in generating ensembles with higher rates of classification [7].
There are a number of methods for generating classifiers in the ensemble. In order to be
effective, there must be diversity between each of the classifiers. This is usually achieved
through the utilization of an element of randomness when constructing the various classifiers.
According to the distinctions made by Brown, et al. [8] diversity can be either explicit through
3. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
3
the use of deterministic selection of the variables with individual classifiers, or it can be implicit
since diversity is randomly generated. For example, implicit methods achieve diversity through
the initialization of the weights of a neural net at random or using a randomized subset of
features when node splitting in trees.
The development of individual classifiers for use by decision tree or neural net ensembles is
usually performed with a random subset of predictor variables. This is to provide diversity and
ensure that errors are more likely to occur in different areas of the data. This process is repeated
numerous times so that a wide variety of classifiers is produced, and the necessary diversity
amongst individual classifiers is established. Recent research compares how the various means
of generating classifiers compares with the output of their respective ensembles [9]. Techniques
such as bagging and boosting are used to generate different classifiers that make independent
classifications of instances. Bagging is a technique where the underlying dataset is repeatedly
sampled during the training phase, whereas boosting changes the distribution of the training
data by focusing on those items which present difficulties in classification [10].
Researchers examined the use of kNN classifiers as members of an ensemble [11].
Madabhushi, et al. found that using kNN ensembles on Prostate Cancer datasets resulted in
higher accuracy rates than other methods which required extensive training [12]. Work by Bay
considered an ensemble of kNN classifiers which were developed from random subsets of
variables [13]. This method resulted in increased classification accuracy. The objective of
developing these different classifiers is to ensure that their respective errors in classification
occur in different clusters of data. Domeniconi and Yan [14] proposed a method whereby
different subsets of variables were randomly generated and used to construct members of kNN
ensembles. Their approach continued by adding only those classifiers to the ensemble which
improved ensemble classification performance.
Use of the classifier ensemble is straightforward. Consider an ensemble C*
= {c1, c2, …, cm} of
m individual classifiers, with each as a binary classifier. The instance to be classified is passed
through the group of classifiers C* and their corresponding individual classifications are then
aggregated as discussed above in order to determine what the final classification should be.
The final step in developing an ensemble classifier is to determine how each of the votes from
the individual classifiers in the ensemble will be transformed into a final classification. The
most common method is to use a simple majority rule, but it is not difficult to see how various
weighting schemes could be implemented in this process. Perhaps the occurrence of the
classification as membership of a particular class is enough to override all other votes. The
underlying data and application are the primary decision criteria regarding how votes should be
tallied.
1.3. Related and Recent Work
The fundamental strategy of ensemble network classification is to generally isolate errors within
different segments of a population. Oliveira et al, [15], used genetic algorithms to generate
ensembles for classification models of handwriting recognition. Their methodology uses genetic
programming to continually create new networks, search for the best features, and keep the set
of networks that are both accurate but disagree with each other as much as possible. Error rates
in final classification will be less when ensembles use only a subset of the best features for
classification. A supervised and an unsupervised approach were used to extract the best features
relevant for subset selection and ensemble creation. They found that both techniques were
successful and also concluded there are still many open problems with regard to optimal feature
extraction.
K means clustering is a popular classification clustering algorithm by where each observation is
a member of the cluster with the nearest mean. K medoids is a similar approach which uses
actual data points for cluster centers. [2] K means does not work well with data clusters which
4. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
4
are non spherical and of different sizes. There are many techniques in literature to improve the
k means algorithm. For example, fuzzy k means clustering often improves results by
incorporating a probabilistic component into membership classification. Weng et al [16]
effectively used ensembles of k means clustering to improve the classification rate of intrusion
detection for computer network security. Their approach successfully improves classification
with clusters of “anomalistic shapes.” Work by Bharti et al [17] used a decision tree algorithm,
known as J48, built with fuzzy K-means clustering to very accurately map clusters of data to
classification for intrusion detection.
Awad et al [18] recently applied six different machine learning algorithms to spam
classification: Naïve Bayes, Support Vector Machines (SVM), Neural Networks, k-Nearest
Neighbor, Rough Sets, and the Artificial Immune System. While performing well in the
categories of spam recall and overall accuracy, kNN showed a marked decrease in precision (the
ability to correctly filter out noise) compared to the other algorithms. Used here, the kNN
routine produced too many false positives. Perhaps using an ensemble of kNN classifiers would
have significantly improved results.
It is recognized that kNN is very sensitive to outliers and noise within observations. Jiang and
Zhou, [19] built four kNN classification techniques involving editing and ensemble creation. In
order to manage the error induced by outliers, they developed differing editing routines that
effectively removed the most problematic training data and therefore increased the accuracy of
classification. They also created a fourth neural network ensemble mechanism using the
Bagging technique, which generally performed better than the editing routines. An approach
used by Subbulakshmi et al [20] also used several different classifier types (neural nets and
SVMs) to enhance overall classification. Each of the individual classifiers of the ensemble
possessed different threshold values for activation based on the ensemble member’s accuracy.
They found that the ensemble approach had higher classification rates than any of the individual
underlying classifiers.
2. OUR APPROACH
Our approach begins with the production of an ensemble of kNN classifiers. We chose to use
kNN classifiers because of their ability to adapt to highly nonlinear data, they are a fairly mature
technique, and there are a number of methods available for optimizing instances of kNN
classifiers.
Each instance or object to be classified p is a vector of values for r different attributes so
ൌ ۃଵ, ଶ, … , .ۄ We follow the one-per-class method of decomposing classification
amongst m different classes into a set of binary classification problems [21]. These classifiers
determine class membership using a set כ
ൌ ሼܿଵ, ܿଶ, … , ܿሽ, where classifier ܿ א כ
determines if a given instance is member of the ith
class. Each classifier takes a vector of
attributes for an item to be classified and performs the following function:
ܿሺሻ ൌ ൜
1 ݂݅ ݅ݏ ܽ݊ ݈݁݁݉݁݊ݐ ݂ ݈ܿܽݏݏ ݅
0 ݂݅ ݅ݏ ܱܰܶ ܽ݊ ݈݁݁݉݁݊ݐ ݂ ݈ܿܽݏݏ ݅
This method works best for algorithms such as kNN that produces activation as an output to
determine class membership [22]. Essentially each binary kNN classifier is the analogue of a
classification “stump”, which is a decision tree that makes a single classification of whether or
not a given instance is a member of a specific class. Classifiers which discriminate between all
classes, such as a single model to determine membership, have an error rate determined by the
number of misclassifications from the entire dataset. This is because the classifier is tailored for
and optimized over the collection of m different classes. As a result, the parameters are adjusted
5. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
5
so that the error rate across all classifications is as low as possible without deference to any
particular class.
The subset of variables which leads to the lowest error rates when determining membership in a
specific class are likely to be entirely different from the subset of variables which are most
effective in determining membership in another class. The use of the FSS algorithm allows
each individual binary classifier to tailor itself around the variables it deems most important for
determining membership of an instance. As a result, diversity amongst the kNN classifiers is
achieved deterministically. The necessary diversity is achieved by each individual classifier
selecting the set of variables which are deemed most important for identifying specific class
membership. This is slightly different from the traditional definition of diversity which stresses
errors being made on different instances of data.
Since we use an ensemble of individual kNN classifiers which are responsible for determining
membership in a specific class, each individual classifier can have the parameters for variable
weights adjusted to achieve the highest classification rate for the specific class being analyzed.
When using a single classifier to differentiate between multiple classes, the differences in which
variables are most important to clustering for identification of various classes becomes
overshadowed.
Our approach also differs from previous approaches in that we use the specialized kNN
technique of FSS for each of the binary classifiers in the ensemble. We elected to use FSS since
the final collection of predictor variables selected for classification is usually smaller [6]. This
is especially noticeable in datasets with a large number of variables for subset selection. We
also chose FSS as opposed to BSS since it requires significantly less processor time, especially
given the large amount of processing time which must be devoted if there are many variables.
Furthermore, the models are often substantially simpler. As discussed previously, a successful
ensemble implementation requires diversity between the individual classifiers being used. The
diversity here is achieved through the inclusion of different variables which are selected by the
FSS-kNN algorithm as being the most important towards determining membership in an
instance of a particular class.
By building different classifiers for determining membership in each class, we are choosing the
subset of variables that work best with the kNN algorithm to classify members of the specific
class. This provides the algorithm with greater accuracy than a single implementation of the
kNN algorithm differentiating amongst all classes. Through an individual classifier tailored to
determine membership for a particular class, we allow the isolation of those variables that
contribute the most toward the clustering of the class members. Clearly, the subsets of variables
selected between different binary classifiers will be different. Furthermore, by using a kNN
variant collection which has been optimized, the ensemble itself should have a higher resultant
classification rate. There are many other implementations of the kNN process that we could
have relied on. We believe that the use of any of these others would lead to similar results.
One of the benefits of this method is that it overcomes the curse of dimensionality. For a given
class, there might only be a handful of variables which are critical to classification and could be
completely different from other classes. A classifier differentiating between all m classes would
have to potentially consider all attributes. However, our approach relies on the fact that the
classifier of the ith
class needs only to determine membership through the use of variables that
are most important to determining distance to its nearest neighbors.
In order to combine the respective votes of each member within the ensemble, we have three
cases: one of the individual classifiers identifies membership within the group, no membership
is selected, or there is a conflict regarding classification with two or more classifiers presenting
conflicting classifications. Where classification is straightforward with one classification
emerging from the ensemble, we use that classification. In the latter two cases discussed above,
there must be some way of achieving an output. There are two possible approaches. The first is
to rely on a single overall kNN classifier which determines identification in the event of
6. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
6
conflict. Therefore, if the ensemble is unsuccessful, the classification scheme reverts back to a
single instance (master) classifier. The second approach is to use the classifier with the highest
accuracy which selected the instance for membership. Figure 1 presents an overview of this
process.
A master classifier uses the same methodology but provides for classification between all
possible classes in the dataset as opposed to simply determining membership in a single class.
This master classifier is used to assign classification in the event that none of the members of
the ensemble identifies an item for class membership.
Ensemble of classifiers to
determine membership in
each of the classes
No class
selected for
membership
One class
selected for
membership
More than one
class selected
for membership
Three possible outcomes
from ensemble output
Instance to
be classified
Revert to
master
classifier
Use class
selected
Use classifier
with highest
accuracy rate
Figure 1. Determining class membership of an instance
3. EXPERIMENTAL RESULTS
3.1. Datasets
The datasets which we utilized were from the UCI Machine Learning Repository with the
exception of the IRIS dataset which is available in the R software package [23, 24, 25]. The
statistics regarding each data set are presented in Table 1. We began with the IRIS data since it
is one of the most used datasets in classification problems. Furthermore, it is a straightforward
dataset with four predictors and provided a good benchmark for initial results. We also selected
the Low Resolution Spectrometer (LRS) data since it contained a large number of variables and
the data required no scaling prior to using the algorithm. The dataset itself consists of header
information for each entry, followed by intensity measurements at various spectral lengths.
Finally, the ARRYTHMIA dataset was selected due to the large number of predictor variables
which it offered. We were curious to see how well the FSS-kNN algorithm performed at
reducing the number of variables needed to determine class membership. There were several
instances in the ARRYTHMIA data set where missing data was problematic. These attributes
were removed from the dataset so that classification could continue.
Table 1. Dataset Statistics
Dataset: Iris LRS Arrythmia
Number of classes: 3 10 16
Number of variables: 4 93 263
Number of data points: 150 532 442
7. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
7
3.2. Model Generation
Our methodology follows the pseudocode in Figure 2. We began by building the best
classification model for each class in the dataset. The individual models were constructed using
FSS-kNN to determine the best subset of variables to use for determining membership in each
class. Every subset of variables was then tested using n-fold cross validation, where each
element was predicted using the remaining ݊ െ 1 elements in the kNN model over various k-
values to determine the most accurate models for each class. This required a modest amount of
processor time, but enabled us to use all of the available data for both training and testing which
is one of the benefits of n-fold cross validation. Following the generation of the individual
classifiers, we built the master classifier.
After building our classifiers, we processed the data with the ensemble. The majority of
instances were selected for membership by one of the classifiers. In the event that more than
one classifier categorized the instance as being a member of the class that it represented, we
reverted to the model accuracies of the individual classifiers, and assigned the item to the most
accurate classifier which identified the item for class membership. Instances which were not
selected for membership in a class by any of the individual classifiers were processed by the
master classifier.
We conducted n-fold cross-validation testing to determine the overall accuracy of the ensemble.
The k-value and subset of variables selected for an individual kNN classification model were the
only factors remaining the same between the classifications of instances.
Figure 2. Pseudocode for classifier construction and usage
4. DISCUSSION OF RESULTS
The statistics regarding the accuracy rates and numbers of predictor variables used by the
individual classifiers are presented in Table 2. By using individual classifiers to determine set
inclusion, we were able to achieve high rates of classification. Only predictor variables useful
in the classification of instances of a given class with the kNN algorithm were used in the
models. In the LRS and ARRYTHMIA datasets, the largest model for membership
classification in the ensemble uses only about 5% of the available predictor variables. The
average number of predictor variables used for classification is significantly less than that. In
each of the datasets, there were classification models that needed only one variable with which
to determine membership in a particular class. We believe that the high rates of classification
CONSTRUCTION PHASE:
for each class in the data set
build classifier ci which determines membership in class i using the Forward Subset Selection
Algorithm
compute the accuracy of this classifier
next class
build a master classifier which considers membership amongst all classes
CLASSIFICATION PHASE:
for each item to be classified
the item is evaluated by each classifier so seek membership in respective classes
if only one classifier identified the item for membership
then assign the item to that class
if more than one classifier identified the item for membership
then assign class membership to the most accurate classifier
if no classifiers identified the item for membership
then use the master classifier to assign a classification
next item
8. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
8
for the individual classifiers are closely related to the reduction in the number of dimensions
being used for model construction, thereby overcoming the curse of dimensionality. This has
some rather interesting implications. The first is that this process can be used as a discovery
mechanism for determining the most important variables for use in determining membership in
a specific class. It also implies that accurate models can be constructed using small subsets of
available predictor variables, thereby greatly reducing the number of dimensions in which
classifications are performed.
The results of building the master models that incorporate all of the classes are depicted in Table
2. These represent use of a single model constructed using the FSS-kNN algorithm to
determine classifications of data. Note that the classification accuracy rates of the models that
determine classification between all classes are at or below the minimum accuracy rates of the
individual classifiers that determine membership in a specific class. This is not surprising given
that the master classifier is now attempting to discriminate between all of the various classes.
Note also that the numbers of variables selected by the master models are significantly higher
than the mean number of variables selected in the individual classifiers. These represent the
best accuracy rates available if only one classifier was being constructed using the FSS-kNN
model.
Table 2. FSS-kNN Statistics for Classifiers of the Ensemble
Dataset: Iris LRS Arrythmia
Max Accuracy Achieved: 1.000 0.998 1.000
Mean Accuracy of Classifiers in Ensemble: 0.979 0.985 0.977
Standard Deviation of Classifiers in Ensemble: 0.018 0.015 0.043
Minimum Accuracy of Classifiers in Ensemble: 0.973 0.957 0.827
Maximum Variables Selected by a Classifier: 2 5 13
Mean Number of Variables Selected
by Classifiers in Ensemble: 1.667 2.900 2.750
Standard Deviation of
Number of Variables Selected: 0.577 1.524 3.066
Minimum Variables Selected by a Classifier 1 1 1
Accuracy Rate of the Master Model: 0.973 0.891 0.732
Number of Variables Utilized in Master Model: 2 6 6
When classifying instances, there were three distinct cases that could occur. An instance could
be selected for membership by none, one, or more than one of the ensemble classifiers. Table 3
presents the statistics for various parts of our method.
The accuracy of the master classifier used in cases where no ensemble classifier identified
membership demonstrated a significant degradation in classification accuracy. This probably
represents classification of difficult cases which were largely responsible for the errors in the
master classification. Instances which are not selected for classification by any of the individual
ensemble members are passed to the master classifier. This classifier is hindered by the same
difficulties that individual classifiers face when determining membership of an object in a
specific class. Here though, we are forcing a classification to take place.
9. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
9
Table 3. Ensemble Statistics and Accuracy Rates
Dataset: Iris LRS Arrythmia
Instances Classified by 0 Members
of the Ensemble: 0 (0%) 34 (6.4%) 56 (12.42%)
Accuracy of Master Model in
Determining Class Membership: N/A 0.4705882 0.4464286
Instances Classified by 1 Member
of Ensemble: 145 (96.66%) 483 (90.96%) 375 (83.15%)
Accuracy: 0.9862068 0.9689441 0.8773333
Instances Classified by 2 Members
of the Ensemble: 5 (3.33%) 14 (2.63%) 20 (4.43%)
Accuracy: 0.8 0.7142857 0.8
Instances Classified by 3 Members
of the Ensemble: 0 0 0
Overall Accuracy of Method: 0.98 0.930255 0.820399
Instances classified by only one of the ensemble members comprised the majority of cases in
classification and were characterized by their large degree of accuracy. Instances selected for
membership in a class by two or more of the ensemble members comprised a small minority of
classification cases. By reverting to classifier accuracy to determine the final classification, we
were able to achieve fairly high classification rates, given that blind chance would have resulted
in a 50% accuracy rate. There were no cases in any of our data sets where more than two
classifiers competed for a given instance.
The overall accuracy obtained by the ensemble method presented in this paper is greater than
the single classifier attempting to classify amongst all classes. Consequently, using ensembles
increases accuracy when compared to the case of using a single classifier.
5. CONCLUSIONS AND FUTURE WORK
Our approach has demonstrated that an ensemble of classifiers trained to detect membership in a
given class can achieve high rates of classification. We have shown that we can achieve greater
classification rates by combining a series of classifiers optimized to detect class membership,
than by using single instances of classifiers. Our model is best adapted towards classification
problems involving three or more classes since a two class model can be readily handled by a
single classifier instance.
We have not adjusted the importance of individual variables during the process of constructing
individual classifiers for the ensemble. We have simply included or excluded variables as being
equally weighted without scaling. While variable selection is helpful in addressing some of the
problems outlined, additional improvements can be made to the kNN algorithm by weighting
the variables which have been selected for inclusion into the model to account for differences in
variable importance. Another weakness which needs to be addressed is the consideration of
incomplete datasets.
Future work will focus on developing additional classifiers to distinguish between instances that
are selected for class membership by more than one classifier within the ensemble rather than
reverting to the highest accuracy rate. An element of conditional probability might be of
10. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
10
considerable importance in biomedical classifications. In larger datasets, there could be a
number of cases where discerning membership amongst instances becomes difficult. Often the
determination occurs between two classes which are very similar. In such cases where FSS-
KNN results in classifiers with relatively low rates of classification, it might be necessary to
examine the data to determine whether the class in question is really composed of several
subclasses which would benefit from their own respective binary classifiers within the
ensemble. Finally, there remains the possibility that we can use the predictor variables selected
as most important for clustering by FSS to improve classification rates of other methods such as
neural nets and decision trees.
REFERENCES
[1] Xindong, W., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng,
A., Liu, B., Yu, P. S., Zhou, Z. H., Steinbach, M., Hand, D. J. and Steinberg, D. (2008) Top 10
Algorithms in Data Mining. Knowledge and Information Systems, 14 1: 1-37.
[2] Xu, R., Wunsch, D., “Clustering”, 2009, John Wiley & Sons
[3] Wettschereck, D., Aha, D., & Mohri, T. (1995). A Review and Empirical Evaluation of Feature
Weighting Methods for a Class of Lazy Learning Algorithms. Tech. rept. AIC95-012. Naval
Research Laboratory, Navy Center for Applied Research in Artificial Intelligence, Washington,
D.C.
[4] Hand, D. , Mannila, H., Smyth, P., Principles of Data Mining, 2001
[5] Kotsiantis, S. “Supervised machine learning: a review of classification techniques”, Informatica
Journal (31), 2007, pp.249-268.
[6] Aha, D. W., & Bankert, R. L. (1996). A Comparative Evaluation of Sequential Feature
Selection Algorithms. In D. Fisher & H. H. Lenz (Eds.), Artificial Intelligence and Statistics V.
New York: Springer – Verlag.
[7] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. “Ensemble selection from libraries of
models.” in Proceedings of the International Conference on Machine Learning (ICML), 2004.
[8] G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: A survey and
categorisation. Journal of Information Fusion, 6(1):5–20, 2005.
[9] D. Opitz, R. Maclin, Popular ensemble methods: An empirical study, Journal of Articial
Intelligence Research 11 (1999), pp 169 - 198.
[10] Tan, Pang-Ning, Steinbach, Michael, Kumar, Vipin, Introduction to Data Mining, Addison
Wesley, 2006, pp 283-285.
[11] N. El Gayar, An Experimental Study of a Self-Supervised Classifier Ensemble, International
Journal of Information Technology, Vol. 1, No. 1, 2004.
[12] A. Madabhushi, J. Shi, M.D. Feldman, M. Rosen, and J. Tomaszewski, “Comparing Ensembles
of Learners: Detecting Prostate Cancer from High Resolution MRI,” Proc. Second Int'l
Workshop Computer Vision Approaches to Medical Image Analysis (CVAMIA '06), pp. 25-36,
2006.
[13] S. D. Bay. Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. Proc.
17th Intl. Conference on Machine Learning, pp. 37 – 45, Madison, WI, 1998.
[14] C. Domeniconi and B. Yan. Nearest Neighbor Ensemble. In Proceedings of the 17th
International Conference on Pattern Recognition, Cambridge, UK, pages 23–26, 2004.
[15] L. S. Oliveira, M. Morita, R. Sabourin, and F. Bortolozzi. Multi-Objective Genetic Algoritms to
Create Ansemble of Classifiers. Proceedings of the Third International Conference on
Evolutionary Multi-Criterion Optimization, Vol. 87, pp 592-606. 2005.
[16] Fangfei Weng, Quingshan Jiang, Liang Shi, and Nannan Wu., An Intrusion Detection System
Based on the Clustering Ensemble, IEEE International Workshop on 16-18 April 2007, pp. 121-
124. 2007.
11. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.1, No.2, April 2011
11
[17] K. Bharti, S. Jain, S. Shukla, Fuzzy K-mean Clustering Via J48 For Intrusion Detection System.
International Journal of Computer Science and Information Technologies, Vol 1(4), pp 315-318,
2010.
[18] W. A. Awad and S. M. ELseuofi, Machine Learning Methods for Spam E-Mail Classification,
International Journal of Computer Science & Information Technology, Vol 3, No 1, pp 173 –
184. 2011.
[19] Y. Jiang and Z-H. Zhou. Editing Training Data for kNN Classifiers with Neural Network
Ensemble. Lecture Notes in Computer Science 3173, pp 356 – 361, 2004.
[20] T. Subbulakshmi, A. Ramamoorthi, and S. M. Shalinie, Ensemble Design for Intrusion Detection
Systems. International Journal of Computer Science & Information Technology, Vol. 1, No. 1,
August 2009, pp 1 – 9.
[21] Nilsson, Nils J., Learning Machines: Foundations of Trainable Pattern-Classifying Systems.,
McGraw-Hill, 1965
[22] Kong, E. B., & Dietterich, T. G., Error-Correcting Output Coding Corrects Bias and Variance, in
A. Prieditis & S. Russell, eds, Machine Learning: Proceedings of the Twelfth International
Conference, San Francisco, CA: Morgan Kaufmann, pp. 313 – 321. 1995.
[23] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml/datasets/Arrhythmia]. Irvine, CA: University of California, School
of Information and Computer Science.
[24] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml/datasets/Low+Resolution+Spectrometer]. Irvine, CA: University of
California, School of Information and Computer Science.
[25] R: A Language and Environment for Statistical Computing, R Development Core Team, Vienna,
Austria, Version 2.11.1.