Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used tocategories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like
2 statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has suppli
ed a large volume of data to many fields. The gene
microarray analysis and classification have demonst
rated an effective way for the effective diagnosis
of
diseases and cancers. In as much as the data achiev
ing from microarray technology is very noisy and al
so
has thousands of features, feature selection plays
an important role in removing irrelevant and redund
ant
features and also reducing computational complexity
. There are two important approaches for gene
selection in microarray data analysis, the filters
and the wrappers. To select a concise subset of inf
ormative
genes, we introduce a hybrid feature selection whic
h combines two approaches. The fact of the matter i
s
that candidate’s features are first selected from t
he original set via several effective filters. The
candidate
feature set is further refined by more accurate wra
ppers. Thus, we can take advantage of both the filt
ers
and wrappers. Experimental results based on 11 micr
oarray datasets show that our mechanism can be
effected with a smaller feature set. Moreover, thes
e feature subsets can be obtained in a reasonable t
ime
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
Sequencing projects arising from high throughput technologies including those of sequencing DNA microarrays allowed to simultaneously measure the expression levels of millions of genes of a biological sample as well as annotate and identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information,
bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the hypothesis of researchers throughout the experimental cycle. In this context, this article describes and discusses some of techniques used for the functional analysis of gene expression data.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has suppli
ed a large volume of data to many fields. The gene
microarray analysis and classification have demonst
rated an effective way for the effective diagnosis
of
diseases and cancers. In as much as the data achiev
ing from microarray technology is very noisy and al
so
has thousands of features, feature selection plays
an important role in removing irrelevant and redund
ant
features and also reducing computational complexity
. There are two important approaches for gene
selection in microarray data analysis, the filters
and the wrappers. To select a concise subset of inf
ormative
genes, we introduce a hybrid feature selection whic
h combines two approaches. The fact of the matter i
s
that candidate’s features are first selected from t
he original set via several effective filters. The
candidate
feature set is further refined by more accurate wra
ppers. Thus, we can take advantage of both the filt
ers
and wrappers. Experimental results based on 11 micr
oarray datasets show that our mechanism can be
effected with a smaller feature set. Moreover, thes
e feature subsets can be obtained in a reasonable t
ime
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
Sequencing projects arising from high throughput technologies including those of sequencing DNA microarrays allowed to simultaneously measure the expression levels of millions of genes of a biological sample as well as annotate and identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information,
bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the hypothesis of researchers throughout the experimental cycle. In this context, this article describes and discusses some of techniques used for the functional analysis of gene expression data.
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...CSCJournals
Subcellular localization is a key functional characteristic of proteins. As an interesting ``bio-image informatics\'\' application, an automatic, reliable and efficient prediction system for protein subcellular localization can be used for establishing knowledge of the spatial distribution of proteins within living cells and permits to screen systems for drug discovery or for early diagnosis of a disease. In this paper, we propose a two-stage multiple classifier system to improve classification reliability by introducing rejection option. The system is built as a cascade of two classifier ensembles. The first ensemble consists of set of binary SVMs which generalizes to learn a general classification rule and the second ensemble, which also include three distinct classifiers, focus on the exceptions rejected by the rule. A new way to induce diversity for the classifier ensembles is proposed by designing classifiers that are based on descriptions of different feature patterns. In addition to the Subcellular Location Features (SLF) generally adopted in earlier researches, three well-known texture feature descriptions have been applied to cell phenotype images, which are the local binary patterns (LBP), Gabor filtering and Gray Level Coocurrence Matrix (GLCM). The different texture feature sets can provide sufficient diversity among base classifiers, which is known as a necessary condition for improvement in ensemble performance. Using the public benchmark 2D HeLa cell images, a high classification accuracy 96% is obtained with rejection rate $21\\%$ from the proposed system by taking advantages of the complementary strengths of feature construction and majority-voting based classifiers\' decision fusions.
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
Morphometric and morphological bat identification are a conventional method of identification and
requires precision, significant experience, and encyclopedic knowledge. Morphological features of a
species may sometimes similar to that of another species and this causes several problems for the
beginners working with bat taxonomy. The purpose of the study was to implement and conduct the random
forest and C5.0 algorithm analysis in order to decide characteristics and carry out identification of bat
species. It also aims at developing supporting decision-making system based on the model to find out the
characteristics and identification of the bat species. The study showed that C5.0 algorithm prevailed and
was selected with the mean score of accuracy of 98.98%, while the mean score of accuracy for the
random forest was 97.26%. As many 50 rules were implemented in the DSS to identify common and rare
bat species with morphometric and morphological attributes.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
Since dealing with high dimensional data is computationally complex and sometimes even intractable, recently several feature reductions methods have been developed to reduce the dimensionality of the data in order to simplify the calculation analysis in various applications such as text categorization, signal processing, image retrieval, gene expressions and etc. Among feature reduction techniques, feature selection is one the most popular methods due to the preservation of the original features. However, most of the current feature selection methods do not have a good performance when fed on imbalanced data sets which are pervasive in real world applications. In this paper, we propose a new unsupervised feature selection method attributed to imbalanced data sets, which will remove redundant features from the original feature space based on the distribution of features. To show the effectiveness of the proposed method, popular feature selection methods have been implemented and compared. Experimental results on the several imbalanced data sets, derived from UCI repository database, illustrate the effectiveness of our proposed methods in comparison with the other compared methods in terms of both accuracy and the number of selected features.
Feature selection for multiple water quality status: integrated bootstrapping...IJECEIAES
STORET is one method to determine the river water quality, and to classify them into four classes (very good, good, medium and bad) based on the data of water for each attribute or feature. The success of the formation of pattern recognition model much depends on the quality of data. There are two issues as the concern of this research as follows, the data having disproportionate amount among the classes (imbalance class) and the finding of noise on its attribute. Therefore, this research integrates the SMOTE Technique and bootstrapping to handle the problem of imbalance class. While an experiment is conducted to eliminate the noise on the attribute by using some feature selection algorithms with filter approach (information gain, rule, derivation, correlation and chi square). This research has some stages as follows: data understanding, pre-processing, imbalance class, feature selection, classification and performance evaluation. Based on the result of testing using 10-fold cross validation, it shows that the use of the SMOTE-bootstrapping technique is able to increase the accuracy from 83.3% to be 98.8%. While the process of noise elimination onthe data attribute is also able to increase the accuracy to be 99.5% (the use of feature subset produced by the information gain algorithm and the decision tree classification algorithm).
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Weighted Ensemble Classifier for Plant Leaf IdentificationTELKOMNIKA JOURNAL
Plant leaf identification using image can be constructed by ensemble classifier. Ensemble
classifier executes classification of various features independently. This experiment utilized texture feature
and geometry feature of plant leaf to find out which features are more powerful. Each classifier trained by
specific feature produced different accuracy rate. To integrate ensemble classifier the results of the
classification were weighted, so as the score obtained from better features contributed greater to the final
results. Weighted classification results were combined to get the final result. The proposed method was
evaluated using dataset comprises of 156 variety of plants with 4559 images. Weighting and combining
classifier used in this study were Weighted Majority Vote (WMV) and Naïve Bayes Combination. Both of
those method result showed better accuracy than using single classifier. The average accuracy of single
classifier was 61.2% for geometry classifier and 70.3% for texture classifier, while WMV method was
77.8% and Naïve Bayes Comb ination was 94.6%. The calculation of classifier’s weight b y using WMV
method produces a weight value of 0.54 for texture feature classifier and 0.46 for geometry feature
classifier.
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from
large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an
established, proven structure from a voluminous collection of facts. A dominant area of modern-day
research in the field of medical investigations includes disease prediction and malady categorization. In
this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering
techniques and compare the performance of classification algorithms on the clinical data. Feature
selection is a supervised method that attempts to select a subset of the predictor features based on the
information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with
the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms
in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen
classification algorithms on the Lymphography dataset that enables the classifier to accurately perform
multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and
the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and
also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated
here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm
offers increased clustering accuracy in less computation time.
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
Survey on semi supervised classification methods and feature selectioneSAT Journals
Abstract Data mining also called knowledge discovery is a process of analyzing data from several perspective and summarize it into useful information. It has tremendous application in the area of classification like pattern recognition, discovering several disease type, analysis of medical image, recognizing speech, for identifying biometric, drug discovery etc. This is a survey based on several semisupervised classification method used by classifiers , in this both labeled and unlabeled data can be used for classification purpose.It is less expensive than other classification methods . Different techniques surveyed in this paper are low density separation approach, transductive SVM, semi-supervised based logistic discriminate procedure, self training nearest neighbour rule using cut edges, self training nearest neighbour rule using cut edges. Along with classification methods a review about various feature selection methods is also mentioned in this paper. Feature selection is performed to reduce the dimension of large dataset. After reducing attribute the data is given for classification hence the accuracy and performance of classification system can be improved .Several feature selection method include consistency based feature selection, fuzzy entropy measure feature selection with similarity classifier, Signal to noise ratio,Positive approximation. So each method has several benefits. Index Terms: Semisupervised classification, Transductive support vector machine, Feature selection, unlabeled samples
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTIJDKP
The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is increasing, there is need for an intelligent text-based system for proper management of the data. The selection of optimal set of features for processing plays vital roles in text-based system. This paper analyzed the structure of Igbo text and designed an efficient feature selection model for an intelligent Igbo text-based system. It adopted Mean TF-IDF measure to select most relevant features on Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models. The model is designed with Object-Oriented Methodology and implemented with Python programming language with tools from Natural Language Toolkits (NLTK). The result shows that bigram represented text gives more relevant features based on the language semantics.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has supplied a large volume of data to many fields. The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. In as much as the data achieving from microarray technology is very noisy and also has thousands of features, feature selection plays an important role in removing irrelevant and redundant features and also reducing computational complexity. There are two important approaches for gene selection in microarray data analysis, the filters and the wrappers. To select a concise subset of informative genes, we introduce a hybrid feature selection which combines two approaches. The fact of the matter is that candidate’s features are first selected from the original set via several effective filters. The candidate feature set is further refined by more accurate wrappers. Thus, we can take advantage of both the filters and wrappers. Experimental results based on 11 microarray datasets show that our mechanism can be effected with a smaller feature set. Moreover, these feature subsets can be obtained in a reasonable time.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...CSCJournals
Subcellular localization is a key functional characteristic of proteins. As an interesting ``bio-image informatics\'\' application, an automatic, reliable and efficient prediction system for protein subcellular localization can be used for establishing knowledge of the spatial distribution of proteins within living cells and permits to screen systems for drug discovery or for early diagnosis of a disease. In this paper, we propose a two-stage multiple classifier system to improve classification reliability by introducing rejection option. The system is built as a cascade of two classifier ensembles. The first ensemble consists of set of binary SVMs which generalizes to learn a general classification rule and the second ensemble, which also include three distinct classifiers, focus on the exceptions rejected by the rule. A new way to induce diversity for the classifier ensembles is proposed by designing classifiers that are based on descriptions of different feature patterns. In addition to the Subcellular Location Features (SLF) generally adopted in earlier researches, three well-known texture feature descriptions have been applied to cell phenotype images, which are the local binary patterns (LBP), Gabor filtering and Gray Level Coocurrence Matrix (GLCM). The different texture feature sets can provide sufficient diversity among base classifiers, which is known as a necessary condition for improvement in ensemble performance. Using the public benchmark 2D HeLa cell images, a high classification accuracy 96% is obtained with rejection rate $21\\%$ from the proposed system by taking advantages of the complementary strengths of feature construction and majority-voting based classifiers\' decision fusions.
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
Morphometric and morphological bat identification are a conventional method of identification and
requires precision, significant experience, and encyclopedic knowledge. Morphological features of a
species may sometimes similar to that of another species and this causes several problems for the
beginners working with bat taxonomy. The purpose of the study was to implement and conduct the random
forest and C5.0 algorithm analysis in order to decide characteristics and carry out identification of bat
species. It also aims at developing supporting decision-making system based on the model to find out the
characteristics and identification of the bat species. The study showed that C5.0 algorithm prevailed and
was selected with the mean score of accuracy of 98.98%, while the mean score of accuracy for the
random forest was 97.26%. As many 50 rules were implemented in the DSS to identify common and rare
bat species with morphometric and morphological attributes.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
Since dealing with high dimensional data is computationally complex and sometimes even intractable, recently several feature reductions methods have been developed to reduce the dimensionality of the data in order to simplify the calculation analysis in various applications such as text categorization, signal processing, image retrieval, gene expressions and etc. Among feature reduction techniques, feature selection is one the most popular methods due to the preservation of the original features. However, most of the current feature selection methods do not have a good performance when fed on imbalanced data sets which are pervasive in real world applications. In this paper, we propose a new unsupervised feature selection method attributed to imbalanced data sets, which will remove redundant features from the original feature space based on the distribution of features. To show the effectiveness of the proposed method, popular feature selection methods have been implemented and compared. Experimental results on the several imbalanced data sets, derived from UCI repository database, illustrate the effectiveness of our proposed methods in comparison with the other compared methods in terms of both accuracy and the number of selected features.
Feature selection for multiple water quality status: integrated bootstrapping...IJECEIAES
STORET is one method to determine the river water quality, and to classify them into four classes (very good, good, medium and bad) based on the data of water for each attribute or feature. The success of the formation of pattern recognition model much depends on the quality of data. There are two issues as the concern of this research as follows, the data having disproportionate amount among the classes (imbalance class) and the finding of noise on its attribute. Therefore, this research integrates the SMOTE Technique and bootstrapping to handle the problem of imbalance class. While an experiment is conducted to eliminate the noise on the attribute by using some feature selection algorithms with filter approach (information gain, rule, derivation, correlation and chi square). This research has some stages as follows: data understanding, pre-processing, imbalance class, feature selection, classification and performance evaluation. Based on the result of testing using 10-fold cross validation, it shows that the use of the SMOTE-bootstrapping technique is able to increase the accuracy from 83.3% to be 98.8%. While the process of noise elimination onthe data attribute is also able to increase the accuracy to be 99.5% (the use of feature subset produced by the information gain algorithm and the decision tree classification algorithm).
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Weighted Ensemble Classifier for Plant Leaf IdentificationTELKOMNIKA JOURNAL
Plant leaf identification using image can be constructed by ensemble classifier. Ensemble
classifier executes classification of various features independently. This experiment utilized texture feature
and geometry feature of plant leaf to find out which features are more powerful. Each classifier trained by
specific feature produced different accuracy rate. To integrate ensemble classifier the results of the
classification were weighted, so as the score obtained from better features contributed greater to the final
results. Weighted classification results were combined to get the final result. The proposed method was
evaluated using dataset comprises of 156 variety of plants with 4559 images. Weighting and combining
classifier used in this study were Weighted Majority Vote (WMV) and Naïve Bayes Combination. Both of
those method result showed better accuracy than using single classifier. The average accuracy of single
classifier was 61.2% for geometry classifier and 70.3% for texture classifier, while WMV method was
77.8% and Naïve Bayes Comb ination was 94.6%. The calculation of classifier’s weight b y using WMV
method produces a weight value of 0.54 for texture feature classifier and 0.46 for geometry feature
classifier.
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from
large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an
established, proven structure from a voluminous collection of facts. A dominant area of modern-day
research in the field of medical investigations includes disease prediction and malady categorization. In
this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering
techniques and compare the performance of classification algorithms on the clinical data. Feature
selection is a supervised method that attempts to select a subset of the predictor features based on the
information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with
the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms
in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen
classification algorithms on the Lymphography dataset that enables the classifier to accurately perform
multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and
the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and
also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated
here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm
offers increased clustering accuracy in less computation time.
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
Survey on semi supervised classification methods and feature selectioneSAT Journals
Abstract Data mining also called knowledge discovery is a process of analyzing data from several perspective and summarize it into useful information. It has tremendous application in the area of classification like pattern recognition, discovering several disease type, analysis of medical image, recognizing speech, for identifying biometric, drug discovery etc. This is a survey based on several semisupervised classification method used by classifiers , in this both labeled and unlabeled data can be used for classification purpose.It is less expensive than other classification methods . Different techniques surveyed in this paper are low density separation approach, transductive SVM, semi-supervised based logistic discriminate procedure, self training nearest neighbour rule using cut edges, self training nearest neighbour rule using cut edges. Along with classification methods a review about various feature selection methods is also mentioned in this paper. Feature selection is performed to reduce the dimension of large dataset. After reducing attribute the data is given for classification hence the accuracy and performance of classification system can be improved .Several feature selection method include consistency based feature selection, fuzzy entropy measure feature selection with similarity classifier, Signal to noise ratio,Positive approximation. So each method has several benefits. Index Terms: Semisupervised classification, Transductive support vector machine, Feature selection, unlabeled samples
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTIJDKP
The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is increasing, there is need for an intelligent text-based system for proper management of the data. The selection of optimal set of features for processing plays vital roles in text-based system. This paper analyzed the structure of Igbo text and designed an efficient feature selection model for an intelligent Igbo text-based system. It adopted Mean TF-IDF measure to select most relevant features on Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models. The model is designed with Object-Oriented Methodology and implemented with Python programming language with tools from Natural Language Toolkits (NLTK). The result shows that bigram represented text gives more relevant features based on the language semantics.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has supplied a large volume of data to many fields. The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. In as much as the data achieving from microarray technology is very noisy and also has thousands of features, feature selection plays an important role in removing irrelevant and redundant features and also reducing computational complexity. There are two important approaches for gene selection in microarray data analysis, the filters and the wrappers. To select a concise subset of informative genes, we introduce a hybrid feature selection which combines two approaches. The fact of the matter is that candidate’s features are first selected from the original set via several effective filters. The candidate feature set is further refined by more accurate wrappers. Thus, we can take advantage of both the filters and wrappers. Experimental results based on 11 microarray datasets show that our mechanism can be effected with a smaller feature set. Moreover, these feature subsets can be obtained in a reasonable time.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAIJCI JOURNAL
This paper proposes a new method that intends on reducing the size of high dimensional dataset by
identifying and removing irrelevant and redundant features. Dataset reduction is important in the case of
machine learning and data mining. The measure of dependence is used to evaluate the relationship
between feature and target concept and or between features for irrelevant and redundant feature removal.
The proposed work initially removes all the irrelevant features and then a minimum spanning tree of
relevant features is constructed using Prim’s algorithm. Splitting the minimum spanning tree based on the
dependency between features leads to the generation of forests. A representative feature from each of the
forests is taken to form the final feature subset
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...ijsc
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an established, proven structure from a voluminous collection of facts. A dominant area of modern-day research in the field of medical investigations includes disease prediction and malady categorization. In this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering techniques and compare the performance of classification algorithms on the clinical data. Feature selection is a supervised method that attempts to select a subset of the predictor features based on the information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen classification algorithms on the Lymphography dataset that enables the classifier to accurately perform multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm offers increased clustering accuracy in less computation time.
Classification problems specified in high dimensional data with smallnumber of observation are generally becoming common in specific microarray data. In the time of last two periods of years, manyefficient classification standard models and also Feature Selection (FS) algorithm which isalso referred as FS technique have basically been proposed for higher prediction accuracies. Although, the outcome of FS algorithm related to predicting accuracy is going to be unstable over the variations in considered trainingset, in high dimensional data. In this paperwe present a latest evaluation measure Q-statistic that includes the stability of the selected feature subset in inclusion to prediction accuracy. Then we are going to propose the standard Booster of a FS algorithm that boosts the basic value of the preferred Q-statistic of the algorithm applied. Therefore study on synthetic data and 14 microarray data sets shows that Booster boosts not only the value of Q-statistics but also the prediction accuracy of the algorithm applied.
Effective Feature Selection for Feature Possessing Group Structurerahulmonikasharma
Feature selection has become an interesting research topic in recent years. It is an effective method to tackle the data with high dimension. The underlying structure has been ignored by the previous feature selection method and it determines the feature individually. Considering this we focus on the problem where feature possess some group structure. To solve this problem we present group feature selection method at group level to execute feature selection. Its objective is to execute the feature selection in within the group and between the group of features that select discriminative features and remove redundant features to obtain optimal subset. We demonstrate our method on data sets and perform the task to achieve classification accuracy.
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to
analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection based ensemble learning models is to classify the high dimensional data with high computational efficiency and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
Large amount of data have been stored and manipulated using various database
technologies. Processing all the attributes for the particular means is the difficult task. To avoid such
difficulties, feature selection process is processed.In this paper,we are collect a eight various benchmark
datasets from UCI repository.Feature selection process is carried out using fuzzy entropy based
relevance measure algorithm and follows three selection strategies like Mean selection strategy,Half
selection strategy and Neural network for threshold selection strategy. After the features are selected,
they are evaluated using Radial Basis Function (RBF) network,Stacking,Bagging,AdaBoostM1 and Antminer
classification methodologies.The test results depicts that Neural network for threshold selection
strategy works well in selecting features and Ant-miner methodology works best in bringing out better
accuracy with selected feature than processing with original dataset.The obtained result of this
experiment shows that clearly the Ant-miner is superiority than other classifiers.Thus, this proposed Antminer
algorithm could be a more suitable method for producing good results with fewer features than
the original datasets.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This paper presents a new approach to select reduced number of features in databases. Every database has a
given number of features but it is observed that some of these features can be redundant and can be harmful as
well as confuse the process of classification. The proposed method first applies a binary coded genetic algorithm
to select a small subset of features. The importance of these features is judged by applying Naïve Bayes (NB)
method of classification. The best reduced subset of features which has high classification accuracy on given
databases is adopted. The classification accuracy obtained by proposed method is compared with that reported
recently in publications on eight databases. It is noted that proposed method performs satisfactory on these
databases and achieves higher classification accuracy but with smaller number of feature
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
1. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
DOI : 10.5121/ijcsea.2012.2104 31
Filter Based Approach for Genomic Feature Set
Selection (FBA-GFS)
V.Bhuvaneswari1
and K.Poongodi2
1
Assistant Professor, Department of Computer Application, Bharathiar University,
Coimbatore, India
bhuvanes_v@yahoo.com
1
M.Phil Research Scholar,, Department of Computer Application, Bharathiar University,
Coimbatore, India
poongodikmca@gmail.com
Abstract
Feature selection is an effective method used in text categorization for sorting a set of documents into
certain number of predefined categories. It is an important method for improving the efficiency and
accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome
contains the total amount of genetic information in the chromosomes of an organism, including its genes
and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used to
categories the Features from the Genome documents. A framework is proposed for Genomic Feature set
Selection. A Filter based Feature Selection Method like ݔ 2
statistics, CHIR statistics are used to select the
Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for
Biological relevance using the BLAST tool.
Keywords:
Feature selection, Clustering, Genome
1. INTRODUCTION
Data Mining is also known as Knowledge Discovery in Database (KDD). Knowledge Discovery
in Databases (KDD) is the non-trivial process of identifying valid, potentially useful and
ultimately understandable patterns in data. It is the process of extracting novel information and
knowledge from large databases. This process consists of many interacting stages performing
specific data manipulation and transformation operations with an information flow from one stage
onto the next [4].
Document clustering is an automatic grouping of text documents into clusters so that documents
within a cluster have high similarity in comparison to one another, but are dissimilar to
documents in other clusters.Hierarchical techniques produce a nested sequence of partitions, with
2. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
32
a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom.
Each intermediate level can be viewed as combining two clusters from the next lower level (or
splitting a cluster from the next higher level). The result of a hierarchical clustering algorithm can
be graphically displayed as tree, called a dendogram. This tree graphically displays the merging
process and the intermediate clusters. The dendogram at the right shows how four points can be
merged into a single cluster. For document clustering, this dendogram provides a taxonomy, or
hierarchical index.
The task of feature selection is generally divided into two aspects eliminating irrelevant features
and redundant ones. Irrelevant features usually disturb the learner and degrade the accuracy,
while redundant features add to computational cost without bringing in new information. Feature
Selection techniques can be divided into three main approaches as filter, wrapper and embedded
approaches. In embedded approaches, where feature selection is part of the classification
algorithm, i.e. decision tree. In Filter approaches, the features are selected before the
classification algorithm and in the Wrapper approaches the classification algorithm is used as a
black box to find the best subset of attributes [14].
In filter based approach, the features are selected according to data intrinsic values, such as
information dependency or consistency measures. In bioinformatics, datasets are often very large.
Therefore, the filter approach is mostly used to select the features. The advantage of this approach
is that we can use any classifier to evaluate the accuracy of the test set. In the wrapper approach,
we have to use a different classifier for the test set and training set, results may be negatively
affected. In the proposed work, the Filter based Feature Selection approach is studied and
analyzed for selecting features from genomic databases for clustering.
The paper is organized as follows. Section 2 provides the literature study of the various Feature
selection methods for Bio- logical database. Section 3 explores the methodology for Genomic
Feature set Selection (GFS). In section 4 the implemented results are verified and validated. The
final section draws the conclusion of the paper.
2. REVIEW OF LITERATURE
Feature selection methods have been successfully applied to text categorization but seldom
applied to text clustering due to the unavailability of class label information [13].The goal of
feature selection for unsupervised learning is to find the smallest feature subset that best uncovers
clusters form data according to the preferred criterion [1].Feature selection techniques do not alter
the original representation of the variables, but merely select a subset of them [14].
In[3] , Bassam Al-Salemi (2011) Used Feature Selection techniques such as Mutual Information
(MI), Chi-Square Statistic (CHI), Information Gain (IG), GSS Coefficient (GSS) and Odds Ratio
(OR) to reduce the dimensionality of feature space by eliminating the features that are considered
irrelevant for the category [3].
Feature selection can help improve the efficiency of training. In information retrieval, especially
in web search, usually the data size is very large and thus training of ranking models is
computationally costly[15]. In [2] , Balaji Krishnapuram et al., (2004) have adopted a Bayesian
approach for both an optimal nonlinear classifier and a subset of predictor variables (or features)
3. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
33
that are most relevant to the classification task. The Bayesian approach uses heavy-tailed priors to
promote sparsity in the utilization of both basis functions and features.
Feng Tan et al.,(2006) have proposed a hybrid approach to combine useful outcomes from
different feature selection methods through a genetic algorithm. Their experimental results
demonstrate that the approach can achieve better classification accuracy with a smaller gene
subset than each individual feature selection algorithm [6].
In[9],Gouchol Pok etal., (2010) have done a effective feature selection framework suitable for
two-dimensional microarray data. The correlation-based on parametric approach allows compact
representation of class-specific properties with a small number of genes. In[11],Lin Sun et
al(2011) have discussed a new rough entropy to measure the roughness of knowledge and its
properties were proposed in decision information system, and concluded that rough entropy
decreased monotonously as the information granularities became finer was obtained.
In[15],Yiming Yang et al., (1992) have reported a controlled study with statistical significance
test on five text categorization methods : Support Vector Machine(SVM), a K-Nearest Neighbor
(KNN) Classifier,a Neural Network(NNet) approach, the Linear Least Square Fit(LLSF) mapping
and a Naïve-Bayes(NB) Classifier.
In[8], George H.John et al., (1994) have described a method for feature subset selection using
cross-validation that is applicable to any induction algorithm. Discussed the experiments
conducted with ID3 and C4.5 on artificial and real datasets. Pabitra Mitra et al., (2002) have
described an unsupervised feature selection algorithm suitable for data sets, large in both
dimension and size. The method is based on measuring similarity between features whereby
redundancy therein is removed [12]. Spectral feature selection identifies relevant features by
measuring their capability of preserving sample similarity [21].
3. METHODOLOGY
The Genomic sequence data are stored in public databases like NCBI, Uniport in various formats.
The information related to sequence is represented as attributes. Finding the important attributes
for comparing the genomic sequence data based on annotation, becomes the challenging task.
Feature selection methods can be used to analyze and study the best features used for representing
sequence information for association and clustering of documents using supervised and
techniques. The proposed framework given in Figure 1 is used for Genomic Feature set Selection
(GFS).
4. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
34
Figure 1. Genomic Feature set Selection (GFS).
The proposed framework consists of four phases which includes preprocessing, Feature set
Analysis, Feature set Identification and verification and validation phase.
3.1 Dataset
The scorpio-venom dataset is used in the proposed work. It is downloaded from the NCBI which
is in the XML format. The NCBI dataset is the integrated, text-based search and retrieval system
used at the major databases, including PubMed, Nucleotide and Protein Sequences, Protein
Structures, Complete Genomes, Taxanomy, and others. The scorpio -venom dataset contains
sequence information in XML format.
The genomic data in XML format has more than 3500 tags to represent the functional
descriptions about the sequences like accession no, taxonomy, organism, lineage, sequence title,
sequence descriptions, alternate name,gene name, author details, and identifiers related to other
databases like GO, KEGG, PUBMED. The scorpio-venom specis in XML Format is shown in the
Figure 2.The dataset consist of 107 documents in XML format for the scorpio -venom species.
Scorpio-venom dataset
from NCBI Preprocessing
Feature set
Analysis
Feature set
Identification
Verification and Validation
5. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
35
Figure 2.scorpio-venomspecis in XML Format
3.2 Preprocessing
The preprocessing phase of the proposed work consists of two main process which includes
extraction of keyword from XML document and construction of text matrix for feature
selection.The dataset in XML format is stored into DB2 data base, where each XML transaction
data file is uniquely identified with transaction identifiers. The XML documents are stored as xml
files in the db2 database and the features are extracted from the DB2 database using XQuery
features which offers easy access to the entire XML data or part of XML data without converting
to any conventional format. The key filter interface developed in DB2 XQuery provides the way
to extract the necessary fields like gene name and associated go terms, keywords in every
biological XML file. The 358 keywords are extracted from original Dataset which contain 511
keywords.
The document-term matrix contains rows corresponding to the documents and columns
corresponding to the terms. After extraction of the keywords from the XML document, the term
Matrix is constructed which is the input for further processing. The term matrix is represented as
binary encoded format. The values are encoded as zero and one, in the presence of keyword one
is entered and in absence of a keyword zero is entered into the corresponding place, since in the
xml database all the elements have the same domain values so no string edit measures are needed.
Table 1 shows the snapshot of the binary coded term matrix for the dataset chosen.
Table 1. Binary coded Term Matrix
6. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
36
3.3 Feature Set Analysis
The phase 2 consists of two main processes which includes clustering of the documents to assign
class label and analyzing the features using Filter based feature selection approaches using
supervised methods like ࢞2
statistics, CHIR statistics . Document clustering is the act of collecting
similar documents into bins, where similarity is some function on a document. The term matrix
constructed is given as input for the clustering phase. The documents are initially clustered for
analyzing the features using hierarchical clustering algorithm. The proposed work we have
considered 107 documents with 358 extracted keywords. On clustering the 107 documents 30
clusters are generated. From the generated cluster it is found that single document is found in
many clusters and maximum documents are found in 9 clusters. So we have taken the cluster
which contains highest number of documents to analyze the feature attributes and find the term
relevance using filter based approach.
Feature Analysis Based On ࢞ 2
Statistics
The ࢞ 2
Statistics can be used to measure the independence between the keyword and the
category[14]. This can be done by comparing observed frequency in the 2-way contingency table
with the expected frequency when they are assumed to be independent. For the ࢞ 2
keyword-
category independency, we consider the null hypothesis and alternative hypothesis. The null
hypothesis is that the keyword and category are independent. On the other hand, the alternative
hypothesis is that the keyword and category are not independent. To test the null hypothesis, we
compare the observed frequency with the expected frequency calculated under the assumption
that the null hypothesis is true. The expected frequency E(i,j) can be calculated as :
ܧሺ݅, ݆ሻ =
∑ ைሺ,ሻ ∑ ைሺ,ሻ್ ∈ሼ,షሽೌ ∈ሼഘ,షഘሽ
… eq (1)
The ࢞ 2
statistics value is defined as:
ݔଶ
=௪,
൫ܱሺ݅, ݆ሻ − ܧሺ݅, ݆ሻ൯
ଶ
ܧሺ݅, ݆ሻ
… ݁ݍሺ2ሻ
∈ሼ.ିሽ ∈ሼఠ,ିఠሽ
The degrees of freedom for the ݔ 2
Statistics is calculated by using the equation 3.
ܨܦ = ሺݎ − 1ሻ ∗ ሺܿ − 1ሻ … … … ݁ݍሺ3ሻ
Here, r and c are the number of row and column. In the fifth step, Look up the ݔଶ
Critical value
and Conclude the result .Looking up the ݔଶ
distribution table, if the critical value is much
smaller than the ݔ 2
Statistics values, then the null hypothesis is rejected. This can be explained
that it is significant and there is some dependency between the keyword and the category. The
statistics e relationship is analyzed between the keyword and category for the 9 Clusters which
contain maximum number of documents. Among 358 keywords, 156 keywords are found to be
dependent to the category.
7. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
37
Feature Analysis Based On ࡴࡵࡾ Statistics ሺܴఠ,ሻ
CHIR is a supervised learning algorithm based on ࢞
statistics, which determines the dependency
between a keyword and a category and also the type of dependency [13]. Type of dependency
indicates whether the feature is a positive or negative dependency for the category. There are two
steps to evaluate the dependency of a keyword w, to category c. First step to build a table of the
Observed Frequency and the Expected Frequency. Second step to calculate the ܴఠ, by using the
equation 4.
ܴ௪, =
௪,
ܧ௪,
… … … ݁ݍሺ4ሻ
If there is no dependency between the term w and the category c, then the value of ܴఠ,, is close
to 1. If there is a positive dependency then the observed frequency is larger than the expected
frequency, hence value of ܴఠ,,is larger than 1 and when there is a negative dependency ܴఠ,, is
smaller than 1. By calculating the ܴఠ,, it resulted all the 358 keywords are positively dependent
to atleast one cluster because the values are greater than one.
3.4 Feature Set Identification
The Third phase is the Feature set Identification Phase. In this phase the Features are Ranked
based on ݔ୫ୟ୶
ଶ
, ݔୟ୴
ଶ
and ݔݎ2
Statistics. The highly ranked Features are used for analyzing the
term relevance. The Feature sets are identified from ranking are Clustered with respect to
documents.
Ranking Based On ࢞ܠ܉ܕ
, ࢞ܞ܉
And ࢘࢞2
Statistics
The best keyword are selected for finding the keyword-goodness .The Supervised feature
selection method uses ࢞୫ୟ୶
ଶ
or ࢞ୟ୴
ଶ
to select the best keywords from m categories. The ࢞2
max and
࢞ୟ୴
ଶ
are calculated by using the equation 8 and equation 9.
࢞୫ୟ୶
ଶ ሺ߱ሻ = ୨
୫ୟ୶
ቄ࢞ ன,ୡౠ
ଶ
ቅ … ݁ݍሺ8ሻ
࢞ୟ୴
ଶ ሺ߱ሻ = pሺc୨
୫
୨ୀଵ
ሻ࢞ ன,ୡౠ
ଶ
… ݁ݍሺ9ሻ
Here, pሺc୨ሻ is the probability of the documents to be in the category c୨,then keyword whose
keyword-goodness measure is lower than a certain threshold would be removed from the feature
space. In other words, ݔ2
selects terms having strong dependency on categories. The supervised
feature selection method CHIR uses ࢘࢞2
to measure the Keyword -goodness and prove that ࢘࢞2
values represent only the positive keyword-category dependency. The following are the steps to
select the n keywords. There are three steps to calculate the ݔݎ2
Statistic value. In the first step the
࢘࢞2
Statistic value is calculated by using the equation 10.
࢘࢞
= ሺ
ୀ
ࡾ࣓,ࢉ
ሻ ࢞
࣓ࢉ, ࢝]࢚ࢎ ࡾ࣓,ࢉ
> 1 … ࢋሺሻ
8. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
38
Here, ሺܴఠ,ೕ
ሻ is the weight of ݔଶ
ఠೕ, in the corpus. In terms of ܴఠೕ, , ሺܴఠ,ೕ
ሻ is defined as
ሺࡾ࣓,ࢉ
ሻ =
ࡾ࣓,ࢉ
∑ ࡾ࣓,ࢉ
ୀ
࢚࢝]ࢎ ࡾ࣓,ࢉ
> 1 … ࢋሺሻ
In the second step the keywords are sorted in descending order of their ݔݎ2
Statistic. In the third
step the top n keywords from the list are selected. The largest values of ࢘࢞2
indicates that the
keyword ߱ is more relevant to the category c. Keyword with top ࢘࢞2
values are chosen as
features. Let us assume the threshold value to remove the redundant and irrelevant features.
Minimum threshold value 10 is taken 156 keywords are retrieved. For an average 50 is taken as
threshold value 77 keywords are retrieved. For Maximum 85 is taken as threshold value 58
keywords are retrieved. Among 358 keywords 58 keywords are ranked high and these keywords
are said to be more relevant and strongly dependent to the category.
Table 2. Relevant Keywords Retrieved
Threshold Value No. of Keywords Retrieved
10 156
50 77
85 58
Clustering Based On Selected Feature Set
The documents are clustered based on the new feature set with {156, 77, 58} keywords. On
clustering the 107 documents with respect to these feature set, 30 clusters are generated for each.
On clustering the 107 documents with respect to 156 keywords13 clusters are obtained with more
than one documents.
Among the 107 documents , it is founded that 12 documents are empty. Remaining 95 documents
are judged to be of 156 keywords in entire hierarchy. On clustering the 107 documents for feature
set with 77 keywords we obtained 6 clusters which contain more than one document. Among the
107 documents, it is founded that 82 documents are empty. Remaining 25 documents are judged
to be of 77 keywords in entire hierarchy.
On clustering the 107 documents for feature set with 58 keywords we obtained 2 clusters which
contain more than one document. Among the 107 documents, it is founded that 12 documents are
judged to be of 58 keywords in entire hierarchy. Remaining documents are found to be empty.
3.5 Verification and Validation
The fourth phase is the verification and validation phase. The clusters are generated using the
features set are evaluated by using the validation metrics. F-measure used to validate the clusters
of Original Feature set, Feature sets with 156, 77 and 58 keywords. The clusters are validated for
biological relevance by our experimental results is compared with existing BLAST tool.
9. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
39
3.5.1 Verifying the Terms Using F-measure
In our proposed work the keywords are verified using the F-Measure. F-Measure which is the
harmonic mean of the precision and recall. The formula for the corresponding Precision, Recall
and F-measure is shown in the eq.12, eq.13, and eq.14 respectively. For any Topic T and cluster
X: N1: Number of documents judged to be of topic T in cluster X.N2: Number of documents in
cluster X.N3: Number of documents to be judged to be of Topic T in entire hierarchy.
ܲ݊݅ݏ݅ܿ݁ݎ =
ܰଵ
ܰଶ
… ݁ݍሺ12ሻ
ܴ݈݈݁ܿܽ =
ܰଵ
ܰଷ
… ݁ݍሺ13ሻ
ܨ − ݉݁ܽ݁ݎݑݏ =
2 ∗ ሺܲ݊݅ݏ݅ܿ݁ݎ ∗ ܴ݈݈݁ܿܽሻ
ሺܲ݊݅ݏ݅ܿ݁ݎ + ܴ݈݈݁ܿܽሻ
… ݁ݍሺ14ሻ
Verification of Original Feature Set
On clustering the 107 documents for original feature set with 358 keywords we obtain 9 clusters
which contain more than one document. Among 9 clusters, 5 clusters {c3, c8, c11, c12, c24}
contain maximum number of documents. It is found that 98 documents are found to be judged to
be of Topic T in entire hierarchy. Remaining 9 documents are found to be empty. The clusters of
feature set with 358 keywords are evaluated with three measures (Precision, Recall and F-
measure) and the obtained result is given in Table 3.
Table 3.Evaluation Metrics for the Original Dataset
Cluster (%) Precision (%) Recall (%) F-measure (%)
c3 85 50 63
c8 80 45 58
c11 83 25 38
c12 77 20 32
c24 79 20 32
Verification of Feature set with 156 keywords
On clustering the 107 documents for feature set with 156 keywords we obtained 13 clusters {c1,
c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c14, c30} which contain more than one document. Among
13 clusters, 4 clusters {c1, c4, c5, c9} contain maximum number of documents. It is founded that
95 documents are found to be judged to be of Topic T in entire hierarchy. Remaining 12
10. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
40
documents are found to be empty. The clusters of feature set with 156 keywords are evaluated
with three measures (Precision, Recall and F-measure) and the obtained result is given in Table 4.
Table 4.Evaluation Metrics for the Feature set with 156 keywords.
Cluster(%) Precision(%) Recall(%) F-measure(%)
c1 80 60 69
c4 83 43 57
c5 42 42 57
c9 87 45 59
Verification of Feature set with 77 keywords
On clustering the 107 documents for feature set with 77 keywords we obtained 6 clusters {c1, c3,
c5, c10, c11, c30} which contain more than one document. Among 6 clusters, 2 clusters {c3, c11} contain
maximum number of documents. It is founded that 25 documents are found to be judged to be of Topic T in
entire hierarchy. Remaining documents are found to be empty. The clusters of feature set with 77 keywords
are evaluated with three measures (Precision, Recall and F-measure) and the obtained result is given in
Table 5.
Table 5.Evaluation Metrics for the Feature set with 77 keywords.
Cluster(%) Precision(%) Recall(%) F-measure(%)
C3 82 75 78
C11 83 78 80
Verification of Feature set with 58 keywords
On clustering the 107 documents for feature set with 58 keywords we obtained 2 clusters {c1, c11} which
contain more than one document. Among 2 clusters, clusters {c11} contain maximum number of
documents. It is founded that 12 documents are found to be judged to be of Topic T in entire hierarchy.
Remaining documents are found to be empty.
The clusters of feature set with 77 keywords are evaluated with three measures (Precision, Recall and F-
measure) and the obtained result is given in Table 6.
Table 6.Evaluation Metrics for the Feature set with 58 keywords.
Cluster (%) Precision (%) Recall (%) F-measure (%)
C11 84 95 89
11. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
41
Analyzing the Feature sets
In this section, we have compared the result obtained above. Using the Filter based Feature
selection methods the feature set extracted from scorpio-venom dataset are evaluated using
precision, recall and F-measure. The cross matrix for metrics are displayed in the Table 7.
Table 7. Evaluation Metrics for Feature set
Feature Set Keywords Precision (%) Recall (%) F-measure (%)
Original Feature set 358 80 32 45
Extracted Feature
set
156 83 48 60
77 82 77 80
58 84 95 89
3.5.2 Validating with BLAST
In our proposed work the feature sets are extracted using Filter based Feature selection methods.
The extracted feature set is compared with the Standard BLAST tool. The Clustering result
produced by the BLAT tool is accurate and it is already proven. We have compared our result
with BLAST to verify the biological relevance of the documents grouped based on the feature set.
Here, we have taken three clusters {c11, c10, c12} from BLAST and compare it with our
approach. The following Table 8 shows the results of Biological Validation Using BLAST.
Table 8. Biological Validation Using BLAST
The documents clustered using the feature set is verified with the documents clustered using the
BLAST. We have found that the feature set with 77 keywords and 58 keywords has grouped 82%
of similar documents that are grouped by existing BLAST tool. In this study we have found that
the feature set with 77 keywords and the feature set with 58 keywords are biologically relevant
for grouping the documents. From the experimental result we have consider the Feature set with
12. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
42
77 keywords and Feature set with 58 keywords extracted based on supervised and unsupervised
Filter based approach can be used for Clustering and Classification.
4. CONCLUSION
Mining biological data is an emerging area of intersection between bioinformatics and data
mining. It is difficult to explore and utilize the features set from huge amount of data. Feature
selection is a method for improving the efficiency and accuracy of text categorization algorithms
by removing redundant and irrelevant terms from the corpus. The proposed work is to study and
analyze the Filter based Feature Selection Approach for identifying Feature from Genomic
Sequence Database. A framework “Genomic Feature set selection (GFS)” is designed to
implement the proposed approach. The framework is processed using four different phases. The
experimental results it is found that the Analysis of Feature set with 58 keywords is 89% accurate
for grouping the documents which is evaluated and verified by using F-Measure. The
implemented work is also validated with existing Standard BLAST tool. On validating, we have
identified that 82% of documents grouped by Feature set with 77 and 58 keywords are similar to
the documents grouped by the BLAST tool.
REFERENCE
[1] Asha Gowda Karegowda, M.A.Jayaram, A.S. Manjunath, “Feature Subset Selection Problem using
Wrapper Approach in Supervised Learning”, 2010 International Journal of Computer Applications
(0975 – 8887) Volume 1 – No. 7.
[2] Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Mario A.T. Figueiredo, “A
Bayesian Approach to Joint Feature Selection and Classifier Design ” , IEEE Transactions On
Pattern Analysis And Machine Intelligence, Vol. 26, NO. 9, SEPTEMBER 2004.
[3] Bassam Al-Salemi ., Mohd. Juzaiddin Ab Aziz , “Statistical Bayesian Learning For Automatic
Arabic Text Categorization” , In Journal of Computer Science 7 (1): 39-45, 2011.
[4] Bharati M. Ramageri , “Data Mining Techniques And Applications”, In Indian Journal of Computer
Science and Engineering, Vol. 1 No. 4 , 2010.
[5] Daniel T. Larose, “ Discovering Knowledge in Data: an Introduction to Data Mining, ISBN 0-471-
66657-2, 2005.
[6] Feng Tan, Xuezheng Fu, Hao Wang, Yanqing Zhang, and Anu Bourgeois, “A Hybrid Feature
Selection Approach for Microarray Gene Expression Data”, V.N. Alexandrov et al. (Eds.): ICCS
2006, Part II, LNCS 3992, 2006.
[7] Feng Tan, Xuezheng Fu, Hao Wang, Yanqing Zhang, and Anu Bourgeois, “A Hybrid Feature
Selection Approach for Microarray Gene Expression Data”, V.N. Alexandrov et al. (Eds.): ICCS
2006, Part II, LNCS 3992, 2006.
[8] George H.John, Ron Kohavi, Karl Pfleger,“Irrelevant Features And The Subset Selection
Problem”,Machine Learning : Proceedings of the Eleventh International Conference,Morgan
Kaufmann Publishers,San Francisco,CA,1994.
[9] Gouchol Pok, Jyh-Charn Steve Liu, Keun Ho Ryu, “Effective feature selection framework for cluster
analysis of microarray data”, ISSN 0973-2063 (online) 0973-8894 ,Bioinformation 4(8): 385-
389,2010.
[10] Guyon, Isabelle, and Andr´e ELISSEEFF, “An introduction to variable and feature selection”.
Journal of Machine Learning Research, 2003 1157–
[11] Lin Sun, Jiucheng Xu, Zhan’ao Xue, Lingjun Zhang, “Rough Entropy-based Feature Selection and
Its Application”, Journal of Information & Computational Science 8: 9 (2011) 1525–1532.
13. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.1, February 2012
43
[12] Pabitra Mitra C.A.,Murthy, SanKar K.Pal, “Unsupervised Feature Selection Using Feature
Similarity”, IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 24, NO. 4,
April 2002.
[13] Tao Liu, Shengping Liu, Zheng Chen, “An Evaluation on Feature Selection for Text Clustering”,
Proceedings of the Twentieth International Conference On Machine Learning(ICML-2003).
[14] Xiubo Geng, Tie-Yan Liu, Tao Qin, Hang Li, “ Feature Selection for Ranking”,2007.
[15] Yiming Yang and Xiu Liu, “A re-examinations of text categorization methods”,1992.
[16] Yiming Yang, “A Comparitive study of Feature selection in Text Categorization”,1997.
[17] Yiming Yang, “Noise Reduction in a Statistical Approach to Text Categorization”,1995.
[18] YongSeog Kim, W. Nick Street and Filippo Menczer, “Feature Selection in Unsupervised Learning
via Evolutionary Search”, 2000.
[19] YongSeog Kim, W. Nick Street, and Filippo Menczer, “Feature Selection in Data Mining”.
[20] Yvan Saeys, Inaki Inza2 and Pedro Larranaga, “A review of feature selection techniques in
bioinformatics”,Vol.23 no. 19 2007, pages 2507–2517 doi:10.1093/bioinformatics/btm344.
[21] Zheng Zhao, Lei Wang and Huan Liu, “Efficient Spectral Feature Selection with Minimum
Redundancy”,2010.
Authors
Ms V Bhuvaneswari received her Bachelors Degree (B.Sc.) in Computer technology
from Bharathiar University, India 1997 , Masters Degree (MCA) in Computer
Applications from IGNOU, India . and M.Phil in Computer Science in 2003 from
Bharathiar University, India . She has qualified JRF , UGC-NET, for Lectureship in the
year 2003 She is currently pursuing her doctoral research in School of Computer Science
and Engineering at Bharathiar University in the area of Data mining . Her research
interests include Bioinformatics, Soft computing and Databases. She is currently
working as Assistant Professor in the School of Computer Science and Engineering, Bharathiar University,
India. She has for her credit publications in journals, International/ National Conferences.
Ms. K. Poongodi received her B.Sc degree in Mathematics with Computer Applications
from Bharathiar University, Master Degree(MCA) in Computer Applications from Anna
University, M.Phil in Computer Science in 2007, 2010, and 2011 respectively, from
Bharathiar University, Coimbatore, India. She has attended National conferences and
Presented a paper. Her area of interest includes Data Mining, Data Structures and
Bioinformatics.