Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
In this work article founds the best classifier for
breast cancer diseases classification, the survey shows it is one
of the major diseases in India. For classification of Breast
cancer is a multivariate data analysis it is big challenge among
researcher to classify multivariate data because more than one
decision variables present .The proposed work founds which
classifier is best among 48 different categories of classifier and
it founds logistics function is best among all the classifiers, it
shows 80.9524 accuracy. Logistics function is also called logistic
regression; it is a statistical method to analysis breast cancer
data set.
Outlier Modification and Gene Selection for Binary Cancer Classification usin...CSCJournals
Gaussian linear Bayes classifier is one of the most popular approaches for classification. However, it is not so popular for cancer classification using gene expression data due to the inverse problem of its covariance matrix in presence of large number of gene variables with small number of cancer patients/samples in the training dataset. To overcome these problems, we propose few top differentially expressed (DE) genes from both upregulated and downregulated groups for binary cancer classification using the Gaussian linear Bayes classifier. Usually top DE genes are selected by ranking the p-values of t-test procedure. However, both t-test statistic and Gaussian linear Bayes classifier are sensitive to outliers. Therefore, we also propose outlier modification for gene expression dataset before applying to the proposed methods, since gene expression datasets are often contaminated by outliers due to several steps involves in the data generating process from hybridization to image analysis. The performance of the proposed method is investigated using both simulated and real gene expression datasets. It is observed that the proposed method improves the performance with outlier modifications for binary cancer classification.
Classification accuracy analyses using Shannon’s EntropyIJERA Editor
There are many methods for determining the Classification Accuracy. In this paper significance of Entropy of
training signatures in Classification has been shown. Entropy of training signatures of the raw digital image
represents the heterogeneity of the brightness values of the pixels in different bands. This implies that an image
comprising a homogeneous lu/lc category will be associated with nearly the same reflectance values that would
result in the occurrence of a very low entropy value. On the other hand an image characterized by the
occurrence of diverse lu/lc categories will consist of largely differing reflectance values due to which the
entropy of such image would be relatively high. This concept leads to analyses of classification accuracy.
Although Entropy has been used many times in RS and GIS but its use in determination of classification
accuracy is new approach.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
In this work article founds the best classifier for
breast cancer diseases classification, the survey shows it is one
of the major diseases in India. For classification of Breast
cancer is a multivariate data analysis it is big challenge among
researcher to classify multivariate data because more than one
decision variables present .The proposed work founds which
classifier is best among 48 different categories of classifier and
it founds logistics function is best among all the classifiers, it
shows 80.9524 accuracy. Logistics function is also called logistic
regression; it is a statistical method to analysis breast cancer
data set.
Outlier Modification and Gene Selection for Binary Cancer Classification usin...CSCJournals
Gaussian linear Bayes classifier is one of the most popular approaches for classification. However, it is not so popular for cancer classification using gene expression data due to the inverse problem of its covariance matrix in presence of large number of gene variables with small number of cancer patients/samples in the training dataset. To overcome these problems, we propose few top differentially expressed (DE) genes from both upregulated and downregulated groups for binary cancer classification using the Gaussian linear Bayes classifier. Usually top DE genes are selected by ranking the p-values of t-test procedure. However, both t-test statistic and Gaussian linear Bayes classifier are sensitive to outliers. Therefore, we also propose outlier modification for gene expression dataset before applying to the proposed methods, since gene expression datasets are often contaminated by outliers due to several steps involves in the data generating process from hybridization to image analysis. The performance of the proposed method is investigated using both simulated and real gene expression datasets. It is observed that the proposed method improves the performance with outlier modifications for binary cancer classification.
Classification accuracy analyses using Shannon’s EntropyIJERA Editor
There are many methods for determining the Classification Accuracy. In this paper significance of Entropy of
training signatures in Classification has been shown. Entropy of training signatures of the raw digital image
represents the heterogeneity of the brightness values of the pixels in different bands. This implies that an image
comprising a homogeneous lu/lc category will be associated with nearly the same reflectance values that would
result in the occurrence of a very low entropy value. On the other hand an image characterized by the
occurrence of diverse lu/lc categories will consist of largely differing reflectance values due to which the
entropy of such image would be relatively high. This concept leads to analyses of classification accuracy.
Although Entropy has been used many times in RS and GIS but its use in determination of classification
accuracy is new approach.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This paper presents a new approach to select reduced number of features in databases. Every database has a
given number of features but it is observed that some of these features can be redundant and can be harmful as
well as confuse the process of classification. The proposed method first applies a binary coded genetic algorithm
to select a small subset of features. The importance of these features is judged by applying Naïve Bayes (NB)
method of classification. The best reduced subset of features which has high classification accuracy on given
databases is adopted. The classification accuracy obtained by proposed method is compared with that reported
recently in publications on eight databases. It is noted that proposed method performs satisfactory on these
databases and achieves higher classification accuracy but with smaller number of feature
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology
related to the vertebral column. There is great potential to become more efficient and effective in terms of quality
of care provided to patients through the use of automated systems. However, in many cases automated systems
can allow for misclassification and force providers to have to review more causes than necessary. In this study, we
analyzed methods to increase the True Positives and lower the False Positives while comparing them against stateof-the-art
techniques in the biomedical community. We found that by applying the studied techniques of a data-driven
model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized
in the current research community.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONIJDKP
The geometry of data, also known as probability distribution, is an important consideration for accurate computation of data mining tasks, such as pre-processing, classification and interpretation. The data geometry influences outcome and accuracy of the statistical analysis to a large extent. The current paper focuses on, understanding the influence of data geometry in the feature subset selection process using random forest algorithm. In practice, it is assumed that the data follows normal distribution and most of the time, it may not be true. The dimensionality reduction varies, due to change in the distribution of the data. A comparison is made using three standard distributions such as Triangular, Uniform and Normal Distribution. The results are discussed in this paper.
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...cscpconf
Improving accuracy of supervised classification algorithms in biomedical applications,
especially CADx, is one of active area of research. This paper proposes construction of rotation
forest (RF) ensemble using 20 learners over two clinical datasets namely lymphography and
backache. We propose a new feature selection strategy based on support vector machines
optimized by particle swarm optimization for relevant and minimum feature subset for obtaining
higher accuracy of ensembles. We have quantitatively analyzed 20 base learners over two
datasets and carried out the experiments with 10 fold cross validation leave-one-out strategy
and the performance of 20 classifiers are evaluated using performance metrics namely accuracy
(acc), kappa value (K), root mean square error (RMSE) and area under receiver operating
characteristics curve (ROC). Base classifiers succeeded 79.96% & 81.71% average accuracies
for lymphography & backache datasets respectively. As for RF ensembles, they produced
average accuracies of 83.72% & 85.77% for respective diseases. The paper presents promising
results using RF ensembles and provides a new direction towards construction of reliable and robust medical diagnosis systems.
Metasem: An R Package For Meta-Analysis Using Structural Equation Modelling: ...Pubrica
This presentation explains about the Metasem: An r package for Meta-Analysis using Structural Equation Modelling:
1. SEM is used meta-analytical model formulated for conducting Meta-analysis which is used to analyse structural relationships. SEM can be univariate, multivariate, and three-level meta-analysis
2. Structural equation model (SEM) in general optimized and fit by using OpenMx package
3. The routine analysis of batch mode either interactively or noninteractively can be analysed by R package. Using the graphical interface like R studio is a convenient method for users to interfere with the analysis.
Learn More: https://pubrica.com/academy/
Why Pubrica:
When you order our services, we promise you the following – Plagiarism free, always on Time, outstanding customer support, written to Standard, Unlimited Revisions support and High-quality Subject Matter Experts.
Find freelance Meta-Analysis professionals, consultants, freelancers and get your project done - https://bit.ly/30V8QUK
Contact us:
Web: https://pubrica.com/
Email: sales@pubrica.com
WhatsApp : +91 9884350006
United Kingdom : +44-1143520021
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Study of Clustering of Data Base in Education Sector Using Data MiningIJSRD
Data mining is a technology used in different disciplines to search for significant relationships among variables in n number of data sets. Data mining is frequently used in all types’ areas as well as applications. In this paper the application of data mining is attached with the field of education. The relationship between student’s university entrance examination results and their success was studied using cluster analysis and k-means algorithm techniques.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Feature Selection Approach based on Firefly Algorithm and Chi-square IJECEIAES
Dimensionality problem is a well-known challenging issue for most classifiers in which datasets have unbalanced number of samples and features. Features may contain unreliable data which may lead the classification process to produce undesirable results. Feature selection approach is considered a solution for this kind of problems. In this paperan enhanced firefly algorithm is proposed to serve as a feature selection solution for reducing dimensionality and picking the most informative features to be used in classification. The main purpose of the proposedmodel is to improve the classification accuracy through using the selected features produced from the model, thus classification errors will decrease. Modeling firefly in this research appears through simulating firefly position by cell chi-square value which is changed after every move, and simulating firefly intensity by calculating a set of different fitness functionsas a weight for each feature. Knearest neighbor and Discriminant analysis are used as classifiers to test the proposed firefly algorithm in selecting features. Experimental results showed that the proposed enhanced algorithmbased on firefly algorithm with chisquare and different fitness functions can provide better results than others. Results showed that reduction of dataset is useful for gaining higher accuracy in classification.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This paper presents a new approach to select reduced number of features in databases. Every database has a
given number of features but it is observed that some of these features can be redundant and can be harmful as
well as confuse the process of classification. The proposed method first applies a binary coded genetic algorithm
to select a small subset of features. The importance of these features is judged by applying Naïve Bayes (NB)
method of classification. The best reduced subset of features which has high classification accuracy on given
databases is adopted. The classification accuracy obtained by proposed method is compared with that reported
recently in publications on eight databases. It is noted that proposed method performs satisfactory on these
databases and achieves higher classification accuracy but with smaller number of feature
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology
related to the vertebral column. There is great potential to become more efficient and effective in terms of quality
of care provided to patients through the use of automated systems. However, in many cases automated systems
can allow for misclassification and force providers to have to review more causes than necessary. In this study, we
analyzed methods to increase the True Positives and lower the False Positives while comparing them against stateof-the-art
techniques in the biomedical community. We found that by applying the studied techniques of a data-driven
model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized
in the current research community.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONIJDKP
The geometry of data, also known as probability distribution, is an important consideration for accurate computation of data mining tasks, such as pre-processing, classification and interpretation. The data geometry influences outcome and accuracy of the statistical analysis to a large extent. The current paper focuses on, understanding the influence of data geometry in the feature subset selection process using random forest algorithm. In practice, it is assumed that the data follows normal distribution and most of the time, it may not be true. The dimensionality reduction varies, due to change in the distribution of the data. A comparison is made using three standard distributions such as Triangular, Uniform and Normal Distribution. The results are discussed in this paper.
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...cscpconf
Improving accuracy of supervised classification algorithms in biomedical applications,
especially CADx, is one of active area of research. This paper proposes construction of rotation
forest (RF) ensemble using 20 learners over two clinical datasets namely lymphography and
backache. We propose a new feature selection strategy based on support vector machines
optimized by particle swarm optimization for relevant and minimum feature subset for obtaining
higher accuracy of ensembles. We have quantitatively analyzed 20 base learners over two
datasets and carried out the experiments with 10 fold cross validation leave-one-out strategy
and the performance of 20 classifiers are evaluated using performance metrics namely accuracy
(acc), kappa value (K), root mean square error (RMSE) and area under receiver operating
characteristics curve (ROC). Base classifiers succeeded 79.96% & 81.71% average accuracies
for lymphography & backache datasets respectively. As for RF ensembles, they produced
average accuracies of 83.72% & 85.77% for respective diseases. The paper presents promising
results using RF ensembles and provides a new direction towards construction of reliable and robust medical diagnosis systems.
Metasem: An R Package For Meta-Analysis Using Structural Equation Modelling: ...Pubrica
This presentation explains about the Metasem: An r package for Meta-Analysis using Structural Equation Modelling:
1. SEM is used meta-analytical model formulated for conducting Meta-analysis which is used to analyse structural relationships. SEM can be univariate, multivariate, and three-level meta-analysis
2. Structural equation model (SEM) in general optimized and fit by using OpenMx package
3. The routine analysis of batch mode either interactively or noninteractively can be analysed by R package. Using the graphical interface like R studio is a convenient method for users to interfere with the analysis.
Learn More: https://pubrica.com/academy/
Why Pubrica:
When you order our services, we promise you the following – Plagiarism free, always on Time, outstanding customer support, written to Standard, Unlimited Revisions support and High-quality Subject Matter Experts.
Find freelance Meta-Analysis professionals, consultants, freelancers and get your project done - https://bit.ly/30V8QUK
Contact us:
Web: https://pubrica.com/
Email: sales@pubrica.com
WhatsApp : +91 9884350006
United Kingdom : +44-1143520021
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Study of Clustering of Data Base in Education Sector Using Data MiningIJSRD
Data mining is a technology used in different disciplines to search for significant relationships among variables in n number of data sets. Data mining is frequently used in all types’ areas as well as applications. In this paper the application of data mining is attached with the field of education. The relationship between student’s university entrance examination results and their success was studied using cluster analysis and k-means algorithm techniques.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Feature Selection Approach based on Firefly Algorithm and Chi-square IJECEIAES
Dimensionality problem is a well-known challenging issue for most classifiers in which datasets have unbalanced number of samples and features. Features may contain unreliable data which may lead the classification process to produce undesirable results. Feature selection approach is considered a solution for this kind of problems. In this paperan enhanced firefly algorithm is proposed to serve as a feature selection solution for reducing dimensionality and picking the most informative features to be used in classification. The main purpose of the proposedmodel is to improve the classification accuracy through using the selected features produced from the model, thus classification errors will decrease. Modeling firefly in this research appears through simulating firefly position by cell chi-square value which is changed after every move, and simulating firefly intensity by calculating a set of different fitness functionsas a weight for each feature. Knearest neighbor and Discriminant analysis are used as classifiers to test the proposed firefly algorithm in selecting features. Experimental results showed that the proposed enhanced algorithmbased on firefly algorithm with chisquare and different fitness functions can provide better results than others. Results showed that reduction of dataset is useful for gaining higher accuracy in classification.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has suppli
ed a large volume of data to many fields. The gene
microarray analysis and classification have demonst
rated an effective way for the effective diagnosis
of
diseases and cancers. In as much as the data achiev
ing from microarray technology is very noisy and al
so
has thousands of features, feature selection plays
an important role in removing irrelevant and redund
ant
features and also reducing computational complexity
. There are two important approaches for gene
selection in microarray data analysis, the filters
and the wrappers. To select a concise subset of inf
ormative
genes, we introduce a hybrid feature selection whic
h combines two approaches. The fact of the matter i
s
that candidate’s features are first selected from t
he original set via several effective filters. The
candidate
feature set is further refined by more accurate wra
ppers. Thus, we can take advantage of both the filt
ers
and wrappers. Experimental results based on 11 micr
oarray datasets show that our mechanism can be
effected with a smaller feature set. Moreover, thes
e feature subsets can be obtained in a reasonable t
ime
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has supplied a large volume of data to many fields. The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. In as much as the data achieving from microarray technology is very noisy and also has thousands of features, feature selection plays an important role in removing irrelevant and redundant features and also reducing computational complexity. There are two important approaches for gene selection in microarray data analysis, the filters and the wrappers. To select a concise subset of informative genes, we introduce a hybrid feature selection which combines two approaches. The fact of the matter is that candidate’s features are first selected from the original set via several effective filters. The candidate feature set is further refined by more accurate wrappers. Thus, we can take advantage of both the filters and wrappers. Experimental results based on 11 microarray datasets show that our mechanism can be effected with a smaller feature set. Moreover, these feature subsets can be obtained in a reasonable time.
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
Classification of medical datasets using back propagation neural network powe...IJECEIAES
The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Data mining is indispensable for business organizations for extracting useful information from the huge
volume of stored data which can be used in managerial decision making to survive in the competition. Due
to the day-to-day advancements in information and communication technology, these data collected from e-
commerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high
dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets
selected in subsequent iterations by feature selection should be same or similar even in case of small
perturbations of the dataset and is called as selection stability. It is recently becomes important topic of
research community. The selection stability has been measured by various measures. This paper analyses
the selection of the suitable search method and stability measure for the feature selection algorithms and
also the influence of the characteristics of the dataset as the choice of the best approach is highly problem
dependent.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Improved feature selection using a hybrid side-blotched lizard algorithm and ...IJECEIAES
Feature selection entails choosing the significant features among a wide collection of original features that are essential for predicting test data using a classifier. Feature selection is commonly used in various applications, such as bioinformatics, data mining, and the analysis of written texts, where the dataset contains tens or hundreds of thousands of features, making it difficult to analyze such a large feature set. Removing irrelevant features improves the predictor performance, making it more accurate and cost-effective. In this research, a novel hybrid technique is presented for feature selection that aims to enhance classification accuracy. A hybrid binary version of sideblotched lizard algorithm (SBLA) with genetic algorithm (GA), namely SBLAGA, which combines the strengths of both algorithms is proposed. We use a sigmoid function to adapt the continuous variables values into a binary one, and evaluate our proposed algorithm on twenty-three standard benchmark datasets. Average classification accuracy, average number of selected features and average fitness value were the evaluation criteria. According to the experimental results, SBLAGA demonstrated superior performance compared to SBLA and GA with regards to these criteria. We further compare SBLAGA with four wrapper feature selection methods that are widely used in the literature, and find it to be more efficient.
Opposition Based Firefly Algorithm Optimized Feature Subset Selection Approac...mlaij
Recently huge amount of data is available in the field of medicine that helps the doctors in diagnosing diseases when analysed. Data mining techniques can be applied to these medical data to extract knowledge so that disease prediction becomes accurate and easier. In this work, cardiotocogram (CTG) data is analysed using Support Vector Machine (SVM) for predicting fetal risk. Opposition based firefly algorithm (OBFA) is proposed to extract the relevant features that maximise the classification performance of SVM. The obtained results show that opposition based firefly algorithm outperforms the standard firefly algorithm (FA).
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...ijsc
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an established, proven structure from a voluminous collection of facts. A dominant area of modern-day research in the field of medical investigations includes disease prediction and malady categorization. In this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering techniques and compare the performance of classification algorithms on the clinical data. Feature selection is a supervised method that attempts to select a subset of the predictor features based on the information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen classification algorithms on the Lymphography dataset that enables the classifier to accurately perform multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm offers increased clustering accuracy in less computation time.
Similar to Effect of Feature Selection on Gene Expression Datasets Classification Accuracy (20)
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
This research is developing an incubator system that integrates the internet of things and artificial intelligence to improve care for premature babies. The system workflow starts with sensors that collect data from the incubator. Then, the data is sent in real-time to the internet of things (IoT) broker eclipse mosquito using the message queue telemetry transport (MQTT) protocol version 5.0. After that, the data is stored in a database for analysis using the long short-term memory network (LSTM) method and displayed in a web application using an application programming interface (API) service. Furthermore, the experimental results produce as many as 2,880 rows of data stored in the database. The correlation coefficient between the target attribute and other attributes ranges from 0.23 to 0.48. Next, several experiments were conducted to evaluate the model-predicted value on the test data. The best results are obtained using a two-layer LSTM configuration model, each with 60 neurons and a lookback setting 6. This model produces an R 2 value of 0.934, with a root mean square error (RMSE) value of 0.015 and a mean absolute error (MAE) of 0.008. In addition, the R 2 value was also evaluated for each attribute used as input, with a result of values between 0.590 and 0.845.
A review on internet of things-based stingless bee's honey production with im...IJECEIAES
Honey is produced exclusively by honeybees and stingless bees which both are well adapted to tropical and subtropical regions such as Malaysia. Stingless bees are known for producing small amounts of honey and are known for having a unique flavor profile. Problem identified that many stingless bees collapsed due to weather, temperature and environment. It is critical to understand the relationship between the production of stingless bee honey and environmental conditions to improve honey production. Thus, this paper presents a review on stingless bee's honey production and prediction modeling. About 54 previous research has been analyzed and compared in identifying the research gaps. A framework on modeling the prediction of stingless bee honey is derived. The result presents the comparison and analysis on the internet of things (IoT) monitoring systems, honey production estimation, convolution neural networks (CNNs), and automatic identification methods on bee species. It is identified based on image detection method the top best three efficiency presents CNN is at 98.67%, densely connected convolutional networks with YOLO v3 is 97.7%, and DenseNet201 convolutional networks 99.81%. This study is significant to assist the researcher in developing a model for predicting stingless honey produced by bee's output, which is important for a stable economy and food security.
A trust based secure access control using authentication mechanism for intero...IJECEIAES
The internet of things (IoT) is a revolutionary innovation in many aspects of our society including interactions, financial activity, and global security such as the military and battlefield internet. Due to the limited energy and processing capacity of network devices, security, energy consumption, compatibility, and device heterogeneity are the long-term IoT problems. As a result, energy and security are critical for data transmission across edge and IoT networks. Existing IoT interoperability techniques need more computation time, have unreliable authentication mechanisms that break easily, lose data easily, and have low confidentiality. In this paper, a key agreement protocol-based authentication mechanism for IoT devices is offered as a solution to this issue. This system makes use of information exchange, which must be secured to prevent access by unauthorized users. Using a compact contiki/cooja simulator, the performance and design of the suggested framework are validated. The simulation findings are evaluated based on detection of malicious nodes after 60 minutes of simulation. The suggested trust method, which is based on privacy access control, reduced packet loss ratio to 0.32%, consumed 0.39% power, and had the greatest average residual energy of 0.99 mJoules at 10 nodes.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
The performance of artificial intelligence in prostate magnetic resonance im...IJECEIAES
Prostate cancer is the predominant form of cancer observed in men worldwide. The application of magnetic resonance imaging (MRI) as a guidance tool for conducting biopsies has been established as a reliable and well-established approach in the diagnosis of prostate cancer. The diagnostic performance of MRI-guided prostate cancer diagnosis exhibits significant heterogeneity due to the intricate and multi-step nature of the diagnostic pathway. The development of artificial intelligence (AI) models, specifically through the utilization of machine learning techniques such as deep learning, is assuming an increasingly significant role in the field of radiology. In the realm of prostate MRI, a considerable body of literature has been dedicated to the development of various AI algorithms. These algorithms have been specifically designed for tasks such as prostate segmentation, lesion identification, and classification. The overarching objective of these endeavors is to enhance diagnostic performance and foster greater agreement among different observers within MRI scans for the prostate. This review article aims to provide a concise overview of the application of AI in the field of radiology, with a specific focus on its utilization in prostate MRI.
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
According to the World Health Organization (WHO), seventy million individuals worldwide suffer from epilepsy, a neurological disorder. While electroencephalography (EEG) is crucial for diagnosing epilepsy and monitoring the brain activity of epilepsy patients, it requires a specialist to examine all EEG recordings to find epileptic behavior. This procedure needs an experienced doctor, and a precise epilepsy diagnosis is crucial for appropriate treatment. To identify epileptic seizures, this study employed a convolutional neural network (CNN) based on raw scalp EEG signals to discriminate between preictal, ictal, postictal, and interictal segments. The possibility of these characteristics is explored by examining how well timedomain signals work in the detection of epileptic signals using intracranial Freiburg Hospital (FH), scalp Children's Hospital Boston-Massachusetts Institute of Technology (CHB-MIT) databases, and Temple University Hospital (TUH) EEG. To test the viability of this approach, two types of experiments were carried out. Firstly, binary class classification (preictal, ictal, postictal each versus interictal) and four-class classification (interictal versus preictal versus ictal versus postictal). The average accuracy for stage detection using CHB-MIT database was 84.4%, while the Freiburg database's time-domain signals had an accuracy of 79.7% and the highest accuracy of 94.02% for classification in the TUH EEG database when comparing interictal stage to preictal stage.
Analysis of driving style using self-organizing maps to analyze driver behaviorIJECEIAES
Modern life is strongly associated with the use of cars, but the increase in acceleration speeds and their maneuverability leads to a dangerous driving style for some drivers. In these conditions, the development of a method that allows you to track the behavior of the driver is relevant. The article provides an overview of existing methods and models for assessing the functioning of motor vehicles and driver behavior. Based on this, a combined algorithm for recognizing driving style is proposed. To do this, a set of input data was formed, including 20 descriptive features: About the environment, the driver's behavior and the characteristics of the functioning of the car, collected using OBD II. The generated data set is sent to the Kohonen network, where clustering is performed according to driving style and degree of danger. Getting the driving characteristics into a particular cluster allows you to switch to the private indicators of an individual driver and considering individual driving characteristics. The application of the method allows you to identify potentially dangerous driving styles that can prevent accidents.
Hyperspectral object classification using hybrid spectral-spatial fusion and ...IJECEIAES
Because of its spectral-spatial and temporal resolution of greater areas, hyperspectral imaging (HSI) has found widespread application in the field of object classification. The HSI is typically used to accurately determine an object's physical characteristics as well as to locate related objects with appropriate spectral fingerprints. As a result, the HSI has been extensively applied to object identification in several fields, including surveillance, agricultural monitoring, environmental research, and precision agriculture. However, because of their enormous size, objects require a lot of time to classify; for this reason, both spectral and spatial feature fusion have been completed. The existing classification strategy leads to increased misclassification, and the feature fusion method is unable to preserve semantic object inherent features; This study addresses the research difficulties by introducing a hybrid spectral-spatial fusion (HSSF) technique to minimize feature size while maintaining object intrinsic qualities; Lastly, a soft-margins kernel is proposed for multi-layer deep support vector machine (MLDSVM) to reduce misclassification. The standard Indian pines dataset is used for the experiment, and the outcome demonstrates that the HSSF-MLDSVM model performs substantially better in terms of accuracy and Kappa coefficient.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
2. Int J Elec & Comp Eng ISSN: 2088-8708
Effect of Feature Selection on Gene Expression Datasets …. (Hicham Omara)
3195
the redundancy in each class using Mutual Information (MI) mesures. Relief F is also widely used with
cancer microarray data [13]; it detects features which are statistically relevant to the target concept.
The wrappers uses classifying algorithm to evaluate which features are useful; it means that the
features were selected taking the classification algorithm into account [14]. Many researches have applied
wrappers selector, like study of Gheyas and Smith that proposed a new method named simulated annealing
generic algorithm (SAGA), which incorporates existing wrapper methods into a single solution [15]. LDA-
based Genetic Algorithm (LDA-GA) proposed by Huerta et al in [16]; this method applied t-statistic filter to
retain a group of p top ranking genes, and used the LDA-based GA. Leave-one-out calculation sequential
forward selection (LOOCSFS) algorithm that combine the leave-one-out calculation measure with the
sequential forward selection scheme proposed by Tang et al [17]. Genetic Algorithm-Support Vector
Machine (GA-SVM) creates a population of chromosomes as binary strings that represent the subset of
features that are evaluated using SVMs developed by Perez and Marwala in [18].
The third field of feature selection approaches is embedded methods. It takes advantage of the two
models by using their different evaluation criteria in different search stages [19]. In this case we can cite the
most widely applied embedded techniques based on support vector machine based on Recursive Feature
Elimination (SVM-RFE) for gene selection and cancer classification proposed by Guyon et al. in [20].
Maldonado et al. proposed an embedded approach called kernel-penalized SVM (KP-SVM) by introducing a
penalty factor in the dual formulation of SVM [21]. Mundra et al. hybridized two of the most popular feature
selection approaches: SVM-RFE and mRMR [22]. Chuang et al. proposed a hybrid approach that hybridize
correlation based feature selection (CFS) and Taguchi-Genetic Algorithm (TGA) and used KNN as the
classifier with the leave-one-out cross-validation (LOOCV) [23]. Lee and Liu [24] proposed an approach
called Genetic Algorithm Dynamic Parameter (GADP) for producing every possible subset of genes and rank
the genes using their occurrence frequency.
Therefore, this paper attempts to present a review of widely used feature selection techniques
focusing on cancer classification. In addition, other tasks related to microarray data analysis also have been
revealed such as missing values, normalization and discretisation. Furthermore, commonly used classification
methods were discussed. This study evaluated five different filter algorithms: Random forest, information
gain and chi-squared on three cancer datasets; and evaluated their effect on three classification algorithm:
SOM, KNN, K-means and Random Forest.
2. METHOD AND MATERIALS
2.1. General Bachground
Analysis of gene expression data is primarily based on comparison of gene expression profiles. To
do these, we need a measure to quantify the similarity between genes in expression profiles. A variety of
distance measures can be used to compute similarity. In this section, a description of most metrics used is
discussed. The gene expression data from microarray experiments is usually in the form of large matrices
𝐺(𝑛+1)×𝑚 of expression levels of genes 𝑔1, 𝑔2, … , 𝑔 𝑛 under different experimental conditions 𝑠1, 𝑠2, … 𝑠 𝑚 and
the last row contains the label 𝑌 of each sample, their values 𝑦𝑗 ∈ {−1,1}. Each element 𝐺[𝑖, 𝑗], denoted
as 𝑔𝑖𝑗, represents the expression level of the gene 𝑔𝑖 in the sample 𝑠𝑗 (see Table 1). The expression profile of
a gene i can be represented as a row vector: 𝑔𝑖 = (𝑔𝑖1, 𝑔𝑖2, . . , 𝑔𝑖𝑚) as follow:
𝐺 = �
𝑔1
⋮
𝑔 𝑛
𝑌
� = �
𝑔11 ⋯ 𝑔1𝑚
⋮ ⋱ ⋮
𝑔 𝑛1 … 𝑔 𝑛𝑚
𝑦1 … 𝑦 𝑚
�
Table 1. Microarray Dataset Example
GenesSamples 𝑆1 𝑆2 𝑆3 𝑆4 … 𝑆 𝑚
𝑔1 56,23 43,74 4,18 9,5 … 34,18
𝑔2 33,54 30,5 4,71 32,18 … 43,71
𝑔3 13 29,09 4,13 2,88 … 49,13
𝑔4 64,25 70,24 76,1 31,4 … 36,91
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑔 𝑛 3,54 0,5 40,71 2,99
Label : Y Normal ANormal Normal Anormal … Normal
3. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 3194 - 3203
3196
Pearson correlation coefficient: (represented by the letter ρ), can be obtain by substituting covariances 𝑐𝑜𝑣
and variances 𝜎 based on a sample. So, for two genes 𝑔1 and 𝑔2 the formula for ρ is:
ρ =
𝑐𝑜𝑣(𝑔1,𝑔2)
�𝜎2(𝑔1)𝜎2(𝑔2)
=
∑ (𝑔1𝑖−𝑔1����)(𝑔2𝑖−𝑔2����)𝑛
𝑖=1
�∑ (𝑔1𝑖−𝑔1����)2𝑛
𝑖=1 �∑ (𝑔2𝑖−𝑔2����)2𝑛
𝑖=1
(2)
where 𝑔1��� =
1
n
∑ 𝑔1𝑖
n
i=1 and 𝑔2��� =
1
n
∑ 𝑔2𝑖
n
i=1 are the mean for gene 𝑔1 and 𝑔2 respectively.
Mutual information (MI): It is a distance measure that compares genes whose profiles are discrete. It
can be calculated using Shannonʼs entropy. It has been used to measure the dependency between two random
variables based on the probability of them. For two genes 𝑔1 and 𝑔2, the mutual information between theme,
𝐼(𝑔1, 𝑔2), can be calculated as follow: [25], [26]:
𝐼(𝑔1, 𝑔2), = 𝐻(𝑔1) − 𝐻(𝑔1|𝑔2)
= 𝐻(𝑔2) − 𝐻(𝑔2|𝑔1)
= 𝐻(𝑔1) + 𝐻(𝑔2) − 𝐻(𝑔1, 𝑔2)
(3)
where: 𝐻(𝑔1), 𝐻(𝑔2) are the Shannonʼs entropies, expressed as follow:
𝐻(𝑔1) = − ∑ 𝑃(𝑔1𝑖) × log2�𝑃(𝑔1𝑖)�𝑚
𝑖=1
𝐻(𝑔1, 𝑔2) is the joint entropy of the 𝑔1 and 𝑔2 defined as follow:
𝐻(𝑔1, 𝑔2) = − ∑ ∑ 𝑃�𝑔1𝑖, 𝑔2𝑗� × log2 �𝑃�𝑔1𝑖, 𝑔2𝑗��𝑚
𝑖=1
𝑚
𝑗=1
𝐻(𝑔2|𝑔1) is the conditional entropy of g1 given g2 . It can be calculated as follow:
𝐻(𝑔2|𝑔1) = − ∑ ∑ 𝑃�𝑔1𝑖, 𝑔2𝑗� × log2�𝑃(𝑔2𝑖|𝑔1𝑖)�𝑚
𝑖=1
𝑚
𝑗=1
Noted that 𝑃(𝑔1𝑖) represent the probability mass function, it can be calculated, when gene 𝑔1 is discrete, as
follow:
p(𝑔1𝑖) =
number of instants with value 𝑔1𝑖
total number ofinstants (n)
and 𝑃�𝑔1𝑖, 𝑔2𝑗� is the joint probability mass function of the gene 𝑔1 and 𝑔2
2.2.Feature Selection
The goal of the feature selection is to select the smallest subset of features by scoring all features
and using a threshold to remove features below the threshold. This process makes a classification problem
simpler to interpret and reduces the time for training model. Mathematicly, for a feature set composed by all
genes 𝐺𝑓 = �𝑔 𝑓1, 𝑔 𝑓2, … , 𝑔 𝑓𝑛�, the feature selection process identifies a subset of features Sf with
dimension k where k ≤ n, and Sf ⊆ 𝐺𝑓. In this study, five features selector algorithm were descussed,
includes information gain, mRMR, linear correlation and chi-squared. The choice of filter method instead of
a wrapper one due to the huge computational costs when uses wrappers methods [2].
Information gain (IG): It is a filter method that ranks features based on high information gain
entropy in decreasing order. It ranks features based on the value of their mutual information with the class
label using equation 3. Simplicity and low computational costs are the main advantages of this method.
However, it does not take into consideration the dependency between the features; rather, it assumes
independency, which is not always the case. Therefore some of the selected features may carry redundant
information.
Chi-squared (𝐶ℎ𝑖2
): is a statistical test to determine the dependency of two events, it characterize by
it simplicity to implement (In feature selection, the two events are occurrence of the feature and occurrence
of the class). The process consists of calculation of Chi2
between every feature variable gfi and the label Y. If
Y is independent of gfi , this feature variable will be discard. If they are dependent, this feature variable will
be present into training model [27]. The initial hypothesis 𝐻0 is the assumption that the two features are
uncorelated, and it is tested by 𝐶ℎ𝑖2
formula as follow:
4. Int J Elec & Comp Eng ISSN: 2088-8708
Effect of Feature Selection on Gene Expression Datasets …. (Hicham Omara)
3197
𝐶ℎ𝑖2
= ∑ ∑
(𝑔 𝑖𝑗−𝐸 𝑖𝑗)2
𝐸 𝑖𝑗
𝑐
𝑗=1
𝑟
𝑖=1 (4)
where 𝑔𝑖𝑗 is the observed frequency and 𝐸𝑖𝑗 is the expected frequency under the null hypothesis. 𝐸𝑖𝑗 can be
computed by :
𝐸𝑖𝑗 =
𝑟𝑜𝑤 𝑡𝑜𝑡𝑙𝑎 ×𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
The high value of 𝐶ℎ𝑖2
indicates that the hypothesis of independence is incorrect and the feature is
correlated with the class, thus it should be selected for model training. Linear correlation (Corr): (well-known
similarity measure between two random variables) It can be calculated using Pearson correlation coefficient
(𝜌) as defined in equation 2. The resulting value is in [−1; 1], with -1 meaning perfect negative correlation
(as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no
linear correlation between the two variables [28].
minimum Redundancy-Maximum Relevancy (mRMR): The mRMR filter method selects genes with
the highest relevance and minimally redundant with the target class [8], [29]. mRMR of genes are based on
mutual information using equation 3. The Maximum Relevance method selects the highest top k genes,
which have the highest relevance correlated to the class labels from the descent arranged set of 𝐼(𝑔𝑖, 𝑌),
equation 5. Minimum Redundancy criterion is introduced by [14] in order to remove the redundancy features;
this criterion defined by Equation 6.
𝑚𝑎𝑥 �
1
�𝐺 𝑓�
∑ 𝐼(𝑔𝑖; 𝑌)𝑔 𝑖∈𝐺 𝑓
� (5)
𝑚𝑖𝑛 �
1
�𝐺 𝑓�
2 ∑ 𝐼(𝑔𝑖; 𝑔𝑗)𝑔 𝑖,𝑔 𝑗∈𝐺 𝑓
� (6)
The (mRMR) filter takes the mutual information between each pair of genes into consideration and combines
both optimization criteria of equation 5 and 6.
2.3.Classifiers
In this part, a brief description of commonly classifier algorithms used for classification task. Table
2 shows the parameters used for each classifier.
Table 2. Table Parameters of Classifier
Classifier Parameter
K-means
K=2:9
Distance = Euclidean distance;
KNN
Distance = Euclidean distance;
Number of nearest neighbors = 5
Kernel= rectangular
SOM
Number of input neurons = 10×10
Learning rate = 0.9
Radius = 20
Distance Metric = Euclidean
Initialization = Random
Number of iteration = 1000
Random
Forest
Number of trees: 500
Number of variables tried at each split: 10
K-means: is a clustering algorithm or unsupervised classification which divides observations into k
clusters [30]–[32]. It can be adapted for supervised classification case by dividing data into equal to or
more than the number of classes. It takes a set 𝑆 of 𝑚 samples and the number of clusters 𝐾 as input,
and outputs a set 𝐶 = {𝑐1, 𝑐2, … , 𝑐 𝑘} of 𝐾 centroids. The algorithm starts by initialising randomly all
centroids; then, it iterates between two steps until a stopping criteria is done (often, the maximum
number of iterations is reached). In the first one, each sample 𝑠𝑖 is assigned to its nearest centroid 𝑐 𝑘,
based on the distance measure between 𝑠𝑖 and 𝑐 𝑘 as follow:
5. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 3194 - 3203
3198
argmin 𝑐 𝑘∈𝐶 𝐷𝑒𝑢𝑐(𝑐 𝑘, 𝑠𝑖)2
(7)
generating a set 𝑆 𝑘
𝑐
formed by sample assignments for each kth
cluster centroid. In the second step,
each centroid 𝑐 𝑘 is updated based on the mean of all samples assigned to their 𝑆𝑖
𝑐
as follow:
𝑐𝑖 =
1
�𝑆 𝑘
𝑐
�
∑ 𝑠𝑗𝑥 𝑗∈𝑆 𝑘
𝑐 (8)
Self-organizing maps (SOM): SOM is commonly used for visualizing and clustering of
multidimensional data, due to his ability to project high-dimensional data in a lower dimension [33]-[37].
The SOM often consists of a regular grid of map units. Each unit is represented by a vector 𝑊𝑗 = (𝑊𝑗1, 𝑊𝑗2,
… , 𝑊𝑗𝑚), where 𝑚 is input sample dimension. The units are connected to adjacent ones by neighbourhood
relation. The SOM iteratively trained. At each training step, a sample input 𝑆 is randomly chosen from the
input data set, a metric distance is computed for all weight vectors 𝑊𝑗 to find the reference vector Wbmu
(called Best Matching Unit (BMU) that satisfies a minimum distance or maximum similarity criterion
following the Equation 9.
𝑏𝑚𝑢(𝑡) = 𝑎𝑟𝑔𝑚𝑖𝑛1≤𝑖≤𝑛‖𝑆(𝑡) − 𝑊𝑖(𝑡)‖ (9)
Where n is the neurons number in the map. The weights of the bmu and its neighbours are then adjusted
towards the input pattern, following equation:
𝑊𝑖(𝑡 + 1) = 𝑊𝑖(𝑡) + 𝛽𝑏𝑚𝑢,𝑖(𝑡)‖𝑆 − 𝑊𝑖‖ (10)
where 𝛽𝑏𝑚𝑢,𝑖 is the neighbourhood function between the winner neuron bmu and neighbour neuron i. It is
defined by the equation (11).
𝛽𝑏𝑚𝑢,𝑖(𝑡) = exp �
‖𝑟 𝑏𝑚𝑢−𝑟 𝑖‖
2𝜎𝑖
2(𝑡)
� (11)
Where‖𝑟𝑏𝑚𝑢 − 𝑟𝑖‖ ≅ �𝑊 𝑏𝑚𝑢
− 𝑊 𝑖
� , 𝑟𝑏𝑚𝑢 and 𝑟𝑖are positions of the BMU and neuron i on the Kohonen
topological map. The σ(t) decreases monotonically with time.
K nearst neighbours (k-NN): is a non-parametric method used for classification [38]–[40]. The
process begins by calculating similarity distance 𝐷𝑒𝑢𝑐�𝑠𝑗, 𝑠𝑖� between test sample 𝑠𝑗 and a set of training
samples 𝑠𝑖 and it sorts the distances in ascending (or descending) order. Then, it selects k closest
neighbours to the sample 𝑠𝑗, and it gathers them together. To predict the class of this sample, it uses
the majority voting: the class that occurs the most frequently in the nearest neighbors wins.
Random forest (RF): can be supposed of as a form of nearest neighbor predictor. It creates a set of
decision trees from randomly selected subset of original training set; and sums the votes from different
decision trees to decide the final class of the test object. It is considered well suited to situations characterized
by a large number of features [41]-[43].
2.4. Datasets Descreption
In this study, the following published datasets was used (a brief description exists in Table 2). The
first one is ALL/AML leukemia proposed by Golub et al in 1999 [3]; these data contains 7129 genes and 72
samples splits in two classes. It was used to classify patients with acute myeloid leukemia (labelled as AML)
25 examples (34.7%) and acute lymphoblastic leukemia (labelled as ALL) 47 examples (65.3%). The second
dataset is Colon cancer dataset [44] that contains 62 samples. Among them, 40 tumor biopsies are from
tumors (labeled as “N”) and 22 normal (labeled as "P") biopsies are from healthy parts of the colons of the
same patients. The total number of genes to be tested is 2000. The third dataset is Lymphoma Cancer Data
Classification [45]; it includes 45 tissues and 4026 genes. The first category, Germinal Centre B-Like
(labelled as GCL) has 23 patients, and the second type Activated B-Like (labelled as ACL) has 22. The
problem is to distinguish the GCL samples from the ACL samples. This data contains about 3.28% missing
values.
Before applying any learning algorithm, the data must be pre-processed by several processes as
missing values imputation, noisy data elimination, and normalizing data. Missing values: In general, dataset
contains missing values occuring due to a variety of reasons including hybridization failures, artifacts on the
microarray, insufficient resolution, image noise and corruption, or they may occur systematically as a result
6. Int J Elec & Comp Eng ISSN: 2088-8708
Effect of Feature Selection on Gene Expression Datasets …. (Hicham Omara)
3199
of the spotting process. There are many techniques to handle these missing values such as omitting the entire
record which contains the missing value or impute themes by Median, Mean, K-NN [46]. Data
Normalization: Some algorithms, such as K-means and K-NN, may require that the data be normalized to
increase the efficacy as well as efficiency of the algorithm. The normalization will prevent any variation in
distance measures where the data may not been normalized. Normalizing the attribute will place all attribute
within a similar range, usually [0, 1] [47]. Data Discretization: Discretization is the process of converting
continuous variables into nominal ones. Studies have shown that discretization makes learning algorithms
more accurate and faster [48]. The process can be done manually or by predefining thresholds on which to
divide the data [49]–[51]. In this study, the percentage of missing values in our data set is less than 5%,
which leads us to impute the missing values by the mean; and all data were normalized to zero. Then, gene
expression values were directly used as input characteristics for classifiers. The framework of our process is
described in Figure 1.
Figure 1. Framework used in this research
Before applying any learning algorithm, the data must be pre-processed by several processes as
missing values imputation, noisy data elimination, and normalizing data. Missing values: In general, dataset
contains missing values occuring due to a variety of reasons including hybridization failures, artifacts on the
microarray, insufficient resolution, image noise and corruption, or they may occur systematically as a result
of the spotting process. There are many techniques to handle these missing values such as omitting the entire
record which contains the missing value or impute themes by Median, Mean, K-NN [46]. Data
Normalization: Some algorithms, such as K-means and K-NN, may require that the data be normalized to
increase the efficacy as well as efficiency of the algorithm. The normalization will prevent any variation in
distance measures where the data may not been normalized. Normalizing the attribute will place all attribute
within a similar range, usually [0, 1] [47]. Data Discretization: Discretization is the process of converting
continuous variables into nominal ones. Studies have shown that discretization makes learning algorithms
more accurate and faster [48]. The process can be done manually or by predefining thresholds on which to
divide the data [49]-[51]. In this study, the percentage of missing values in our data set is less than 5%, which
leads us to impute the missing values by the mean; and all data were normalized to zero. Then, gene
expression values were directly used as input characteristics for classifiers. The framework of our process is
described in Figure 1
3. RESULTS AND ANALYSIS
In this study, five features selector were tested on four different classifiers using three gene
expression datasets labeled Leukeimia, Colon and Lymphoma short description in Table 3. Classification
accuracies are presented before and after the feature selection in Table 4.The columns named ALL, RFS, IG,
Chi-2, Corr, mRMR present the accuracy values of classification using all features, Random Forest Selector,
Information Gain, Chi-square, linear Correlation and Minimum Redundancy Maximum Relevance filters.
7. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 3194 - 3203
3200
Table 3. A Brief Summary of Datasets Used
Dataset
No. of
examples
No. of
features
No. of classes
Class 1 Class 2
Leukeimia 72 7129 47(ALL) 25(AML)
Colon 62 2000 40(P) 22 (N)
Lymphoma 47 4026 22(ACL) 23(GCL)
Table 4. Effects of Feature Selection on Classifiers Using 100 Important Features
Classifier Dataset name
Classification Accuracy %
ALL RFS IG Chi-2 Corr mRMR
K-NN
Leukeimia 89.28 96.43 92.86 95.24 94.29 100
Colon 78.01 87.22 86.82 87.77 88.88 85.63
Lymphoma 93.33 100 100 100 100 100
K-means
Leukeimia 84.72 98.61 98.61 97.22 97.22 98.61
Colon 79.03 87.09 88.70 88.70 88.70 90.32
Lymphoma 82.22 93.33 97.7 100 100 100
SOM
Leukeimia 93.05 91.66 94.44 87.5 95.83 94.44
Colon 88.70 98.38 96.77 93.54 96.77 95.16
Lymphoma 87.68 88.88 95.55 93.33 95.55 97.77
Random
Forest
Leukeimia 97.11 98.55 97.76 98.63 97.13 97.13
Colon 83.39 88.40 86.50 88.73 86.82 85.06
Lymphoma 90.74 98.33 95.16 93.05 93.16 100
The k-Nearest Neighbor (k-NN), Self-organizing maps (SOM), K-means and Random Forest were
used as classifiers in the experiments, and the accuracy of five filters: Random Forest Selector, Information
Gain, chi-square, linear correlation and Minimum Redundancy Maximum Relevance, when the top 100
features are selected are compared between them.
The choice of filters is due to the enormous size of the datasets used which increases the calculation
time. For the k-NN classifier, we used the Euclidean distance as the distance metric, and the best k between 2
and 9; the same thing for K-means. For SOM, we used the parameters as follow: (10×10) input neurons, 0.9
Learning rate, Euclidean Distance Metric, all the neuron were initialized in random and 1000 as Number of
iteration. For Random Forest, we used number of trees equal to 500 and the number of variables tried at each
split is 10. The summarized description is in Table 2.
The result of this work in Table 4 and in Figures 2 ,3, 4 and 5 shows a very important effect of the
selection of variables on the classification rate (the top 100 features in this experiment). From the table we
can observe that mRMR and FRS are a little better on the Leukeimia dataset than other methods if used with
K-NN and K-means, with a great improvement over the use of all variables in classification. For Lymphoma
dataset, all the selectors work very well with all classifiers, with the exception of SOM which is suitable with
mRMR, and FRS, and still, there is an improvement over the use of ALL features. For the colon dataset, the
classification rate is always low in all cases, with an improvement when using SOM as classifiers and RFS as
filter.
Figure 2. Effects of feature selection on KNN using 100 important features
8. Int J Elec & Comp Eng ISSN: 2088-8708
Effect of Feature Selection on Gene Expression Datasets …. (Hicham Omara)
3201
Figure 3. Effects of feature selection on K-means using 100 important features
Figure 4. Effects of feature selection on SOM using 100 important features
4. CONCLUSION
Feature selection is an important issue in classification, because it may have a considerable effect on
accuracy of the classifier. It reduces the number of dimensions of the dataset, so the processor and memory
usage reduce; the data becomes more comprehensible and easier to study on. In this study we have
investigated the influence of feature selection on four classifiers SOM, K-NN, K-means and Random Forest
using five datasets. So by just using 100 top features, the classification accuracy is improved up to 9%
comparing to all feature, and the complexity and the training time were reduced.
REFERENCES
[1] D. Devaraj, B. Yegnanarayana, and K. Ramar, “Radial basis function networks for fast contingency ranking,” Int. J.
Electr. Power Energy Syst., vol. 24, pp. 387–393, Jun. 2002.
[2] [S. Dudoit, J. Fridlyand, and T. P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors
Using Gene Expression Data,” J. Am. Stat. Assoc., vol. 97, no. 457, pp. 77–87, Mar. 2002.
[3] T. R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring,” Science, vol. 286, no. 5439, pp. 531–537, Oct. 1999.
[4] L. Li, C. R. Weinberg, T. A. Darden, and L. G. Pedersen, “Gene selection for sample classification based on gene
expression data: study of sensitivity to choice of parameters of the GA/KNN method,” Bioinforma. Oxf. Engl., vol.
17, no. 12, pp. 1131–1142, Dec. 2001.
[5] A. Narayanan, E. C. Keedwell, J. Gamalielsson, and S. Tatineni, “Single-layer Artificial Neural Networks for Gene
Expression Analysis,” Neurocomput., vol. 61, no. C, pp. 217–240, Oct. 2004.
[6] A. Arauzo-Azofra, J. L. Aznarte, and J. M. Benítez, “Empirical Study of Feature Selection Methods Based on
Individual Feature Evaluation for Classification Problems,” Expert Syst Appl, vol. 38, no. 7, pp. 8170–8177, Jul.
2011.
[7] A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artif. Intell., vol.
97, no. 1, pp. 245–271, Dec. 1997.
[8] C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” J.
Bioinform. Comput. Biol., vol. 3, no. 2, pp. 185–205, Apr. 2005.
[9] I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., vol. 3, no.
Mar, pp. 1157–1182, 2003.
9. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 3194 - 3203
3202
[10] R. Ruiz, J. Riquelme, J. Aguilar-Ruiz, and M. Garcia Torres, “Fast feature selection aimed at high–dimensional
data via hybrid–sequential–ranked searches,” Expert Syst. Appl., vol. 39, pp. 11094–11102, Sep. 2012.
[11] A. Sharma, S. Imoto, and S. Miyano, “A top-r feature selection algorithm for microarray gene expression data,”
IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 9, no. 3, pp. 754–764, Jun. 2012.
[12] Z. Wang, V. Palade, and Y. Xu, “Neuro-Fuzzy Ensemble Approach for Microarray Cancer Gene Expression Data
Analysis,” in 2006 International Symposium on Evolving Fuzzy Systems, 2006, pp. 241–246.
[13] K. Kira and L. A. Rendell, “A Practical Approach to Feature Selection,” in Proceedings of the Ninth International
Workshop on Machine Learning, San Francisco, CA, USA, 1992, pp. 249–256.
[14] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif. Intell., vol. 97, no. 1, pp. 273–324, Dec.
1997.
[15] I. Gheyas and L. Smith, “Feature subset selection in large dimensionality domains,” Pattern Recognit., vol. 43, pp.
5–13, Jan. 2010.
[16] E. B. Huerta, B. Duval, and J.-K. Hao, “Gene Selection for Microarray Data by a LDA-Based Genetic Algorithm,”
in Pattern Recognition in Bioinformatics, 2008, pp. 250–261.
[17] E. K. Tang, P. Suganthan, and X. Yao, “Gene selection algorithms for microarray data based on least squares
support vector machine,” BMC Bioinformatics, vol. 7, p. 95, Feb. 2006.
[18] M. Perez and T. Marwala, “Microarray data feature selection using hybrid genetic algorithm simulated annealing,”
in 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012, pp. 1–5.
[19] J. Canul-Reich, L. O. Hall, D. B. Goldgof, J. N. Korecki, and S. Eschrich, “Iterative feature perturbation as a gene
selector for microarray data,” Int. J. Pattern Recognit. Artif. Intell., vol. 26, no. 05, p. 1260003, Aug. 2012.
[20] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification using Support Vector
Machines,” Mach. Learn., vol. 46, no. 1–3, pp. 389–422, Jan. 2002.
[21] S. Maldonado, R. Weber, and J. Basak, “Simultaneous Feature Selection and Classification Using Kernel-penalized
Support Vector Machines,” Inf Sci, vol. 181, no. 1, pp. 115–128, Jan. 2011.
[22] P. A. Mundra and J. C. Rajapakse, “SVM-RFE With MRMR Filter for Gene Selection,” IEEE Trans.
NanoBioscience, vol. 9, no. 1, pp. 31–37, Mar. 2010.
[23] L.-Y. Chuang, C.-H. Yang, K.-C. Wu, and C.-H. Yang, “A Hybrid Feature Selection Method for DNA Microarray
Data,” Comput Biol Med, vol. 41, no. 4, pp. 228–237, Apr. 2011.
[24] C.-P. Lee and Y. Leu, “A novel hybrid feature selection method for microarray data analysis,” Appl. Soft Comput.,
vol. 11, no. 1, pp. 208–213, Jan. 2011.
[25] M. Bennasar, Y. Hicks, and R. Setchi, “Feature selection using Joint Mutual Information Maximisation,” Expert
Syst. Appl., vol. 42, no. 22, pp. 8520–8532, Dec. 2015.
[26] C. E. Shannon, “A Mathematical Theory of Communication,” SIGMOBILE Mob Comput Commun Rev, vol. 5,
no. 1, pp. 3–55, Jan. 2001.
[27] H. Zhang et al., “Informative Gene Selection and Direct Classification of Tumor Based on Chi-Square Test of
Pairwise Gene Interactions,” BioMed Res. Int., vol. 2014, p. e589290, Jul. 2014.
[28] H. F. Eid, A. E. Hassanien, T. Kim, and S. Banerjee, “Linear Correlation-Based Feature Selection for Network
Intrusion Detection Model,” in Advances in Security of Information and Communication Networks, Springer,
Berlin, Heidelberg, 2013, pp. 240–248.
[29] S. S. Shreem, S. Abdullah, M. Z. A. Nazri, and M. Alzaqebah, “Hybridizing relieff, mRMR filters and GA wrapper
approaches for gene selection,” J. Theor. Appl. Inf. Technol., vol. 46, no. 2, pp. 1034–1039, 2012.
[30] F.-X. Wu, W. J. Zhang, and A. J. Kusalik, “A Genetic K-means Clustering Algorithm Applied to Gene Expression
Data,” in Advances in Artificial Intelligence, 2003, pp. 520–526.
[31] K. R. Nirmal and K. V. V. Satyanarayana, “Issues of K Means Clustering While Migrating to Map Reduce
Paradigm with Big Data: A Survey,” Int. J. Electr. Comput. Eng. IJECE, vol. 6, no. 6, pp. 3047–3051, Dec. 2016.
[32] W. K. Oleiwi, “Using the Fuzzy Logic to Find Optimal Centers of Clusters of K-means,” Int. J. Electr. Comput.
Eng. IJECE, vol. 6, no. 6, pp. 3068–3072, Dec. 2016.
[33] C. Budayan, I. Dikmen, and M. T. Birgonul, “Comparing the performance of traditional cluster analysis, self-
organizing maps and fuzzy C-means method for strategic grouping,” Expert Syst. Appl., vol. 36, no. 9, pp. 11772–
11781, Nov. 2009.
[34] I. Valova, G. Georgiev, N. Gueorguieva, and J. Olson, “Initialization Issues in Self-organizing Maps,” Procedia
Comput. Sci., vol. 20, pp. 52–57, Jan. 2013.
[35] M. Ettaouil and M. Lazaar, “Vector Quantization by Improved Kohonen Algorithm,” J. Comput., vol. 4, no. 6, Jun.
2012.
[36] T. Kohonen, “Essentials of the self-organizing map,” Neural Netw., vol. 37, pp. 52–65, Jan. 2013.
[37] S. Pavel and K. Olga, “Visual analysis of self-organizing maps,” Nonlinear Anal Model Control, vol. 16, no. 4, pp.
488–504, Dec. 2011.
[38] T. Cover and P. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans Inf Theor, vol. 13, no. 1, pp. 21–27,
Sep. 2006.
[39] R. M. Parry et al., “k-Nearest neighbor models for microarray gene expression analysis and clinical outcome
prediction,” Pharmacogenomics J., vol. 10, no. 4, pp. 292–309, Aug. 2010.
[40] A. Alalousi, R. Razif, M. AbuAlhaj, M. Anbar, and S. Nizam, “A Preliminary Performance Evaluation of K-means,
KNN and EM Unsupervised Machine Learning Methods for Network Flow Classification,” Int. J. Electr. Comput.
Eng. IJECE, vol. 6, no. 2, pp. 778–784, Apr. 2016.
10. Int J Elec & Comp Eng ISSN: 2088-8708
Effect of Feature Selection on Gene Expression Datasets …. (Hicham Omara)
3203
[41] D. Amaratunga, J. Cabrera, and Y.-S. Lee, “Enriched random forests,” Bioinformatics, vol. 24, no. 18, pp. 2010–
2014, Sep. 2008.
[42] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001.
[43] X. Chen and H. Ishwaran, “Random Forests for Genomic Data Analysis,” Genomics, vol. 99, no. 6, pp. 323–329,
Jun. 2012.
[44] U. Alon et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues
probed by oligonucleotide arrays,” Proc. Natl. Acad. Sci. U. S. A., vol. 96, no. 12, pp. 6745–6750, Jun. 1999.
[45] A. A. Alizadeh et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,”
Nature, vol. 403, no. 6769, pp. 503–511, Feb. 2000.
[46] E. Acuna and C. Rodriguez, “The Treatment of Missing Values and its Effect on Classifier Accuracy,” in Journal of
Classification, 2004, pp. 639–647.
[47] Y. K. Jain and S. K. Bhandare, “Min Max Normalization based data Perturbation Method for Privacy Protection,”
International Journal of Computer & Communication Technology (IJCCT), vol. 2, no. 8, pp. 45–50, 2011.
[48] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” in
Machine Learning: Proceedings of the Twelfth International Conference, 1995, pp. 194–202.
[49] H. Liu, F. Hussain, C. L. Tan, and M. Dash, “Discretization: An Enabling Technique,” Data Min. Knowl. Discov.,
vol. 6, no. 4, pp. 393–423, Oct. 2002.
[50] P. E. Meyer, F. Lafitte, and G. Bontempi, “minet: A R/Bioconductor package for inferring large transcriptional
networks using mutual information,” BMC Bioinformatics, vol. 9, p. 461, Oct. 2008.
[51] Y. Yang and G. I. Webb, “On Why Discretization Works for Naive-Bayes Classifiers,” in AI 2003: Advances in
Artificial Intelligence, 2003, pp. 440–452.