This document evaluates several supervised machine learning algorithms for classifying gene expression data from microarray experiments. It describes analyzing two gene expression datasets, the leukemia and DLBCL datasets, using k-nearest neighbors, naive Bayes, decision trees, and support vector machines with and without feature selection. The results show that support vector machines achieved the best performance overall, and that feature selection improved the accuracy of all the algorithms.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
Classification of medical datasets using back propagation neural network powe...IJECEIAES
The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
Classification of medical datasets using back propagation neural network powe...IJECEIAES
The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).
A new model for large dataset dimensionality reduction based on teaching lear...TELKOMNIKA JOURNAL
One of the human diseases with a high rate of mortality each year is breast cancer (BC). Among all the forms of cancer, BC is the commonest cause of death among women globally. Some of the effective ways of data classification are data mining and classification methods. These methods are particularly efficient in the medical field due to the presence of irrelevant and redundant attributes in medical datasets. Such redundant attributes are not needed to obtain an accurate estimation of disease diagnosis. Teaching learning-based optimization (TLBO) is a new metaheuristic that has been successfully applied to several intractable optimization problems in recent years. This paper presents the use of a multi-objective TLBO algorithm for the selection of feature subsets in automatic BC diagnosis. For the classification task in this work, the logistic regression (LR) method was deployed. From the results, the projected method produced better BC dataset classification accuracy (classified into malignant and benign). This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems.
Classification of pneumonia from X-ray images using siamese convolutional net...TELKOMNIKA JOURNAL
Pneumonia is one of the highest global causes of deaths especially for children under 5 years old. This happened mainly because of the difficulties in identifying the cause of pneumonia. As a result, the treatment given may not be suitable for each pneumonia case. Recent studies have used deep learning approaches to obtain better classification within the cause of pneumonia. In this research, we used siamese convolutional network (SCN) to classify chest x-ray pneumonia image into 3 classes, namely normal conditions, bacterial pneumonia, and viral pneumonia. Siamese convolutional network is a neural network architecture that learns similarity knowledge between pairs of image inputs based on the differences between its features. One of the important benefits of classifying data with SCN is the availability of comparable images that can be used as a reference when determining class. Using SCN, our best model achieved 80.03% accuracy, 79.59% f1 score, and an improved result reasoning by providing the comparable images.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
Sample Work For Engineering Literature Review and Gap IdentificationPhD Assistance
Sample Work For Engineering Literature Review and Gap Identification - PhD Assistance - http://bit.ly/2E9fAVq
2.1 INTRODUCTION
2.2 RESEARCH GAPS IN EXISTING METHODS
2.3 OBJECTIVES OF THIS WORK
Read More : http://bit.ly/2Rl7XT5
#gapanalysis #strategicmanagement #datagapanalysis #gapanalysisppt #gapanalysishealthcare #gapanalysisfinance #gapanalysisEngineering
An approach for breast cancer diagnosis classification using neural networkacijjournal
Artificial neural network has been widely used in various fields as an intelligent tool in recent years, such
as artificial intelligence, pattern recognition, medical diagnosis, machine learning and so on. The
classification of breast cancer is a medical application that poses a great challenge for researchers and
scientists. Recently, the neural network has become a popular tool in the classification of cancer datasets.
Classification is one of the most active research and application areas of neural networks. Major
disadvantages of artificial neural network (ANN) classifier are due to its sluggish convergence and always
being trapped at the local minima. To overcome this problem, differential evolution algorithm (DE) has
been used to determine optimal value or near optimal value for ANN parameters. DE has been applied
successfully to improve ANN learning from previous studies. However, there are still some issues on DE
approach such as longer training time and lower classification accuracy. To overcome these problems,
island based model has been proposed in this system. The aim of our study is to propose an approach for
breast cancer distinguishing between different classes of breast cancer. This approach is based on the
Wisconsin Diagnostic and Prognostic Breast Cancer and the classification of different types of breast
cancer datasets. The proposed system implements the island-based training method to be better accuracy
and less training time by using and analysing between two different migration topologies
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
Identification of Disease in Leaves using Genetic Algorithmijtsrd
Plant disease is an impairment of normal state of a plant that interrupts or modifies its vital functions. Many leaf diseases are caused by pathogens. Agriculture is the mains try of the Indian economy. Perception of human eye is not so much stronger so as to observe minute variation in the infected part of leaf. In this paper, we are providing software solution to automatically detect and classify plant leaf diseases. In this we are using image processing techniques to classify diseases and quickly diagnosis can be carried out as per disease. This approach will enhance productivity of crops. It includes image processing techniques starting from image acquisition, preprocessing, testing, and training. K. Beulah Suganthy ""Identification of Disease in Leaves using Genetic Algorithm"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22901.pdf
Paper URL: https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/22901/identification-of-disease-in-leaves-using-genetic-algorithm/k-beulah-suganthy
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
A new model for large dataset dimensionality reduction based on teaching lear...TELKOMNIKA JOURNAL
One of the human diseases with a high rate of mortality each year is breast cancer (BC). Among all the forms of cancer, BC is the commonest cause of death among women globally. Some of the effective ways of data classification are data mining and classification methods. These methods are particularly efficient in the medical field due to the presence of irrelevant and redundant attributes in medical datasets. Such redundant attributes are not needed to obtain an accurate estimation of disease diagnosis. Teaching learning-based optimization (TLBO) is a new metaheuristic that has been successfully applied to several intractable optimization problems in recent years. This paper presents the use of a multi-objective TLBO algorithm for the selection of feature subsets in automatic BC diagnosis. For the classification task in this work, the logistic regression (LR) method was deployed. From the results, the projected method produced better BC dataset classification accuracy (classified into malignant and benign). This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems.
Classification of pneumonia from X-ray images using siamese convolutional net...TELKOMNIKA JOURNAL
Pneumonia is one of the highest global causes of deaths especially for children under 5 years old. This happened mainly because of the difficulties in identifying the cause of pneumonia. As a result, the treatment given may not be suitable for each pneumonia case. Recent studies have used deep learning approaches to obtain better classification within the cause of pneumonia. In this research, we used siamese convolutional network (SCN) to classify chest x-ray pneumonia image into 3 classes, namely normal conditions, bacterial pneumonia, and viral pneumonia. Siamese convolutional network is a neural network architecture that learns similarity knowledge between pairs of image inputs based on the differences between its features. One of the important benefits of classifying data with SCN is the availability of comparable images that can be used as a reference when determining class. Using SCN, our best model achieved 80.03% accuracy, 79.59% f1 score, and an improved result reasoning by providing the comparable images.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
Sample Work For Engineering Literature Review and Gap IdentificationPhD Assistance
Sample Work For Engineering Literature Review and Gap Identification - PhD Assistance - http://bit.ly/2E9fAVq
2.1 INTRODUCTION
2.2 RESEARCH GAPS IN EXISTING METHODS
2.3 OBJECTIVES OF THIS WORK
Read More : http://bit.ly/2Rl7XT5
#gapanalysis #strategicmanagement #datagapanalysis #gapanalysisppt #gapanalysishealthcare #gapanalysisfinance #gapanalysisEngineering
An approach for breast cancer diagnosis classification using neural networkacijjournal
Artificial neural network has been widely used in various fields as an intelligent tool in recent years, such
as artificial intelligence, pattern recognition, medical diagnosis, machine learning and so on. The
classification of breast cancer is a medical application that poses a great challenge for researchers and
scientists. Recently, the neural network has become a popular tool in the classification of cancer datasets.
Classification is one of the most active research and application areas of neural networks. Major
disadvantages of artificial neural network (ANN) classifier are due to its sluggish convergence and always
being trapped at the local minima. To overcome this problem, differential evolution algorithm (DE) has
been used to determine optimal value or near optimal value for ANN parameters. DE has been applied
successfully to improve ANN learning from previous studies. However, there are still some issues on DE
approach such as longer training time and lower classification accuracy. To overcome these problems,
island based model has been proposed in this system. The aim of our study is to propose an approach for
breast cancer distinguishing between different classes of breast cancer. This approach is based on the
Wisconsin Diagnostic and Prognostic Breast Cancer and the classification of different types of breast
cancer datasets. The proposed system implements the island-based training method to be better accuracy
and less training time by using and analysing between two different migration topologies
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
Identification of Disease in Leaves using Genetic Algorithmijtsrd
Plant disease is an impairment of normal state of a plant that interrupts or modifies its vital functions. Many leaf diseases are caused by pathogens. Agriculture is the mains try of the Indian economy. Perception of human eye is not so much stronger so as to observe minute variation in the infected part of leaf. In this paper, we are providing software solution to automatically detect and classify plant leaf diseases. In this we are using image processing techniques to classify diseases and quickly diagnosis can be carried out as per disease. This approach will enhance productivity of crops. It includes image processing techniques starting from image acquisition, preprocessing, testing, and training. K. Beulah Suganthy ""Identification of Disease in Leaves using Genetic Algorithm"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22901.pdf
Paper URL: https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/22901/identification-of-disease-in-leaves-using-genetic-algorithm/k-beulah-suganthy
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
Performance enhancement of machine learning algorithm for breast cancer diagn...IJECEIAES
Breast cancer is the most fatal women’s cancer, and accurate diagnosis of this disease in the initial phase is crucial to abate death rates worldwide. The demand for computer-aided disease diagnosis technologies in healthcare is growing significantly to assist physicians in ensuring the effectual treatment of critical diseases. The vital purpose of this study is to analyze and evaluate the classification efficiency of several machine learning algorithms with hyperparameter optimization techniques using grid search and random search to reveal an efficient breast cancer diagnosis approach. Choosing the optimal combination of hyperparameters using hyperparameter optimization for machine learning models has a straight influence on the performance of models. According to the findings of several evaluation studies, the k-nearest neighbor is addressed in this study for effective diagnosis of breast cancer, which got a 100.00% recall value with hyperparameters found utilizing grid search. k-nearest neighbor, logistic regression, and multilayer perceptron obtained 99.42% accuracy after utilizing hyperparameter optimization. All machine learning models showed higher efficiency in breast cancer diagnosis with grid search-based hyperparameter optimization except for XGBoost. Therefore, the evaluation outcomes strongly validate the effectiveness and reliability of the proposed technique for breast cancer diagnosis.
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
Medical data mining has great deal for exploring new knowledge from large amount of data. Classification is one of the important data mining techniques for classification of data. In this research work, we have used various data mining based classification techniques for classification of cancer diseases patient or not. We applied the Breast Cancer-Wisconsin (Original) data set into different data mining techniques and compared the accuracy of models with two different data partitions. BayesNet achieved highest accuracy as 97.13% in case of 10-fold data partitions. We have also applied the info gain feature selection technique on BayesNet and Support Vector Machine (SVM) and achieved best accuracy 97.28% accuracy with BayesNet in case of 6 feature subset.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...rahulmonikasharma
Classification problems in high dimensional information with little sort of observations became furthercommon significantly in microarray information. The increasing amount of text data on internet sites affects the agglomerationanalysis. The text agglomeration could also be a positive analysis technique used for partitioning a huge amount of datainto clusters. Hence, the most necessary draw back that affects the text agglomeration technique is that the presenceuninformative and distributed choices in text documents. A broad class of boosting algorithms is known as actingcoordinate-wise gradient descent to attenuate some potential performs of the margins of a data set. This paperproposes a novel analysis live Q-statistic that comes with the soundness of the chosen feature set to boot to theprediction accuracy. Then we've a bent to propose the Booster of associate degree FS algorithm that enhances theworth of the Q-statistic of the algorithm applied.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Arinze Akutekwe
Comprehensive understanding of gene regulatory
networks (GRNs) is a major challenge in systems biology. Most
methods for modeling and inferring the dynamics of GRNs,
such as those based on state space models, vector autoregressive
models and G1DBN algorithm, assume linear dependencies
among genes. However, this strong assumption does not make
for true representation of time-course relationships across the
genes, which are inherently nonlinear. Nonlinear modeling
methods such as the S-systems and causal structure
identification (CSI) have been proposed, but are known to be
statistically inefficient and analytically intractable in high
dimensions. To overcome these limitations, we propose an
optimized ensemble approach based on support vector
regression (SVR) and dynamic Bayesian networks (DBNs). The
method called SVR-DBN, uses nonlinear kernels of the SVR to
infer the temporal relationships among genes within the DBN
framework. The two-stage ensemble is further improved by
SVR parameter optimization using Particle Swarm
Optimization. Results on eight insilico-generated datasets, and
two real world datasets of Drosophila Melanogaster and
Escherichia Coli, show that our method outperformed the
G1DBN algorithm by a total average accuracy of 12%. We
further applied our method to model the time-course
relationships of ovarian carcinoma. From our results, four hub
genes were discovered. Stratified analysis further showed that
the expression levels Prostrate differentiation factor and BTG
family member 2 genes, were significantly increased by the
cisplatin and oxaliplatin platinum drugs; while expression levels
of Polo-like kinase and Cyclin B1 genes, were both decreased by
the platinum drugs. These hub genes might be potential
biomarkers for ovarian carcinoma.
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
Feature selection improves the classification performance of machine learning models. It also identifies the important features and eliminates those with little significance. Furthermore, feature selection reduces the dimensionality of training and testing data points. This study proposes a feature selection method that uses a multivariate sample similarity measure. The method selects features with significant contributions using a machine-learning model. The multivariate sample similarity measure is evaluated using the University of California, Irvine heart disease dataset and compared with existing feature selection methods. The multivariate sample similarity measure is evaluated with metrics such as minimum subset selected, accuracy, F1-score, and area under the curve (AUC). The results show that the proposed method is able to diagnose chest pain, thallium scan, and major vessels scanned using X-rays with a high capability to distinguish between healthy and heart disease patients with a 99.6% accuracy.
Mining of Important Informative Genes and Classifier Construction for Cancer ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust gene identification methods is extremely fundamental. Many gene selection methods as well as their corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination capability is selected and classification rules are generated for cancer based on gene expression profiles. The method first computes importance factor of each gene of experimental cancer dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high class discrimination capability according to their depended degree of classes. Then initial important genes are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans clustering algorithm is applied on each selected gene of initial reduct and compute missclassification errors of individual genes. The final reduct is formed by selecting most important genes with respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples of experimental test dataset. The proposed method test on four publicly available cancerous gene expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to prove the robustness of proposed method compares the outcomes (correctly classified instances) with some existing well known classifiers.
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination
capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans
clustering algorithm is applied on each selected gene of initial reduct and compute missclassification
errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
Similar to CSCI 6505 Machine Learning Project (20)
1. Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya [email_address] Winter 2006 Dalhousie University Machine Learning Prediction
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Editor's Notes
(at the end) We used Weka to perform the experiments We evaluated KNN, NB, DT, and SVM. Each has its own strengths and limitations. It would be difficult to say which one gives the best results. It is necessary to evaluate on the basis of the same datasets and with a common evaluation criteria. In our experiments, we perform comparative studies using the full set of features, as well as a subset of them. A DNA microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously.
In a classification problem, we are given m training instances, and l classes, where the instances consist of n features, and the known class labels C. The goal is to predict, the class label for a new given instance. For our problem, we consider the features being gene expression coefficients, and the instances correspond to patients. Here, n >> m . Overfitting : building models that are very good for the training set but perform poorly of future independent samples How can we guard against overtting? Split the data into a training set and a crossvalidation set. Use the latter for monitoring the generalization performance. When overtting sets in, stop the training process. Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). DNA microarray experiments from biological samples generate thousands of gene expression measurements. The datasets produced are highly dimensional and often noisy due to the process involved in the experiments. This is not only a challenging problem were the results can be used to diagnose a disease or predict survival of a patient. The approach taken by this project is to provide comparative results to indicate that a small number of instances can be used to create a useful model, and that feature selection improves the classification accuracy.
Golub et al. … its results demonstrate the feasibility of cancer classification based solely on gene expression. A. Rosenwald et al. … for diffuse large-b-cell lymphoma Furey et al. … their results indicate that SVM is able to classify this kind of data, and be used in the identification of the presence of a disease. Guyon et al. … their results show an increase in the overall performance of SVM classification with the reduced set of features.
KNN - To classify a given instance I , the algorithm ranks the neighbors of I , and uses the class labels of the k most similar neighbors to predict the class of the instance I . Then, after gathering the class labels of neighbors, majority of them is taken, and I is assigned the class label with the greatest number of votes among the K nearest neighbors. The best choice of k depends on the dataset. NB - The training phase consists on calculating the conditional probability P(x|c) of an instance given a class label, and the prior probability P(c) of the class. To classify an unseen instance, the posterior probability of each class given the instance, is calculated, and the instance is assigned the class with the highest probability. DT - The algorithm builds a tree based on a training dataset, it recursively partitions the set by choosing an attribute and creates a separate branch for each value of the chosen attribute. The best attribute to split on is the one with the highest information gain or lowest entropy. To classify an instance, the method starts at the root node, testing the attribute specified by the node, then moving down the branch corresponding to the value of the attribute in the given instance. This process is repeated for the subtree rooted at the new node until a leaf is encountered, and the instance is finally labeled with the class indicated by the leaf. SVM - The Support Vector Machine (SVM) method finds a linear discriminant called hyperplane, which separates the classes in a given a dataset. The best hyperplane is the one that keeps the maximum separation between the classes in order to better generalize the model, so we are looking for the maximum margin hyperplane.
The datasets used for this evaluation were obtained from the Kent Ridge Biomedical Data Set Repository. They correspond to gene expression data obtained from DNA microarrays. Leukemia dataset. The source of the gene expression were taken from bone marrow samples and blood samples. Diffuse Large-B-Cell Lymphoma (DLBCL) dataset. This dataset consists of biopsy samples of 240 patients that were examined for gene expression with the use of DNA microarrays. The number of microarray features is 7399, and each sample belongs to one of two classes: Alive, Dead. The two classes correspond to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma.
FEATURE SELECTION Due to the high dimensional nature of this type of data, we chose a smaller set of features from the set of original features. Another reason to perform feature selection, lies in the fact that having a number of features much greater that the number of instances, increases the potential problem of overfitting. TESTING METHODOLOGY We divided both datasets with different ratios of train/test sets (66/34, 80/20, and 90/10), and averaged over the results (macroaveraging). However, given the fact that our datasets are small, we also wanted to evaluate the accuracy on the basis of 10-fold cross-validation. The major advantage of cross-validation is that all the cases in the dataset are used for testing, and nearly all the cases are used for training the classifier. This resampling technique can provide a good estimate of the accuracy.
The classification of the data corresponds to a binary classification task; we want to determine if a patient is alive or dead, or if it has one of two types of leukemia. However, using only the accuracy can result in misleading overoptimistic estimates, that is why, to evaluate the performance of the classification algorithms, we also use the concepts of precision, recall, and F-measure. Precision is the proportion of the instances which actually have class C among all those which were classified as class C . Recall is the proportion of instances which were classified as class C , among all instances which truly have class C , i. e. how much part of the class was captured. In order to pay equal importance to each class, we want to average the values of precision, recall and F-measure that we get for each class C . Classes are equally (almost evenly) represented in the training samples, that is why we can trust in accuracy as a measure of performance.
For both datasets there is an intuitive agreement between the evaluation over an independent test set and cross-validation , however cross-validation results are lower, most likely because it uses nearly all the data for training and testing, giving a more realistic estimation. In the Leukemia dataset, the classification accuracies in both evaluation methods, are remarkably high, there are features that completely determine the class, and Naive Bayes and SVM algorithms tend to slightly outperform KNN and DT. In the case of SVM , it is due to the fact that the classes are linearly divisible, and for NB , its assumption of feature independence indicates that there is at least a number of features that completely determine the class, despite possible redundant or noisy features. For the DLBCL dataset , the accuracy is significantly low in all algorithms, being KNN (66.92%, and 62.91%) the best classifier. Decision Trees gave the lowest accuracy, this is due to the large number of features involved. Surprisingly, KNN outperforms SVM in DLBCL and almost matches it in Leukemia.
We must point out that reducing the dimensionality using now the best ranked features , increases the accuracy when compared with using the full set of features. The results obtained from the independent test set evaluations and cross-validation, still intuitively agree , being cross-validation measures, again a little lower. For the Leukemia dataset , the reduced dimensionality brought an slight increase in the overall accuracy, indicating that this dataset can be described to a high degree of accuracy by a reduced number of features. For the DLBCL dataset , feature selection significantly increased the overall performance in all the algorithms being Naive Bayes (78.84%, and 70.83%), and SVM (75.37%, and 71.25%) the ones with the highest accuracies.
Observing that cross-validation gives a more realistic view of the algorithms' behavior, the table summarizes the best performance for each type of classifier with and without feature selection, in the terms of 10-fold cross validation. The Figure shows the variation of the F-Measure in each algorithm, using both datasets, reinforcing the assumption, that SVM outperforms the rest. It is interesting to point that the measures are consistent among all the algorithms in each dataset. For example, Leukemia with all features is in the range of [0.847, 0.985], DLBCL with feature selection, is in the range of [0.612, 0.706].
Performance depends ... This is confirmed by the remarkably high results obtained with the Leukemia dataset, and which drop dramatically with DLBCL data. Feature selection … No matter which algorithm is being used, all of them benefit from feature selection, increasing the performance. This is specially important for algorithms such as KNN where distances must be computed in terms of features. The use of an information gain based method such as gain ratio, seems to preserve the underlying correlation between the selected features, and the class labels. SVM … As initially suspected, SVM classification gave the best results, however, in spite of the fact that they perform well with high dimensional data, we have shown that SVM can also benefit from reducing the dimensionality with feature selection. Decision Trees … it is widely known that they do not behave well with high dimensional and noisy datasets.
Surprisingly, KNN … its relatively strong performance makes it a good choice for baseline when applied to gene expression data. The DLBCL dataset … The reason for the low results, might be due to the fact that predicting whether a patient is dead or alive after certain time has passed since chemotherapy, involves other circumstances such as the living environment, care of the patient, etc, which can not be numerically measured, and they do affect the final prediction.
While our results indicate that SVM by its very own nature, deal well with high dimensional gene expression data, we have showed that other methods work surprisingly well too . The datasets used, contain relatively a few number of instances, and do not allow one method to demonstrate absolute superiority. We have also shown that there is no single approach that works well in all situations, and the use of one algorithm instead of others should be evaluated on a case by case basis.
Knowing that data transformation methods destroy the underlying meaning of the set of features, it would be interesting to see if algorithms such as SVM and Naive Bayes which assumes term independence, benefit from the transformation. Another direction for future research can be the statistical analysis of the effect of noisy gene expression data on the reliability of the classifier. This is interesting, given the fact that the methods to obtain this type of data can be subject to “noise”, it is crucial to determine these effects on the results and conclude on the basis of robustness of an algorithm in the presence of noisy measures or mislabeled classes. Finally more experiments with other datasets should be performed before deriving final conclusions.