(at the end) We used Weka to perform the experiments We evaluated KNN, NB, DT, and SVM. Each has its own strengths and limitations. It would be difficult to say which one gives the best results. It is necessary to evaluate on the basis of the same datasets and with a common evaluation criteria. In our experiments, we perform comparative studies using the full set of features, as well as a subset of them. A DNA microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously.
In a classification problem, we are given m training instances, and l classes, where the instances consist of n features, and the known class labels C. The goal is to predict, the class label for a new given instance. For our problem, we consider the features being gene expression coefficients, and the instances correspond to patients. Here, n >> m . Overfitting : building models that are very good for the training set but perform poorly of future independent samples How can we guard against overtting? Split the data into a training set and a crossvalidation set. Use the latter for monitoring the generalization performance. When overtting sets in, stop the training process. Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). DNA microarray experiments from biological samples generate thousands of gene expression measurements. The datasets produced are highly dimensional and often noisy due to the process involved in the experiments. This is not only a challenging problem were the results can be used to diagnose a disease or predict survival of a patient. The approach taken by this project is to provide comparative results to indicate that a small number of instances can be used to create a useful model, and that feature selection improves the classification accuracy.
Golub et al. … its results demonstrate the feasibility of cancer classification based solely on gene expression. A. Rosenwald et al. … for diffuse large-b-cell lymphoma Furey et al. … their results indicate that SVM is able to classify this kind of data, and be used in the identification of the presence of a disease. Guyon et al. … their results show an increase in the overall performance of SVM classification with the reduced set of features.
KNN - To classify a given instance I , the algorithm ranks the neighbors of I , and uses the class labels of the k most similar neighbors to predict the class of the instance I . Then, after gathering the class labels of neighbors, majority of them is taken, and I is assigned the class label with the greatest number of votes among the K nearest neighbors. The best choice of k depends on the dataset. NB - The training phase consists on calculating the conditional probability P(x|c) of an instance given a class label, and the prior probability P(c) of the class. To classify an unseen instance, the posterior probability of each class given the instance, is calculated, and the instance is assigned the class with the highest probability. DT - The algorithm builds a tree based on a training dataset, it recursively partitions the set by choosing an attribute and creates a separate branch for each value of the chosen attribute. The best attribute to split on is the one with the highest information gain or lowest entropy. To classify an instance, the method starts at the root node, testing the attribute specified by the node, then moving down the branch corresponding to the value of the attribute in the given instance. This process is repeated for the subtree rooted at the new node until a leaf is encountered, and the instance is finally labeled with the class indicated by the leaf. SVM - The Support Vector Machine (SVM) method finds a linear discriminant called hyperplane, which separates the classes in a given a dataset. The best hyperplane is the one that keeps the maximum separation between the classes in order to better generalize the model, so we are looking for the maximum margin hyperplane.
The datasets used for this evaluation were obtained from the Kent Ridge Biomedical Data Set Repository. They correspond to gene expression data obtained from DNA microarrays. Leukemia dataset. The source of the gene expression were taken from bone marrow samples and blood samples. Diffuse Large-B-Cell Lymphoma (DLBCL) dataset. This dataset consists of biopsy samples of 240 patients that were examined for gene expression with the use of DNA microarrays. The number of microarray features is 7399, and each sample belongs to one of two classes: Alive, Dead. The two classes correspond to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma.
FEATURE SELECTION Due to the high dimensional nature of this type of data, we chose a smaller set of features from the set of original features. Another reason to perform feature selection, lies in the fact that having a number of features much greater that the number of instances, increases the potential problem of overfitting. TESTING METHODOLOGY We divided both datasets with different ratios of train/test sets (66/34, 80/20, and 90/10), and averaged over the results (macroaveraging). However, given the fact that our datasets are small, we also wanted to evaluate the accuracy on the basis of 10-fold cross-validation. The major advantage of cross-validation is that all the cases in the dataset are used for testing, and nearly all the cases are used for training the classifier. This resampling technique can provide a good estimate of the accuracy.
The classification of the data corresponds to a binary classification task; we want to determine if a patient is alive or dead, or if it has one of two types of leukemia. However, using only the accuracy can result in misleading overoptimistic estimates, that is why, to evaluate the performance of the classification algorithms, we also use the concepts of precision, recall, and F-measure. Precision is the proportion of the instances which actually have class C among all those which were classified as class C . Recall is the proportion of instances which were classified as class C , among all instances which truly have class C , i. e. how much part of the class was captured. In order to pay equal importance to each class, we want to average the values of precision, recall and F-measure that we get for each class C . Classes are equally (almost evenly) represented in the training samples, that is why we can trust in accuracy as a measure of performance.
For both datasets there is an intuitive agreement between the evaluation over an independent test set and cross-validation , however cross-validation results are lower, most likely because it uses nearly all the data for training and testing, giving a more realistic estimation. In the Leukemia dataset, the classification accuracies in both evaluation methods, are remarkably high, there are features that completely determine the class, and Naive Bayes and SVM algorithms tend to slightly outperform KNN and DT. In the case of SVM , it is due to the fact that the classes are linearly divisible, and for NB , its assumption of feature independence indicates that there is at least a number of features that completely determine the class, despite possible redundant or noisy features. For the DLBCL dataset , the accuracy is significantly low in all algorithms, being KNN (66.92%, and 62.91%) the best classifier. Decision Trees gave the lowest accuracy, this is due to the large number of features involved. Surprisingly, KNN outperforms SVM in DLBCL and almost matches it in Leukemia.
We must point out that reducing the dimensionality using now the best ranked features , increases the accuracy when compared with using the full set of features. The results obtained from the independent test set evaluations and cross-validation, still intuitively agree , being cross-validation measures, again a little lower. For the Leukemia dataset , the reduced dimensionality brought an slight increase in the overall accuracy, indicating that this dataset can be described to a high degree of accuracy by a reduced number of features. For the DLBCL dataset , feature selection significantly increased the overall performance in all the algorithms being Naive Bayes (78.84%, and 70.83%), and SVM (75.37%, and 71.25%) the ones with the highest accuracies.
Observing that cross-validation gives a more realistic view of the algorithms' behavior, the table summarizes the best performance for each type of classifier with and without feature selection, in the terms of 10-fold cross validation. The Figure shows the variation of the F-Measure in each algorithm, using both datasets, reinforcing the assumption, that SVM outperforms the rest. It is interesting to point that the measures are consistent among all the algorithms in each dataset. For example, Leukemia with all features is in the range of [0.847, 0.985], DLBCL with feature selection, is in the range of [0.612, 0.706].
Performance depends ... This is confirmed by the remarkably high results obtained with the Leukemia dataset, and which drop dramatically with DLBCL data. Feature selection … No matter which algorithm is being used, all of them benefit from feature selection, increasing the performance. This is specially important for algorithms such as KNN where distances must be computed in terms of features. The use of an information gain based method such as gain ratio, seems to preserve the underlying correlation between the selected features, and the class labels. SVM … As initially suspected, SVM classification gave the best results, however, in spite of the fact that they perform well with high dimensional data, we have shown that SVM can also benefit from reducing the dimensionality with feature selection. Decision Trees … it is widely known that they do not behave well with high dimensional and noisy datasets.
Surprisingly, KNN … its relatively strong performance makes it a good choice for baseline when applied to gene expression data. The DLBCL dataset … The reason for the low results, might be due to the fact that predicting whether a patient is dead or alive after certain time has passed since chemotherapy, involves other circumstances such as the living environment, care of the patient, etc, which can not be numerically measured, and they do affect the final prediction.
While our results indicate that SVM by its very own nature, deal well with high dimensional gene expression data, we have showed that other methods work surprisingly well too . The datasets used, contain relatively a few number of instances, and do not allow one method to demonstrate absolute superiority. We have also shown that there is no single approach that works well in all situations, and the use of one algorithm instead of others should be evaluated on a case by case basis.
Knowing that data transformation methods destroy the underlying meaning of the set of features, it would be interesting to see if algorithms such as SVM and Naive Bayes which assumes term independence, benefit from the transformation. Another direction for future research can be the statistical analysis of the effect of noisy gene expression data on the reliability of the classifier. This is interesting, given the fact that the methods to obtain this type of data can be subject to “noise”, it is crucial to determine these effects on the results and conclude on the basis of robustness of an algorithm in the presence of noisy measures or mislabeled classes. Finally more experiments with other datasets should be performed before deriving final conclusions.
CSCI 6505 Machine Learning Project
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya [email_address] Winter 2006 Dalhousie University Machine Learning Prediction
Outline <ul><li>Introduction </li></ul><ul><li>Definition of the Problem </li></ul><ul><li>Related Work </li></ul><ul><li>Algorithms </li></ul><ul><li>Description of the Data </li></ul><ul><li>Methodology of Experiments </li></ul><ul><li>Results </li></ul><ul><li>Relevance of Results </li></ul><ul><li>Conclusions & Future Work </li></ul>
Introduction <ul><li>ML has gained attention in the biomedical field. </li></ul><ul><li>Need to turn biomedical data into meaningful information. </li></ul><ul><li>Microarray technology is used to generate gene expression data. </li></ul><ul><li>Gene expression data involves a huge number of numeric attributes (gene expression measurements). </li></ul><ul><li>This kind of data is also characterized by consisting of a small numbers of instances. </li></ul><ul><li>This work investigates the classification problem on such data. </li></ul>
Definition of the Problem <ul><li>Classifying Gene Expression Data </li></ul><ul><ul><li>Number of features (n) is much greater than the number of sample instances (m). (n >> m) </li></ul></ul><ul><ul><li>Typical data: n > 5000, and m < 100 </li></ul></ul><ul><ul><li>High risk of overfitting the data due the abundance of attributes and shortage of available samples. </li></ul></ul><ul><ul><li>The datasets produced by Microarray experiments are highly dimensional and often noisy due to the process involved in the experiments. </li></ul></ul>
Related Work <ul><li>Using gene expression data for the task of classification, has recently gained attention in the biomedical community. </li></ul><ul><li>Golub et al. describe an approach to cancer classification based on gene expression applied to human acute Leukemia (ALL vs AML). </li></ul><ul><li>A. Rosenwald et al. developed a model predictor of patient survival after chemotherapy (Alive vs Dead). </li></ul><ul><li>Furey et al. present a method to analyze microarray expression data using SVM. </li></ul><ul><li>Guyon et al. experiment with reducing the dimensionality of gene expression data. </li></ul>
Algorithms <ul><li>K-Nearest Neighbor (KNN) </li></ul><ul><ul><li>It is one of the simplest and widely used algorithms for data classification. </li></ul></ul><ul><li>Naive Bayes (NB) </li></ul><ul><ul><li>It assumes that the effect of a feature value on a given class is independent of the values of other features. </li></ul></ul><ul><li>Decision Trees (DT) </li></ul><ul><ul><li>Internal nodes represent tests on one or more attributes and leaf nodes indicate decision outcomes. </li></ul></ul><ul><li>Support Vector Machines (SVM) </li></ul><ul><ul><li>Works well on high dimensional data </li></ul></ul>
Description of the Data <ul><li>Leukemia dataset. </li></ul><ul><ul><li>A collection of 72 expression measurements. The samples are divided into two variants of leukemia: 25 samples of acute myeloid leukemia (AML) and 47 samples acute lymphoblastic leukemia (ALL). </li></ul></ul><ul><li>Diffuse Large-B-Cell Lymphoma (DLBCL) dataset </li></ul><ul><ul><li>Biopsy samples that were examined for gene expression with the use of DNA microarrays. Each sample corresponds to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma (Alive, Dead). </li></ul></ul>
Methodology of Experiments <ul><li>Feature Selection </li></ul><ul><ul><li>Remove irrelevant features (but may have biological meaning). </li></ul></ul><ul><ul><li>Use of GainRatio </li></ul></ul><ul><li>Selecting a Supervised Learning Method </li></ul><ul><ul><li>KNN, NB, DT, SVM </li></ul></ul><ul><li>Testing Methodology </li></ul><ul><ul><li>Evaluation over independent test set (train/test split) </li></ul></ul><ul><ul><ul><li>Ratios: 66/34, 80/20, 90/10 </li></ul></ul></ul><ul><ul><li>10-fold Cross-Validation </li></ul></ul><ul><ul><li>Compare both methods and see if they are in logical agreement </li></ul></ul>Feature Selection (gene subset) Algorithm All features
Methodology of Experiments (cont…) <ul><li>Measuring Performance </li></ul><ul><ul><li>Accuracy </li></ul></ul><ul><ul><li>Precision (p) </li></ul></ul><ul><ul><li>Recall (r) </li></ul></ul><ul><ul><li>F-Measure </li></ul></ul><ul><ul><ul><li>It is hard to compare two classifiers using two measures. F-Measure combines precision and recall into one measure. </li></ul></ul></ul><ul><ul><ul><li>F-Measure is the harmonic mean of precision, and recall. </li></ul></ul></ul><ul><ul><ul><li>For F to be large, both p and r must be large. </li></ul></ul></ul>
Results <ul><li>Without Feature Selection </li></ul><ul><ul><li>Naive Bayes and SVM perform better </li></ul></ul><ul><ul><li>KNN and SVM perform better </li></ul></ul>Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation.
Results (cont…) <ul><li>With Feature Selection </li></ul><ul><ul><li>KNN and SVM perform better </li></ul></ul><ul><ul><li>NB and SVM perform better </li></ul></ul>There is an increase in the overall accuracy, more notorious in DLBCL
Results (cont…) <ul><li>Summary of classification accuracies with cross-validation </li></ul><ul><li>F-Measures for both datasets with and without feature selection </li></ul>
Relevance of Results <ul><li>Performance depends on the characteristics of the problem, the quality of the measurements in the data, and the capabilities of the classifier in finding regularities in the data. </li></ul><ul><li>Feature selection, helps to minimize the use of redundant and/or noisy features. </li></ul><ul><li>SVM gave the best results, they perform well with high dimensional data, and also benefit from feature selection. </li></ul><ul><li>Decision Trees had the overall worst performance, however, they still work at a competitive level. </li></ul>
Relevance of Results (cont…) <ul><li>Surprisingly, KNN behaves relatively well despite its simplicity, this characteristic allows it to scale well for large feature spaces. </li></ul><ul><li>In the case of the Leukemia dataset, very high accuracies were achieved here for all the algorithms. Perfect accuracy was achieved in many cases. </li></ul><ul><li>The DLBCL dataset shows lower accuracies, although using feature selection improved them. </li></ul><ul><li>In the overall, the observations of the accuracy results are consistent with those from the F-measure, giving us confidence in the relevance of the results obtained. </li></ul>
Conclusions & Future Work <ul><li>Supervised learning algorithms can be used to the classification of gene expression data from DNA microarrays with high accuracy. </li></ul><ul><li>SVM by its very own nature, deal well with high dimensional gene expression data. </li></ul><ul><li>We have verified that there are subsets of features (genes) that are more relevant than others and better separate the classes. </li></ul><ul><li>The use of one algorithm instead of others should be evaluated on a case by case basis </li></ul>
Conclusions & Future Work (cont…) <ul><li>The use of feature selection proved to be beneficial to improve the overall performance of the algorithms. This idea can be extended to the use of other feature selection methods or data transformation such as PCA. </li></ul><ul><li>Analysis of the effect of noisy gene expression data on the reliability of the classifier. </li></ul><ul><li>While the scope of our experimental results is confined to a couple of datasets, the analysis can be used as a baseline for future use of supervised learning algorithms for gene expression data </li></ul>
References <ul><li>T.R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene-expression monitoring. Science, Vol. 286 , 531–537, 1999. </li></ul><ul><li>A. Rosenwald, G. Wright, W. C. Chan, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New England Journal of Medicine, Vol. 346 , 1937–1947, 2002. </li></ul><ul><li>Terrence S. Furey, Nello Cristianini, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics , Vol. 16 , 906–914, 2001. </li></ul><ul><li>I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. BIOWulf Technical Report , 2000. </li></ul><ul><li>Ethem Alpaydin. Introduction to Machine Learning . The MIT Press, 2004. </li></ul><ul><li>Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques . Second Edition. Morgan Kaufmann Publishers , 2005 </li></ul><ul><li>Wikipedia : www.wikipedia.org </li></ul><ul><li>Alvis Brazma, Helen Parkinson, Thomas Schlitt, Mohammadreza Shojatalab. A quick introduction to elements of biology-cells, molecules, genes, functional genomics, microarrays. European Bioinformatics Institute. </li></ul>