Glioblastoma subtype classification using machine learning algorithms
Elsa Fecke†, Mark Kon‡, Xinying Mu‡, Mia Li‡
†St. Lawrence University; Boston University BRITE Bioinformatics, Summer 2016, ‡Department of Mathematics and Statistics, Boston University
Using GE data has been a widely used approach to classification of
cancer subtype. However, recent developments indicate that copy number
variations (CNVs) are also important gene-level biomarkers of cancer
subtype.5 CNVs are important biomarkers in cancer given their extreme
and consistent variations in different cancer subtypes. Normal cells have
two copies of each gene; copy number variation occurs when a portion of
a gene, or an entire gene, is duplicated or deleted in the process of
carcinogenesis.5 Therefore, like GE, CNV can be used to classify cancer
subtypes. Combining these methods of cancer subtype classification may
provide better diagnosis, prevention and treatment options. If subtype
classification is optimized, then the appropriate drugs can be
administered to patients with higher confidence.
Introduction
Machine Learning & Dimensionality Reduction
1 Alvarez, M.J. (2014). Diggit: Inference of genetic variants driving cellular phenotypes. R package version 1.4.0.
2 The approaches for reducing dimensionality [Digital image]. (n.d.). Retrieved from
http://www.byclb.com/TR/Tutorials/neural_networks/ch5_1.htm
3 Copy number variation [digital image]. (n.d). Retrieved from
http://readingroom.mindspec.org/?page_id=8221
4 James, G. et al. (2013) An Introduction to statistical learning. New York: Springer Science+Business Media.
5 Kim, S. et al. (2015). A method for generating new datasets based on copy number for cancer analysis. BioMed Research International,
2015, Article ID 467514.
6 Kon, M. (2016). Decision tree, random forests [PowerPoint]. Retrieved from Boston University Course named Mathematical and
Statistical Methods of Bioinformatics.
7 Ghose, A. (2016, Feb, 24). Why does the RBF (radial basis function) kernel map into infinite dimensional space? [Web log post].
Retrieved from
https://www.quora.com/Why-does-the-RBF-radial-basis-function-kernel-map-into-infinite-dimensional-space
References
CNV 3
Reducing Dimensionality
Illustration of k-NN4
Abstract
Machine learning (ML) methods for prediction of cancer subtype form a
growing area of research in computational biology. In this project, we
use supervised ML algorithms to predict class membership (e.g., cancer
subtype) based on biomarker evidence (e.g., gene expression vectors).
The goal of this project is to determine how well gene-level copy
number (GLCN) data and gene expression (GE) data predict for cancer
subtype, both separately and together. The principal analysis is
conducted on two glioblastoma datasets.1 The GE biomarkers alone
yield higher classification accuracies than GLCN, and the combined
data (GE and GLCN) performs similarly to GE alone; however, there are
potential other ways to optimize the combined data accuracy. The
combined data offers a variety of uses including not only cancer subtype
prediction, but also predicted survival time and classification of cancer
metastasis.5
Feature Extraction2
Machine
Learning
unsupervised
data with no
labels (e.g.,
clustering, K-
means)
supervised
data with
labels (e.g.,
regression,
classification)
Machine Learning Algorithms Results
SVC 4 SVM with a RBF* kernel 4
Support Vector Machine (SVM)
Naïve Bayes (NB) Classifier
Random Forest (RF) Classifier
On right: classification tree (CT) for gene
X13. It predicts that each observation
belongs to the most commonly occurring
class in the training set in the node where
it belongs; class proportions in the
training set of that particular node
are also considered. A RF is a
collection of CTs in which each
CT is a different, random subset of the
training set.6
Identifies the number of k-neighboring points
in the training set nearest to the test
observation X0. In k-NN classification, the
response is the class variable, so X0 is
assigned its class according to the majority
vote of its k neighbors.
SVM is an extension of the Support Vector Classifier (SVC). A SVC
classifies observations based on a linear decision boundary, while a SVM
can classify observations based non-linear decision boundaries.4
This work was funded by the Boston University Bioinformatics BRITE REU program
A special thanks to all of my colleagues for their continued support
50GeneExpressionProfiles
MES Patients PN Patients
K-Nearest Neighbors (k-NN) Classifier
P(C | xi) =
P(xi |C)P(C)
P(xi)
 
Feature selection2
General Formula:
Expansion:
P(C | x1, x2,..., x8569) =
P(x1 |C)P(x2 |C)!P(x8569 |C)P(C)
P(x1)P(x2)!P(x8569)
Where xi is some GE or
GLCN and C is class
(i.e., subtype**)
Original GE dataset on 8569
genes with MES, PN, and PRO
subtypes**
3 pairwise GE datasets on 8569
genes: (1) MES-PRO, (2) MES-
PN, (3) PN-PRO
Original GLCN dataset on
8569 genes with MES, PN, and
PRO subtypes
3 pairwise GLCN datasets on
8569 genes: (1) MES-PRO, (2)
MES-PN, (3) PN-PRO
3 Combined datasets (containing GE & GLCN) on 8569
genes: (1) MES-PRO, (2) MES-PN, (3) PN-PRO
Data Partition for Binary Classification
The three highest LOOCV accuracies were found using RF on the MES-
PN GE dataset (94.69%), NB on the MES-PN combined dataset
(93.72%), and RF on the MES-PN combined dataset (92.75%). The
MES-PN GE and MES-PN combined datasets performed the best, on
average, across all algorithms with the same LOOCV accuracy of
92.51%. Across all datasets, RF performed the best, with an average
LOOCV of 85.09%. It appears that combined biomarker data is better at
predicting cancer subtype than GE alone by about .55%, on average,
and better than GLCN alone by about 8.24%, on average. There may be
room for improvement in terms of data structure, but further analysis is
required. Supplementary analysis could include data manipulation as
well as implementing new ML algorithms, such as clustering.
.
*A function that finds a way to compute the dot-product of points in high-dimensional space, without explicitly doing so7
**Mesenchymal (MES) glioblastoma patient; ProNeural (PN) glioblastoma patient; Proliferative (PRO) glioblastoma patient
***LOOCV (leave-one-out cross validation): leaves out one sample, trains on n-1 samples and records a prediction; this process is
executed n times to obtain n predictions which are then compared to the actual n results; this yields an LOOCV accuracy rate.
Terminology
Acknowledgements
• naive Bayes assumes independence between all features
Conclusion

BRITEREU_finalposter

  • 1.
    Glioblastoma subtype classificationusing machine learning algorithms Elsa Fecke†, Mark Kon‡, Xinying Mu‡, Mia Li‡ †St. Lawrence University; Boston University BRITE Bioinformatics, Summer 2016, ‡Department of Mathematics and Statistics, Boston University Using GE data has been a widely used approach to classification of cancer subtype. However, recent developments indicate that copy number variations (CNVs) are also important gene-level biomarkers of cancer subtype.5 CNVs are important biomarkers in cancer given their extreme and consistent variations in different cancer subtypes. Normal cells have two copies of each gene; copy number variation occurs when a portion of a gene, or an entire gene, is duplicated or deleted in the process of carcinogenesis.5 Therefore, like GE, CNV can be used to classify cancer subtypes. Combining these methods of cancer subtype classification may provide better diagnosis, prevention and treatment options. If subtype classification is optimized, then the appropriate drugs can be administered to patients with higher confidence. Introduction Machine Learning & Dimensionality Reduction 1 Alvarez, M.J. (2014). Diggit: Inference of genetic variants driving cellular phenotypes. R package version 1.4.0. 2 The approaches for reducing dimensionality [Digital image]. (n.d.). Retrieved from http://www.byclb.com/TR/Tutorials/neural_networks/ch5_1.htm 3 Copy number variation [digital image]. (n.d). Retrieved from http://readingroom.mindspec.org/?page_id=8221 4 James, G. et al. (2013) An Introduction to statistical learning. New York: Springer Science+Business Media. 5 Kim, S. et al. (2015). A method for generating new datasets based on copy number for cancer analysis. BioMed Research International, 2015, Article ID 467514. 6 Kon, M. (2016). Decision tree, random forests [PowerPoint]. Retrieved from Boston University Course named Mathematical and Statistical Methods of Bioinformatics. 7 Ghose, A. (2016, Feb, 24). Why does the RBF (radial basis function) kernel map into infinite dimensional space? [Web log post]. Retrieved from https://www.quora.com/Why-does-the-RBF-radial-basis-function-kernel-map-into-infinite-dimensional-space References CNV 3 Reducing Dimensionality Illustration of k-NN4 Abstract Machine learning (ML) methods for prediction of cancer subtype form a growing area of research in computational biology. In this project, we use supervised ML algorithms to predict class membership (e.g., cancer subtype) based on biomarker evidence (e.g., gene expression vectors). The goal of this project is to determine how well gene-level copy number (GLCN) data and gene expression (GE) data predict for cancer subtype, both separately and together. The principal analysis is conducted on two glioblastoma datasets.1 The GE biomarkers alone yield higher classification accuracies than GLCN, and the combined data (GE and GLCN) performs similarly to GE alone; however, there are potential other ways to optimize the combined data accuracy. The combined data offers a variety of uses including not only cancer subtype prediction, but also predicted survival time and classification of cancer metastasis.5 Feature Extraction2 Machine Learning unsupervised data with no labels (e.g., clustering, K- means) supervised data with labels (e.g., regression, classification) Machine Learning Algorithms Results SVC 4 SVM with a RBF* kernel 4 Support Vector Machine (SVM) Naïve Bayes (NB) Classifier Random Forest (RF) Classifier On right: classification tree (CT) for gene X13. It predicts that each observation belongs to the most commonly occurring class in the training set in the node where it belongs; class proportions in the training set of that particular node are also considered. A RF is a collection of CTs in which each CT is a different, random subset of the training set.6 Identifies the number of k-neighboring points in the training set nearest to the test observation X0. In k-NN classification, the response is the class variable, so X0 is assigned its class according to the majority vote of its k neighbors. SVM is an extension of the Support Vector Classifier (SVC). A SVC classifies observations based on a linear decision boundary, while a SVM can classify observations based non-linear decision boundaries.4 This work was funded by the Boston University Bioinformatics BRITE REU program A special thanks to all of my colleagues for their continued support 50GeneExpressionProfiles MES Patients PN Patients K-Nearest Neighbors (k-NN) Classifier P(C | xi) = P(xi |C)P(C) P(xi)   Feature selection2 General Formula: Expansion: P(C | x1, x2,..., x8569) = P(x1 |C)P(x2 |C)!P(x8569 |C)P(C) P(x1)P(x2)!P(x8569) Where xi is some GE or GLCN and C is class (i.e., subtype**) Original GE dataset on 8569 genes with MES, PN, and PRO subtypes** 3 pairwise GE datasets on 8569 genes: (1) MES-PRO, (2) MES- PN, (3) PN-PRO Original GLCN dataset on 8569 genes with MES, PN, and PRO subtypes 3 pairwise GLCN datasets on 8569 genes: (1) MES-PRO, (2) MES-PN, (3) PN-PRO 3 Combined datasets (containing GE & GLCN) on 8569 genes: (1) MES-PRO, (2) MES-PN, (3) PN-PRO Data Partition for Binary Classification The three highest LOOCV accuracies were found using RF on the MES- PN GE dataset (94.69%), NB on the MES-PN combined dataset (93.72%), and RF on the MES-PN combined dataset (92.75%). The MES-PN GE and MES-PN combined datasets performed the best, on average, across all algorithms with the same LOOCV accuracy of 92.51%. Across all datasets, RF performed the best, with an average LOOCV of 85.09%. It appears that combined biomarker data is better at predicting cancer subtype than GE alone by about .55%, on average, and better than GLCN alone by about 8.24%, on average. There may be room for improvement in terms of data structure, but further analysis is required. Supplementary analysis could include data manipulation as well as implementing new ML algorithms, such as clustering. . *A function that finds a way to compute the dot-product of points in high-dimensional space, without explicitly doing so7 **Mesenchymal (MES) glioblastoma patient; ProNeural (PN) glioblastoma patient; Proliferative (PRO) glioblastoma patient ***LOOCV (leave-one-out cross validation): leaves out one sample, trains on n-1 samples and records a prediction; this process is executed n times to obtain n predictions which are then compared to the actual n results; this yields an LOOCV accuracy rate. Terminology Acknowledgements • naive Bayes assumes independence between all features Conclusion