Glioblastoma_Linkedin

Classifying Brain Cancer Subtypes using
Statistical Methods
Elsa Fecke
Boston University Bioinformatics
summer 2016

Project Overview
Question: how can we optimize classification
accuracy for prediction of brain cancer
subtype?
Approaches: try a variety of methods to see
what works best…
1. Use gene expression data and copy
number data, both separately and
together
2. Compare different statistical methods

Project Overview
Glioma = benign brain tumor
Glioblastoma = malignant brain tumor (cancer)
Meschymal
(MES)
Proneural
(PN)
Proliferative
(PRO)
Brain cancer subtypes

Data
Gene expression (GE) data  collected via Affymetrix
microarray analysis
• Microarrays measure GE and
produce different colored dots to
indicate different GE levels
• Colors represent numeric values
Copy number (CN) data  how
many copies a person has of a
gene, or a portion of a gene

Original Data
• We have 3 types of patients
• and 2 types of data (GE and CN) on 8,569 genes from each patient
• Each gene has a GE level and a CN associated with it
Patient 8,569 gene expression levels
per patient
8,569 copy numbers
per patient
Mesenchymal (MES)
146
GE1,1 , GE1,2 , … , GE1,8569
GE146,1,GE146,2, … ,GE146,8569
CN1,1 , CN1,2 , … , CN1,8569
CN146,1,CN146,2, … ,CN146,8569
Proliferative (PRO)
22
GE147,1, GE147,2 , … , GE147,8569
GE168,1, GE168,2 , … , GE168,8569
CN147,1 , CN147,2 , … , CN147,8569
CN168,1 , CN168,2 , … , CN168,8569
Proneural (PN)
61
GE169,1 , GE169,2 , … , GE169,8569
GE229,1 , GE229,2 , … , GE229,8569
CN169,1 , CN169,2 , … , CN169,8569
CN229,1 , CN229,2 , … , CN229,8569

Partitioned Data for Binary Classification
• Interested in pairwise comparisons (i.e., two subtypes in one dataset)
• Can we predict brain cancer subtype with higher accuracy using GE
an CN at the same time (i.e., combined)?
*
*Gene-level copy number (GLCN) is an individual’s copy number converted to the gene-level

Statistical Methods
Naïve Bayes Classifier
Formula:
Expansion:
• Key assumption: all features are independent
P(C | xi) =
P(xi |C)P(C)
P(xi)
Where xi is some GE level or CN
and C is class (i.e., subtype**

K-Nearest Neighbors Classifier
• Identifies the number of K
neighboring points closest
the the “unknown”, or
unlabeled patient
• Classifies patient subtype
based on the majority vote of
its K neighbors
GE2
GE1
?
?
*try to imagine this in 8,569 dimensions…
Statistical Methods
Given an unknown patient, this
classifier looks to the unknown
patient’s k-nearest neighbors
(where k is some positive
integer) to decide what subtype
the patient likely has

Statistical Methods
Support Vector Machine
SVM can classify observations based non-linear decision
boundariesSupport Vector Classifier SVM with an RBF* Kernel
*RBF (radial basis function) kernel allows the machine to classify observations using
more complex functions rather than using a simple linear boundary

Statistical Methods
Random Forest Classifier
• A random forest (RF) consists of several classification trees
• Each classification tree is a different, random sample of genes
• In reality, the
random sample (or
each tree) has
hundreds of genes
in it
• The RF can have
hundreds of these
trees
Classification Tree Example for Gene 1 and Gene 3

Statistical Methods
Random Forest Classifier
• Once the RF is built, we can plug in new, unlabeled data (i.e., a patient
with an unknown subtype) and predict the patient’s subtype with the RF
Tree 1
Tree 50
? 8,569 GE levels for the “unclassified
patient” are plugged into the RF
RF decides patient subtype based on majority vote
. . .

Results
• What algorithm performed the best, on average?
– Random forest  85.09% classification accuracy
• What type of data was most useful, on average?
– The combined data (GE and CN)  .55% higher
classification than GE data alone
– GE data alone performs 8.24% better compared to CN
alone
• Highest individual classification accuracies:
1. RF classifier on the MES-PN GE dataset  94.69%
2. NB classifier on the MES-PN combined 93.72%
3. RF on MES-PN combined dataset  92.75%

Glioblastoma_Linkedin

Recommended

Recommended

More Related Content

Similar to Glioblastoma_Linkedin

Similar to Glioblastoma_Linkedin (20)

Glioblastoma_Linkedin