SlideShare a Scribd company logo
Classifying Brain Cancer Subtypes using
Statistical Methods
Elsa Fecke
Boston University Bioinformatics
summer 2016
Project Overview
Question: how can we optimize classification
accuracy for prediction of brain cancer
subtype?
Approaches: try a variety of methods to see
what works best…
1. Use gene expression data and copy
number data, both separately and
together
2. Compare different statistical methods
Project Overview
Glioma = benign brain tumor
Glioblastoma = malignant brain tumor (cancer)
Meschymal
(MES)
Proneural
(PN)
Proliferative
(PRO)
Brain cancer subtypes
Data
Gene expression (GE) data  collected via Affymetrix
microarray analysis
• Microarrays measure GE and
produce different colored dots to
indicate different GE levels
• Colors represent numeric values
Copy number (CN) data  how
many copies a person has of a
gene, or a portion of a gene
Original Data
• We have 3 types of patients
• and 2 types of data (GE and CN) on 8,569 genes from each patient
• Each gene has a GE level and a CN associated with it
Patient 8,569 gene expression levels
per patient
8,569 copy numbers
per patient
Mesenchymal (MES)
146
GE1,1 , GE1,2 , … , GE1,8569
GE146,1,GE146,2, … ,GE146,8569
CN1,1 , CN1,2 , … , CN1,8569
CN146,1,CN146,2, … ,CN146,8569
Proliferative (PRO)
22
GE147,1, GE147,2 , … , GE147,8569
GE168,1, GE168,2 , … , GE168,8569
CN147,1 , CN147,2 , … , CN147,8569
CN168,1 , CN168,2 , … , CN168,8569
Proneural (PN)
61
GE169,1 , GE169,2 , … , GE169,8569
GE229,1 , GE229,2 , … , GE229,8569
CN169,1 , CN169,2 , … , CN169,8569
CN229,1 , CN229,2 , … , CN229,8569
Partitioned Data for Binary Classification
• Interested in pairwise comparisons (i.e., two subtypes in one dataset)
• Can we predict brain cancer subtype with higher accuracy using GE
an CN at the same time (i.e., combined)?
*
*Gene-level copy number (GLCN) is an individual’s copy number converted to the gene-level
Statistical Methods
Naïve Bayes Classifier
Formula:
Expansion:
• Key assumption: all features are independent
P(C | xi) =
P(xi |C)P(C)
P(xi)
  Where xi is some GE level or CN
and C is class (i.e., subtype**
K-Nearest Neighbors Classifier
• Identifies the number of K
neighboring points closest
the the “unknown”, or
unlabeled patient
• Classifies patient subtype
based on the majority vote of
its K neighbors
GE2
GE1
?
?
*try to imagine this in 8,569 dimensions…
Statistical Methods
Given an unknown patient, this
classifier looks to the unknown
patient’s k-nearest neighbors
(where k is some positive
integer) to decide what subtype
the patient likely has
Statistical Methods
Support Vector Machine
SVM can classify observations based non-linear decision
boundariesSupport Vector Classifier SVM with an RBF* Kernel
*RBF (radial basis function) kernel allows the machine to classify observations using
more complex functions rather than using a simple linear boundary
Statistical Methods
Random Forest Classifier
• A random forest (RF) consists of several classification trees
• Each classification tree is a different, random sample of genes
• In reality, the
random sample (or
each tree) has
hundreds of genes
in it
• The RF can have
hundreds of these
trees
Classification Tree Example for Gene 1 and Gene 3
Statistical Methods
Random Forest Classifier
• Once the RF is built, we can plug in new, unlabeled data (i.e., a patient
with an unknown subtype) and predict the patient’s subtype with the RF
Tree 1
Tree 50
? 8,569 GE levels for the “unclassified
patient” are plugged into the RF
RF decides patient subtype based on majority vote
. . .
Results
• What algorithm performed the best, on average?
– Random forest  85.09% classification accuracy
• What type of data was most useful, on average?
– The combined data (GE and CN)  .55% higher
classification than GE data alone
– GE data alone performs 8.24% better compared to CN
alone
• Highest individual classification accuracies:
1. RF classifier on the MES-PN GE dataset  94.69%
2. NB classifier on the MES-PN combined 93.72%
3. RF on MES-PN combined dataset  92.75%

More Related Content

Similar to Glioblastoma_Linkedin

Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
Lekki Frazier-Wood
 
20100509 bioinformatics kapushesky_lecture05_0
20100509 bioinformatics kapushesky_lecture05_020100509 bioinformatics kapushesky_lecture05_0
20100509 bioinformatics kapushesky_lecture05_0
Computer Science Club
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
Elsa Fecke
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
Patricia Francis-Lyon
 
Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09
Pascal Mayer
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
Genome Reference Consortium
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
FranciscoJAzuajeG
 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Method
praveena06
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
GenomeInABottle
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary Analysis
James Warren
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Nils Gehlenborg
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
Elena Sügis
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
bbuliksullivan
 
Creating custom gene panels for next-generation sequencing: optimization of 5...
Creating custom gene panels for next-generation sequencing: optimization of 5...Creating custom gene panels for next-generation sequencing: optimization of 5...
Creating custom gene panels for next-generation sequencing: optimization of 5...
Thermo Fisher Scientific
 
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptxGGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
BHAGWAT NAWADE
 
Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...
Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...
Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...
Thermo Fisher Scientific
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Intel IT Center
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
butest
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
improvemed
 

Similar to Glioblastoma_Linkedin (20)

Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
20100509 bioinformatics kapushesky_lecture05_0
20100509 bioinformatics kapushesky_lecture05_020100509 bioinformatics kapushesky_lecture05_0
20100509 bioinformatics kapushesky_lecture05_0
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Method
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary Analysis
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
 
Creating custom gene panels for next-generation sequencing: optimization of 5...
Creating custom gene panels for next-generation sequencing: optimization of 5...Creating custom gene panels for next-generation sequencing: optimization of 5...
Creating custom gene panels for next-generation sequencing: optimization of 5...
 
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptxGGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
GGWS_M3_L5_Estimation_of_heritability_from_GWAS_summary_statistics.pptx
 
Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...
Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...
Developing Custom Next-Generation Sequencing Panels using Pre-Optimized Assay...
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
 

Glioblastoma_Linkedin

  • 1. Classifying Brain Cancer Subtypes using Statistical Methods Elsa Fecke Boston University Bioinformatics summer 2016
  • 2. Project Overview Question: how can we optimize classification accuracy for prediction of brain cancer subtype? Approaches: try a variety of methods to see what works best… 1. Use gene expression data and copy number data, both separately and together 2. Compare different statistical methods
  • 3. Project Overview Glioma = benign brain tumor Glioblastoma = malignant brain tumor (cancer) Meschymal (MES) Proneural (PN) Proliferative (PRO) Brain cancer subtypes
  • 4. Data Gene expression (GE) data  collected via Affymetrix microarray analysis • Microarrays measure GE and produce different colored dots to indicate different GE levels • Colors represent numeric values Copy number (CN) data  how many copies a person has of a gene, or a portion of a gene
  • 5. Original Data • We have 3 types of patients • and 2 types of data (GE and CN) on 8,569 genes from each patient • Each gene has a GE level and a CN associated with it Patient 8,569 gene expression levels per patient 8,569 copy numbers per patient Mesenchymal (MES) 146 GE1,1 , GE1,2 , … , GE1,8569 GE146,1,GE146,2, … ,GE146,8569 CN1,1 , CN1,2 , … , CN1,8569 CN146,1,CN146,2, … ,CN146,8569 Proliferative (PRO) 22 GE147,1, GE147,2 , … , GE147,8569 GE168,1, GE168,2 , … , GE168,8569 CN147,1 , CN147,2 , … , CN147,8569 CN168,1 , CN168,2 , … , CN168,8569 Proneural (PN) 61 GE169,1 , GE169,2 , … , GE169,8569 GE229,1 , GE229,2 , … , GE229,8569 CN169,1 , CN169,2 , … , CN169,8569 CN229,1 , CN229,2 , … , CN229,8569
  • 6. Partitioned Data for Binary Classification • Interested in pairwise comparisons (i.e., two subtypes in one dataset) • Can we predict brain cancer subtype with higher accuracy using GE an CN at the same time (i.e., combined)? * *Gene-level copy number (GLCN) is an individual’s copy number converted to the gene-level
  • 7. Statistical Methods Naïve Bayes Classifier Formula: Expansion: • Key assumption: all features are independent P(C | xi) = P(xi |C)P(C) P(xi)   Where xi is some GE level or CN and C is class (i.e., subtype**
  • 8. K-Nearest Neighbors Classifier • Identifies the number of K neighboring points closest the the “unknown”, or unlabeled patient • Classifies patient subtype based on the majority vote of its K neighbors GE2 GE1 ? ? *try to imagine this in 8,569 dimensions… Statistical Methods Given an unknown patient, this classifier looks to the unknown patient’s k-nearest neighbors (where k is some positive integer) to decide what subtype the patient likely has
  • 9. Statistical Methods Support Vector Machine SVM can classify observations based non-linear decision boundariesSupport Vector Classifier SVM with an RBF* Kernel *RBF (radial basis function) kernel allows the machine to classify observations using more complex functions rather than using a simple linear boundary
  • 10. Statistical Methods Random Forest Classifier • A random forest (RF) consists of several classification trees • Each classification tree is a different, random sample of genes • In reality, the random sample (or each tree) has hundreds of genes in it • The RF can have hundreds of these trees Classification Tree Example for Gene 1 and Gene 3
  • 11. Statistical Methods Random Forest Classifier • Once the RF is built, we can plug in new, unlabeled data (i.e., a patient with an unknown subtype) and predict the patient’s subtype with the RF Tree 1 Tree 50 ? 8,569 GE levels for the “unclassified patient” are plugged into the RF RF decides patient subtype based on majority vote . . .
  • 12. Results • What algorithm performed the best, on average? – Random forest  85.09% classification accuracy • What type of data was most useful, on average? – The combined data (GE and CN)  .55% higher classification than GE data alone – GE data alone performs 8.24% better compared to CN alone • Highest individual classification accuracies: 1. RF classifier on the MES-PN GE dataset  94.69% 2. NB classifier on the MES-PN combined 93.72% 3. RF on MES-PN combined dataset  92.75%