SlideShare a Scribd company logo
1 of 1
Download to read offline
`
Towards Prediction of Platinum Treatment Response in
Ovarian Cancer using Machine Learning Approaches
Abstract
Materials and Methods
0.735
Antoaneta Vladimirova, Richard Goold, Tod Klingler
Station X Inc., San Francisco, CA 94107, www.stationxinc.com
Figure 8. RNA-seq data was explored with deep-learning artificial neural networks (ANN)
8.A. Exploration of different architectures for continuously increasing ANN train and dev
accuracy and 8.B. continuously diminishing ANN loss of train and dev sets for the top 50
differentially expressed genes of platinum sensitive vs. resistant RNA-seq unscaled samples
across epochs. 8.C. Graphical representation of the best performing tuned ANN and its best
parameters. ANN deep-learning results after model tuning for train set 8.D. and test set 8.E..
GenePool1, a cloud-based Software as a Service (SaaS) platform built by Station
X, was used for analysis and visualization in assessing ovarian cancer primary
tumor patient samples with clinical and molecular data from The Cancer Genome
Atlas (TCGA) Consortium2 and the Cancer Genomics Hub3. Clinical guidelines
of sensitivity and resistance to platinum drugs was researched and implemented
as additional annotation to TCGA ovarian cancer clinical data and used for
further analysis in this project4. GenePool contains a comprehensive collection of
The Cancer Genome Atlas data including ovarian cancer as part of thirty three
cancers and includes curated clinical data along with data from six molecular
assays (copy number, methylation, miRNAseq, protein expression, RNAseq and
somatic mutations) from primary tumors and other patient tissues. For this
analysis we used ovarian (OV) primary tumor samples with their corresponding
clinical data and RNA-seq to investigate the ability to predict platinum-based
resistance based on clinical and expression data. We utilized built-in analytical
workflows and visualization tools in GenePool and well as traditional machine
learning and deep learning methods. Python (3.6) language6, Anaconda (3)
platform, a variety of Python libraries5 for generating and graphing results of the
presented analysis were used including NumPy (1.13.1), Pandas (0.20.3), Scikit-
learn (0.19.0), Matplotlib (2.0.2) and Keras (2.0.7).
Figure 3. Selecting clinical data, building and evaluating models, and feature importances
3.A. Ovarian cancer TCGA data was split into “Resistant” and “Sensitive” sample sets based on “Platinum
Status” and “Platinum Interval” values. All evaluated samples are TCGA ovarian primary tumors with
existing Platinum Status and positive Platinum Interval. For “Extreme” samples, the top quartile of
“Resistant” and bottom quartile of “Sensitive” samples were omitted from model building to allow cleaner
non-overlapping values and signal. 3.B. Four models (NB, LR, RF and SVM) were built based on machine
learning-transformed clinical data and with tuned hyperparameters to establish a baseline value of platinum
status prediction without molecular data. 3.C. Tuned Random Forest (RF) model with optimal
hyperparameters was used to evaluate and plot clinical feature importances that contribute to the model
prediction of platinum resistance.
•  Machine learning approaches lead to about 80% accuracy in predicting platinum resistance/
sensitivity status using TCGA ovarian primary tumors clinical data and RNA-seq
•  ANN perform the highest across all tuned models likely due to model complexity
•  Prediction of status with clinical data is about 75% accurate and can offer some interpretation
•  Tuning of all methods led to moderate accuracy increase
•  Limiting samples lead to moderate increase in accuracy, while scaling data did appear to help
•  To sign up for a free trial GenePool account or to subscribe to GenePool please go to
www.stationxinc.com/#introducing-genepool
•  More information on Station X and GenePool platform can be obtained at www.stationxinc.com
•  For more information on this poster please contact antoaneta@stationxinc.com
•  Follow us on Twitter @StationXInc
1. GenePool by Station X, Inc.: http://www.stationxinc.com/
2. The Cancer Genome Atlas Consortium: http://cancergenome.nih.gov/
3. The Cancer Genomics Hub: https://cghub.ucsc.edu/
4. Integrated genomic analyses of ovarian carcinoma, Nature, vol 474, June 2011
5. Python, Anaconda platform, and Scikit-learn, NumPy, Pandas, Matplotlib and Keras libraries:
https://www.python.org, https://www.anaconda.com, http://scikit-learn.org,
http://www.numpy.org, http://pandas.pydata.org, https://matplotlib.org, https://keras.io
6. Python language for scripting: https://docs.python.org/3/reference/index.html
Ovarian cancer is the eighth most prevalent cancer in women in
the US and the fifth most common cause of death. The
standard of care for ovarian cancer consists of a combination of
surgery and chemotherapy, typically platinum-taxane treatment.
However, many patients develop platinum resistance, defined
as a relapse within 6 months of starting platinum
chemotherapy. Predictive methods for early evaluation of the
potential for platinum resistance may benefit patients by
identifying those that might be better served with alternative
second-line therapies or by enrollment in relevant clinical trials.
 
In this study we generate and evaluate predictive models of
resistance and sensitivity to platinum drugs in ovarian cancer
by using the GenePool™ genomics platform and the integrated
TCGA (The Cancer Genome Atlas) ovarian cancer RNA-seq
and clinical data, including manually-curated platinum status
information for each patient, and complement these analyses
with a battery of machine learning approaches.

We utilize GenePool best practices RNA-seq workflows on
ovarian primary tumors to derive and prioritize genes most
strongly associated with platinum response. First, using clinical
information, we define two ovarian cancer cohorts, one
platinum-sensitive and the other platinum-resistant. We then
compare the expression levels of all genes in these cohorts and
identify those that are most differentially expressed. The gene
expression results are analyzed using a variety of classifier
approaches such as logistic regression, support vector
machines and deep learning techniques to create, optimize and
evaluate predictive models of platinum sensitivity status. Our
workflows leverage cross-validation, dimensionality reduction, a
variety of performance metrics to evaluate our models, as well
as visualization of results to facilitate interpretation.
We demonstrate a combination of approaches to derive and
validate predictive models of platinum response in ovarian
cancer and illustrate the potential of similar approaches to
benefit cancer patient care.
A.
Results
Figure 5. Model performance evaluation with best percent features and default model hyperparameters
on scaled and unscaled data, and on all and “extreme” samples
All four models were evaluated with best number of features and default hyperparameters on scaled and
unscaled data, and on all or extreme samples, and cross-validation data was plotted with box-plots (5A. NB,
5B. LR,54C. RF, 5D. SVM) .
Figure 1. Clinical guidelines for platinum sensitivity and resistance were applied to TCGA
ovarian cancer data in GenePool and then used for selection of “Platinum Resistant” and
“Platinum Sensitive” sample groups to analyze with machine learning methods
1A. Schematic of platinum drugs sensitivity and resistance for ovarian cancer patients based on
clinical guidelines implemented via manual curation in GenePool genomics platform. 1.B.
GenePool was used to select platinum resistant and their days to status, and platinum sensitive
with days to status groups, and their respective RNA-seq data to predict treatment outcomes with
machine leaning approaches.
ResistantSensitive
A.
B.
C.
C.
B. D.
“Extreme”
Primary
Tumors
+/- Data scaling
Build classifiers on K train sets:
•  NB = Naïve Bayes
•  LR = Logistic Regression
•  RF = Random Forest
•  SVM = Support Vector Machine
•  ANN = Artificial Neural Network
Test models on
K Dev sets
with same
pipeline
Repeat &
average on
all K folds
Train Set (75%) Test Set (25%)
. . .
Dev (1-fold) Train ((K-1)-fold)
K-Fold
Cross-
Validation
All
Primary
Tumors
All genes (54K) Subset of genes
Reduce to most
informative samples
Dimensionality
Reduction
(ANOVA)
GenePool TCGA
data:
RNA-seq
Hyperparameter tuning
&
model evaluation
on performance metric
Performance
Evaluation on
Test Set
Clinical Data
Subset of genes &
Samples
Evaluate models
with best hyper-
parameters on full
train set &
learning curve
(bias and variance)
Figure 4. Feature number evaluation with default hyperparameters for each model
All four models were evaluated with cross-validation across a variety of percentages of the original number
of features/genes as assessed by ANOVA dimensionality reduction analysis on unscaled data. The mean and
standard deviation values of 5 cross-validation pipelines were plotted and displayed for each model,
respectively: 4A. NB, 4B. LR, 4C. RF, 4D. SVM.
A.
B.
C.
D.
Figure 6. Model performance evaluation with tuned hyperparameters on scaled and unscaled
data, and on all and “extreme” samples on full train data set
All four models were tuned and evaluated with best hyperparameters on scaled and unscaled data,
and on all or “extreme” samples, and cross-validation data was plotted with box-plots (6A. All
samples, unscaled, 6B. Extreme samples, unscaled, 6C. All samples, scaled, 6D. Extreme samples,
scaled) . Test set results displayed in green in box-plots for each tuned model.
For further information
References
Conclusions
A. C.
B. D.
A. C.
C. D.
E.
A.
B.
D.
E.
Hidden
Layer
(Relu)
Input Output
(Sigmoid)
Hidden
Layer 1
(Relu)
• 
• 
• 
C.
Figure 7. Learning curves for all models and gene coefficients with LR model
Learning curves for all four hyperparameter tuned models: cross-validated performance for
different number of samples on unscaled data for all samples of train and dev samples (7A. NB,
7B. LR, 7C. RF, 7D. SVM). 7.E. Plot of gene coefficients for the LR model on all unscaled data.
Figure 2. Workflow diagram of evaluating machine learning models on ovarian TCGA RNA-seq data
The schematic represents using TCGA clinical and molecular TCGA data, selecting samples and features
(genes), splitting into training, development and test data sets, several models building, model cross-
validation, hyperparameter tuning for best results, evaluation, and model scoring on test data.
0.734 0.734 0.750 0.750 Test Set 0.734 0.641 0.750 0.750 Test Set
0.694 0.653 0.735 0.694 Test Set0.694 0.776 0.735 0.735 Test Set
Accuracy
0.88
0.81
C.
D.
A.
B.

More Related Content

Similar to Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Machine Learning Approaches, ASHG 2017

CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
butest
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
Ford Sleeman
 
SVM based prioritization of cancer causing mutations in centromere protein f...
SVM based prioritization of cancer causing mutations in centromere protein f...SVM based prioritization of cancer causing mutations in centromere protein f...
SVM based prioritization of cancer causing mutations in centromere protein f...
Ambuj Kumar
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
Michael Atkins
 

Similar to Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Machine Learning Approaches, ASHG 2017 (20)

CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
Madhavi
MadhaviMadhavi
Madhavi
 
SVM based prioritization of cancer causing mutations in centromere protein f...
SVM based prioritization of cancer causing mutations in centromere protein f...SVM based prioritization of cancer causing mutations in centromere protein f...
SVM based prioritization of cancer causing mutations in centromere protein f...
 
LIMS in Modern Molecular Pathology by Dr. Perry Maxwell
LIMS in Modern Molecular Pathology by Dr. Perry MaxwellLIMS in Modern Molecular Pathology by Dr. Perry Maxwell
LIMS in Modern Molecular Pathology by Dr. Perry Maxwell
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Maldi tof-ms analysis in identification of prostate cancer
Maldi tof-ms analysis in identification of prostate cancerMaldi tof-ms analysis in identification of prostate cancer
Maldi tof-ms analysis in identification of prostate cancer
 
Rna seq - PDX models
Rna seq - PDX models Rna seq - PDX models
Rna seq - PDX models
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
150224 giab 30 min generic slides
150224 giab 30 min generic slides150224 giab 30 min generic slides
150224 giab 30 min generic slides
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
 
FFPE Applications Solutions brochure
FFPE Applications Solutions brochureFFPE Applications Solutions brochure
FFPE Applications Solutions brochure
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 

Recently uploaded

Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 

Recently uploaded (20)

Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Fourth quarter science 9-Kinetic-and-Potential-Energy.pptx
Fourth quarter science 9-Kinetic-and-Potential-Energy.pptxFourth quarter science 9-Kinetic-and-Potential-Energy.pptx
Fourth quarter science 9-Kinetic-and-Potential-Energy.pptx
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolation
 

Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Machine Learning Approaches, ASHG 2017

  • 1. ` Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Machine Learning Approaches Abstract Materials and Methods 0.735 Antoaneta Vladimirova, Richard Goold, Tod Klingler Station X Inc., San Francisco, CA 94107, www.stationxinc.com Figure 8. RNA-seq data was explored with deep-learning artificial neural networks (ANN) 8.A. Exploration of different architectures for continuously increasing ANN train and dev accuracy and 8.B. continuously diminishing ANN loss of train and dev sets for the top 50 differentially expressed genes of platinum sensitive vs. resistant RNA-seq unscaled samples across epochs. 8.C. Graphical representation of the best performing tuned ANN and its best parameters. ANN deep-learning results after model tuning for train set 8.D. and test set 8.E.. GenePool1, a cloud-based Software as a Service (SaaS) platform built by Station X, was used for analysis and visualization in assessing ovarian cancer primary tumor patient samples with clinical and molecular data from The Cancer Genome Atlas (TCGA) Consortium2 and the Cancer Genomics Hub3. Clinical guidelines of sensitivity and resistance to platinum drugs was researched and implemented as additional annotation to TCGA ovarian cancer clinical data and used for further analysis in this project4. GenePool contains a comprehensive collection of The Cancer Genome Atlas data including ovarian cancer as part of thirty three cancers and includes curated clinical data along with data from six molecular assays (copy number, methylation, miRNAseq, protein expression, RNAseq and somatic mutations) from primary tumors and other patient tissues. For this analysis we used ovarian (OV) primary tumor samples with their corresponding clinical data and RNA-seq to investigate the ability to predict platinum-based resistance based on clinical and expression data. We utilized built-in analytical workflows and visualization tools in GenePool and well as traditional machine learning and deep learning methods. Python (3.6) language6, Anaconda (3) platform, a variety of Python libraries5 for generating and graphing results of the presented analysis were used including NumPy (1.13.1), Pandas (0.20.3), Scikit- learn (0.19.0), Matplotlib (2.0.2) and Keras (2.0.7). Figure 3. Selecting clinical data, building and evaluating models, and feature importances 3.A. Ovarian cancer TCGA data was split into “Resistant” and “Sensitive” sample sets based on “Platinum Status” and “Platinum Interval” values. All evaluated samples are TCGA ovarian primary tumors with existing Platinum Status and positive Platinum Interval. For “Extreme” samples, the top quartile of “Resistant” and bottom quartile of “Sensitive” samples were omitted from model building to allow cleaner non-overlapping values and signal. 3.B. Four models (NB, LR, RF and SVM) were built based on machine learning-transformed clinical data and with tuned hyperparameters to establish a baseline value of platinum status prediction without molecular data. 3.C. Tuned Random Forest (RF) model with optimal hyperparameters was used to evaluate and plot clinical feature importances that contribute to the model prediction of platinum resistance. •  Machine learning approaches lead to about 80% accuracy in predicting platinum resistance/ sensitivity status using TCGA ovarian primary tumors clinical data and RNA-seq •  ANN perform the highest across all tuned models likely due to model complexity •  Prediction of status with clinical data is about 75% accurate and can offer some interpretation •  Tuning of all methods led to moderate accuracy increase •  Limiting samples lead to moderate increase in accuracy, while scaling data did appear to help •  To sign up for a free trial GenePool account or to subscribe to GenePool please go to www.stationxinc.com/#introducing-genepool •  More information on Station X and GenePool platform can be obtained at www.stationxinc.com •  For more information on this poster please contact antoaneta@stationxinc.com •  Follow us on Twitter @StationXInc 1. GenePool by Station X, Inc.: http://www.stationxinc.com/ 2. The Cancer Genome Atlas Consortium: http://cancergenome.nih.gov/ 3. The Cancer Genomics Hub: https://cghub.ucsc.edu/ 4. Integrated genomic analyses of ovarian carcinoma, Nature, vol 474, June 2011 5. Python, Anaconda platform, and Scikit-learn, NumPy, Pandas, Matplotlib and Keras libraries: https://www.python.org, https://www.anaconda.com, http://scikit-learn.org, http://www.numpy.org, http://pandas.pydata.org, https://matplotlib.org, https://keras.io 6. Python language for scripting: https://docs.python.org/3/reference/index.html Ovarian cancer is the eighth most prevalent cancer in women in the US and the fifth most common cause of death. The standard of care for ovarian cancer consists of a combination of surgery and chemotherapy, typically platinum-taxane treatment. However, many patients develop platinum resistance, defined as a relapse within 6 months of starting platinum chemotherapy. Predictive methods for early evaluation of the potential for platinum resistance may benefit patients by identifying those that might be better served with alternative second-line therapies or by enrollment in relevant clinical trials.   In this study we generate and evaluate predictive models of resistance and sensitivity to platinum drugs in ovarian cancer by using the GenePool™ genomics platform and the integrated TCGA (The Cancer Genome Atlas) ovarian cancer RNA-seq and clinical data, including manually-curated platinum status information for each patient, and complement these analyses with a battery of machine learning approaches.
 We utilize GenePool best practices RNA-seq workflows on ovarian primary tumors to derive and prioritize genes most strongly associated with platinum response. First, using clinical information, we define two ovarian cancer cohorts, one platinum-sensitive and the other platinum-resistant. We then compare the expression levels of all genes in these cohorts and identify those that are most differentially expressed. The gene expression results are analyzed using a variety of classifier approaches such as logistic regression, support vector machines and deep learning techniques to create, optimize and evaluate predictive models of platinum sensitivity status. Our workflows leverage cross-validation, dimensionality reduction, a variety of performance metrics to evaluate our models, as well as visualization of results to facilitate interpretation. We demonstrate a combination of approaches to derive and validate predictive models of platinum response in ovarian cancer and illustrate the potential of similar approaches to benefit cancer patient care. A. Results Figure 5. Model performance evaluation with best percent features and default model hyperparameters on scaled and unscaled data, and on all and “extreme” samples All four models were evaluated with best number of features and default hyperparameters on scaled and unscaled data, and on all or extreme samples, and cross-validation data was plotted with box-plots (5A. NB, 5B. LR,54C. RF, 5D. SVM) . Figure 1. Clinical guidelines for platinum sensitivity and resistance were applied to TCGA ovarian cancer data in GenePool and then used for selection of “Platinum Resistant” and “Platinum Sensitive” sample groups to analyze with machine learning methods 1A. Schematic of platinum drugs sensitivity and resistance for ovarian cancer patients based on clinical guidelines implemented via manual curation in GenePool genomics platform. 1.B. GenePool was used to select platinum resistant and their days to status, and platinum sensitive with days to status groups, and their respective RNA-seq data to predict treatment outcomes with machine leaning approaches. ResistantSensitive A. B. C. C. B. D. “Extreme” Primary Tumors +/- Data scaling Build classifiers on K train sets: •  NB = Naïve Bayes •  LR = Logistic Regression •  RF = Random Forest •  SVM = Support Vector Machine •  ANN = Artificial Neural Network Test models on K Dev sets with same pipeline Repeat & average on all K folds Train Set (75%) Test Set (25%) . . . Dev (1-fold) Train ((K-1)-fold) K-Fold Cross- Validation All Primary Tumors All genes (54K) Subset of genes Reduce to most informative samples Dimensionality Reduction (ANOVA) GenePool TCGA data: RNA-seq Hyperparameter tuning & model evaluation on performance metric Performance Evaluation on Test Set Clinical Data Subset of genes & Samples Evaluate models with best hyper- parameters on full train set & learning curve (bias and variance) Figure 4. Feature number evaluation with default hyperparameters for each model All four models were evaluated with cross-validation across a variety of percentages of the original number of features/genes as assessed by ANOVA dimensionality reduction analysis on unscaled data. The mean and standard deviation values of 5 cross-validation pipelines were plotted and displayed for each model, respectively: 4A. NB, 4B. LR, 4C. RF, 4D. SVM. A. B. C. D. Figure 6. Model performance evaluation with tuned hyperparameters on scaled and unscaled data, and on all and “extreme” samples on full train data set All four models were tuned and evaluated with best hyperparameters on scaled and unscaled data, and on all or “extreme” samples, and cross-validation data was plotted with box-plots (6A. All samples, unscaled, 6B. Extreme samples, unscaled, 6C. All samples, scaled, 6D. Extreme samples, scaled) . Test set results displayed in green in box-plots for each tuned model. For further information References Conclusions A. C. B. D. A. C. C. D. E. A. B. D. E. Hidden Layer (Relu) Input Output (Sigmoid) Hidden Layer 1 (Relu) •  •  •  C. Figure 7. Learning curves for all models and gene coefficients with LR model Learning curves for all four hyperparameter tuned models: cross-validated performance for different number of samples on unscaled data for all samples of train and dev samples (7A. NB, 7B. LR, 7C. RF, 7D. SVM). 7.E. Plot of gene coefficients for the LR model on all unscaled data. Figure 2. Workflow diagram of evaluating machine learning models on ovarian TCGA RNA-seq data The schematic represents using TCGA clinical and molecular TCGA data, selecting samples and features (genes), splitting into training, development and test data sets, several models building, model cross- validation, hyperparameter tuning for best results, evaluation, and model scoring on test data. 0.734 0.734 0.750 0.750 Test Set 0.734 0.641 0.750 0.750 Test Set 0.694 0.653 0.735 0.694 Test Set0.694 0.776 0.735 0.735 Test Set Accuracy 0.88 0.81 C. D. A. B.