Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Machine Learning Approaches, ASHG 2017
1. `
Towards Prediction of Platinum Treatment Response in
Ovarian Cancer using Machine Learning Approaches
Abstract
Materials and Methods
0.735
Antoaneta Vladimirova, Richard Goold, Tod Klingler
Station X Inc., San Francisco, CA 94107, www.stationxinc.com
Figure 8. RNA-seq data was explored with deep-learning artificial neural networks (ANN)
8.A. Exploration of different architectures for continuously increasing ANN train and dev
accuracy and 8.B. continuously diminishing ANN loss of train and dev sets for the top 50
differentially expressed genes of platinum sensitive vs. resistant RNA-seq unscaled samples
across epochs. 8.C. Graphical representation of the best performing tuned ANN and its best
parameters. ANN deep-learning results after model tuning for train set 8.D. and test set 8.E..
GenePool1, a cloud-based Software as a Service (SaaS) platform built by Station
X, was used for analysis and visualization in assessing ovarian cancer primary
tumor patient samples with clinical and molecular data from The Cancer Genome
Atlas (TCGA) Consortium2 and the Cancer Genomics Hub3. Clinical guidelines
of sensitivity and resistance to platinum drugs was researched and implemented
as additional annotation to TCGA ovarian cancer clinical data and used for
further analysis in this project4. GenePool contains a comprehensive collection of
The Cancer Genome Atlas data including ovarian cancer as part of thirty three
cancers and includes curated clinical data along with data from six molecular
assays (copy number, methylation, miRNAseq, protein expression, RNAseq and
somatic mutations) from primary tumors and other patient tissues. For this
analysis we used ovarian (OV) primary tumor samples with their corresponding
clinical data and RNA-seq to investigate the ability to predict platinum-based
resistance based on clinical and expression data. We utilized built-in analytical
workflows and visualization tools in GenePool and well as traditional machine
learning and deep learning methods. Python (3.6) language6, Anaconda (3)
platform, a variety of Python libraries5 for generating and graphing results of the
presented analysis were used including NumPy (1.13.1), Pandas (0.20.3), Scikit-
learn (0.19.0), Matplotlib (2.0.2) and Keras (2.0.7).
Figure 3. Selecting clinical data, building and evaluating models, and feature importances
3.A. Ovarian cancer TCGA data was split into “Resistant” and “Sensitive” sample sets based on “Platinum
Status” and “Platinum Interval” values. All evaluated samples are TCGA ovarian primary tumors with
existing Platinum Status and positive Platinum Interval. For “Extreme” samples, the top quartile of
“Resistant” and bottom quartile of “Sensitive” samples were omitted from model building to allow cleaner
non-overlapping values and signal. 3.B. Four models (NB, LR, RF and SVM) were built based on machine
learning-transformed clinical data and with tuned hyperparameters to establish a baseline value of platinum
status prediction without molecular data. 3.C. Tuned Random Forest (RF) model with optimal
hyperparameters was used to evaluate and plot clinical feature importances that contribute to the model
prediction of platinum resistance.
• Machine learning approaches lead to about 80% accuracy in predicting platinum resistance/
sensitivity status using TCGA ovarian primary tumors clinical data and RNA-seq
• ANN perform the highest across all tuned models likely due to model complexity
• Prediction of status with clinical data is about 75% accurate and can offer some interpretation
• Tuning of all methods led to moderate accuracy increase
• Limiting samples lead to moderate increase in accuracy, while scaling data did appear to help
• To sign up for a free trial GenePool account or to subscribe to GenePool please go to
www.stationxinc.com/#introducing-genepool
• More information on Station X and GenePool platform can be obtained at www.stationxinc.com
• For more information on this poster please contact antoaneta@stationxinc.com
• Follow us on Twitter @StationXInc
1. GenePool by Station X, Inc.: http://www.stationxinc.com/
2. The Cancer Genome Atlas Consortium: http://cancergenome.nih.gov/
3. The Cancer Genomics Hub: https://cghub.ucsc.edu/
4. Integrated genomic analyses of ovarian carcinoma, Nature, vol 474, June 2011
5. Python, Anaconda platform, and Scikit-learn, NumPy, Pandas, Matplotlib and Keras libraries:
https://www.python.org, https://www.anaconda.com, http://scikit-learn.org,
http://www.numpy.org, http://pandas.pydata.org, https://matplotlib.org, https://keras.io
6. Python language for scripting: https://docs.python.org/3/reference/index.html
Ovarian cancer is the eighth most prevalent cancer in women in
the US and the fifth most common cause of death. The
standard of care for ovarian cancer consists of a combination of
surgery and chemotherapy, typically platinum-taxane treatment.
However, many patients develop platinum resistance, defined
as a relapse within 6 months of starting platinum
chemotherapy. Predictive methods for early evaluation of the
potential for platinum resistance may benefit patients by
identifying those that might be better served with alternative
second-line therapies or by enrollment in relevant clinical trials.
In this study we generate and evaluate predictive models of
resistance and sensitivity to platinum drugs in ovarian cancer
by using the GenePool™ genomics platform and the integrated
TCGA (The Cancer Genome Atlas) ovarian cancer RNA-seq
and clinical data, including manually-curated platinum status
information for each patient, and complement these analyses
with a battery of machine learning approaches.
We utilize GenePool best practices RNA-seq workflows on
ovarian primary tumors to derive and prioritize genes most
strongly associated with platinum response. First, using clinical
information, we define two ovarian cancer cohorts, one
platinum-sensitive and the other platinum-resistant. We then
compare the expression levels of all genes in these cohorts and
identify those that are most differentially expressed. The gene
expression results are analyzed using a variety of classifier
approaches such as logistic regression, support vector
machines and deep learning techniques to create, optimize and
evaluate predictive models of platinum sensitivity status. Our
workflows leverage cross-validation, dimensionality reduction, a
variety of performance metrics to evaluate our models, as well
as visualization of results to facilitate interpretation.
We demonstrate a combination of approaches to derive and
validate predictive models of platinum response in ovarian
cancer and illustrate the potential of similar approaches to
benefit cancer patient care.
A.
Results
Figure 5. Model performance evaluation with best percent features and default model hyperparameters
on scaled and unscaled data, and on all and “extreme” samples
All four models were evaluated with best number of features and default hyperparameters on scaled and
unscaled data, and on all or extreme samples, and cross-validation data was plotted with box-plots (5A. NB,
5B. LR,54C. RF, 5D. SVM) .
Figure 1. Clinical guidelines for platinum sensitivity and resistance were applied to TCGA
ovarian cancer data in GenePool and then used for selection of “Platinum Resistant” and
“Platinum Sensitive” sample groups to analyze with machine learning methods
1A. Schematic of platinum drugs sensitivity and resistance for ovarian cancer patients based on
clinical guidelines implemented via manual curation in GenePool genomics platform. 1.B.
GenePool was used to select platinum resistant and their days to status, and platinum sensitive
with days to status groups, and their respective RNA-seq data to predict treatment outcomes with
machine leaning approaches.
ResistantSensitive
A.
B.
C.
C.
B. D.
“Extreme”
Primary
Tumors
+/- Data scaling
Build classifiers on K train sets:
• NB = Naïve Bayes
• LR = Logistic Regression
• RF = Random Forest
• SVM = Support Vector Machine
• ANN = Artificial Neural Network
Test models on
K Dev sets
with same
pipeline
Repeat &
average on
all K folds
Train Set (75%) Test Set (25%)
. . .
Dev (1-fold) Train ((K-1)-fold)
K-Fold
Cross-
Validation
All
Primary
Tumors
All genes (54K) Subset of genes
Reduce to most
informative samples
Dimensionality
Reduction
(ANOVA)
GenePool TCGA
data:
RNA-seq
Hyperparameter tuning
&
model evaluation
on performance metric
Performance
Evaluation on
Test Set
Clinical Data
Subset of genes &
Samples
Evaluate models
with best hyper-
parameters on full
train set &
learning curve
(bias and variance)
Figure 4. Feature number evaluation with default hyperparameters for each model
All four models were evaluated with cross-validation across a variety of percentages of the original number
of features/genes as assessed by ANOVA dimensionality reduction analysis on unscaled data. The mean and
standard deviation values of 5 cross-validation pipelines were plotted and displayed for each model,
respectively: 4A. NB, 4B. LR, 4C. RF, 4D. SVM.
A.
B.
C.
D.
Figure 6. Model performance evaluation with tuned hyperparameters on scaled and unscaled
data, and on all and “extreme” samples on full train data set
All four models were tuned and evaluated with best hyperparameters on scaled and unscaled data,
and on all or “extreme” samples, and cross-validation data was plotted with box-plots (6A. All
samples, unscaled, 6B. Extreme samples, unscaled, 6C. All samples, scaled, 6D. Extreme samples,
scaled) . Test set results displayed in green in box-plots for each tuned model.
For further information
References
Conclusions
A. C.
B. D.
A. C.
C. D.
E.
A.
B.
D.
E.
Hidden
Layer
(Relu)
Input Output
(Sigmoid)
Hidden
Layer 1
(Relu)
•
•
•
C.
Figure 7. Learning curves for all models and gene coefficients with LR model
Learning curves for all four hyperparameter tuned models: cross-validated performance for
different number of samples on unscaled data for all samples of train and dev samples (7A. NB,
7B. LR, 7C. RF, 7D. SVM). 7.E. Plot of gene coefficients for the LR model on all unscaled data.
Figure 2. Workflow diagram of evaluating machine learning models on ovarian TCGA RNA-seq data
The schematic represents using TCGA clinical and molecular TCGA data, selecting samples and features
(genes), splitting into training, development and test data sets, several models building, model cross-
validation, hyperparameter tuning for best results, evaluation, and model scoring on test data.
0.734 0.734 0.750 0.750 Test Set 0.734 0.641 0.750 0.750 Test Set
0.694 0.653 0.735 0.694 Test Set0.694 0.776 0.735 0.735 Test Set
Accuracy
0.88
0.81
C.
D.
A.
B.