SlideShare a Scribd company logo
1 of 6
Download to read offline
Identifying Genes with Prognostic DNA Methylation
Rates for Breast Cancer Survival
Teun de Planque, Christopher Elamri
Department of Computer Science
Stanford University
Abstract—Breast cancer treatments using methylation in-
hibitors are an effective new therapeutic option for breast cancer
patients [1]. We used different regression models, i.e. proportional
hazards regression, elastic net regression, ridge regression, and
lasso regression, to identify genes of which methylation rates
are strongly correlated to breast cancer survival. With each
of the regression models we identified genes of which high
methylation rates are strongly favorably correlated with breast
cancer survival, and genes of which high methylation rates are
strongly adversely correlated with breast cancer survival. A better
understanding of the relationship between DNA methylation rates
and breast cancer survival can assist in the development of
patient-tailored therapy strategies, and the discovery of thera-
peutic targets.
I. INTRODUCTION
DNA methylation is an epigenetic process by which methyl
groups are added to the cytosine (C) or adenine (A) nucleotides
in the DNA molecule [1]. This addition of a methyl group
to DNA is used to regulate gene expression and assure stable
gene silencing. Abnormal DNA methylation patterns have been
associated with breast cancer development [1, 2, 3]. However,
epigenetic processes are reversible and inhibitors of DNA
methylation can reactivate silenced tumor suppressor genes,
and restore normal gene function. Therapeutic applications
of methylation inhibitors provide an effective new treatment
option for breast cancer patients. The identification of genes
of which the methylation rates correlate with breast cancer
survival rates is, however, challenging, because of the enor-
mous number of human genes. In this paper, we use different
regression models including proportional hazards regression,
elastic net regression, ridge regression, and lasso regression,
to identify genes of which methylation rates strongly correlate
with breast cancer survival rates.
II. TASK DEFINITION
Using the survival and methylation data of breast cancer
patients as input, our goal is to output a set of genes of which
the methylation rates are strong predictors of breast cancer
survival.
A. Datasets
We used two datasets from TCGA (The Cancer Genomic
Atlas); one contains survival data of cancer patients, the other
contains genomic data, copy number variation (CNV) data,
and methylation data of cancer patients. The survival dataset
contains the type of cancer (11 different cancer types in total),
the ”time to last contact or event,” and whether the event
occurred (1: death) or not (0: no death at time of last contact)
for 8089 different patients [4]. Many of the survival times are
censored, i.e. the time of observation was cut off before death
occurred; this indicates that the patient either was still alive
at the end of the study or that the patient withdrew from the
study before the end of the study. The dataset with genomic
data, copy number variation (CNV) data, and methylation data
contains methylation data for more than 16,500 different genes
of over 1,000 different cancer patients [4, 5].
B. Input and Output
• Input: survival data of breast cancer patients including
the ”time to last contact or event,” and whether the event
occurred (1: death) or not (0: no death at time of last
contact), and the methylation data of these breast cancer
patients.
• Output: a set of genes of which the methylation rates
are strong predictors of breast cancer survival.
C. Evaluation Metric
We use 10-fold cross validation to measure the success of
our system by evaluating how well the survival of patients in
our test set can be predicted based on the methylation rates
of the genes chosen by our system. In order to do this we
compute the hazard of death of the patients in our test set given
their methylation rates of the genes chosen by our system.
We compute the hazard of death using the Cox proportional
hazards model. The hazard of death at time t can interpreted
as the risk of dying at time t. Ideally, the computed hazard
is much higher for patients in the test set who die than for
patients in the test set who survive.
III. APPROACH
A. Baseline
Doctors do not yet use methylation data of breast cancer pa-
tients when selecting breast cancer therapy strategies or when
predicting breast cancer survival. In other words, methylation
rates of patients do currently not affect hazard estimates for
both patients who will survive breast cancer and for patients
who will die from breast cancer. Thus, the average ’hazard
ratio relative to the sample average’ based on methylation rates
2
is the same (1) for both patients who will survive breast cancer
and for patients who will die from breast cancer. We use this
approach as our baseline, meaning that the average ’hazard
ratio relative to the sample average’ is 1 for both both patients
who will survive breast cancer and for patients who will die
from breast cancer.
B. Oracle
Our oracle knows which genes are most correlated with
breast cancer survival. Thus, it identifies the genes for which
the average predicted ’hazard ratio relative to the sample
average’ of patients who will survive breast cancer is minimal,
or for which the average predicted ’hazard ratio relative to the
sample average’ of patients who will die from breast cancer is
maximal. We do not know what these genes are, so there is no
way for us to implement the oracle; the purpose of this work
is to identify those genes. Ideally, we can correctly predict
survival for all patients in our test set based on the methylation
rates of the genes selected by the oracle. This corresponds
to an average predicted ’hazard ratio relative to the sample
average’ of patients who will survive breast cancer of 0, and an
average predicted ’hazard ratio relative to the sample average’
of patients who will die from breast cancer of ∞.
C. Data Preprocessing
We merged the dataset containing genomic data, copy
number variation (CNV) data, and methylation data, with the
survival dataset by processing all 9,074 patient IDs, putting
all of them in the same format, and then finding the patient
IDs contained in both datasets. We then created a matrix with
methylation data of the 16,020 different genes of all the breast
cancer patients contained in both datasets (989 patients in
total). For all of the 989 patients, we added their survival data
including the ”time to last contact or event,” and whether the
event occurred (1:death) or not (0:no death at time of last
contact) to this new matrix. Because of the enormous number
of genes included in this matrix (16,020), we reduced the
number of genes contained in the matrix by removing genes
of which methylation rates do not or barely correlate to breast
cancer survival. We identified genes of which methylation rates
do not or barely correlate to breast cancer survival using our
regression models. For each model we fitted the model to the
survival and methylation data. We then removed the genes with
the lowest absolute weights.
D. Regularized Least-squares Regression Using Ridge
Ridge regression minimizes squared error while regularizing
the L2
-norm of the weights [6]:
J(w) = λ(w)
2
2 +
i
(wT
xi − yi)2
(1)
Then the stationary condition is
∂J
∂w
= λw +
i
(wT
xi − yi)x = 0 (2)
(XXT
+ λI)w = Xy (3)
w = (XXT
+ λI)−1
Xy (4)
Ridge regression is ideal if there are many predictors (i.e.
the 16,020 genes from our dataset), all with non-zero coef-
ficients and drawn from a normal distribution. In particular,
ridge regression performs well with predictors that have small
effects, and prevents coefficients of regression models with
many correlated variables from being poorly determined and
exhibiting high variance [7].
E. Regularized Least-squares Regression Using Lasso
Lasso regression methods are widely used in domains with
massive datasets, such as genomics, for which efficient and
fast algorithms are essential [7]. However, lasso regularization
is not robust to high correlations among predictors. It will
arbitrarily choose one predictor, ignore other predictors, and
break down when all predictors are identical [8]. Moreover,
the lasso penalty expects many coefficients to be close to zero
and only a small subset of coefficients to be significantly larger
than zero. The lasso estimator uses the L1
-norm penalized least
squares criterion to obtain a sparse solution to the following
optimization problem:
J(w) = argmin
w
(y − Xw)
2
1 + λ(w)1 (5)
(w)1 =
p
j (wj) is the L1
-norm penalty on w, which induces
sparsity
in the solution, and λ ≥ 0 is a tuning parameter.
The L1
-norm penalty enables the lasso method to simul-
taneously regularize the least squares fit and shrink some
components of J(w) to zero for some suitably chosen λ.
However, the lasso method is unstable for high-dimensional
data and cannot select more variables than the sample size
before it saturates when p > n [8].
F. Regularized Least-squares Regression Using Elastic Net
The elastic net (ENET) is an extension of the lasso that is
robust to high correlations among the predictors. In fact, in
order to circumvent the instability of the lasso solution paths
when predictors are highly correlated in the context of our
DNA methylation analysis, the ENET can efficiently analyze
high dimensional data [9]. In particular, the ENET uses a
mixture of the L1
-norm (lasso) and L2
-norm (ridge regression)
penalties and can be formulated as:
J(w) = (1 +
λ2
n
)(argmin
w
(y − Xw)
2
2 + λ(w)
2
2 + λ(w)1) (6)
On setting α = λ2
λ1+λ2
, the ENET estimator is seen to be
equivalent to the minimizer of:
J(w) = argmin
w
(y − Xw)
2
2 (7)
subject to
Pα(w) = (1 − α)(w)1 + α(w)
2
2 (8)
where Pα(w) is the ENET penalty [9].
Thus, the ENET simplifies to simple ridge regression when
α = 1 and to the lasso when α = 0. The L1
-norm part of the
3
ENET does automatic variable selection, while the L2
-norm
part encourages grouped selection and stabilizes the solution
paths with respect to random sampling, thereby improving
prediction. By inducing a grouping effect during variable
selection, such that a group of highly correlated variables tend
to have coefficients of similar magnitude, the ENET can select
groups of correlated features when the groups are not known
in advance. Unlike the lasso, when p >> n the elastic net
selects more than n variables [9].
G. Cox Proportional Hazards Model
The Cox model is a well-recognised statistical technique
for analyzing the relationship between patient survival and
explanatory variables [10]. The Cox regression model (also
known as know as the proportional hazards regression model)
allows us to isolate the effects of several explanatory variables
and deal with the censored survival times. It models the
survival times of the patients on the gene methylation rates.
Proportional hazards regression produces an equation for the
hazard function of breast cancer patients given their DNA
methylation rates. The hazard function is the probability that a
breast cancer patient will die within a short time interval, given
that the breast cancer patient has survived up to the beginning
of the interval. The hazard at time t can be interpreted as the
risk that a breast cancer patient will die during time period t.
The hazard function obtained using the Cox regression
model is:
h(t) = h0(t)exp(βT
x) (9)
where,
t: time after the start of the study
h0(t): the baseline hazard function
β: vector of the regression coefficients
x: vector of the values of the explanatory variables
The baseline hazard function represents the probability of
dying when all the methylation rates are zero. Based on the re-
gression coefficients we can identify the genes most correlated
to lower or higher survival rates. The regression coefficients
with low values correspond to genes of which the methylation
rates are favorably correlated with breast cancer survival, and
the regression coefficients with high values correspond to
genes of which the methylation rates are adversely correlated
with breast cancer survival. A disadvantage of the Cox model
is that the proportional hazards (PH) assumption assumes that
the impact of each covariate on hazard remains constant during
the entire follow-up time. However, in our case, the genomic
expression of a patient might slightly change during the study
time, thereby violating the PH assumption [10].
IV. RESULTS
A. Error Analysis
We evaluated the different regression models using 10-fold
cross-validation. We then used the Cox proportional hazard
model to compute the ’hazard ratios relative to the sample
average’ for all patients in our test data based on the genes
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0
Kaplan−Meier Survival Curve for GRHPR Gene
Time (days)
CumulativeSurvivalPercentage(%)
high GRHPR methylation rate
low GRHPR methylation rate
4
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0 Kaplan−Meier Survival Curve for GADD45A Gene
Time (days)
CumulativeSurvivalPercentage(%)
high GADD45A methylation rate
low GADD45A methylation rate
that affect breast cancer prognosis as selected by the different
regression models. As visible in the bar graph with the average
hazard ratios of patients in the test data, the computed average
predicted ’hazard ratio relative to the sample average’ based
on the chosen genes is significantly larger for patients who
die of breast cancer than for patients who are still alive at the
time of last contact. In particular, the ’hazard ratios relative to
the sample average’ based on the genes selected with elastic
net regression turns out to be over 18.6 times higher than for
patients who were still alive at the time of last contact. The
’hazard ratios relative to the sample average’ based on the
genes selected using ridge regression, lasso regression, and
the Cox proportional hazards model is respectively 10.0, 3.8,
and 3.4 times higher for patients who die than for patients who
are still alive at the time of last contact. In other words, the
genes selected using elastic net regression, and ridge regression
are particularly useful for the prediction of the risk of death
of breast cancer patients within a certain time interval. The
genes selected using lasso regression and the Cox proportional
hazards model are also good predictors of the probability that
a breast cancer patient will experience death within a certain
time period, but the hazard predictions based on the genes
selected using elastic net regression and ridge regression are
more accurate.
The Kaplan-Meier curves show a comparison of how long
patients with high and low methylation rates of four of the
genes selected using our methods will survive [11]. As visible
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0
Kaplan−Meier Survival Curve for ENOX2 Gene
Time (days)
CumulativeSurvivalPercentage(%)
high ENOX2 methylation rate
low ENOX2 methylation rate
in the GRHPR Kaplan-Meier curve, breast cancer patients with
a relatively high GRHPR methylation rate survive longer than
patients with a relatively low GRHPR methylation rate. The
Cox proportional hazards model, the lasso regression model,
and the ridge regression model all suggest that GRHPR is a
gene of which a high methylation rate in patients favorably
affects breast cancer survival. In fact, 5 years after the start of
the survival study 85% of the breast cancer patients (who did
not withdraw from the study) with high GRHPR methylation
rates were still alive, while 73% of the breast cancer patients
(who did not withdraw from the study) with low GRHPR
methylation rate were still alive.
Lasso regression indicates that GADD45A is a gene of which
high methylation rates are associated with high cancer survival
rates. GADD45A Kaplan-Meier curve shows that 83% of
breast cancer patients (who did not withdraw from the study)
with relatively high GADD45A methylation are still alive 5
years after the start of the survival study, while 76% of breast
cancer patients (who did not withdraw from the study) with
relatively low GADD45A methylation are still alive 5 years
after the start of the survival study. Similarly, both elastic net
regression and ridge regression suggest that a high ENOX2
methylation rate negatively affect breast cancer survival, and
the Cox proportional hazards model indicates that a high
ANKRD52 methylation rate adversely affects breast cancer
survival. The curves for GADD45A and ENOX2 show that
high GADD45A and ENOX2 methylation rates do indeed
5
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0
Kaplan−Meier Survival Curve for ANKRD52 Gene
Time (days)
CumulativeSurvivalPercentage(%)
high ANKRD52 methylation rate
low ANKRD52 methylation rate
Cox Proportional Hazards Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
EEF1A1P9 COL6A2
GRHPR ANKRD52
CASP3 C12orf41
	
Elastic Net Regression Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
CLEC2D DHDDS
C9orf89 EXOC1
CASP3 ENOX2
	
Lasso Regression Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
GRHPR GGCX
FUZ GTPBP8
GADD45A GRHL2
	
Ridge Regression Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
ADH5 DNAJC8
GRHPR ENOX2
CASP3 EXOC1
	
negatively affect breast cancer survival rates.
B. Literature Review
Several other projects have focused on applying machine
learning techniques in order to extract valuable information
from DNA methylation data. Previous projects mainly focused
on evaluating different statistical methods for analyzing DNA
methylation data [18], while others analyzed DNA methylation
data for specific types of cancer, such as leukemia [19]. In that
context, our project fits in the second framework, since we use
different regression techniques for gene identification for breast
cancer specifically using DNA methylation data.
In terms of existing projects, our contribution is two-fold.
First, we have compared different regression models (lasso,
ridge, Cox proportional hazards, elastic net) to find poten-
tial genes highly correlated to breast cancer survival, which
further highlights the importance of using different methods
in gene identification (i.e. different genes can be found with
different methods). Second, we have found genes of which
the methylation rates are highly correlated to breast cancer
development (i.e., genes of which methylation have been
shown to be linked to breast cancer survival), which may
give additional directions for breast cancer research, and breast
cancer treatment developments.
In fact, the favorably and adversely prognostic genes iden-
tified by our methods might be worth looking at in order to
further understand breast cancer biological mechanisms. Many
of the genes we identified have been widely acknowledged in
the medical literature as genes strongly correlated to breast
cancer survival, such as: CASP3 [12, 13], GADD45A [14],
ENOX2 [15], GRHL2 [16], and COL6A2 [17]. Some of those
genes were identified by only one method, such as ENOX2
(only identified by ENET). This underscores the benefits of
using distinctive methods in the context of gene identification.
Moreover, given our success in identifying genes known to be
highly-correlated to breast cancer survival, the additional genes
we found might be worth investigating to further understand
breast cancer.
V. CONCLUSION
We have presented different regressions techniques to iden-
tify genes that are highly correlated to breast cancer survival
rates by analyzing the survival and DNA methylation data
of 989 breast cancer patients [4]. Our results identify genes
widely known in the medical literature to be involved in
breast cancer development. The identified genes may prove
to be helpful for the discovery of therapeutic targets, and the
development of patient-tailored therapy strategies.
6
ACKNOWLEDGMENT
This project would have not been possible without the help
of the Gevaert Biomedical Informatics Lab, which provided
both the datasets and ongoing support.
REFERENCES
[1] M. Szyf, ’DNA methylation signatures for breast cancer
classification and prognosis’, Genome Medicine, vol. 4, no.
3, p. 26, 2012.
[2] S. Baylin, ’Aberrant patterns of DNA methylation,
chromatin formation and gene expression in cancer’,
Human Molecular Genetics, vol. 10, no. 7, pp. 687-692,
2001.
[3] K. Hansen, W. Timp, H. Bravo, S. Sabunciyan, B.
Langmead, O. McDonald, B. Wen, H. Wu, Y. Liu, D. Diep,
E. Briem, K. Zhang, R. Irizarry and A. Feinberg, ’Increased
methylation variation in epigenetic domains across cancer
types’, Nature Genetics, vol. 43, no. 8, pp. 768-775, 2011.
[4] The Cancer Genome Atlas - National Cancer Institute,
’The Cancer Genome Atlas Home Page’, 2015. [Online].
Available: http://cancergenome.nih.gov/. [Accessed: 20-Nov
-2015].
[5] C. Creighton, ’SR2-3: Integrative Genomic Analyses of
Breast Cancer from The Cancer Genome Atlas (TCGA).’,
Cancer Research, vol. 71, no. 24, pp. SR2-3-SR2-3, 2011.
[6] A. Hoerl and R. Kennard, ’Ridge Regression: Biased
Estimation for Nonorthogonal Problems’, Technometrics,
vol. 42, no. 1.
[7] J. Friedman, T. Hastie and R. Tibshirani, ’Regularization
Paths for Generalized Linear Models via Coordinate De
scent’, Journal of Statistical Software, vol. 33, no. 1, 2010.
[8] H. Zou, ’The Adaptive Lasso and Its Oracle Properties’,
Journal of the American Statistical Association, vol. 101,
no.476, pp. 1418-1429, 2006.
[9] J. Ogutu, T. Schulz-Streeck and H. Piepho, ’Genomic
selection using regularized linear regression models: ridge
regression, lasso, elastic net and their extensions’, BMC
Proc, vol. 6, no. 2, p. S10, 2012.
[10] M. Abrahamowicz, T. Schopflocher, K. Leffondr´e, R. du
Berger and D. Krewski, ’Flexible Modeling of Exposure-
-Response Relationship between Long-Term Average Levels
of Particulate Air Pollution and Mortality in the American
Cancer Society Study’,Journal of Toxicology and Environ
mental Health, Part A, vol. 66, no. 16-19, pp. 1625-1654,
2003.
[11] E. Kaplan and P. Meier, ’Nonparametric Estimation
from Incomplete Observations’, Journal of the American
Statistical Association, vol. 53, no. 282, p. 457, 1958.
[12] O’Donovan N, Crown J, Stunell H, Hill AD, McDermott
E, O’Higgins N, Duffy MJ. ’Caspase 3 in breast cancer’,
Clin Cancer Res, pp. 738-742, 2003.
[13] E. Devarajan, A. Sahin, J. Chen, R. Krishnamurthy, N.
Aggarwal, A. Brun, A. Sapino, F. Zhang,D. Sharma, X. Yang,
A. Tora and K. Mehta, ’Down-regulation of caspase 3 in
breast cancer: a possible mechanism for chemoresistance’,
Oncogene, vol. 21, no. 57, pp. 8843-8851, 2002.
[14] J. Tront, Y. Huang, A. Fornace, B. Hoffman and
D. Liebermann, ’Gadd45a Functions as a Promoter or
Suppressor of Breast Cancer Dependent on the Oncogenic
Stress’, Cancer Research, vol.70, no. 23, pp. 9671-9681,
2010.
[15] D. Morr´e and D. Morr´e, ECTO-NOX proteins. New
York, NY Springer, 2013.
[16] X. Xiang, Z. Deng, X. Zhuang, S. Ju, J. Mu, H. Jiang, L.
Zhang, J. Yan, D. Miller and H. Zhang, ’Grhl2 Determines
the Epithelial Phenotype of Breast Cancers and Promotes
Tumor Progression’, PLoS ONE, vol. 7, no. 12, p. e50781,
2012.
[17] E. Karousou, M. D’Angelo, K. Kouvidi, D. Vigetti, M.
Viola, D. Nikitovic, G. De Luca and A. Passi, ’Collagen
VI and Hyaluronan: The Common Role in Breast Cancer’,
BioMed Research International, vol. 2014, pp. 1-10, 2014.
[18] T. Wilhelm, ’Phenotype prediction based on genome-wide
DNA methylation data’, BMC Bioinformatics, vol. 15, no.
1, p. 193, 2014.
[19] J. Nordlund, C. B¨acklin, V. Zachariadis, et al. ’DNA
methylation-based subtype prediction for pediatric acute
lymphoblastic leukemia’, Clin Epigenetics, vol. 7, no. 1, p.
11, 2015.

More Related Content

What's hot

Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...Enrique Moreno Gonzalez
 
2013 CD8 Lymphocytes and apoptosis in MBC-1
2013 CD8  Lymphocytes and apoptosis in MBC-12013 CD8  Lymphocytes and apoptosis in MBC-1
2013 CD8 Lymphocytes and apoptosis in MBC-1IKA Nurlaila
 
Breastcancer_lbp_poster
Breastcancer_lbp_posterBreastcancer_lbp_poster
Breastcancer_lbp_posterpashaa khan
 
Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...
Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...
Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...Jose Perez
 
Geveart Lab SIMR Paper
Geveart Lab SIMR PaperGeveart Lab SIMR Paper
Geveart Lab SIMR PaperNathan Dalal
 
Micro rna signature with indolent non hodgkin lymphomas
Micro rna signature with indolent non hodgkin  lymphomasMicro rna signature with indolent non hodgkin  lymphomas
Micro rna signature with indolent non hodgkin lymphomasMohsin Maqbool
 
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...caijjournal
 
ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75Sam Yang
 
2014DynamicsDaysPoster_Jie
2014DynamicsDaysPoster_Jie2014DynamicsDaysPoster_Jie
2014DynamicsDaysPoster_JieZhao Jie
 
Michael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael Araya
 
A Medical Device Strategy to Inhibit HER2+ Breast Cancer Progression
A Medical Device Strategy to Inhibit HER2+ Breast Cancer ProgressionA Medical Device Strategy to Inhibit HER2+ Breast Cancer Progression
A Medical Device Strategy to Inhibit HER2+ Breast Cancer ProgressionAethlon Medical, Inc.
 
Majumder_B_et_al_Nature_Communications_2015
Majumder_B_et_al_Nature_Communications_2015Majumder_B_et_al_Nature_Communications_2015
Majumder_B_et_al_Nature_Communications_2015Joelle Lynn Kord
 
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
 

What's hot (19)

Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...
 
2013 CD8 Lymphocytes and apoptosis in MBC-1
2013 CD8  Lymphocytes and apoptosis in MBC-12013 CD8  Lymphocytes and apoptosis in MBC-1
2013 CD8 Lymphocytes and apoptosis in MBC-1
 
Project_702
Project_702Project_702
Project_702
 
Breastcancer_lbp_poster
Breastcancer_lbp_posterBreastcancer_lbp_poster
Breastcancer_lbp_poster
 
Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...
Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...
Multi-Scale Modeling of T Cell and Antigen Presenting Cell Interaction in the...
 
Geveart Lab SIMR Paper
Geveart Lab SIMR PaperGeveart Lab SIMR Paper
Geveart Lab SIMR Paper
 
Micro rna signature with indolent non hodgkin lymphomas
Micro rna signature with indolent non hodgkin  lymphomasMicro rna signature with indolent non hodgkin  lymphomas
Micro rna signature with indolent non hodgkin lymphomas
 
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
 
I1803056267
I1803056267I1803056267
I1803056267
 
ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75ASEE-GSW_2015_submission_75
ASEE-GSW_2015_submission_75
 
2014DynamicsDaysPoster_Jie
2014DynamicsDaysPoster_Jie2014DynamicsDaysPoster_Jie
2014DynamicsDaysPoster_Jie
 
npjsba201634-2
npjsba201634-2npjsba201634-2
npjsba201634-2
 
Michael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael's IUCRL Poster 2014 Close to Final with CDW edits
Michael's IUCRL Poster 2014 Close to Final with CDW edits
 
Daniel
DanielDaniel
Daniel
 
A Medical Device Strategy to Inhibit HER2+ Breast Cancer Progression
A Medical Device Strategy to Inhibit HER2+ Breast Cancer ProgressionA Medical Device Strategy to Inhibit HER2+ Breast Cancer Progression
A Medical Device Strategy to Inhibit HER2+ Breast Cancer Progression
 
Olaparib Upregulates miR-630 Expression to Enhance the Chemotherapeutic Sensi...
Olaparib Upregulates miR-630 Expression to Enhance the Chemotherapeutic Sensi...Olaparib Upregulates miR-630 Expression to Enhance the Chemotherapeutic Sensi...
Olaparib Upregulates miR-630 Expression to Enhance the Chemotherapeutic Sensi...
 
4640-63316-1-PB
4640-63316-1-PB4640-63316-1-PB
4640-63316-1-PB
 
Majumder_B_et_al_Nature_Communications_2015
Majumder_B_et_al_Nature_Communications_2015Majumder_B_et_al_Nature_Communications_2015
Majumder_B_et_al_Nature_Communications_2015
 
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
 

Similar to Gene_Identification_Report

i.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approach
i.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approachi.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approach
i.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approachJonathan Josue Cid Galiot
 
Construction and Validation of Prognostic Signature Model Based on Metastatic...
Construction and Validation of Prognostic Signature Model Based on Metastatic...Construction and Validation of Prognostic Signature Model Based on Metastatic...
Construction and Validation of Prognostic Signature Model Based on Metastatic...daranisaha
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learningFord Sleeman
 
Adjuvant radiation based on genomic risk factors emerging scenarios
Adjuvant radiation based on genomic risk factors   emerging scenariosAdjuvant radiation based on genomic risk factors   emerging scenarios
Adjuvant radiation based on genomic risk factors emerging scenariosSantam Chakraborty
 
Predictors of survival in children with ependymoma from a single center: usi...
 Predictors of survival in children with ependymoma from a single center: usi... Predictors of survival in children with ependymoma from a single center: usi...
Predictors of survival in children with ependymoma from a single center: usi...Francisco H C Felix
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...daranisaha
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...eshaasini
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...semualkaira
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...semualkaira
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...semualkaira
 
Gene expression mining for predicting survivability of patients in earlystage...
Gene expression mining for predicting survivability of patients in earlystage...Gene expression mining for predicting survivability of patients in earlystage...
Gene expression mining for predicting survivability of patients in earlystage...ijbbjournal
 
Modeling the Dynamics of Glioblastoma Multiforme and Cancer Stem Cells
Modeling the Dynamics of Glioblastoma Multiforme and Cancer Stem CellsModeling the Dynamics of Glioblastoma Multiforme and Cancer Stem Cells
Modeling the Dynamics of Glioblastoma Multiforme and Cancer Stem CellsStephen Steward
 
Machine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer DiagnosisMachine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer DiagnosisPramod Sharma
 
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...semualkaira
 
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...semualkaira
 
Week12sampling and feature selection technique to solve imbalanced dataset
Week12sampling and feature selection technique to solve imbalanced datasetWeek12sampling and feature selection technique to solve imbalanced dataset
Week12sampling and feature selection technique to solve imbalanced datasetMusTapha KaMal FaSya
 
Oncotype Dx Mammaprint
Oncotype Dx MammaprintOncotype Dx Mammaprint
Oncotype Dx Mammaprintfondas vakalis
 
Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...
Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...
Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...European School of Oncology
 

Similar to Gene_Identification_Report (20)

i.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approach
i.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approachi.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approach
i.a.Preoperative ovarian cancer diagnosis using neuro fuzzy approach
 
Construction and Validation of Prognostic Signature Model Based on Metastatic...
Construction and Validation of Prognostic Signature Model Based on Metastatic...Construction and Validation of Prognostic Signature Model Based on Metastatic...
Construction and Validation of Prognostic Signature Model Based on Metastatic...
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
Adjuvant radiation based on genomic risk factors emerging scenarios
Adjuvant radiation based on genomic risk factors   emerging scenariosAdjuvant radiation based on genomic risk factors   emerging scenarios
Adjuvant radiation based on genomic risk factors emerging scenarios
 
Predictors of survival in children with ependymoma from a single center: usi...
 Predictors of survival in children with ependymoma from a single center: usi... Predictors of survival in children with ependymoma from a single center: usi...
Predictors of survival in children with ependymoma from a single center: usi...
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
 
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
Prognosis of Invasive Micropapillary Carcinoma of the Breast Analyzed by Usin...
 
Gene expression mining for predicting survivability of patients in earlystage...
Gene expression mining for predicting survivability of patients in earlystage...Gene expression mining for predicting survivability of patients in earlystage...
Gene expression mining for predicting survivability of patients in earlystage...
 
Modeling the Dynamics of Glioblastoma Multiforme and Cancer Stem Cells
Modeling the Dynamics of Glioblastoma Multiforme and Cancer Stem CellsModeling the Dynamics of Glioblastoma Multiforme and Cancer Stem Cells
Modeling the Dynamics of Glioblastoma Multiforme and Cancer Stem Cells
 
Machine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer DiagnosisMachine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer Diagnosis
 
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
 
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...
 
Week12sampling and feature selection technique to solve imbalanced dataset
Week12sampling and feature selection technique to solve imbalanced datasetWeek12sampling and feature selection technique to solve imbalanced dataset
Week12sampling and feature selection technique to solve imbalanced dataset
 
Oncotype Dx Mammaprint
Oncotype Dx MammaprintOncotype Dx Mammaprint
Oncotype Dx Mammaprint
 
Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...
Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...
Gene Profiling in Clinical Oncology - Slide 11 - J. Albanell Mestres - The Sp...
 

Gene_Identification_Report

  • 1. Identifying Genes with Prognostic DNA Methylation Rates for Breast Cancer Survival Teun de Planque, Christopher Elamri Department of Computer Science Stanford University Abstract—Breast cancer treatments using methylation in- hibitors are an effective new therapeutic option for breast cancer patients [1]. We used different regression models, i.e. proportional hazards regression, elastic net regression, ridge regression, and lasso regression, to identify genes of which methylation rates are strongly correlated to breast cancer survival. With each of the regression models we identified genes of which high methylation rates are strongly favorably correlated with breast cancer survival, and genes of which high methylation rates are strongly adversely correlated with breast cancer survival. A better understanding of the relationship between DNA methylation rates and breast cancer survival can assist in the development of patient-tailored therapy strategies, and the discovery of thera- peutic targets. I. INTRODUCTION DNA methylation is an epigenetic process by which methyl groups are added to the cytosine (C) or adenine (A) nucleotides in the DNA molecule [1]. This addition of a methyl group to DNA is used to regulate gene expression and assure stable gene silencing. Abnormal DNA methylation patterns have been associated with breast cancer development [1, 2, 3]. However, epigenetic processes are reversible and inhibitors of DNA methylation can reactivate silenced tumor suppressor genes, and restore normal gene function. Therapeutic applications of methylation inhibitors provide an effective new treatment option for breast cancer patients. The identification of genes of which the methylation rates correlate with breast cancer survival rates is, however, challenging, because of the enor- mous number of human genes. In this paper, we use different regression models including proportional hazards regression, elastic net regression, ridge regression, and lasso regression, to identify genes of which methylation rates strongly correlate with breast cancer survival rates. II. TASK DEFINITION Using the survival and methylation data of breast cancer patients as input, our goal is to output a set of genes of which the methylation rates are strong predictors of breast cancer survival. A. Datasets We used two datasets from TCGA (The Cancer Genomic Atlas); one contains survival data of cancer patients, the other contains genomic data, copy number variation (CNV) data, and methylation data of cancer patients. The survival dataset contains the type of cancer (11 different cancer types in total), the ”time to last contact or event,” and whether the event occurred (1: death) or not (0: no death at time of last contact) for 8089 different patients [4]. Many of the survival times are censored, i.e. the time of observation was cut off before death occurred; this indicates that the patient either was still alive at the end of the study or that the patient withdrew from the study before the end of the study. The dataset with genomic data, copy number variation (CNV) data, and methylation data contains methylation data for more than 16,500 different genes of over 1,000 different cancer patients [4, 5]. B. Input and Output • Input: survival data of breast cancer patients including the ”time to last contact or event,” and whether the event occurred (1: death) or not (0: no death at time of last contact), and the methylation data of these breast cancer patients. • Output: a set of genes of which the methylation rates are strong predictors of breast cancer survival. C. Evaluation Metric We use 10-fold cross validation to measure the success of our system by evaluating how well the survival of patients in our test set can be predicted based on the methylation rates of the genes chosen by our system. In order to do this we compute the hazard of death of the patients in our test set given their methylation rates of the genes chosen by our system. We compute the hazard of death using the Cox proportional hazards model. The hazard of death at time t can interpreted as the risk of dying at time t. Ideally, the computed hazard is much higher for patients in the test set who die than for patients in the test set who survive. III. APPROACH A. Baseline Doctors do not yet use methylation data of breast cancer pa- tients when selecting breast cancer therapy strategies or when predicting breast cancer survival. In other words, methylation rates of patients do currently not affect hazard estimates for both patients who will survive breast cancer and for patients who will die from breast cancer. Thus, the average ’hazard ratio relative to the sample average’ based on methylation rates
  • 2. 2 is the same (1) for both patients who will survive breast cancer and for patients who will die from breast cancer. We use this approach as our baseline, meaning that the average ’hazard ratio relative to the sample average’ is 1 for both both patients who will survive breast cancer and for patients who will die from breast cancer. B. Oracle Our oracle knows which genes are most correlated with breast cancer survival. Thus, it identifies the genes for which the average predicted ’hazard ratio relative to the sample average’ of patients who will survive breast cancer is minimal, or for which the average predicted ’hazard ratio relative to the sample average’ of patients who will die from breast cancer is maximal. We do not know what these genes are, so there is no way for us to implement the oracle; the purpose of this work is to identify those genes. Ideally, we can correctly predict survival for all patients in our test set based on the methylation rates of the genes selected by the oracle. This corresponds to an average predicted ’hazard ratio relative to the sample average’ of patients who will survive breast cancer of 0, and an average predicted ’hazard ratio relative to the sample average’ of patients who will die from breast cancer of ∞. C. Data Preprocessing We merged the dataset containing genomic data, copy number variation (CNV) data, and methylation data, with the survival dataset by processing all 9,074 patient IDs, putting all of them in the same format, and then finding the patient IDs contained in both datasets. We then created a matrix with methylation data of the 16,020 different genes of all the breast cancer patients contained in both datasets (989 patients in total). For all of the 989 patients, we added their survival data including the ”time to last contact or event,” and whether the event occurred (1:death) or not (0:no death at time of last contact) to this new matrix. Because of the enormous number of genes included in this matrix (16,020), we reduced the number of genes contained in the matrix by removing genes of which methylation rates do not or barely correlate to breast cancer survival. We identified genes of which methylation rates do not or barely correlate to breast cancer survival using our regression models. For each model we fitted the model to the survival and methylation data. We then removed the genes with the lowest absolute weights. D. Regularized Least-squares Regression Using Ridge Ridge regression minimizes squared error while regularizing the L2 -norm of the weights [6]: J(w) = λ(w) 2 2 + i (wT xi − yi)2 (1) Then the stationary condition is ∂J ∂w = λw + i (wT xi − yi)x = 0 (2) (XXT + λI)w = Xy (3) w = (XXT + λI)−1 Xy (4) Ridge regression is ideal if there are many predictors (i.e. the 16,020 genes from our dataset), all with non-zero coef- ficients and drawn from a normal distribution. In particular, ridge regression performs well with predictors that have small effects, and prevents coefficients of regression models with many correlated variables from being poorly determined and exhibiting high variance [7]. E. Regularized Least-squares Regression Using Lasso Lasso regression methods are widely used in domains with massive datasets, such as genomics, for which efficient and fast algorithms are essential [7]. However, lasso regularization is not robust to high correlations among predictors. It will arbitrarily choose one predictor, ignore other predictors, and break down when all predictors are identical [8]. Moreover, the lasso penalty expects many coefficients to be close to zero and only a small subset of coefficients to be significantly larger than zero. The lasso estimator uses the L1 -norm penalized least squares criterion to obtain a sparse solution to the following optimization problem: J(w) = argmin w (y − Xw) 2 1 + λ(w)1 (5) (w)1 = p j (wj) is the L1 -norm penalty on w, which induces sparsity in the solution, and λ ≥ 0 is a tuning parameter. The L1 -norm penalty enables the lasso method to simul- taneously regularize the least squares fit and shrink some components of J(w) to zero for some suitably chosen λ. However, the lasso method is unstable for high-dimensional data and cannot select more variables than the sample size before it saturates when p > n [8]. F. Regularized Least-squares Regression Using Elastic Net The elastic net (ENET) is an extension of the lasso that is robust to high correlations among the predictors. In fact, in order to circumvent the instability of the lasso solution paths when predictors are highly correlated in the context of our DNA methylation analysis, the ENET can efficiently analyze high dimensional data [9]. In particular, the ENET uses a mixture of the L1 -norm (lasso) and L2 -norm (ridge regression) penalties and can be formulated as: J(w) = (1 + λ2 n )(argmin w (y − Xw) 2 2 + λ(w) 2 2 + λ(w)1) (6) On setting α = λ2 λ1+λ2 , the ENET estimator is seen to be equivalent to the minimizer of: J(w) = argmin w (y − Xw) 2 2 (7) subject to Pα(w) = (1 − α)(w)1 + α(w) 2 2 (8) where Pα(w) is the ENET penalty [9]. Thus, the ENET simplifies to simple ridge regression when α = 1 and to the lasso when α = 0. The L1 -norm part of the
  • 3. 3 ENET does automatic variable selection, while the L2 -norm part encourages grouped selection and stabilizes the solution paths with respect to random sampling, thereby improving prediction. By inducing a grouping effect during variable selection, such that a group of highly correlated variables tend to have coefficients of similar magnitude, the ENET can select groups of correlated features when the groups are not known in advance. Unlike the lasso, when p >> n the elastic net selects more than n variables [9]. G. Cox Proportional Hazards Model The Cox model is a well-recognised statistical technique for analyzing the relationship between patient survival and explanatory variables [10]. The Cox regression model (also known as know as the proportional hazards regression model) allows us to isolate the effects of several explanatory variables and deal with the censored survival times. It models the survival times of the patients on the gene methylation rates. Proportional hazards regression produces an equation for the hazard function of breast cancer patients given their DNA methylation rates. The hazard function is the probability that a breast cancer patient will die within a short time interval, given that the breast cancer patient has survived up to the beginning of the interval. The hazard at time t can be interpreted as the risk that a breast cancer patient will die during time period t. The hazard function obtained using the Cox regression model is: h(t) = h0(t)exp(βT x) (9) where, t: time after the start of the study h0(t): the baseline hazard function β: vector of the regression coefficients x: vector of the values of the explanatory variables The baseline hazard function represents the probability of dying when all the methylation rates are zero. Based on the re- gression coefficients we can identify the genes most correlated to lower or higher survival rates. The regression coefficients with low values correspond to genes of which the methylation rates are favorably correlated with breast cancer survival, and the regression coefficients with high values correspond to genes of which the methylation rates are adversely correlated with breast cancer survival. A disadvantage of the Cox model is that the proportional hazards (PH) assumption assumes that the impact of each covariate on hazard remains constant during the entire follow-up time. However, in our case, the genomic expression of a patient might slightly change during the study time, thereby violating the PH assumption [10]. IV. RESULTS A. Error Analysis We evaluated the different regression models using 10-fold cross-validation. We then used the Cox proportional hazard model to compute the ’hazard ratios relative to the sample average’ for all patients in our test data based on the genes 0 500 1000 1500 2000 2500 3000 0.50.60.70.80.91.0 Kaplan−Meier Survival Curve for GRHPR Gene Time (days) CumulativeSurvivalPercentage(%) high GRHPR methylation rate low GRHPR methylation rate
  • 4. 4 0 500 1000 1500 2000 2500 3000 0.50.60.70.80.91.0 Kaplan−Meier Survival Curve for GADD45A Gene Time (days) CumulativeSurvivalPercentage(%) high GADD45A methylation rate low GADD45A methylation rate that affect breast cancer prognosis as selected by the different regression models. As visible in the bar graph with the average hazard ratios of patients in the test data, the computed average predicted ’hazard ratio relative to the sample average’ based on the chosen genes is significantly larger for patients who die of breast cancer than for patients who are still alive at the time of last contact. In particular, the ’hazard ratios relative to the sample average’ based on the genes selected with elastic net regression turns out to be over 18.6 times higher than for patients who were still alive at the time of last contact. The ’hazard ratios relative to the sample average’ based on the genes selected using ridge regression, lasso regression, and the Cox proportional hazards model is respectively 10.0, 3.8, and 3.4 times higher for patients who die than for patients who are still alive at the time of last contact. In other words, the genes selected using elastic net regression, and ridge regression are particularly useful for the prediction of the risk of death of breast cancer patients within a certain time interval. The genes selected using lasso regression and the Cox proportional hazards model are also good predictors of the probability that a breast cancer patient will experience death within a certain time period, but the hazard predictions based on the genes selected using elastic net regression and ridge regression are more accurate. The Kaplan-Meier curves show a comparison of how long patients with high and low methylation rates of four of the genes selected using our methods will survive [11]. As visible 0 500 1000 1500 2000 2500 3000 0.50.60.70.80.91.0 Kaplan−Meier Survival Curve for ENOX2 Gene Time (days) CumulativeSurvivalPercentage(%) high ENOX2 methylation rate low ENOX2 methylation rate in the GRHPR Kaplan-Meier curve, breast cancer patients with a relatively high GRHPR methylation rate survive longer than patients with a relatively low GRHPR methylation rate. The Cox proportional hazards model, the lasso regression model, and the ridge regression model all suggest that GRHPR is a gene of which a high methylation rate in patients favorably affects breast cancer survival. In fact, 5 years after the start of the survival study 85% of the breast cancer patients (who did not withdraw from the study) with high GRHPR methylation rates were still alive, while 73% of the breast cancer patients (who did not withdraw from the study) with low GRHPR methylation rate were still alive. Lasso regression indicates that GADD45A is a gene of which high methylation rates are associated with high cancer survival rates. GADD45A Kaplan-Meier curve shows that 83% of breast cancer patients (who did not withdraw from the study) with relatively high GADD45A methylation are still alive 5 years after the start of the survival study, while 76% of breast cancer patients (who did not withdraw from the study) with relatively low GADD45A methylation are still alive 5 years after the start of the survival study. Similarly, both elastic net regression and ridge regression suggest that a high ENOX2 methylation rate negatively affect breast cancer survival, and the Cox proportional hazards model indicates that a high ANKRD52 methylation rate adversely affects breast cancer survival. The curves for GADD45A and ENOX2 show that high GADD45A and ENOX2 methylation rates do indeed
  • 5. 5 0 500 1000 1500 2000 2500 3000 0.50.60.70.80.91.0 Kaplan−Meier Survival Curve for ANKRD52 Gene Time (days) CumulativeSurvivalPercentage(%) high ANKRD52 methylation rate low ANKRD52 methylation rate Cox Proportional Hazards Model top 3 favorably prognostic genes top 3 adversely prognostic genes EEF1A1P9 COL6A2 GRHPR ANKRD52 CASP3 C12orf41 Elastic Net Regression Model top 3 favorably prognostic genes top 3 adversely prognostic genes CLEC2D DHDDS C9orf89 EXOC1 CASP3 ENOX2 Lasso Regression Model top 3 favorably prognostic genes top 3 adversely prognostic genes GRHPR GGCX FUZ GTPBP8 GADD45A GRHL2 Ridge Regression Model top 3 favorably prognostic genes top 3 adversely prognostic genes ADH5 DNAJC8 GRHPR ENOX2 CASP3 EXOC1 negatively affect breast cancer survival rates. B. Literature Review Several other projects have focused on applying machine learning techniques in order to extract valuable information from DNA methylation data. Previous projects mainly focused on evaluating different statistical methods for analyzing DNA methylation data [18], while others analyzed DNA methylation data for specific types of cancer, such as leukemia [19]. In that context, our project fits in the second framework, since we use different regression techniques for gene identification for breast cancer specifically using DNA methylation data. In terms of existing projects, our contribution is two-fold. First, we have compared different regression models (lasso, ridge, Cox proportional hazards, elastic net) to find poten- tial genes highly correlated to breast cancer survival, which further highlights the importance of using different methods in gene identification (i.e. different genes can be found with different methods). Second, we have found genes of which the methylation rates are highly correlated to breast cancer development (i.e., genes of which methylation have been shown to be linked to breast cancer survival), which may give additional directions for breast cancer research, and breast cancer treatment developments. In fact, the favorably and adversely prognostic genes iden- tified by our methods might be worth looking at in order to further understand breast cancer biological mechanisms. Many of the genes we identified have been widely acknowledged in the medical literature as genes strongly correlated to breast cancer survival, such as: CASP3 [12, 13], GADD45A [14], ENOX2 [15], GRHL2 [16], and COL6A2 [17]. Some of those genes were identified by only one method, such as ENOX2 (only identified by ENET). This underscores the benefits of using distinctive methods in the context of gene identification. Moreover, given our success in identifying genes known to be highly-correlated to breast cancer survival, the additional genes we found might be worth investigating to further understand breast cancer. V. CONCLUSION We have presented different regressions techniques to iden- tify genes that are highly correlated to breast cancer survival rates by analyzing the survival and DNA methylation data of 989 breast cancer patients [4]. Our results identify genes widely known in the medical literature to be involved in breast cancer development. The identified genes may prove to be helpful for the discovery of therapeutic targets, and the development of patient-tailored therapy strategies.
  • 6. 6 ACKNOWLEDGMENT This project would have not been possible without the help of the Gevaert Biomedical Informatics Lab, which provided both the datasets and ongoing support. REFERENCES [1] M. Szyf, ’DNA methylation signatures for breast cancer classification and prognosis’, Genome Medicine, vol. 4, no. 3, p. 26, 2012. [2] S. Baylin, ’Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer’, Human Molecular Genetics, vol. 10, no. 7, pp. 687-692, 2001. [3] K. Hansen, W. Timp, H. Bravo, S. Sabunciyan, B. Langmead, O. McDonald, B. Wen, H. Wu, Y. Liu, D. Diep, E. Briem, K. Zhang, R. Irizarry and A. Feinberg, ’Increased methylation variation in epigenetic domains across cancer types’, Nature Genetics, vol. 43, no. 8, pp. 768-775, 2011. [4] The Cancer Genome Atlas - National Cancer Institute, ’The Cancer Genome Atlas Home Page’, 2015. [Online]. Available: http://cancergenome.nih.gov/. [Accessed: 20-Nov -2015]. [5] C. Creighton, ’SR2-3: Integrative Genomic Analyses of Breast Cancer from The Cancer Genome Atlas (TCGA).’, Cancer Research, vol. 71, no. 24, pp. SR2-3-SR2-3, 2011. [6] A. Hoerl and R. Kennard, ’Ridge Regression: Biased Estimation for Nonorthogonal Problems’, Technometrics, vol. 42, no. 1. [7] J. Friedman, T. Hastie and R. Tibshirani, ’Regularization Paths for Generalized Linear Models via Coordinate De scent’, Journal of Statistical Software, vol. 33, no. 1, 2010. [8] H. Zou, ’The Adaptive Lasso and Its Oracle Properties’, Journal of the American Statistical Association, vol. 101, no.476, pp. 1418-1429, 2006. [9] J. Ogutu, T. Schulz-Streeck and H. Piepho, ’Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions’, BMC Proc, vol. 6, no. 2, p. S10, 2012. [10] M. Abrahamowicz, T. Schopflocher, K. Leffondr´e, R. du Berger and D. Krewski, ’Flexible Modeling of Exposure- -Response Relationship between Long-Term Average Levels of Particulate Air Pollution and Mortality in the American Cancer Society Study’,Journal of Toxicology and Environ mental Health, Part A, vol. 66, no. 16-19, pp. 1625-1654, 2003. [11] E. Kaplan and P. Meier, ’Nonparametric Estimation from Incomplete Observations’, Journal of the American Statistical Association, vol. 53, no. 282, p. 457, 1958. [12] O’Donovan N, Crown J, Stunell H, Hill AD, McDermott E, O’Higgins N, Duffy MJ. ’Caspase 3 in breast cancer’, Clin Cancer Res, pp. 738-742, 2003. [13] E. Devarajan, A. Sahin, J. Chen, R. Krishnamurthy, N. Aggarwal, A. Brun, A. Sapino, F. Zhang,D. Sharma, X. Yang, A. Tora and K. Mehta, ’Down-regulation of caspase 3 in breast cancer: a possible mechanism for chemoresistance’, Oncogene, vol. 21, no. 57, pp. 8843-8851, 2002. [14] J. Tront, Y. Huang, A. Fornace, B. Hoffman and D. Liebermann, ’Gadd45a Functions as a Promoter or Suppressor of Breast Cancer Dependent on the Oncogenic Stress’, Cancer Research, vol.70, no. 23, pp. 9671-9681, 2010. [15] D. Morr´e and D. Morr´e, ECTO-NOX proteins. New York, NY Springer, 2013. [16] X. Xiang, Z. Deng, X. Zhuang, S. Ju, J. Mu, H. Jiang, L. Zhang, J. Yan, D. Miller and H. Zhang, ’Grhl2 Determines the Epithelial Phenotype of Breast Cancers and Promotes Tumor Progression’, PLoS ONE, vol. 7, no. 12, p. e50781, 2012. [17] E. Karousou, M. D’Angelo, K. Kouvidi, D. Vigetti, M. Viola, D. Nikitovic, G. De Luca and A. Passi, ’Collagen VI and Hyaluronan: The Common Role in Breast Cancer’, BioMed Research International, vol. 2014, pp. 1-10, 2014. [18] T. Wilhelm, ’Phenotype prediction based on genome-wide DNA methylation data’, BMC Bioinformatics, vol. 15, no. 1, p. 193, 2014. [19] J. Nordlund, C. B¨acklin, V. Zachariadis, et al. ’DNA methylation-based subtype prediction for pediatric acute lymphoblastic leukemia’, Clin Epigenetics, vol. 7, no. 1, p. 11, 2015.