Gene_Identification_Report

Identifying Genes with Prognostic DNA Methylation
Rates for Breast Cancer Survival
Teun de Planque, Christopher Elamri
Department of Computer Science
Stanford University
Abstract—Breast cancer treatments using methylation in-
hibitors are an effective new therapeutic option for breast cancer
patients [1]. We used different regression models, i.e. proportional
hazards regression, elastic net regression, ridge regression, and
lasso regression, to identify genes of which methylation rates
are strongly correlated to breast cancer survival. With each
of the regression models we identiﬁed genes of which high
methylation rates are strongly favorably correlated with breast
cancer survival, and genes of which high methylation rates are
strongly adversely correlated with breast cancer survival. A better
understanding of the relationship between DNA methylation rates
and breast cancer survival can assist in the development of
patient-tailored therapy strategies, and the discovery of thera-
peutic targets.
I. INTRODUCTION
DNA methylation is an epigenetic process by which methyl
groups are added to the cytosine (C) or adenine (A) nucleotides
in the DNA molecule [1]. This addition of a methyl group
to DNA is used to regulate gene expression and assure stable
gene silencing. Abnormal DNA methylation patterns have been
associated with breast cancer development [1, 2, 3]. However,
epigenetic processes are reversible and inhibitors of DNA
methylation can reactivate silenced tumor suppressor genes,
and restore normal gene function. Therapeutic applications
of methylation inhibitors provide an effective new treatment
option for breast cancer patients. The identiﬁcation of genes
of which the methylation rates correlate with breast cancer
survival rates is, however, challenging, because of the enor-
mous number of human genes. In this paper, we use different
regression models including proportional hazards regression,
elastic net regression, ridge regression, and lasso regression,
to identify genes of which methylation rates strongly correlate
with breast cancer survival rates.
II. TASK DEFINITION
Using the survival and methylation data of breast cancer
patients as input, our goal is to output a set of genes of which
the methylation rates are strong predictors of breast cancer
survival.
A. Datasets
We used two datasets from TCGA (The Cancer Genomic
Atlas); one contains survival data of cancer patients, the other
contains genomic data, copy number variation (CNV) data,
and methylation data of cancer patients. The survival dataset
contains the type of cancer (11 different cancer types in total),
the ”time to last contact or event,” and whether the event
occurred (1: death) or not (0: no death at time of last contact)
for 8089 different patients [4]. Many of the survival times are
censored, i.e. the time of observation was cut off before death
occurred; this indicates that the patient either was still alive
at the end of the study or that the patient withdrew from the
study before the end of the study. The dataset with genomic
data, copy number variation (CNV) data, and methylation data
contains methylation data for more than 16,500 different genes
of over 1,000 different cancer patients [4, 5].
B. Input and Output
• Input: survival data of breast cancer patients including
the ”time to last contact or event,” and whether the event
occurred (1: death) or not (0: no death at time of last
contact), and the methylation data of these breast cancer
patients.
• Output: a set of genes of which the methylation rates
are strong predictors of breast cancer survival.
C. Evaluation Metric
We use 10-fold cross validation to measure the success of
our system by evaluating how well the survival of patients in
our test set can be predicted based on the methylation rates
of the genes chosen by our system. In order to do this we
compute the hazard of death of the patients in our test set given
their methylation rates of the genes chosen by our system.
We compute the hazard of death using the Cox proportional
hazards model. The hazard of death at time t can interpreted
as the risk of dying at time t. Ideally, the computed hazard
is much higher for patients in the test set who die than for
patients in the test set who survive.
III. APPROACH
A. Baseline
Doctors do not yet use methylation data of breast cancer pa-
tients when selecting breast cancer therapy strategies or when
predicting breast cancer survival. In other words, methylation
rates of patients do currently not affect hazard estimates for
both patients who will survive breast cancer and for patients
who will die from breast cancer. Thus, the average ’hazard
ratio relative to the sample average’ based on methylation rates

2
is the same (1) for both patients who will survive breast cancer
and for patients who will die from breast cancer. We use this
approach as our baseline, meaning that the average ’hazard
ratio relative to the sample average’ is 1 for both both patients
who will survive breast cancer and for patients who will die
from breast cancer.
B. Oracle
Our oracle knows which genes are most correlated with
breast cancer survival. Thus, it identifies the genes for which
the average predicted ’hazard ratio relative to the sample
average’ of patients who will survive breast cancer is minimal,
or for which the average predicted ’hazard ratio relative to the
sample average’ of patients who will die from breast cancer is
maximal. We do not know what these genes are, so there is no
way for us to implement the oracle; the purpose of this work
is to identify those genes. Ideally, we can correctly predict
survival for all patients in our test set based on the methylation
rates of the genes selected by the oracle. This corresponds
to an average predicted ’hazard ratio relative to the sample
average’ of patients who will survive breast cancer of 0, and an
average predicted ’hazard ratio relative to the sample average’
of patients who will die from breast cancer of ∞.
C. Data Preprocessing
We merged the dataset containing genomic data, copy
number variation (CNV) data, and methylation data, with the
survival dataset by processing all 9,074 patient IDs, putting
all of them in the same format, and then finding the patient
IDs contained in both datasets. We then created a matrix with
methylation data of the 16,020 different genes of all the breast
cancer patients contained in both datasets (989 patients in
total). For all of the 989 patients, we added their survival data
including the ”time to last contact or event,” and whether the
event occurred (1:death) or not (0:no death at time of last
contact) to this new matrix. Because of the enormous number
of genes included in this matrix (16,020), we reduced the
number of genes contained in the matrix by removing genes
of which methylation rates do not or barely correlate to breast
cancer survival. We identified genes of which methylation rates
do not or barely correlate to breast cancer survival using our
regression models. For each model we fitted the model to the
survival and methylation data. We then removed the genes with
the lowest absolute weights.
D. Regularized Least-squares Regression Using Ridge
Ridge regression minimizes squared error while regularizing
the L2
-norm of the weights [6]:
J(w) = λ(w)
2
2 +
i
(wT
xi − yi)2
(1)
Then the stationary condition is
∂J
∂w
= λw +
i
(wT
xi − yi)x = 0 (2)
(XXT
+ λI)w = Xy (3)
w = (XXT
+ λI)−1
Xy (4)
Ridge regression is ideal if there are many predictors (i.e.
the 16,020 genes from our dataset), all with non-zero coef-
ficients and drawn from a normal distribution. In particular,
ridge regression performs well with predictors that have small
effects, and prevents coefficients of regression models with
many correlated variables from being poorly determined and
exhibiting high variance [7].
E. Regularized Least-squares Regression Using Lasso
Lasso regression methods are widely used in domains with
massive datasets, such as genomics, for which efficient and
fast algorithms are essential [7]. However, lasso regularization
is not robust to high correlations among predictors. It will
arbitrarily choose one predictor, ignore other predictors, and
break down when all predictors are identical [8]. Moreover,
the lasso penalty expects many coefficients to be close to zero
and only a small subset of coefficients to be significantly larger
than zero. The lasso estimator uses the L1
-norm penalized least
squares criterion to obtain a sparse solution to the following
optimization problem:
J(w) = argmin
w
(y − Xw)
2
1 + λ(w)1 (5)
(w)1 =
p
j (wj) is the L1
-norm penalty on w, which induces
sparsity
in the solution, and λ ≥ 0 is a tuning parameter.
The L1
-norm penalty enables the lasso method to simul-
taneously regularize the least squares fit and shrink some
components of J(w) to zero for some suitably chosen λ.
However, the lasso method is unstable for high-dimensional
data and cannot select more variables than the sample size
before it saturates when p > n [8].
F. Regularized Least-squares Regression Using Elastic Net
The elastic net (ENET) is an extension of the lasso that is
robust to high correlations among the predictors. In fact, in
order to circumvent the instability of the lasso solution paths
when predictors are highly correlated in the context of our
DNA methylation analysis, the ENET can efficiently analyze
high dimensional data [9]. In particular, the ENET uses a
mixture of the L1
-norm (lasso) and L2
-norm (ridge regression)
penalties and can be formulated as:
J(w) = (1 +
λ2
n
)(argmin
w
(y − Xw)
2
2 + λ(w)
2
2 + λ(w)1) (6)
On setting α = λ2
λ1+λ2
, the ENET estimator is seen to be
equivalent to the minimizer of:
J(w) = argmin
w
(y − Xw)
2
2 (7)
subject to
Pα(w) = (1 − α)(w)1 + α(w)
2
2 (8)
where Pα(w) is the ENET penalty [9].
Thus, the ENET simplifies to simple ridge regression when
α = 1 and to the lasso when α = 0. The L1
-norm part of the

3
ENET does automatic variable selection, while the L2
-norm
part encourages grouped selection and stabilizes the solution
paths with respect to random sampling, thereby improving
prediction. By inducing a grouping effect during variable
selection, such that a group of highly correlated variables tend
to have coefficients of similar magnitude, the ENET can select
groups of correlated features when the groups are not known
in advance. Unlike the lasso, when p >> n the elastic net
selects more than n variables [9].
G. Cox Proportional Hazards Model
The Cox model is a well-recognised statistical technique
for analyzing the relationship between patient survival and
explanatory variables [10]. The Cox regression model (also
known as know as the proportional hazards regression model)
allows us to isolate the effects of several explanatory variables
and deal with the censored survival times. It models the
survival times of the patients on the gene methylation rates.
Proportional hazards regression produces an equation for the
hazard function of breast cancer patients given their DNA
methylation rates. The hazard function is the probability that a
breast cancer patient will die within a short time interval, given
that the breast cancer patient has survived up to the beginning
of the interval. The hazard at time t can be interpreted as the
risk that a breast cancer patient will die during time period t.
The hazard function obtained using the Cox regression
model is:
h(t) = h0(t)exp(βT
x) (9)
where,
t: time after the start of the study
h0(t): the baseline hazard function
β: vector of the regression coefficients
x: vector of the values of the explanatory variables
The baseline hazard function represents the probability of
dying when all the methylation rates are zero. Based on the re-
gression coefficients we can identify the genes most correlated
to lower or higher survival rates. The regression coefficients
with low values correspond to genes of which the methylation
rates are favorably correlated with breast cancer survival, and
the regression coefficients with high values correspond to
genes of which the methylation rates are adversely correlated
with breast cancer survival. A disadvantage of the Cox model
is that the proportional hazards (PH) assumption assumes that
the impact of each covariate on hazard remains constant during
the entire follow-up time. However, in our case, the genomic
expression of a patient might slightly change during the study
time, thereby violating the PH assumption [10].
IV. RESULTS
A. Error Analysis
We evaluated the different regression models using 10-fold
cross-validation. We then used the Cox proportional hazard
model to compute the ’hazard ratios relative to the sample
average’ for all patients in our test data based on the genes
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0
Kaplan−Meier Survival Curve for GRHPR Gene
Time (days)
CumulativeSurvivalPercentage(%)
high GRHPR methylation rate
low GRHPR methylation rate

4
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0 Kaplan−Meier Survival Curve for GADD45A Gene
Time (days)
high GADD45A methylation rate
low GADD45A methylation rate
that affect breast cancer prognosis as selected by the different
regression models. As visible in the bar graph with the average
hazard ratios of patients in the test data, the computed average
predicted ’hazard ratio relative to the sample average’ based
on the chosen genes is signiﬁcantly larger for patients who
die of breast cancer than for patients who are still alive at the
time of last contact. In particular, the ’hazard ratios relative to
the sample average’ based on the genes selected with elastic
net regression turns out to be over 18.6 times higher than for
patients who were still alive at the time of last contact. The
’hazard ratios relative to the sample average’ based on the
genes selected using ridge regression, lasso regression, and
the Cox proportional hazards model is respectively 10.0, 3.8,
and 3.4 times higher for patients who die than for patients who
are still alive at the time of last contact. In other words, the
genes selected using elastic net regression, and ridge regression
are particularly useful for the prediction of the risk of death
of breast cancer patients within a certain time interval. The
genes selected using lasso regression and the Cox proportional
hazards model are also good predictors of the probability that
a breast cancer patient will experience death within a certain
time period, but the hazard predictions based on the genes
selected using elastic net regression and ridge regression are
more accurate.
The Kaplan-Meier curves show a comparison of how long
patients with high and low methylation rates of four of the
genes selected using our methods will survive [11]. As visible
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0
Kaplan−Meier Survival Curve for ENOX2 Gene
Time (days)
high ENOX2 methylation rate
low ENOX2 methylation rate
in the GRHPR Kaplan-Meier curve, breast cancer patients with
a relatively high GRHPR methylation rate survive longer than
patients with a relatively low GRHPR methylation rate. The
Cox proportional hazards model, the lasso regression model,
and the ridge regression model all suggest that GRHPR is a
gene of which a high methylation rate in patients favorably
affects breast cancer survival. In fact, 5 years after the start of
the survival study 85% of the breast cancer patients (who did
not withdraw from the study) with high GRHPR methylation
rates were still alive, while 73% of the breast cancer patients
(who did not withdraw from the study) with low GRHPR
methylation rate were still alive.
Lasso regression indicates that GADD45A is a gene of which
high methylation rates are associated with high cancer survival
rates. GADD45A Kaplan-Meier curve shows that 83% of
breast cancer patients (who did not withdraw from the study)
with relatively high GADD45A methylation are still alive 5
years after the start of the survival study, while 76% of breast
cancer patients (who did not withdraw from the study) with
relatively low GADD45A methylation are still alive 5 years
after the start of the survival study. Similarly, both elastic net
regression and ridge regression suggest that a high ENOX2
methylation rate negatively affect breast cancer survival, and
the Cox proportional hazards model indicates that a high
ANKRD52 methylation rate adversely affects breast cancer
survival. The curves for GADD45A and ENOX2 show that
high GADD45A and ENOX2 methylation rates do indeed

5
0 500 1000 1500 2000 2500 3000
0.50.60.70.80.91.0
Kaplan−Meier Survival Curve for ANKRD52 Gene
Time (days)
high ANKRD52 methylation rate
low ANKRD52 methylation rate
Cox Proportional Hazards Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
EEF1A1P9 COL6A2
GRHPR ANKRD52
CASP3 C12orf41

Elastic Net Regression Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
CLEC2D DHDDS
C9orf89 EXOC1
CASP3 ENOX2

Lasso Regression Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
GRHPR GGCX
FUZ GTPBP8
GADD45A GRHL2

Ridge Regression Model
top 3 favorably
prognostic genes
top 3 adversely
prognostic genes
ADH5 DNAJC8
GRHPR ENOX2
CASP3 EXOC1

negatively affect breast cancer survival rates.
B. Literature Review
Several other projects have focused on applying machine
learning techniques in order to extract valuable information
from DNA methylation data. Previous projects mainly focused
on evaluating different statistical methods for analyzing DNA
methylation data [18], while others analyzed DNA methylation
data for specific types of cancer, such as leukemia [19]. In that
context, our project fits in the second framework, since we use
different regression techniques for gene identification for breast
cancer specifically using DNA methylation data.
In terms of existing projects, our contribution is two-fold.
First, we have compared different regression models (lasso,
ridge, Cox proportional hazards, elastic net) to find poten-
tial genes highly correlated to breast cancer survival, which
further highlights the importance of using different methods
in gene identification (i.e. different genes can be found with
different methods). Second, we have found genes of which
the methylation rates are highly correlated to breast cancer
development (i.e., genes of which methylation have been
shown to be linked to breast cancer survival), which may
give additional directions for breast cancer research, and breast
cancer treatment developments.
In fact, the favorably and adversely prognostic genes iden-
tified by our methods might be worth looking at in order to
further understand breast cancer biological mechanisms. Many
of the genes we identified have been widely acknowledged in
the medical literature as genes strongly correlated to breast
cancer survival, such as: CASP3 [12, 13], GADD45A [14],
ENOX2 [15], GRHL2 [16], and COL6A2 [17]. Some of those
genes were identified by only one method, such as ENOX2
(only identified by ENET). This underscores the benefits of
using distinctive methods in the context of gene identification.
Moreover, given our success in identifying genes known to be
highly-correlated to breast cancer survival, the additional genes
we found might be worth investigating to further understand
breast cancer.
V. CONCLUSION
We have presented different regressions techniques to iden-
tify genes that are highly correlated to breast cancer survival
rates by analyzing the survival and DNA methylation data
of 989 breast cancer patients [4]. Our results identify genes
widely known in the medical literature to be involved in
breast cancer development. The identified genes may prove
to be helpful for the discovery of therapeutic targets, and the
development of patient-tailored therapy strategies.

6
ACKNOWLEDGMENT
This project would have not been possible without the help
of the Gevaert Biomedical Informatics Lab, which provided
both the datasets and ongoing support.
REFERENCES
[1] M. Szyf, ’DNA methylation signatures for breast cancer
classification and prognosis’, Genome Medicine, vol. 4, no.
3, p. 26, 2012.
[2] S. Baylin, ’Aberrant patterns of DNA methylation,
chromatin formation and gene expression in cancer’,
Human Molecular Genetics, vol. 10, no. 7, pp. 687-692,
2001.
[3] K. Hansen, W. Timp, H. Bravo, S. Sabunciyan, B.
Langmead, O. McDonald, B. Wen, H. Wu, Y. Liu, D. Diep,
E. Briem, K. Zhang, R. Irizarry and A. Feinberg, ’Increased
methylation variation in epigenetic domains across cancer
types’, Nature Genetics, vol. 43, no. 8, pp. 768-775, 2011.
[4] The Cancer Genome Atlas - National Cancer Institute,
’The Cancer Genome Atlas Home Page’, 2015. [Online].
Available: http://cancergenome.nih.gov/. [Accessed: 20-Nov
-2015].
[5] C. Creighton, ’SR2-3: Integrative Genomic Analyses of
Breast Cancer from The Cancer Genome Atlas (TCGA).’,
Cancer Research, vol. 71, no. 24, pp. SR2-3-SR2-3, 2011.
[6] A. Hoerl and R. Kennard, ’Ridge Regression: Biased
Estimation for Nonorthogonal Problems’, Technometrics,
vol. 42, no. 1.
[7] J. Friedman, T. Hastie and R. Tibshirani, ’Regularization
Paths for Generalized Linear Models via Coordinate De
scent’, Journal of Statistical Software, vol. 33, no. 1, 2010.
[8] H. Zou, ’The Adaptive Lasso and Its Oracle Properties’,
Journal of the American Statistical Association, vol. 101,
no.476, pp. 1418-1429, 2006.
[9] J. Ogutu, T. Schulz-Streeck and H. Piepho, ’Genomic
selection using regularized linear regression models: ridge
regression, lasso, elastic net and their extensions’, BMC
Proc, vol. 6, no. 2, p. S10, 2012.
[10] M. Abrahamowicz, T. Schopflocher, K. Leffondré, R. du
Berger and D. Krewski, ’Flexible Modeling of Exposure-
-Response Relationship between Long-Term Average Levels
of Particulate Air Pollution and Mortality in the American
Cancer Society Study’,Journal of Toxicology and Environ
mental Health, Part A, vol. 66, no. 16-19, pp. 1625-1654,
2003.
[11] E. Kaplan and P. Meier, ’Nonparametric Estimation
from Incomplete Observations’, Journal of the American
Statistical Association, vol. 53, no. 282, p. 457, 1958.
[12] O’Donovan N, Crown J, Stunell H, Hill AD, McDermott
E, O’Higgins N, Duffy MJ. ’Caspase 3 in breast cancer’,
Clin Cancer Res, pp. 738-742, 2003.
[13] E. Devarajan, A. Sahin, J. Chen, R. Krishnamurthy, N.
Aggarwal, A. Brun, A. Sapino, F. Zhang,D. Sharma, X. Yang,
A. Tora and K. Mehta, ’Down-regulation of caspase 3 in
breast cancer: a possible mechanism for chemoresistance’,
Oncogene, vol. 21, no. 57, pp. 8843-8851, 2002.
[14] J. Tront, Y. Huang, A. Fornace, B. Hoffman and
D. Liebermann, ’Gadd45a Functions as a Promoter or
Suppressor of Breast Cancer Dependent on the Oncogenic
Stress’, Cancer Research, vol.70, no. 23, pp. 9671-9681,
2010.
[15] D. Morré and D. Morré, ECTO-NOX proteins. New
York, NY Springer, 2013.
[16] X. Xiang, Z. Deng, X. Zhuang, S. Ju, J. Mu, H. Jiang, L.
Zhang, J. Yan, D. Miller and H. Zhang, ’Grhl2 Determines
the Epithelial Phenotype of Breast Cancers and Promotes
Tumor Progression’, PLoS ONE, vol. 7, no. 12, p. e50781,
2012.
[17] E. Karousou, M. D’Angelo, K. Kouvidi, D. Vigetti, M.
Viola, D. Nikitovic, G. De Luca and A. Passi, ’Collagen
VI and Hyaluronan: The Common Role in Breast Cancer’,
BioMed Research International, vol. 2014, pp. 1-10, 2014.
[18] T. Wilhelm, ’Phenotype prediction based on genome-wide
DNA methylation data’, BMC Bioinformatics, vol. 15, no.
1, p. 193, 2014.
[19] J. Nordlund, C. Bäcklin, V. Zachariadis, et al. ’DNA
methylation-based subtype prediction for pediatric acute
lymphoblastic leukemia’, Clin Epigenetics, vol. 7, no. 1, p.
11, 2015.

Gene_Identification_Report

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Gene_Identification_Report

Similar to Gene_Identification_Report (20)

Gene_Identification_Report