Integrative Pathway-based Survival Prediction
utilizing the Interaction between
Gene Expression and
DNA Methylation in Breast Cancer
So Yeon Kim, Tae Rim Kim,
Hyun-Hwan Jeong, Kyung-Ah Sohn
Translational Bioinformatics Conference 2017
Los Angeles, California
Introduction
Motivation
• The integrative analysis on multi-omics data
for precise cancer classification
• The causal relationships between gene
expression and DNA methylation in breast
cancer data
• Integrated analysis of gene expression and
methylation data using pathway information
in order to find important pathway and gene
features in breast cancer
Biological pathway
(http://www.genome.jp/dbget-bin/www_bget?pathway:map04010/)
Gene expression data
(https://en.wikipedia.org/wiki/Gene_expression_profiling)
Related works (1/3)
• Many studies for integrating multi-omics data by
considering gene-gene interactions and graph-based
integration approaches using clinical information
• Incorporating pathway information in the graph in order to
find important pathways and genes for cancer
• How to incorporate pathway information?
Related works (2/3)
• Several studies about inferring
pathway activity
• A directed random walk-based
pathway activity inference
method (DRW) to identify
topologically important genes
and pathways for predicting
clinical outcome
Liu, et al. Bioinformatics (2013)
Weight genes with their topological importance
Pathway network
Related works (3/3)
• Integrated extension on multi-
omics data (DRW-GM)
– Improved accuracy of cancer
classification by joint analysis
of gene expression and
metabolite data
– Found many risk metabolite
pathways and topologically
important genes for cancer
• What about other profiles?
Liu, et al. Scientific reports (2015)
Overview
(1) DRW-based approach on an integrated gene-gene graph construction
Gene profile Pathway profile
(2) DA based feature selection and
survival prediction
Methods
A
A
Global directed pathway graph
Gene expression Methylation
Overlapping gene
• Assign bi-directional edges
1) to all the overlapping genes
Gene expression
(RNAseq)
Overlapping genes
Methylation
Pathway-guided integrated gene-gene graph construction (1/2)
A
A
Global directed pathway graph
Gene expression Methylation
Overlapping gene
• Assign bi-directional edges
1) to all the overlapping genes
2) when the expression and methylation
values of the same gene have negative
correlation
Gene expression
(RNAseq)
Overlapping genes
Methylation
Pathway-guided integrated gene-gene graph construction (2/2)
Genes
Samples
RNA sequence Initial weight of genes
Genes
Samples
Methylation
DESeq2
t-test
Integrative Directed Random Walk(iDRW)
A
A
ground
node
Global directed pathway graph
Gene expression Methylation
Overlapping geneRandom walker
𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
𝑾 𝒕+𝟏 = 𝟏 − 𝒓 𝑴 𝑻 𝑾 𝒕 + 𝒓𝑾 𝟎
𝑾∞
Directed Random Walk based pathway activity inference (1/2)
Genes
Samples
RNA sequence Initial weight of genes
Genes
Samples
Methylation
DESeq2
t-test
Integrative Directed Random Walk(iDRW)
A
A
ground
node
Global directed pathway graph
Gene expression Methylation
Overlapping geneRandom walker
𝒛 𝒈𝒊
𝒛 𝒈𝒊
𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
𝑾 𝒕+𝟏 = 𝟏 − 𝒓 𝑴 𝑻 𝑾 𝒕 + 𝒓𝑾 𝟎
𝑾∞
Pathway
Samples
Pathway Profile
𝒂 𝑷𝒋
𝒂 𝑷𝒋 = 𝒊=𝟏
𝒏 𝒋
𝑾∞ 𝒈𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 ∗ 𝒛 𝒈𝒊
𝒊=𝟏
𝒏 𝒋
(𝑾∞ 𝒈𝒊 ) 𝟐
log2 𝑓𝑜𝑙𝑑 𝑐ℎ𝑎𝑛𝑔𝑒
𝑠𝑖𝑔𝑛 𝑡𝑠𝑐𝑜𝑟𝑒 𝑔𝑖
Directed Random Walk based pathway activity inference (2/2)
Risk pathway identification
Survival
prediction
pathway
00410pathway
00060
DA for feature selection
Input layer hidden layer output layer
Pathway
Samples
Pathway Profile
Feature selection and survival prediction (1/5)
• Denoising Autoencoder (DA)
– Effective in selecting robust features against
input noise
– capturing more distinctive features by
learning latent representations of the input
data
– Trained DA to obtain the weight matrix
between input and hidden layer
DA for feature selection
Input layer
𝒙
hidden layer
𝒚
output layer
𝒛
Feature selection and survival prediction (2/5)
• Denoising Autoencoder (DA)
– With training with a corrupted input, we can
obtain robust features to the input noise
– Denoising process can differentiate the
features more and discover interesting
structure in the input
– Rank each input(pathway) feature by taking
the means of the weight vectors of the input
node to all hidden nodes
DA for feature selection
corrupted input
𝒙 = 𝒙 + 𝒏𝒐𝒊𝒔𝒆
Input layer
𝒙
hidden layer
𝒚
output layer
𝒛
DA for feature selection
should be seen as
an original input of 𝒙 given 𝒚latent representation of 𝒙
Feature selection and survival prediction (3/5)
• Feature ranking strategy
1) each pathway feature is ranked by the
weight matrix of DA
2) p-values from the t-test of pathway
activities without using DA
Input layer hidden layer output layer
DA for feature selection
Pathway
Samples
Pathway Profile Rank
t-test
1
2
Feature selection and survival prediction (4/5)
• Logistic regression model
classifies the samples into
good and poor group
• Selected top-N pathway
features that showed the
best classification
performance by 5-fold
cross validation
Risk pathway identification
Survival
prediction
pathway
00410pathway
00060
Pathway
Samples
Pathway Profile Rank
5-fold cross validation
training test
Feature selection and survival prediction (5/5)
Experiments
Dataset
• TCGA breast cancer data from Broad Institute GDAC Firehose
– Gene expression data from RNA sequencing - 17,673 genes
– DNA methylation data obtained as a gene-level feature – 17,037 genes
• 465 samples are divided into 218 good group and 247 poor group
– poor group : survival days <= 3 years & vital status = 0
– good group : survival days > 3 years
• Directed pathway graph is given [Liu, et al, 2013]
– Interactions between genes defined from KEGG pathway
– Totally 4,113 genes for each gene profile and 88,440 directed edges in 279
KEGG pathways
Incorporating pathway information improved
classification performance
• Show the utility of pathway
profiles obtained by DRW
• Improved the classification
performance using pathway
profiles extracted from the
DRW method
meanAUC
meanAccuracy(%)
meanAUC
meanAccuracy(%)
The proposed integrated gene-gene graph showed best
classification performance
• Show the utility of the DRW
method on the integrated
gene-gene graph (iDRW) on
the combined feature data
• Performance of the proposed
method combined with DA
(iDRW+DA) was the best
DA showed effectiveness in prioritizing significant
pathways
• Without using DA (iDRW), it
can detect generally
important pathways for
many cancers
• The proposed method using
DA (iDRW+DA) contributes
to find pathways specifically
related to breast cancer
DA found pathways specifically related to breast cancer
Pathway name Frequency Total
genes
DE
Genes
DM
features
Dorso-ventral axis formation 10/50 27 4 0
Pancreatic secretion 8/50 65 26 3
Neurotrophin signaling pathway 7/50 90 47 3
Prion diseases 7/50 30 12 0
One carbon pool by folate 5/50 33 6 1
alpha-Linolenic acid metabolism 5/50 23 8 1
Pyruvate metabolism 5/50 96 7 1
PPAR signaling pathway 5/50 61 13 1
T cell receptor signaling pathway 5/50 85 52 8
Focal adhesion 5/50 148 83 11
Ribosome 5/50 143 1 0
Glioma 5/50 52 27 0
Circadian rhythm – fly 5/50 8 4 1
Tropane, piperidine and pyridine alkaloid biosynthesis 5/50 26 1 0
Pathway name Frequency Total
genes
DE
Genes
DM
features
Dorso-ventral axis formation 10/50 27 4 0
Pancreatic secretion 8/50 65 26 3
Neurotrophin signaling pathway 7/50 90 47 3
Prion diseases 7/50 30 12 0
One carbon pool by folate 5/50 33 6 1
alpha-Linolenic acid metabolism 5/50 23 8 1
Pyruvate metabolism 5/50 96 7 1
PPAR signaling pathway 5/50 61 13 1
T cell receptor signaling pathway 5/50 85 52 8
Focal adhesion 5/50 148 83 11
Ribosome 5/50 143 1 0
Glioma 5/50 52 27 0
Circadian rhythm – fly 5/50 8 4 1
Tropane, piperidine and pyridine alkaloid biosynthesis 5/50 26 1 0
Most of the patients showed genetic alterations for differentially
expressed genes in dorso-ventral axis formation pathway
The Breast Invasive Carcinoma dataset in cBioPortal (http://www.cbioportal.org/)
DA found pathways specifically related to breast cancer
Pathway name Frequency Total
genes
DE
Genes
DM
features
Dorso-ventral axis formation 10/50 27 4 0
Pancreatic secretion 8/50 65 26 3
Neurotrophin signaling pathway 7/50 90 47 3
Prion diseases 7/50 30 12 0
One carbon pool by folate 5/50 33 6 1
alpha-Linolenic acid metabolism 5/50 23 8 1
Pyruvate metabolism 5/50 96 7 1
PPAR signaling pathway 5/50 61 13 1
T cell receptor signaling pathway 5/50 85 52 8
Focal adhesion 5/50 148 83 11
Ribosome 5/50 143 1 0
Glioma 5/50 52 27 0
Circadian rhythm – fly 5/50 8 4 1
Tropane, piperidine and pyridine alkaloid biosynthesis 5/50 26 1 0
Differentially expressed genes and methylation features
in the selected pathways are related to breast cancer
Disease Gene GDA score
Breast Carcinoma
AKT1 0.2418
PIK3CD 0.0448
MAPK3 0.0118
HRAS 0.0077
BCAR1 0.0074
Malignant neoplasm of
breast
AKT1 0.2420
PIK3CD 0.0475
KDR 0.0119
MAPK3 0.0110
PAK1 0.0095
Triple Negative
Breast Neoplasms
PIK3CD 0.0047
AKT1 0.0022
AKT3 0.0011
MAPK3 0.0011
KDR 0.0008
* GDA(gene-disease association) score obtained from the DisGeNET database (http://www.disgenet.org/)
The hub genes (MAPK3, HRAS, AKT1, PIK3CD, …) were all reported as
highly related to the pathways which are associated with cancers
About 74% of hub genes are highly related to
the breast cancer-related diseases
Conclusion
• A DRW-based method on a pathway-based integrated gene-
gene graph utilizing the interaction between gene expression
and methylation profile
• Contributions
1. Integrative pathway-guided gene-gene graph construction using the
interaction between gene expression and DNA methylation in breast
cancer data
2. A joint analysis on gene expression and methylation data using the
integrated gene-gene graph
3. Cancer-specific pathways and genes prioritization using a Denoising
Autoencoder (DA)
Acknowledgements
All lab members of LAMDA lab
Kyung-Ah Sohn
Byungkon Kang
Yenewondim Biadgie
Garam Lee
Tefrie Kaleab Getaneh
Habtamu Minassie Aycheh
Sehee Wang
Jungryul Seo
Nam-Hyuk Ahn
Ho-Min Park
Min-Soo Kim
Tae-rim Kim
Young-Bum Choi
Jun-hyung Yu
Jeong-hyun Moon
Jeong Gi Kim
Hae-Yong Hwang
Sijin Kim
Jae-Min Nah
Hyun-Hwan Jeong, Ph.D.
Post-doctoral associate
Baylor College of Medicine
Texas Children’s Hospital
Kyung-Ah Sohn, Ph.D.
Associate Professor
Department of Software and Computer Engineering,
Ajou University
Tae-rim Kim
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
Thank you !

Integrative Pathway-based Survival Prediction utilizing the Interaction between Gene Expression and DNA Methylation in Breast Cancer

  • 1.
    Integrative Pathway-based SurvivalPrediction utilizing the Interaction between Gene Expression and DNA Methylation in Breast Cancer So Yeon Kim, Tae Rim Kim, Hyun-Hwan Jeong, Kyung-Ah Sohn Translational Bioinformatics Conference 2017 Los Angeles, California
  • 2.
  • 3.
    Motivation • The integrativeanalysis on multi-omics data for precise cancer classification • The causal relationships between gene expression and DNA methylation in breast cancer data • Integrated analysis of gene expression and methylation data using pathway information in order to find important pathway and gene features in breast cancer Biological pathway (http://www.genome.jp/dbget-bin/www_bget?pathway:map04010/) Gene expression data (https://en.wikipedia.org/wiki/Gene_expression_profiling)
  • 4.
    Related works (1/3) •Many studies for integrating multi-omics data by considering gene-gene interactions and graph-based integration approaches using clinical information • Incorporating pathway information in the graph in order to find important pathways and genes for cancer • How to incorporate pathway information?
  • 5.
    Related works (2/3) •Several studies about inferring pathway activity • A directed random walk-based pathway activity inference method (DRW) to identify topologically important genes and pathways for predicting clinical outcome Liu, et al. Bioinformatics (2013) Weight genes with their topological importance Pathway network
  • 6.
    Related works (3/3) •Integrated extension on multi- omics data (DRW-GM) – Improved accuracy of cancer classification by joint analysis of gene expression and metabolite data – Found many risk metabolite pathways and topologically important genes for cancer • What about other profiles? Liu, et al. Scientific reports (2015)
  • 7.
    Overview (1) DRW-based approachon an integrated gene-gene graph construction Gene profile Pathway profile (2) DA based feature selection and survival prediction
  • 8.
  • 9.
    A A Global directed pathwaygraph Gene expression Methylation Overlapping gene • Assign bi-directional edges 1) to all the overlapping genes Gene expression (RNAseq) Overlapping genes Methylation Pathway-guided integrated gene-gene graph construction (1/2)
  • 10.
    A A Global directed pathwaygraph Gene expression Methylation Overlapping gene • Assign bi-directional edges 1) to all the overlapping genes 2) when the expression and methylation values of the same gene have negative correlation Gene expression (RNAseq) Overlapping genes Methylation Pathway-guided integrated gene-gene graph construction (2/2)
  • 11.
    Genes Samples RNA sequence Initialweight of genes Genes Samples Methylation DESeq2 t-test Integrative Directed Random Walk(iDRW) A A ground node Global directed pathway graph Gene expression Methylation Overlapping geneRandom walker 𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐) 𝑾 𝒕+𝟏 = 𝟏 − 𝒓 𝑴 𝑻 𝑾 𝒕 + 𝒓𝑾 𝟎 𝑾∞ Directed Random Walk based pathway activity inference (1/2)
  • 12.
    Genes Samples RNA sequence Initialweight of genes Genes Samples Methylation DESeq2 t-test Integrative Directed Random Walk(iDRW) A A ground node Global directed pathway graph Gene expression Methylation Overlapping geneRandom walker 𝒛 𝒈𝒊 𝒛 𝒈𝒊 𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐) 𝑾 𝒕+𝟏 = 𝟏 − 𝒓 𝑴 𝑻 𝑾 𝒕 + 𝒓𝑾 𝟎 𝑾∞ Pathway Samples Pathway Profile 𝒂 𝑷𝒋 𝒂 𝑷𝒋 = 𝒊=𝟏 𝒏 𝒋 𝑾∞ 𝒈𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 ∗ 𝒛 𝒈𝒊 𝒊=𝟏 𝒏 𝒋 (𝑾∞ 𝒈𝒊 ) 𝟐 log2 𝑓𝑜𝑙𝑑 𝑐ℎ𝑎𝑛𝑔𝑒 𝑠𝑖𝑔𝑛 𝑡𝑠𝑐𝑜𝑟𝑒 𝑔𝑖 Directed Random Walk based pathway activity inference (2/2)
  • 13.
    Risk pathway identification Survival prediction pathway 00410pathway 00060 DAfor feature selection Input layer hidden layer output layer Pathway Samples Pathway Profile Feature selection and survival prediction (1/5)
  • 14.
    • Denoising Autoencoder(DA) – Effective in selecting robust features against input noise – capturing more distinctive features by learning latent representations of the input data – Trained DA to obtain the weight matrix between input and hidden layer DA for feature selection Input layer 𝒙 hidden layer 𝒚 output layer 𝒛 Feature selection and survival prediction (2/5)
  • 15.
    • Denoising Autoencoder(DA) – With training with a corrupted input, we can obtain robust features to the input noise – Denoising process can differentiate the features more and discover interesting structure in the input – Rank each input(pathway) feature by taking the means of the weight vectors of the input node to all hidden nodes DA for feature selection corrupted input 𝒙 = 𝒙 + 𝒏𝒐𝒊𝒔𝒆 Input layer 𝒙 hidden layer 𝒚 output layer 𝒛 DA for feature selection should be seen as an original input of 𝒙 given 𝒚latent representation of 𝒙 Feature selection and survival prediction (3/5)
  • 16.
    • Feature rankingstrategy 1) each pathway feature is ranked by the weight matrix of DA 2) p-values from the t-test of pathway activities without using DA Input layer hidden layer output layer DA for feature selection Pathway Samples Pathway Profile Rank t-test 1 2 Feature selection and survival prediction (4/5)
  • 17.
    • Logistic regressionmodel classifies the samples into good and poor group • Selected top-N pathway features that showed the best classification performance by 5-fold cross validation Risk pathway identification Survival prediction pathway 00410pathway 00060 Pathway Samples Pathway Profile Rank 5-fold cross validation training test Feature selection and survival prediction (5/5)
  • 18.
  • 19.
    Dataset • TCGA breastcancer data from Broad Institute GDAC Firehose – Gene expression data from RNA sequencing - 17,673 genes – DNA methylation data obtained as a gene-level feature – 17,037 genes • 465 samples are divided into 218 good group and 247 poor group – poor group : survival days <= 3 years & vital status = 0 – good group : survival days > 3 years • Directed pathway graph is given [Liu, et al, 2013] – Interactions between genes defined from KEGG pathway – Totally 4,113 genes for each gene profile and 88,440 directed edges in 279 KEGG pathways
  • 20.
    Incorporating pathway informationimproved classification performance • Show the utility of pathway profiles obtained by DRW • Improved the classification performance using pathway profiles extracted from the DRW method meanAUC meanAccuracy(%)
  • 21.
    meanAUC meanAccuracy(%) The proposed integratedgene-gene graph showed best classification performance • Show the utility of the DRW method on the integrated gene-gene graph (iDRW) on the combined feature data • Performance of the proposed method combined with DA (iDRW+DA) was the best
  • 22.
    DA showed effectivenessin prioritizing significant pathways • Without using DA (iDRW), it can detect generally important pathways for many cancers • The proposed method using DA (iDRW+DA) contributes to find pathways specifically related to breast cancer
  • 23.
    DA found pathwaysspecifically related to breast cancer Pathway name Frequency Total genes DE Genes DM features Dorso-ventral axis formation 10/50 27 4 0 Pancreatic secretion 8/50 65 26 3 Neurotrophin signaling pathway 7/50 90 47 3 Prion diseases 7/50 30 12 0 One carbon pool by folate 5/50 33 6 1 alpha-Linolenic acid metabolism 5/50 23 8 1 Pyruvate metabolism 5/50 96 7 1 PPAR signaling pathway 5/50 61 13 1 T cell receptor signaling pathway 5/50 85 52 8 Focal adhesion 5/50 148 83 11 Ribosome 5/50 143 1 0 Glioma 5/50 52 27 0 Circadian rhythm – fly 5/50 8 4 1 Tropane, piperidine and pyridine alkaloid biosynthesis 5/50 26 1 0
  • 24.
    Pathway name FrequencyTotal genes DE Genes DM features Dorso-ventral axis formation 10/50 27 4 0 Pancreatic secretion 8/50 65 26 3 Neurotrophin signaling pathway 7/50 90 47 3 Prion diseases 7/50 30 12 0 One carbon pool by folate 5/50 33 6 1 alpha-Linolenic acid metabolism 5/50 23 8 1 Pyruvate metabolism 5/50 96 7 1 PPAR signaling pathway 5/50 61 13 1 T cell receptor signaling pathway 5/50 85 52 8 Focal adhesion 5/50 148 83 11 Ribosome 5/50 143 1 0 Glioma 5/50 52 27 0 Circadian rhythm – fly 5/50 8 4 1 Tropane, piperidine and pyridine alkaloid biosynthesis 5/50 26 1 0 Most of the patients showed genetic alterations for differentially expressed genes in dorso-ventral axis formation pathway The Breast Invasive Carcinoma dataset in cBioPortal (http://www.cbioportal.org/)
  • 25.
    DA found pathwaysspecifically related to breast cancer Pathway name Frequency Total genes DE Genes DM features Dorso-ventral axis formation 10/50 27 4 0 Pancreatic secretion 8/50 65 26 3 Neurotrophin signaling pathway 7/50 90 47 3 Prion diseases 7/50 30 12 0 One carbon pool by folate 5/50 33 6 1 alpha-Linolenic acid metabolism 5/50 23 8 1 Pyruvate metabolism 5/50 96 7 1 PPAR signaling pathway 5/50 61 13 1 T cell receptor signaling pathway 5/50 85 52 8 Focal adhesion 5/50 148 83 11 Ribosome 5/50 143 1 0 Glioma 5/50 52 27 0 Circadian rhythm – fly 5/50 8 4 1 Tropane, piperidine and pyridine alkaloid biosynthesis 5/50 26 1 0
  • 26.
    Differentially expressed genesand methylation features in the selected pathways are related to breast cancer Disease Gene GDA score Breast Carcinoma AKT1 0.2418 PIK3CD 0.0448 MAPK3 0.0118 HRAS 0.0077 BCAR1 0.0074 Malignant neoplasm of breast AKT1 0.2420 PIK3CD 0.0475 KDR 0.0119 MAPK3 0.0110 PAK1 0.0095 Triple Negative Breast Neoplasms PIK3CD 0.0047 AKT1 0.0022 AKT3 0.0011 MAPK3 0.0011 KDR 0.0008 * GDA(gene-disease association) score obtained from the DisGeNET database (http://www.disgenet.org/) The hub genes (MAPK3, HRAS, AKT1, PIK3CD, …) were all reported as highly related to the pathways which are associated with cancers About 74% of hub genes are highly related to the breast cancer-related diseases
  • 27.
    Conclusion • A DRW-basedmethod on a pathway-based integrated gene- gene graph utilizing the interaction between gene expression and methylation profile • Contributions 1. Integrative pathway-guided gene-gene graph construction using the interaction between gene expression and DNA methylation in breast cancer data 2. A joint analysis on gene expression and methylation data using the integrated gene-gene graph 3. Cancer-specific pathways and genes prioritization using a Denoising Autoencoder (DA)
  • 28.
    Acknowledgements All lab membersof LAMDA lab Kyung-Ah Sohn Byungkon Kang Yenewondim Biadgie Garam Lee Tefrie Kaleab Getaneh Habtamu Minassie Aycheh Sehee Wang Jungryul Seo Nam-Hyuk Ahn Ho-Min Park Min-Soo Kim Tae-rim Kim Young-Bum Choi Jun-hyung Yu Jeong-hyun Moon Jeong Gi Kim Hae-Yong Hwang Sijin Kim Jae-Min Nah Hyun-Hwan Jeong, Ph.D. Post-doctoral associate Baylor College of Medicine Texas Children’s Hospital Kyung-Ah Sohn, Ph.D. Associate Professor Department of Software and Computer Engineering, Ajou University Tae-rim Kim Graduate student, Masters course Department of Software and Computer Engineering, Ajou University
  • 29.