Robust Pathway-based Multi-Omics Data Integration
using Directed Random Walk for Survival Prediction
in Multiple Cancer Studies
So Yeon Kim, Hyun-Hwan Jeong, Jaesik Kim,
Jeong-Hyeon Moon, Kyung-Ah Sohn
17TH ANNUAL INTERNATIONAL CONFERENCE ON CRITICAL
ASSESSMENT OF MASSIVE DATA ANALYSIS (CAMDA 2018)
CANCER DATA INTEGRATION CHALLENGE
CAMDA 2018 1
Outline
• Introduction
− Motivation
− Related works
• Methods
• Results
• Conclusion
CAMDA 2018 2
Introduction
CAMDA 2018 3
Motivation (1/3)
• Rich information of multi-omics
data provide opportunities for
better biological understanding
and improved clinical outcome
prediction
• Integrative analysis is
important to discover
interrelationships between
multiple different levels of data
CAMDA 2018 4
Weinstein et al. Nature genetics, 2013
Motivation (2/3)
• Graph-based integration methods are effective at combining
multi-omics data to consider the interactions between different
types of genomic data
CAMDA 2018 5
Kim et al. Journal of the American Medical Informatics Association, 2014
Motivation (3/3)
• Incorporating genomic
knowledge such as pathway
information on the
integrated graph can be
useful to increase prediction
power and find important
genes and pathways in
cancers
CAMDA 2018 6
Liu et al. Bioinformatics, 2013
Related works (1/7)
• Pathway-based integrative methods
− They simply transformed single genomic profile into pathway profile using
activity scoring measure
− A pathway level analysis of gene expression (PLAGE) used the singular
vector of singular value decomposition of given gene set
CAMDA 2018 7
TomFohr et al. Bioinformatics, 2005
Related works (2/7)
• Pathway-based integrative methods
− Z-score method convert gene expression
profile into z-scores and combines z-
scores of genes in each pathway per
sample
− They take pathways as the set of genes
− Better to consider gene-gene interactions
CAMDA 2018 8
Lee et al. PLoS Comput Biol, 2008
Related works (3/7)
• Some methods utilized gene-gene
interactions on a graph
− A denoising algorithm based on
relevance network topology (DART)
integrates pathways by deriving
perturbation signatures which reflect
gene contributions in each pathway
CAMDA 2018 9
Jiao et al. Bioinformatics, 2011
Related works (4/7)
• Some methods utilized gene-gene
interactions on a graph
− A directed random walk-based pathway
activity inference method (DRW)
identifies topologically important genes
and pathways by weighting the genes in
the gene-gene network
CAMDA 2018 10
Liu et al. Bioinformatics, 2013
Related works (5/7)
• Some methods utilized gene-gene
interactions on a graph
− Integrated extension on multi-omics
data (DRW-GM)
− Improved prediction performance
− Found many risk metabolite pathways
and topologically important genes for
cancer by a joint analysis of gene
expression and metabolite data
CAMDA 2018 11
Liu, et al. Scientific reports, 2015
Related works (6/7)
• Integrative DRW (iDRW) incorporate interaction between gene
expression and methylation features exploiting DRW-based methods
CAMDA 2018 12
Kim et al. BMC Medical Genomics, 2018 (to be appear)
Related works (7/7)
• Improved survival prediction power and jointly analyzed gene
expression and methylation data on an integrated gene-gene graph
CAMDA 2018 13
Kim et al. BMC Medical Genomics, 2018 (to be appear)
Overview
• Investigate the effectiveness of iDRW method on other types of
genomic profiles for two different cancers
• Reflect the interactions between gene expression and copy
number data on an integrated graph
• Construct graph with the updated pathway database
• A survival group classification for breast cancer and neuroblastoma
patient samples
CAMDA 2018 14
Overview
CAMDA 2018 15
Methods
CAMDA 2018 16
Integrated gene-gene graph construction (1/2)
• 327 human pathways and
corresponding gene sets from KEGG
database
• Interactions between genes were
defined using R KEGGgraph package
• Integrated directed gene-gene graph
− 7,390 nodes and 58,426 edges
CAMDA 2018 17
B
A
gene
KEGG PATHWAY
Database
Integrated gene-gene graph construction (2/2)
• To reflect the impact of copy number variation on gene expression,
we assign directional edges to all the overlapping genes
CAMDA 2018 18
Gene expression
Overlapping genes
Copy number alteration
Gene expression
Copy number alteration
Pathway activity inference
• The weight of the gene 𝒘 𝒈 is
the p-value from
- DESeq2 analysis (RNA-Seq)
- Two-tailed t-test (Microarray)
- 𝜒2-test of independence (Copy
number data)
CAMDA 2018 19
Genes
Samples
Gene expression
𝒛 𝒈𝒊
Genes
Samples
CNA
𝒛 𝒈𝒊
Weight initialization
𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
Pathway activity inference
CAMDA 2018 20
ground
node
Global directed gene-gene graph
Gene expression Copy number alteration Random walker
Integrative Directed Random Walk(iDRW)
𝑾∞
Genes
Samples
Gene expression
𝒛 𝒈𝒊
Genes
Samples
CNA
𝒛 𝒈𝒊
Weight initialization
𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
𝑾 𝒕+𝟏 = 𝟏 − 𝒓 𝑴 𝑻 𝑾 𝒕 + 𝒓𝑾 𝟎
Pathway activity inference
CAMDA 2018 21
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
𝒂 𝑷𝒋 = 𝒊=𝟏
𝒏 𝒋
𝑾∞ 𝒈𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 ∗ 𝒛 𝒈𝒊
𝒊=𝟏
𝒏 𝒋
(𝑾∞ 𝒈𝒊 ) 𝟐
ground
node
Global directed pathway graph
Gene expression Copy number alteration Random walker
Integrative Directed Random Walk(iDRW)
𝑾∞
𝑷𝒋 = {𝒈 𝟏, 𝒈 𝟐, … , 𝒈 𝒏𝒋
}
𝒏𝒋 differential genes
Pathway activity inference
CAMDA 2018 22
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
𝒂 𝑷𝒋 = 𝒊=𝟏
𝒏 𝒋
𝑾∞ 𝒈𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 ∗ 𝒛 𝒈𝒊
𝒊=𝟏
𝒏 𝒋
(𝑾∞ 𝒈𝒊 ) 𝟐
ground
node
Global directed pathway graph
Gene expression Copy number alteration Random walker
Integrative Directed Random Walk(iDRW)
𝑾∞
𝑷𝒋 = {𝒈 𝟏, 𝒈 𝟐, … , 𝒈 𝒏𝒋
}
𝒏𝒋 differential genes
Score of gene 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 is
- 𝑙𝑜𝑔2 fold change from DESeq2 analysis (RNA-Seq)
- 𝑠𝑖𝑔𝑛 𝑡𝑠𝑐𝑜𝑟𝑒 𝑔𝑖 (Microarray)
- 𝑚𝑒𝑎𝑛(𝐶𝑁𝐴 𝑔𝑖 𝑝𝑜𝑜𝑟) − 𝑚𝑒𝑎𝑛(𝐶𝑁𝐴 𝑔𝑖 𝑔𝑜𝑜𝑑) (Copy number data)
Pathway feature selection and survival prediction
• Feature ranking strategy
• p-values from the t-test of pathway
activities
• Top-k pathways across samples are
going to be the input to the
classification model
CAMDA 2018 23
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
Rank
Top-k pathway feature selection
k
p-value
from t-test
Pathway feature selection and survival prediction
• Survival prediction
• Logistic regression model
classifies the samples into
good and poor group
• Empirically select top-k
pathway features that
showed the best
classification performance
CAMDA 2018 24
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
Rank
pathway
00410pathway
00060
Risk-active pathway identification
Survival prediction
Results
CAMDA 2018 25
Challenge Dataset (1/2)
• Breast cancer patients data
from METABRIC dataset
• 24,368 genes of mRNA expression
profile from Illumina Human v3
microarray with log intensity levels
• 22,544 genes of putative copy-
number alterations data
• 1,648 patient samples are divided
into 908 good (> 10 years) and 740
poor (≤ 10 years) samples
CAMDA 2018 26
Agerage
survival years
10
Agerage
age at diagnosis
62
Performance evaluation (1/2)
CAMDA 2018 27
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
1,648
patients
908 good
group
(long-term
survival)
740 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
Performance evaluation (1/2)
CAMDA 2018 28
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
1,648
patients
908 good
group
(long-term
survival)
740 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
fold 1
fold 2
fold 3
fold 4
fold 5
Training setValidation set
5-fold cross-validation
Challenge Dataset (2/2)
• Neuroblastoma dataset from
NCBI GSE49711
• 60,586 genes of gene expression
profile of RNA sequencing
• 22,692 genes of DNA copy number
data
• 144 patient samples are divided into
38 good and 105 poor samples
(binary class label for overall survival
days provided by NCBI dataset)
CAMDA 2018 29
88 56
Agerage
survival years
< 1 year
Agerage
age at diagnosis
16 months
Performance evaluation (1/2)
CAMDA 2018 30
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
144
patients
38 good
group
(long-term
survival)
105 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
Performance evaluation (1/2)
CAMDA 2018 31
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
144
patients
38 good
group
(long-term
survival)
105 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
fold 1
fold 2
fold 3
fold 4
…
fold N
Training setValidation set
Leave-one-out cross-validation
Pathway-based methods
• For gene expression data in each dataset, four pathway-based
methods were compared
− PLAGE [TomFohr et al. Bioinformatics, 2005]
− Z-score [Lee et al. PLoS Comput Biol, 2008]
− DART [Jiao et al. Bioinformatics, 2011]
− DRW [Liu et al. Bioinformatics, 2013]
• Evaluate classification performances in the same way as the
proposed method
CAMDA 2018 32
Integrative analysis on multi-omics data improves
survival prediction performance (1/2)
• Four pathway-based
methods on a single
gene expression profile
• iDRW method on the
gene expression profile
and copy number data in
breast cancer (A) or in
neuroblastoma patients
(B)
CAMDA 2018 33
Breast cancer Neuroblastoma
Integrative analysis on multi-omics data improves
survival prediction performance (2/2)
• Improved performances
when utilizing interactions
between genes on a graph
• Especially, DRW-based
methods showed a more
contribution to a
performance improvement
• iDRW performed the best
in both cancer dataset
CAMDA 2018 34
Breast cancer Neuroblastoma
iDRW identifies cancer-associated pathways and genes (1/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Breast
cancer
(k = 25)
hsa04740 Olfactory transduction 419 54 268
hsa04014 Ras signaling pathway 232 68 164
hsa04015 Rap1 signaling pathway 206 64 142
hsa04916 Melanogenesis 101 37 73
hsa04722 Neurotrophin signaling pathway 119 38 84
hsa05200 Pathways in cancer 526 166 359
hsa04933 AGE-RAGE signaling pathway in diabetic complications 99 37 67
hsa04530 Tight junction 170 53 107
hsa04510 Focal adhesion 199 76 125
hsa04080 Neuroactive ligand-receptor interaction 278 64 193
hsa05225 Hepatocellular carcinoma 168 56 112
hsa04020 Calcium signaling pathway 182 59 136
hsa04024 cAMP signaling pathway 198 58 139
CAMDA 2018 35
Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant
genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
iDRW identifies cancer-associated pathways and genes (2/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Breast
cancer
(k = 25)
hsa04217 Necroptosis 164 49 97
hsa04060 Cytokine-cytokine receptor interaction 270 70 192
hsa05152 Tuberculosis 179 58 112
hsa05165 Human papillomavirus infection 319 103 210
hsa04810 Regulation of actin cytoskeleton 208 64 132
hsa04151 PI3K-Akt signaling pathway 352 119 241
hsa04022 cGMP-PKG signaling pathway 163 58 109
hsa04630 Jak-STAT signaling pathway 162 43 112
hsa05167 Kaposi's sarcoma-associated herpesvirus infection 186 61 114
hsa04010 MAPK signaling pathway 295 87 209
hsa04371 Apelin signaling pathway 137 46 99
hsa04390 Hippo signaling pathway 154 58 100
CAMDA 2018 36
Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant
genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
iDRW identifies cancer-associated pathways and genes (3/5)
CAMDA 2018 37
Hanahan et al. Cell, 2011
Six biological capabilities which are acquired during the tumor generation
Some of top-ranked pathways (Ras signaling, Necroptosis, Regulation of actin cytoskeleton, and PI3K-
Akt signaling pathway) are related with at least one of six functions
“…overexpression of 34 Olfactory Receptors genes has been reported
in patients bearing breast tumors caused by CHEK2 1100delC mutation…”
iDRW identifies cancer-associated pathways and genes (4/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Neuroblastoma
(k = 5)
hsa04976 Bile secretion 71 13 5
hsa05034 Alcoholism 180 22 7
hsa01100 Metabolic pathways 1273 43 93
hsa04080 Neuroactive ligand-receptor interaction 278 21 24
hsa04151 PI3K-Akt signaling pathway 352 19 31
CAMDA 2018 38
Top-k pathways ranked by the iDRW method in neuroblastoma data. For each pathway, the total number of genes, the number of
significant genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
iDRW identifies cancer-associated pathways and genes (5/5)
CAMDA 2018
39
“… we propose a mechanism underlying a potent and
selective anti-tumor effect of LCA in cultured human neuroblastoma cells …”“…the level of Urinary catecholamine metabolites which consist of vanillylmandelic
acid (VMA), homovanillic acid (HVA) and dopamine elevated in neuroblastoma
patients…”
Conclusions
• We showed the effectiveness of an integrative directed random
walk-based method utilizing pathway information (iDRW) on
different cancer datasets
• We benchmark iDRW and several state-of-the-art pathway-based
methods for the survival prediction model
CAMDA 2018 40
Conclusions
• Contributions
− Revamp a directed gene-gene graph considering the interactions
between gene expression and copy number data
− Jointly identify cancer-related pathways and genes on gene
expression and copy number data for breast cancer and
neuroblastoma datasets
CAMDA 2018 41
Acknowledgements
All lab members of LAMDA lab
Kyung-Ah Sohn
Byungkon Kang
Yenewondim Biadgie
Garam Lee
Habtamu Minassie Aycheh
Sehee Wang
Jungryul Seo
Nam-Hyuk Ahn
Min-Soo Kim
Tae-rim Kim
Young-Bum Choi
Jun-hyung Yu
Jeong-hyun Moon
Jaesik Kim
Sijin Kim
Heejin Kim
Joon-seon Hwang
Hyun-Hwan Jeong, Ph.D.
Post-doctoral associate
Baylor College of Medicine
Texas Children’s Hospital
Kyung-Ah Sohn, Ph.D.
Associate Professor
Department of Software and Computer Engineering,
Ajou University
Jaesik Kim
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
Jeong-Hyeon Moon
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
CAMDA 2018 42
Thank you !
Q & A
CAMDA 2018 43

Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk for Survival Prediction in Multiple Cancer Studies

  • 1.
    Robust Pathway-based Multi-OmicsData Integration using Directed Random Walk for Survival Prediction in Multiple Cancer Studies So Yeon Kim, Hyun-Hwan Jeong, Jaesik Kim, Jeong-Hyeon Moon, Kyung-Ah Sohn 17TH ANNUAL INTERNATIONAL CONFERENCE ON CRITICAL ASSESSMENT OF MASSIVE DATA ANALYSIS (CAMDA 2018) CANCER DATA INTEGRATION CHALLENGE CAMDA 2018 1
  • 2.
    Outline • Introduction − Motivation −Related works • Methods • Results • Conclusion CAMDA 2018 2
  • 3.
  • 4.
    Motivation (1/3) • Richinformation of multi-omics data provide opportunities for better biological understanding and improved clinical outcome prediction • Integrative analysis is important to discover interrelationships between multiple different levels of data CAMDA 2018 4 Weinstein et al. Nature genetics, 2013
  • 5.
    Motivation (2/3) • Graph-basedintegration methods are effective at combining multi-omics data to consider the interactions between different types of genomic data CAMDA 2018 5 Kim et al. Journal of the American Medical Informatics Association, 2014
  • 6.
    Motivation (3/3) • Incorporatinggenomic knowledge such as pathway information on the integrated graph can be useful to increase prediction power and find important genes and pathways in cancers CAMDA 2018 6 Liu et al. Bioinformatics, 2013
  • 7.
    Related works (1/7) •Pathway-based integrative methods − They simply transformed single genomic profile into pathway profile using activity scoring measure − A pathway level analysis of gene expression (PLAGE) used the singular vector of singular value decomposition of given gene set CAMDA 2018 7 TomFohr et al. Bioinformatics, 2005
  • 8.
    Related works (2/7) •Pathway-based integrative methods − Z-score method convert gene expression profile into z-scores and combines z- scores of genes in each pathway per sample − They take pathways as the set of genes − Better to consider gene-gene interactions CAMDA 2018 8 Lee et al. PLoS Comput Biol, 2008
  • 9.
    Related works (3/7) •Some methods utilized gene-gene interactions on a graph − A denoising algorithm based on relevance network topology (DART) integrates pathways by deriving perturbation signatures which reflect gene contributions in each pathway CAMDA 2018 9 Jiao et al. Bioinformatics, 2011
  • 10.
    Related works (4/7) •Some methods utilized gene-gene interactions on a graph − A directed random walk-based pathway activity inference method (DRW) identifies topologically important genes and pathways by weighting the genes in the gene-gene network CAMDA 2018 10 Liu et al. Bioinformatics, 2013
  • 11.
    Related works (5/7) •Some methods utilized gene-gene interactions on a graph − Integrated extension on multi-omics data (DRW-GM) − Improved prediction performance − Found many risk metabolite pathways and topologically important genes for cancer by a joint analysis of gene expression and metabolite data CAMDA 2018 11 Liu, et al. Scientific reports, 2015
  • 12.
    Related works (6/7) •Integrative DRW (iDRW) incorporate interaction between gene expression and methylation features exploiting DRW-based methods CAMDA 2018 12 Kim et al. BMC Medical Genomics, 2018 (to be appear)
  • 13.
    Related works (7/7) •Improved survival prediction power and jointly analyzed gene expression and methylation data on an integrated gene-gene graph CAMDA 2018 13 Kim et al. BMC Medical Genomics, 2018 (to be appear)
  • 14.
    Overview • Investigate theeffectiveness of iDRW method on other types of genomic profiles for two different cancers • Reflect the interactions between gene expression and copy number data on an integrated graph • Construct graph with the updated pathway database • A survival group classification for breast cancer and neuroblastoma patient samples CAMDA 2018 14
  • 15.
  • 16.
  • 17.
    Integrated gene-gene graphconstruction (1/2) • 327 human pathways and corresponding gene sets from KEGG database • Interactions between genes were defined using R KEGGgraph package • Integrated directed gene-gene graph − 7,390 nodes and 58,426 edges CAMDA 2018 17 B A gene KEGG PATHWAY Database
  • 18.
    Integrated gene-gene graphconstruction (2/2) • To reflect the impact of copy number variation on gene expression, we assign directional edges to all the overlapping genes CAMDA 2018 18 Gene expression Overlapping genes Copy number alteration Gene expression Copy number alteration
  • 19.
    Pathway activity inference •The weight of the gene 𝒘 𝒈 is the p-value from - DESeq2 analysis (RNA-Seq) - Two-tailed t-test (Microarray) - 𝜒2-test of independence (Copy number data) CAMDA 2018 19 Genes Samples Gene expression 𝒛 𝒈𝒊 Genes Samples CNA 𝒛 𝒈𝒊 Weight initialization 𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
  • 20.
    Pathway activity inference CAMDA2018 20 ground node Global directed gene-gene graph Gene expression Copy number alteration Random walker Integrative Directed Random Walk(iDRW) 𝑾∞ Genes Samples Gene expression 𝒛 𝒈𝒊 Genes Samples CNA 𝒛 𝒈𝒊 Weight initialization 𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐) 𝑾 𝒕+𝟏 = 𝟏 − 𝒓 𝑴 𝑻 𝑾 𝒕 + 𝒓𝑾 𝟎
  • 21.
    Pathway activity inference CAMDA2018 21 Pathway Profile Pathway Samples 𝒂 𝑷𝒋 𝒂 𝑷𝒋 = 𝒊=𝟏 𝒏 𝒋 𝑾∞ 𝒈𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 ∗ 𝒛 𝒈𝒊 𝒊=𝟏 𝒏 𝒋 (𝑾∞ 𝒈𝒊 ) 𝟐 ground node Global directed pathway graph Gene expression Copy number alteration Random walker Integrative Directed Random Walk(iDRW) 𝑾∞ 𝑷𝒋 = {𝒈 𝟏, 𝒈 𝟐, … , 𝒈 𝒏𝒋 } 𝒏𝒋 differential genes
  • 22.
    Pathway activity inference CAMDA2018 22 Pathway Profile Pathway Samples 𝒂 𝑷𝒋 𝒂 𝑷𝒋 = 𝒊=𝟏 𝒏 𝒋 𝑾∞ 𝒈𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 ∗ 𝒛 𝒈𝒊 𝒊=𝟏 𝒏 𝒋 (𝑾∞ 𝒈𝒊 ) 𝟐 ground node Global directed pathway graph Gene expression Copy number alteration Random walker Integrative Directed Random Walk(iDRW) 𝑾∞ 𝑷𝒋 = {𝒈 𝟏, 𝒈 𝟐, … , 𝒈 𝒏𝒋 } 𝒏𝒋 differential genes Score of gene 𝒔𝒄𝒐𝒓𝒆 𝒈𝒊 is - 𝑙𝑜𝑔2 fold change from DESeq2 analysis (RNA-Seq) - 𝑠𝑖𝑔𝑛 𝑡𝑠𝑐𝑜𝑟𝑒 𝑔𝑖 (Microarray) - 𝑚𝑒𝑎𝑛(𝐶𝑁𝐴 𝑔𝑖 𝑝𝑜𝑜𝑟) − 𝑚𝑒𝑎𝑛(𝐶𝑁𝐴 𝑔𝑖 𝑔𝑜𝑜𝑑) (Copy number data)
  • 23.
    Pathway feature selectionand survival prediction • Feature ranking strategy • p-values from the t-test of pathway activities • Top-k pathways across samples are going to be the input to the classification model CAMDA 2018 23 Pathway Profile Pathway Samples 𝒂 𝑷𝒋 Rank Top-k pathway feature selection k p-value from t-test
  • 24.
    Pathway feature selectionand survival prediction • Survival prediction • Logistic regression model classifies the samples into good and poor group • Empirically select top-k pathway features that showed the best classification performance CAMDA 2018 24 Pathway Profile Pathway Samples 𝒂 𝑷𝒋 Rank pathway 00410pathway 00060 Risk-active pathway identification Survival prediction
  • 25.
  • 26.
    Challenge Dataset (1/2) •Breast cancer patients data from METABRIC dataset • 24,368 genes of mRNA expression profile from Illumina Human v3 microarray with log intensity levels • 22,544 genes of putative copy- number alterations data • 1,648 patient samples are divided into 908 good (> 10 years) and 740 poor (≤ 10 years) samples CAMDA 2018 26 Agerage survival years 10 Agerage age at diagnosis 62
  • 27.
    Performance evaluation (1/2) CAMDA2018 27 Predicted good poor Actual good TP FN poor FP TN Survival prediction 1,648 patients 908 good group (long-term survival) 740 poor group (short- term survival) Classification accuracy 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = 𝐓𝐏 + 𝐓𝐍 𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
  • 28.
    Performance evaluation (1/2) CAMDA2018 28 Predicted good poor Actual good TP FN poor FP TN Survival prediction 1,648 patients 908 good group (long-term survival) 740 poor group (short- term survival) Classification accuracy 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = 𝐓𝐏 + 𝐓𝐍 𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍 fold 1 fold 2 fold 3 fold 4 fold 5 Training setValidation set 5-fold cross-validation
  • 29.
    Challenge Dataset (2/2) •Neuroblastoma dataset from NCBI GSE49711 • 60,586 genes of gene expression profile of RNA sequencing • 22,692 genes of DNA copy number data • 144 patient samples are divided into 38 good and 105 poor samples (binary class label for overall survival days provided by NCBI dataset) CAMDA 2018 29 88 56 Agerage survival years < 1 year Agerage age at diagnosis 16 months
  • 30.
    Performance evaluation (1/2) CAMDA2018 30 Predicted good poor Actual good TP FN poor FP TN Survival prediction 144 patients 38 good group (long-term survival) 105 poor group (short- term survival) Classification accuracy 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = 𝐓𝐏 + 𝐓𝐍 𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
  • 31.
    Performance evaluation (1/2) CAMDA2018 31 Predicted good poor Actual good TP FN poor FP TN Survival prediction 144 patients 38 good group (long-term survival) 105 poor group (short- term survival) Classification accuracy 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = 𝐓𝐏 + 𝐓𝐍 𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍 fold 1 fold 2 fold 3 fold 4 … fold N Training setValidation set Leave-one-out cross-validation
  • 32.
    Pathway-based methods • Forgene expression data in each dataset, four pathway-based methods were compared − PLAGE [TomFohr et al. Bioinformatics, 2005] − Z-score [Lee et al. PLoS Comput Biol, 2008] − DART [Jiao et al. Bioinformatics, 2011] − DRW [Liu et al. Bioinformatics, 2013] • Evaluate classification performances in the same way as the proposed method CAMDA 2018 32
  • 33.
    Integrative analysis onmulti-omics data improves survival prediction performance (1/2) • Four pathway-based methods on a single gene expression profile • iDRW method on the gene expression profile and copy number data in breast cancer (A) or in neuroblastoma patients (B) CAMDA 2018 33 Breast cancer Neuroblastoma
  • 34.
    Integrative analysis onmulti-omics data improves survival prediction performance (2/2) • Improved performances when utilizing interactions between genes on a graph • Especially, DRW-based methods showed a more contribution to a performance improvement • iDRW performed the best in both cancer dataset CAMDA 2018 34 Breast cancer Neuroblastoma
  • 35.
    iDRW identifies cancer-associatedpathways and genes (1/5) Dataset Pathway ID Pathway name Total genes EXP CNA Breast cancer (k = 25) hsa04740 Olfactory transduction 419 54 268 hsa04014 Ras signaling pathway 232 68 164 hsa04015 Rap1 signaling pathway 206 64 142 hsa04916 Melanogenesis 101 37 73 hsa04722 Neurotrophin signaling pathway 119 38 84 hsa05200 Pathways in cancer 526 166 359 hsa04933 AGE-RAGE signaling pathway in diabetic complications 99 37 67 hsa04530 Tight junction 170 53 107 hsa04510 Focal adhesion 199 76 125 hsa04080 Neuroactive ligand-receptor interaction 278 64 193 hsa05225 Hepatocellular carcinoma 168 56 112 hsa04020 Calcium signaling pathway 182 59 136 hsa04024 cAMP signaling pathway 198 58 139 CAMDA 2018 35 Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
  • 36.
    iDRW identifies cancer-associatedpathways and genes (2/5) Dataset Pathway ID Pathway name Total genes EXP CNA Breast cancer (k = 25) hsa04217 Necroptosis 164 49 97 hsa04060 Cytokine-cytokine receptor interaction 270 70 192 hsa05152 Tuberculosis 179 58 112 hsa05165 Human papillomavirus infection 319 103 210 hsa04810 Regulation of actin cytoskeleton 208 64 132 hsa04151 PI3K-Akt signaling pathway 352 119 241 hsa04022 cGMP-PKG signaling pathway 163 58 109 hsa04630 Jak-STAT signaling pathway 162 43 112 hsa05167 Kaposi's sarcoma-associated herpesvirus infection 186 61 114 hsa04010 MAPK signaling pathway 295 87 209 hsa04371 Apelin signaling pathway 137 46 99 hsa04390 Hippo signaling pathway 154 58 100 CAMDA 2018 36 Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
  • 37.
    iDRW identifies cancer-associatedpathways and genes (3/5) CAMDA 2018 37 Hanahan et al. Cell, 2011 Six biological capabilities which are acquired during the tumor generation Some of top-ranked pathways (Ras signaling, Necroptosis, Regulation of actin cytoskeleton, and PI3K- Akt signaling pathway) are related with at least one of six functions “…overexpression of 34 Olfactory Receptors genes has been reported in patients bearing breast tumors caused by CHEK2 1100delC mutation…”
  • 38.
    iDRW identifies cancer-associatedpathways and genes (4/5) Dataset Pathway ID Pathway name Total genes EXP CNA Neuroblastoma (k = 5) hsa04976 Bile secretion 71 13 5 hsa05034 Alcoholism 180 22 7 hsa01100 Metabolic pathways 1273 43 93 hsa04080 Neuroactive ligand-receptor interaction 278 21 24 hsa04151 PI3K-Akt signaling pathway 352 19 31 CAMDA 2018 38 Top-k pathways ranked by the iDRW method in neuroblastoma data. For each pathway, the total number of genes, the number of significant genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
  • 39.
    iDRW identifies cancer-associatedpathways and genes (5/5) CAMDA 2018 39 “… we propose a mechanism underlying a potent and selective anti-tumor effect of LCA in cultured human neuroblastoma cells …”“…the level of Urinary catecholamine metabolites which consist of vanillylmandelic acid (VMA), homovanillic acid (HVA) and dopamine elevated in neuroblastoma patients…”
  • 40.
    Conclusions • We showedthe effectiveness of an integrative directed random walk-based method utilizing pathway information (iDRW) on different cancer datasets • We benchmark iDRW and several state-of-the-art pathway-based methods for the survival prediction model CAMDA 2018 40
  • 41.
    Conclusions • Contributions − Revampa directed gene-gene graph considering the interactions between gene expression and copy number data − Jointly identify cancer-related pathways and genes on gene expression and copy number data for breast cancer and neuroblastoma datasets CAMDA 2018 41
  • 42.
    Acknowledgements All lab membersof LAMDA lab Kyung-Ah Sohn Byungkon Kang Yenewondim Biadgie Garam Lee Habtamu Minassie Aycheh Sehee Wang Jungryul Seo Nam-Hyuk Ahn Min-Soo Kim Tae-rim Kim Young-Bum Choi Jun-hyung Yu Jeong-hyun Moon Jaesik Kim Sijin Kim Heejin Kim Joon-seon Hwang Hyun-Hwan Jeong, Ph.D. Post-doctoral associate Baylor College of Medicine Texas Children’s Hospital Kyung-Ah Sohn, Ph.D. Associate Professor Department of Software and Computer Engineering, Ajou University Jaesik Kim Graduate student, Masters course Department of Software and Computer Engineering, Ajou University Jeong-Hyeon Moon Graduate student, Masters course Department of Software and Computer Engineering, Ajou University CAMDA 2018 42
  • 43.
    Thank you ! Q& A CAMDA 2018 43