17th Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA 2018)
Cancer Data Integration Challenge (http://camda.info/)
Deep learning based multi-omics integration, a surveySOYEON KIM
1. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders, Pacific Symposium on Biocomputing, 2015
2. A deep learning approach for cancer detection and relevant gene identification, Pacific Symposium on Biocomputing, 2016
3. Deep Learning based multi-omics integrationrobustly predicts survival in liver cancer, preprint, 2017
Original Next Gen Seq Methods set of slides prepared for Technorazz Vibes 2016. There is also a shorter version.
This starts with an introduction to qPCR followed by an introduction to Library Complexity. Microarrays are discussed as well along with a very short introduction to FISH. Finally discussion of Next gen seq methods is done where generation of sequencers are discussed and a short discussion of the ILLUMINA protocol. Finally comparison of ILLUMINA amongst other 3rd gen sequencer, description of the standard pipeline and the omics technologies that have risen from this seq data.
Deep learning based multi-omics integration, a surveySOYEON KIM
1. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders, Pacific Symposium on Biocomputing, 2015
2. A deep learning approach for cancer detection and relevant gene identification, Pacific Symposium on Biocomputing, 2016
3. Deep Learning based multi-omics integrationrobustly predicts survival in liver cancer, preprint, 2017
Original Next Gen Seq Methods set of slides prepared for Technorazz Vibes 2016. There is also a shorter version.
This starts with an introduction to qPCR followed by an introduction to Library Complexity. Microarrays are discussed as well along with a very short introduction to FISH. Finally discussion of Next gen seq methods is done where generation of sequencers are discussed and a short discussion of the ILLUMINA protocol. Finally comparison of ILLUMINA amongst other 3rd gen sequencer, description of the standard pipeline and the omics technologies that have risen from this seq data.
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
Presented at Cornell Symbiosis symposium. Workflow for processing amplicon based 16S/ITS sequences as well as whole genome shotgun sequences are described. Slides include short description and links for each tool.
DISCLAIMER: This is a small subset of tools out there. No disrespect to methods not mentioned.
High throughput next generation sequencing and robust transcriptome analysis help with gene expression profiling, gene annotation or discovery of non-coding RNA.
Course: Bioinformatics for Biomedical Research (2014).
Session: 3.2- Basic Aspects of Microarray Technology and Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
DRUG DESIGN BASED ON BIOINFORMATICS TOOLSNIPER MOHALI
Drug design is a very complex process it takes many more times but using the these specific tools we can reduce complex process and save the time and produce a effective new drug that will be helpful in heath environment.
EG-CompBio presentation about Artificial Intelligence in Bioinformatics covering:
-AI (Types, Development)
-Deep Learning (Architecture)
-Bioinformatics Fields
-Input formats for AI
-AI Challenges in Biology
-Example: (Proteomics, Transcriptomics)
-Metagenomics: @ NU
-Taxonomic Classification
-Phenotype Classification
-How to begin in AI in Bioinformatics
Bioinformatics and Artificial Intelligence (AI) the interrelation between the...Swapsg
The relation between the Bioinformatics and the Artificial Intelligence (AI) specified in it as both the fields are new.
credits-https://poweredtemplate.com/03075/0/index.html
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building the Database with International Isolates: European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institute (EBI). Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management -23-25 May 2016, Rome, Italy.
This presentation is explains about the genome sequencing, its traditional method and modern method. This basically focus on Next Generation Sequencing and its types.
INTRODUCTION
DEFINITION
HISTORY
METHODS OF DNA SEQUENCING
MAXAM GILBERT METHOD
SANGERS METHOD
AUTOMATED DNA SEQUENCER
PYROSEQUENCING
SHOTGUN SEQUENCING
DNA MICROARRAY
APPLICATION
CONCLUSION
REFRENCES
Effect of Feature Selection on Gene Expression Datasets Classification Accura...IJECEIAES
Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
Presented at Cornell Symbiosis symposium. Workflow for processing amplicon based 16S/ITS sequences as well as whole genome shotgun sequences are described. Slides include short description and links for each tool.
DISCLAIMER: This is a small subset of tools out there. No disrespect to methods not mentioned.
High throughput next generation sequencing and robust transcriptome analysis help with gene expression profiling, gene annotation or discovery of non-coding RNA.
Course: Bioinformatics for Biomedical Research (2014).
Session: 3.2- Basic Aspects of Microarray Technology and Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
DRUG DESIGN BASED ON BIOINFORMATICS TOOLSNIPER MOHALI
Drug design is a very complex process it takes many more times but using the these specific tools we can reduce complex process and save the time and produce a effective new drug that will be helpful in heath environment.
EG-CompBio presentation about Artificial Intelligence in Bioinformatics covering:
-AI (Types, Development)
-Deep Learning (Architecture)
-Bioinformatics Fields
-Input formats for AI
-AI Challenges in Biology
-Example: (Proteomics, Transcriptomics)
-Metagenomics: @ NU
-Taxonomic Classification
-Phenotype Classification
-How to begin in AI in Bioinformatics
Bioinformatics and Artificial Intelligence (AI) the interrelation between the...Swapsg
The relation between the Bioinformatics and the Artificial Intelligence (AI) specified in it as both the fields are new.
credits-https://poweredtemplate.com/03075/0/index.html
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building the Database with International Isolates: European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institute (EBI). Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management -23-25 May 2016, Rome, Italy.
This presentation is explains about the genome sequencing, its traditional method and modern method. This basically focus on Next Generation Sequencing and its types.
INTRODUCTION
DEFINITION
HISTORY
METHODS OF DNA SEQUENCING
MAXAM GILBERT METHOD
SANGERS METHOD
AUTOMATED DNA SEQUENCER
PYROSEQUENCING
SHOTGUN SEQUENCING
DNA MICROARRAY
APPLICATION
CONCLUSION
REFRENCES
Effect of Feature Selection on Gene Expression Datasets Classification Accura...IJECEIAES
Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-...dkNET
Abstract
Omics techniques (e.g., i.e., transcriptomics, genomics, and epigenomics) report quantitative measures of more than tens of thousands of biological features and provide a more comprehensive molecular perspective of studied diabetes mechanisms compared to transitional approaches. Identifying representative molecular signatures from the tremendous number of biological features becomes a central problem in utilizing the data for clinical decision-making. Exploring the complex causal relations of the identified representative molecular signatures and diabetes phenotypes can be the most effective and efficient ways to improve the understanding of diabetes and assess the cause of diabetes for the new patients with already collected data influencing (e.g., TEDDY project). However, due to the unavoidable patient heterogeneity, statistical randomness, and experimental noise in the high-dimension, low-sample-size omics data of the diabetic patients, utilizing the available data for clinical decision-making remains an ongoing challenge for many researchers. To overcome the limitations, in this study we developed (1) a generative adversarial network (GAN)-based model to generate synthetic omics data for the samples with few omics profiles available; (2) a deep learning-based fusion network model for phenotype prediction of type-1 diabetes; (3) a long short-term memory (LSTM)-based model for predicting outcomes of islet autoantibody and persistent positivity. The models are tested on the multi-omics data in TEDDY project.
Presenter: Wei Zhang, Ph.D. Assistant Professor, Department of Computer Science & Genomics and Bioinformatics Cluster, University of Central Florida
Upcoming webinars schedule: https://dknet.org/about/webinar
Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do association testing and trait analysis on large cohorts of samples in both humans and other species. As samples size increase to do population-scale genomics, the analysis methods need to adapt to remain computable on your analysis workstation.
One of the most popular methods for determining population structure in SVS is Principal Component Analysis. In this webcast, we review the fundamentals of this methodology, as well as how we have advanced the state of the art by implementing a new “Large Data PCA” capability in SVS, handling over 10 times as many samples as previously possible at a fraction of the time. Join us as we cover:
A review of SVS association testing and trait analysis capabilities
Usage of Principle Component Analysis to discern population structure
Scaling PCA beyond the limitations of computer hardware Other SVS improvements based on ongoing feedback from the user community
SVS continues to move forward as a flexible and powerful tool to perform genotype and Large-N variant analysis. We hope you enjoy this webcast highlighting the exciting new features and select enhancements we have made.
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
Talk presented at the Simons Foundation Biotech Symposium "Complex Data Visualization: Approach and Application" (12 September 2014)
http://www.simonsfoundation.org/event/complex-data-visualization-approach-and-application/
In this talk I describe how we integrated a sophisticated computational framework directly into the StratomeX visualization technique to enable rapid exploration of tens of thousands of stratifications in cancer genomics data, creating a unique and powerful tool for the identification and characterization of tumor subtypes. The tool can handle a wide range of genomic and clinical data types for cohorts with hundreds of patients. StratomeX also provides direct access to comprehensive data sets generated by The Cancer Genome Atlas Firehose analysis pipeline.
http://stratomex.caleydo.org
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
Introductory lecture to multivariate analysis of proteomic data.
Material from the UC Davis 2014 Proteomics Workshop.
See more at: http://sourceforge.net/projects/teachingdemos/files/2014%20UC%20Davis%20Proteomics%20Workshop/
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...SOYEON KIM
Summary of paper "Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts",
Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, et al.
In PLOS Genetics, 2013
A survey of heterogeneous information network analysisSOYEON KIM
A Survey of Heterogeneous Information Network Analysis
Chuan Shi, Member, IEEE,
Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE,
and Philip S. Yu, Fellow, IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015
Translated Learning: Transfer learning across different feature spaces
Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu.
In Proceedings of Twenty-Second Annual Conference on Neural Information Processing Systems (NIPS 2008)
Semi-automatic ground truth generation using unsupervised clustering and limi...SOYEON KIM
Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition
Szilárd Vajda, Yves Rangoni, Hubert Cecotti
Pattern Recognition Letters, 2015
Evaluating color descriptors for object and scene recognitionSOYEON KIM
Van De Sande, Koen EA, Theo Gevers, and Cees GM Snoek. "Evaluating color descriptors for object and scene recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.9 (2010): 1582-1596.
Outcome-guided mutual information networks for investigating gene-gene intera...SOYEON KIM
TBC2014 poster
"Outcome-guided mutual information networks for investigating gene-gene interaction effects on clinical outcomes", Hyun-hwan Jeong, So Yeon Kim, Kyubum Wee, Kyung-Ah Sohn
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...SOYEON KIM
K. So Yeon, B. Yenewondim, and S. Kyung-Ah, "Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Image Detection Using Scale Invariant Feature Transform Image Descriptor.", Information Science and Applications, LNEE 339, 2015, in press.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk for Survival Prediction in Multiple Cancer Studies
1. Robust Pathway-based Multi-Omics Data Integration
using Directed Random Walk for Survival Prediction
in Multiple Cancer Studies
So Yeon Kim, Hyun-Hwan Jeong, Jaesik Kim,
Jeong-Hyeon Moon, Kyung-Ah Sohn
17TH ANNUAL INTERNATIONAL CONFERENCE ON CRITICAL
ASSESSMENT OF MASSIVE DATA ANALYSIS (CAMDA 2018)
CANCER DATA INTEGRATION CHALLENGE
CAMDA 2018 1
4. Motivation (1/3)
• Rich information of multi-omics
data provide opportunities for
better biological understanding
and improved clinical outcome
prediction
• Integrative analysis is
important to discover
interrelationships between
multiple different levels of data
CAMDA 2018 4
Weinstein et al. Nature genetics, 2013
5. Motivation (2/3)
• Graph-based integration methods are effective at combining
multi-omics data to consider the interactions between different
types of genomic data
CAMDA 2018 5
Kim et al. Journal of the American Medical Informatics Association, 2014
6. Motivation (3/3)
• Incorporating genomic
knowledge such as pathway
information on the
integrated graph can be
useful to increase prediction
power and find important
genes and pathways in
cancers
CAMDA 2018 6
Liu et al. Bioinformatics, 2013
7. Related works (1/7)
• Pathway-based integrative methods
− They simply transformed single genomic profile into pathway profile using
activity scoring measure
− A pathway level analysis of gene expression (PLAGE) used the singular
vector of singular value decomposition of given gene set
CAMDA 2018 7
TomFohr et al. Bioinformatics, 2005
8. Related works (2/7)
• Pathway-based integrative methods
− Z-score method convert gene expression
profile into z-scores and combines z-
scores of genes in each pathway per
sample
− They take pathways as the set of genes
− Better to consider gene-gene interactions
CAMDA 2018 8
Lee et al. PLoS Comput Biol, 2008
9. Related works (3/7)
• Some methods utilized gene-gene
interactions on a graph
− A denoising algorithm based on
relevance network topology (DART)
integrates pathways by deriving
perturbation signatures which reflect
gene contributions in each pathway
CAMDA 2018 9
Jiao et al. Bioinformatics, 2011
10. Related works (4/7)
• Some methods utilized gene-gene
interactions on a graph
− A directed random walk-based pathway
activity inference method (DRW)
identifies topologically important genes
and pathways by weighting the genes in
the gene-gene network
CAMDA 2018 10
Liu et al. Bioinformatics, 2013
11. Related works (5/7)
• Some methods utilized gene-gene
interactions on a graph
− Integrated extension on multi-omics
data (DRW-GM)
− Improved prediction performance
− Found many risk metabolite pathways
and topologically important genes for
cancer by a joint analysis of gene
expression and metabolite data
CAMDA 2018 11
Liu, et al. Scientific reports, 2015
12. Related works (6/7)
• Integrative DRW (iDRW) incorporate interaction between gene
expression and methylation features exploiting DRW-based methods
CAMDA 2018 12
Kim et al. BMC Medical Genomics, 2018 (to be appear)
13. Related works (7/7)
• Improved survival prediction power and jointly analyzed gene
expression and methylation data on an integrated gene-gene graph
CAMDA 2018 13
Kim et al. BMC Medical Genomics, 2018 (to be appear)
14. Overview
• Investigate the effectiveness of iDRW method on other types of
genomic profiles for two different cancers
• Reflect the interactions between gene expression and copy
number data on an integrated graph
• Construct graph with the updated pathway database
• A survival group classification for breast cancer and neuroblastoma
patient samples
CAMDA 2018 14
17. Integrated gene-gene graph construction (1/2)
• 327 human pathways and
corresponding gene sets from KEGG
database
• Interactions between genes were
defined using R KEGGgraph package
• Integrated directed gene-gene graph
− 7,390 nodes and 58,426 edges
CAMDA 2018 17
B
A
gene
KEGG PATHWAY
Database
18. Integrated gene-gene graph construction (2/2)
• To reflect the impact of copy number variation on gene expression,
we assign directional edges to all the overlapping genes
CAMDA 2018 18
Gene expression
Overlapping genes
Copy number alteration
Gene expression
Copy number alteration
19. Pathway activity inference
• The weight of the gene 𝒘 𝒈 is
the p-value from
- DESeq2 analysis (RNA-Seq)
- Two-tailed t-test (Microarray)
- 𝜒2-test of independence (Copy
number data)
CAMDA 2018 19
Genes
Samples
Gene expression
𝒛 𝒈𝒊
Genes
Samples
CNA
𝒛 𝒈𝒊
Weight initialization
𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
23. Pathway feature selection and survival prediction
• Feature ranking strategy
• p-values from the t-test of pathway
activities
• Top-k pathways across samples are
going to be the input to the
classification model
CAMDA 2018 23
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
Rank
Top-k pathway feature selection
k
p-value
from t-test
24. Pathway feature selection and survival prediction
• Survival prediction
• Logistic regression model
classifies the samples into
good and poor group
• Empirically select top-k
pathway features that
showed the best
classification performance
CAMDA 2018 24
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
Rank
pathway
00410pathway
00060
Risk-active pathway identification
Survival prediction
26. Challenge Dataset (1/2)
• Breast cancer patients data
from METABRIC dataset
• 24,368 genes of mRNA expression
profile from Illumina Human v3
microarray with log intensity levels
• 22,544 genes of putative copy-
number alterations data
• 1,648 patient samples are divided
into 908 good (> 10 years) and 740
poor (≤ 10 years) samples
CAMDA 2018 26
Agerage
survival years
10
Agerage
age at diagnosis
62
27. Performance evaluation (1/2)
CAMDA 2018 27
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
1,648
patients
908 good
group
(long-term
survival)
740 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
28. Performance evaluation (1/2)
CAMDA 2018 28
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
1,648
patients
908 good
group
(long-term
survival)
740 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
fold 1
fold 2
fold 3
fold 4
fold 5
Training setValidation set
5-fold cross-validation
29. Challenge Dataset (2/2)
• Neuroblastoma dataset from
NCBI GSE49711
• 60,586 genes of gene expression
profile of RNA sequencing
• 22,692 genes of DNA copy number
data
• 144 patient samples are divided into
38 good and 105 poor samples
(binary class label for overall survival
days provided by NCBI dataset)
CAMDA 2018 29
88 56
Agerage
survival years
< 1 year
Agerage
age at diagnosis
16 months
30. Performance evaluation (1/2)
CAMDA 2018 30
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
144
patients
38 good
group
(long-term
survival)
105 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
31. Performance evaluation (1/2)
CAMDA 2018 31
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
144
patients
38 good
group
(long-term
survival)
105 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
fold 1
fold 2
fold 3
fold 4
…
fold N
Training setValidation set
Leave-one-out cross-validation
32. Pathway-based methods
• For gene expression data in each dataset, four pathway-based
methods were compared
− PLAGE [TomFohr et al. Bioinformatics, 2005]
− Z-score [Lee et al. PLoS Comput Biol, 2008]
− DART [Jiao et al. Bioinformatics, 2011]
− DRW [Liu et al. Bioinformatics, 2013]
• Evaluate classification performances in the same way as the
proposed method
CAMDA 2018 32
33. Integrative analysis on multi-omics data improves
survival prediction performance (1/2)
• Four pathway-based
methods on a single
gene expression profile
• iDRW method on the
gene expression profile
and copy number data in
breast cancer (A) or in
neuroblastoma patients
(B)
CAMDA 2018 33
Breast cancer Neuroblastoma
34. Integrative analysis on multi-omics data improves
survival prediction performance (2/2)
• Improved performances
when utilizing interactions
between genes on a graph
• Especially, DRW-based
methods showed a more
contribution to a
performance improvement
• iDRW performed the best
in both cancer dataset
CAMDA 2018 34
Breast cancer Neuroblastoma
35. iDRW identifies cancer-associated pathways and genes (1/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Breast
cancer
(k = 25)
hsa04740 Olfactory transduction 419 54 268
hsa04014 Ras signaling pathway 232 68 164
hsa04015 Rap1 signaling pathway 206 64 142
hsa04916 Melanogenesis 101 37 73
hsa04722 Neurotrophin signaling pathway 119 38 84
hsa05200 Pathways in cancer 526 166 359
hsa04933 AGE-RAGE signaling pathway in diabetic complications 99 37 67
hsa04530 Tight junction 170 53 107
hsa04510 Focal adhesion 199 76 125
hsa04080 Neuroactive ligand-receptor interaction 278 64 193
hsa05225 Hepatocellular carcinoma 168 56 112
hsa04020 Calcium signaling pathway 182 59 136
hsa04024 cAMP signaling pathway 198 58 139
CAMDA 2018 35
Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant
genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
36. iDRW identifies cancer-associated pathways and genes (2/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Breast
cancer
(k = 25)
hsa04217 Necroptosis 164 49 97
hsa04060 Cytokine-cytokine receptor interaction 270 70 192
hsa05152 Tuberculosis 179 58 112
hsa05165 Human papillomavirus infection 319 103 210
hsa04810 Regulation of actin cytoskeleton 208 64 132
hsa04151 PI3K-Akt signaling pathway 352 119 241
hsa04022 cGMP-PKG signaling pathway 163 58 109
hsa04630 Jak-STAT signaling pathway 162 43 112
hsa05167 Kaposi's sarcoma-associated herpesvirus infection 186 61 114
hsa04010 MAPK signaling pathway 295 87 209
hsa04371 Apelin signaling pathway 137 46 99
hsa04390 Hippo signaling pathway 154 58 100
CAMDA 2018 36
Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant
genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
37. iDRW identifies cancer-associated pathways and genes (3/5)
CAMDA 2018 37
Hanahan et al. Cell, 2011
Six biological capabilities which are acquired during the tumor generation
Some of top-ranked pathways (Ras signaling, Necroptosis, Regulation of actin cytoskeleton, and PI3K-
Akt signaling pathway) are related with at least one of six functions
“…overexpression of 34 Olfactory Receptors genes has been reported
in patients bearing breast tumors caused by CHEK2 1100delC mutation…”
38. iDRW identifies cancer-associated pathways and genes (4/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Neuroblastoma
(k = 5)
hsa04976 Bile secretion 71 13 5
hsa05034 Alcoholism 180 22 7
hsa01100 Metabolic pathways 1273 43 93
hsa04080 Neuroactive ligand-receptor interaction 278 21 24
hsa04151 PI3K-Akt signaling pathway 352 19 31
CAMDA 2018 38
Top-k pathways ranked by the iDRW method in neuroblastoma data. For each pathway, the total number of genes, the number of
significant genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
39. iDRW identifies cancer-associated pathways and genes (5/5)
CAMDA 2018
39
“… we propose a mechanism underlying a potent and
selective anti-tumor effect of LCA in cultured human neuroblastoma cells …”“…the level of Urinary catecholamine metabolites which consist of vanillylmandelic
acid (VMA), homovanillic acid (HVA) and dopamine elevated in neuroblastoma
patients…”
40. Conclusions
• We showed the effectiveness of an integrative directed random
walk-based method utilizing pathway information (iDRW) on
different cancer datasets
• We benchmark iDRW and several state-of-the-art pathway-based
methods for the survival prediction model
CAMDA 2018 40
41. Conclusions
• Contributions
− Revamp a directed gene-gene graph considering the interactions
between gene expression and copy number data
− Jointly identify cancer-related pathways and genes on gene
expression and copy number data for breast cancer and
neuroblastoma datasets
CAMDA 2018 41
42. Acknowledgements
All lab members of LAMDA lab
Kyung-Ah Sohn
Byungkon Kang
Yenewondim Biadgie
Garam Lee
Habtamu Minassie Aycheh
Sehee Wang
Jungryul Seo
Nam-Hyuk Ahn
Min-Soo Kim
Tae-rim Kim
Young-Bum Choi
Jun-hyung Yu
Jeong-hyun Moon
Jaesik Kim
Sijin Kim
Heejin Kim
Joon-seon Hwang
Hyun-Hwan Jeong, Ph.D.
Post-doctoral associate
Baylor College of Medicine
Texas Children’s Hospital
Kyung-Ah Sohn, Ph.D.
Associate Professor
Department of Software and Computer Engineering,
Ajou University
Jaesik Kim
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
Jeong-Hyeon Moon
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
CAMDA 2018 42