This document discusses machine learning and network analysis approaches for predicting clinically relevant outcomes from biomedical data. It describes the research group led by Francisco Azuaje at the Luxembourg Institute of Health that uses these techniques. Their approaches involve representing biological data as networks, extracting topological features, and using models like deep learning on these features to integrate multi-omics data and predict outcomes like disease progression or drug response. They have applied these methods successfully to neuroblastoma and leukemia patient datasets. The research aims to advance biological understanding and identify biomarkers and drug targets through interpretable network-based machine learning models.
1. Machine Learning and Network Analysis Approaches for
Predicting Clinically Relevant Outcomes
Francisco Azuaje, PhD
Head of Bioinformatics
Luxembourg Institute of Health (LIH)
4. Mission
To enable patient-oriented research and biological
understanding through advanced computational approaches
Bioinformatics research and support @ LIH
F. Azuaje
(PI)
P. Nazarov
(Scientist)
T. Kaoma
(Bioinformatician)
S.Y Kim
(Bioinformatician)
A. Muller
(Bioinformatician)
K. Baum
(Postdoc. Fellow, part-time)
Y. Zhang
(PhD. Candidate)
Members:
(April 2019)
+ MSc research
students
5. DataQuestions Approaches Outcomes
Diagnostic
Prognostic
Predictive (drug response)
Other descriptive/modeling
Multi. sources/ technologies
Multi-omics
Clinically-relevant
Cells, animals, patients
Statistical models
Machine learning
Network-based models
Their combinations
Biological understanding
Candidate biomarkers, drugs
and targets
Software, workflows
Our research activities
Collaborations
National and international
Leading and non-leading partner
Funding targets
FNR
EU
7. (Topol, 2014, Cell)
(Eisenstein, 2015, Nature)
Biomedical research: larger and diverse datasets
High inter-individual variabilityDatasets change in time and space High intra-individual variability
8. Key challenges in the field
Heterogeneity: Data, events, states,
within and between individuals…
Data not always “big”: relative lack of
labelled data, curse of dimensionality
Data: multi-layered, hierarchical
For same data type/layer: multiple
measurement platforms
9. Shared, key challenges in the field (2)
Interpretability, understandability:
Global and local, novelty and consistency
with prior knowledge
Reproducibility:
Crucial requirement
“Gold standards”/”ground truth”:
Lack, limitations
Complexity of pattern recurrence,
regularities
10. Addressing key challenges through combination of ML and
biological network models
Why networks?
• Networks are intuitive and biologically-meaningful representations of
biological data
• Networks can be used to encode and visualize data, and more
importantly: to extract features and make predictions about the data
• Network-based models can address different predictive modelling
challenges, including: multi-modal/-layered data analysis applications
and interpretable models
11. A biological network can be represented as a graph that is
biologically meaningful
From: McGillivray et al., 2018, Annu. Rev.
Biomed. Data Sci.
12.
13. Using biological networks and machine learning for multi-omics
patient stratification
Hypothesis: information encoded in graphs is biologically relevant.
Protein-protein network
Jeong et al., Nature (2001)
Patient similarity network
14. Using biological networks and machine learning for multi-omics
patient stratification (cont.)
Global strategy Examples of centrality features
• 4 categories of topological features: Centrality (12 measures), modularity
features (from 7 to 153 features), diffusion features (1000), Node2Vec-
derived features (256).
• Each category generates a model
• Integrated models (weighted voting) also investigated
15. Application example (1): neuroblastoma multi-omic datasets
from the CAMDA challenge
Dataset 1 (498 patients,
2 omic datasets)
Dataset 2 (142 patients,
3 omic datasets)
Focus on Data 1
6,300 classification models
• Models based on graph topology features outperform models based on “classical” approach
• Among topological features, centrality metrics are most predictive (followed by diffusion-based features)
16. Application example (2): Neuroblastoma multi-omics datasets
from the CAMDA challenge, a deep learning approach*
Global strategy Algorithm Parameters Balanced
accuracy
Death from disease, Fischer-M
DNN h=[8,8,8,2], o=Adam, lr=1e-3, d=0.3 87.3% *
SVM t=RBF, c=64, g=0.25 75.4%
RF n=100 75.1% *
Disease progression, Fischer
DNN h=[4,2,2,2], o=Adam, lr=1e-3, d=0.3 84.7% *
SVM t=RBF, c=16, g=0.0625 81.8%
RF n=100 78.1% *• Network features from each dataset: Centrality (12), modularity
(30 to 47) features.
• Models based on each feature category, and their combination
• Data: 498 patients (2 omic datasets, gene expression data)
• Training (50% of total data), validation and test datasets
• DNNs: multiple architectures, Rectified Linear Units (ReLU),
Softmax function (2 outputs)
Prediction performance on test
dataset (top models)
Top DNN: Input features are graph centrality measures
Fischer-M: 1 dataset only (microarrays)
Fischer: Combination of 2 datasets (microarrays and RNA-Seq)
* Article submitted in
cooperation with:
17. Global strategy
• Additional Independent dataset (Versteeg, 88 patients,
microarray dataset)
• Network centrality features
• 3000 DNNs / classification task
• DNNs: Rectified Linear Units (ReLU), Softmax function (2
outputs)
Train Test DNN SVM RF
Death from disease, centralities
Fischer-M
Fischer-M 87.3% 75.4% 75.1%
Fischer-R 82.1% 53.5% 66.8%
Versteeg 75.0% 53.3% 67.5%
Fischer-R
Fischer-R 85.8% 66.0% 62.4%
Fischer-M 81.5% 75.4% 61.2%
Versteeg 70.8% 68.3% 67.5%
Further evaluation using independent datasets
Deep neural nets using graph centrality- based
input features offer best prediction performance
* Article submitted in
cooperation with:
18. Example 2: Linking gene network centrality to anti-cancer drug response
• Biological relevance of central genes/proteins previously determined in several model organisms and phenotypes.
• Their predictive capability in gene co-expression networks in the specific context of cancer-related drug response remains to be
deeply investigated.
Hubs in a pan-cancer cell line co-expression
network are biologically meaningful and
predictive of drug responses
19. • A (linear) model based on the expression
of 47 hubs shows accurate drug sensitivity
prediction capability (CCLE and GDSC
datasets)
• Independent of expression platform
technology (microarrays, RNA-Seq, qPCR)
• Comparable performance to published
models
• Relative accurate predictions in other
independent cell lines and drugs
Linking gene centrality to anti-cancer drug response (cont.)
Predicted vs. actual drug
sensitivity in the CCLE dataset
20. Expression of autophagy-related genes accurately predicts anti-cancer drug response.
Example 3: Biological pathway-focused prediction of drug sensitivity
Tests on a leukemia patient dataset
Prediction accuracy in GDSC dataset
Patients treated with Cytarabine (Data from Farge et al., Cancer Disc 2017)
Article in
preparation in
cooperation with:
21.
22. Takeaways:
• Many ML challenges in BM research are shared by different application domains, but
this field poses its unique challenges.
• Supervised learning, including e.g., deep learning, will meet many of these needs,
however: unbiased exploration, hypothesis generation and interpretation (incl.
“mechanistic”) are crucial.
• The use of graphs/networks to represent data, extract predictive features and
integrate datasets together with ML will continue enabling new discoveries and
applications closer to the patient.