Challenges and opportunities for machine
learning in biomedical research
Francisco Azuaje, PhD.
Luxembourg Institute of Health (LIH)
Presentation for the Data Science Luxembourg Meetup
12 September 2018
Graphics by M. Fraiture
The Bioinformatics and Modelling Research Group (BIOMOD) @ LIH
F. Azuaje
(PI)
P. Nazarov
(Scientist)
L.C. Tranchevent
(Postdoc. Fellow)
T. Kaoma
(Bioinformatician)
S.Y Kim
(Bioinformatician)
A. Muller
(Bioinformatician)
K. Baum
(Postdoc. Fellow)
Y. Zhang
(PhD. Candidate)
Our key funders:
Mission
To enable patient-oriented research and biological understanding
through advanced computational research
Today’s presentation:
• Biomedical Data Science and ML: data landscape, trends in
approaches, crucial challenges.
• A selection of recent, attractive examples.
• The synergy between ML and network analysis, examples.
• Takeaways.
(Topol, 2014, Cell)
(Eisenstein, 2015, Nature)
Biomedical research: larger and diverse datasets
High inter-individual variabilityDatasets change in time and space High intra-individual variability
(Hutter and Zenklusen, Cell, 2018)
Key example of “big data” in cancer research
Typical questions answered by such datasets
“Fundamental” research “Applied” research
• What is the “behavior”,
“mechanism”…?
• Within a data layer, how are
samples or features related?
• How are different data layers
interrelated…?
• What if …?
• Why…?
• Risk assessment
• Diagnosis
• Prognosis
• Other clinical outcome prediction
• Prevention
• Drug targets
• Therapeutic strategies
Statistics and machine learning
Koohy, 2018, F1000 Research
ML in biomedical research
Global usage of ML techniques Trends of ML techniques ( !(PCA & LRM) )
Trends of ML techniques
SVM
RF
DNN
ML in biomedical research: Examples of model diversity and applications (1)
Large-scale phenotypic image analysis
Novelty/anomaly detection
Prediction of hard-to-discretize states
Image classification according to phenotypes
(e.g., here with Cell Profiler Analyst)
Smith et al., 2018, Cell Systems
ML in biomedical research: Examples of model diversity and applications (2)
Kermany et al., 2018, Cell
Medical Diagnoses with Transfer Image-Based Deep Learning
Retina images Retinal diseases
• DL system: Low classification errors, comparable to humans (on 1000
images)
• Strategy also successfully applied to analysis of chest X-ray images
• Potential “generalized” platform for image-based diagnoses (?)
ML in biomedical research: Examples of model diversity and applications (3)
Ambale-Venkatesh et al., Circ. Res., 2017
Random survival forests for predicting cardiovascular (CV) events
Variable importance for each of the 735 variables used in analysis
Variableimportanceis
measuredusingtheminimum
depthofthemaximalsubtree
• Accurate prediction of 6 CV outcomes (in
asymptomatic population).
• Subsets of predictive features for each
event.
• 12-year follow-up, multi-center, -ethnic,
wide age range.
Imaging, noninvasive
tests, questionnaires,
biomarker panels
Top-20 features
Shared, key challenges in the field
Heterogeneity: Data, events, states,
within and between individuals…
Data not always “big”: relative lack of
labelled data, curse of dimensionality
Data: multi-layered, hierarchical
For same data type/layer: multiple
measurement platforms
Shared, key challenges in the field (2)
Interpretability, understandability:
Global and local, novelty and consistency
with prior knowledge
Reproducibility:
Crucial requirement
“Gold standards”/”ground truth”:
Lack, limitations
Complexity of pattern recurrence,
regularities
Addressing key challenges through combination of ML and
biological network models
Why networks?
• Networks are intuitive and biologically-meaningful representations of
biological data
• Networks can be used to encode and visualize data, and more
importantly: to extract features and make predictions about the data
• Network-based models can address different predictive modelling
challenges, including: multi-modal/-layered data analysis applications
and interpretable models
A biological network can be represented as a graph that is
biologically meaningful
From: McGillivray et al., 2018, Annu. Rev.
Biomed. Data Sci.
Addressing key challenges through combination of ML and
biological network models (cont.)
Data Processed
data
Graphs
Prediction
modelsFeatures
Our strategy:
Combining ML and biological networks:
Two application examples in cancer research
Application examples from SINGALUN project:
New tools for the prediction of influential nodes and links in multi-level cancer-related networks
L.C. Tranchevent
(Postdoc. Fellow)
PI: J.C. RajapaksePI: F. Azuaje
Using biological networks and machine learning for multi-omics
patient stratification
Hypothesis: information encoded in graphs is biologically relevant.
Protein-protein network
Jeong et al., Nature (2001)
Patient similarity network
Using biological networks and machine learning for multi-omics
patient stratification (cont.)
Global strategy Examples of centrality features
• 4 categories of topological features: Centrality (12 measures), modularity
features (from 7 to 153 features), diffusion features (1000), Node2Vec-
derived features (256).
• Each category generates a model
• Integrated models (weighted voting) also investigated
Application example (1): neuroblastoma multi-omic datasets
from the CAMDA challenge
Dataset 1 (498 patients,
2 omic datasets)
Dataset 2 (142 patients,
3 omic datasets)
Focus on Data 1
6,300 classification models
• Models based on graph topology features outperform models based on “classical” approach
• Among topological features, centrality metrics are most predictive (followed by diffusion-based features)
Application example (2): Neuroblastoma multi-omics datasets
from the CAMDA challenge, a deep learning approach*
Global strategy Algorithm Parameters Balanced
accuracy
Death from disease, Fischer-M
DNN h=[8,8,8,2], o=Adam, lr=1e-3, d=0.3 87.3% *
SVM t=RBF, c=64, g=0.25 75.4%
RF n=100 75.1% *
Disease progression, Fischer
DNN h=[4,2,2,2], o=Adam, lr=1e-3, d=0.3 84.7% *
SVM t=RBF, c=16, g=0.0625 81.8%
RF n=100 78.1% *• Network features from each dataset: Centrality (12), modularity
(30 to 47) features.
• Models based on each feature category, and their combination
• Data: 498 patients (2 omic datasets, gene expression data)
• Training (50% of total data), validation and test datasets
• DNNs: multiple architectures, Rectified Linear Units (ReLU),
Softmax function (2 outputs)
Prediction performance on test
dataset (top models)
Top DNN: Input features are graph centrality measures
Fischer-M: 1 dataset only (microarrays)
Fischer: 2 datasets (microarrays + RNA-Seq)
*Article in preparation
Global strategy
• Additional Independent dataset (Versteeg, 88 patients)
(microarray dataset)
• Network features: Centrality and modularity features
concatenated
• 3000 DNNs / classification task
• DNNs: Rectified Linear Units (ReLU), Softmax function (2
outputs)
Train Test DNN SVM RF
Death from disease, centralities
Fischer-M
Fischer-M 87.3% 75.4% 75.1%
Fischer-R 82.1% 53.5% 66.8%
Versteeg 75.0% 53.3% 67.5%
Fischer-R
Fischer-R 85.8% 66.0% 62.4%
Fischer-M 81.5% 75.4% 61.2%
Versteeg 70.8% 68.3% 67.5%
Further evaluation using independent datasets
Deep neural nets using graph centrality- based
input features offer best prediction performance
Takeaways:
• Many ML challenges in BM research are shared by different application domains, but
this field poses its unique challenges.
• ML in BM research will continue advancing driven by: more data, new expectations
and emerging questions.
• Supervised learning, including e.g., deep learning, will meet many of these needs,
however: unbiased exploration, hypothesis generation and interpretation (incl.
“mechanistic”) are crucial.
• The use of graphs/networks to represent data, extract predictive features and
integrate datasets together with ML will continue enabling new discoveries and
applications closer to the patient.
Thanks to:
Funding from:
Bioinformatics and Modelling Research
Group (BIOMOD)
Our research partners in Luxembourg and abroad

Challenges and opportunities for machine learning in biomedical research

  • 1.
    Challenges and opportunitiesfor machine learning in biomedical research Francisco Azuaje, PhD. Luxembourg Institute of Health (LIH) Presentation for the Data Science Luxembourg Meetup 12 September 2018
  • 2.
  • 3.
    The Bioinformatics andModelling Research Group (BIOMOD) @ LIH F. Azuaje (PI) P. Nazarov (Scientist) L.C. Tranchevent (Postdoc. Fellow) T. Kaoma (Bioinformatician) S.Y Kim (Bioinformatician) A. Muller (Bioinformatician) K. Baum (Postdoc. Fellow) Y. Zhang (PhD. Candidate) Our key funders: Mission To enable patient-oriented research and biological understanding through advanced computational research
  • 4.
    Today’s presentation: • BiomedicalData Science and ML: data landscape, trends in approaches, crucial challenges. • A selection of recent, attractive examples. • The synergy between ML and network analysis, examples. • Takeaways.
  • 5.
    (Topol, 2014, Cell) (Eisenstein,2015, Nature) Biomedical research: larger and diverse datasets High inter-individual variabilityDatasets change in time and space High intra-individual variability
  • 6.
    (Hutter and Zenklusen,Cell, 2018) Key example of “big data” in cancer research
  • 7.
    Typical questions answeredby such datasets “Fundamental” research “Applied” research • What is the “behavior”, “mechanism”…? • Within a data layer, how are samples or features related? • How are different data layers interrelated…? • What if …? • Why…? • Risk assessment • Diagnosis • Prognosis • Other clinical outcome prediction • Prevention • Drug targets • Therapeutic strategies Statistics and machine learning
  • 8.
    Koohy, 2018, F1000Research ML in biomedical research Global usage of ML techniques Trends of ML techniques ( !(PCA & LRM) ) Trends of ML techniques SVM RF DNN
  • 9.
    ML in biomedicalresearch: Examples of model diversity and applications (1) Large-scale phenotypic image analysis Novelty/anomaly detection Prediction of hard-to-discretize states Image classification according to phenotypes (e.g., here with Cell Profiler Analyst) Smith et al., 2018, Cell Systems
  • 10.
    ML in biomedicalresearch: Examples of model diversity and applications (2) Kermany et al., 2018, Cell Medical Diagnoses with Transfer Image-Based Deep Learning Retina images Retinal diseases • DL system: Low classification errors, comparable to humans (on 1000 images) • Strategy also successfully applied to analysis of chest X-ray images • Potential “generalized” platform for image-based diagnoses (?)
  • 11.
    ML in biomedicalresearch: Examples of model diversity and applications (3) Ambale-Venkatesh et al., Circ. Res., 2017 Random survival forests for predicting cardiovascular (CV) events Variable importance for each of the 735 variables used in analysis Variableimportanceis measuredusingtheminimum depthofthemaximalsubtree • Accurate prediction of 6 CV outcomes (in asymptomatic population). • Subsets of predictive features for each event. • 12-year follow-up, multi-center, -ethnic, wide age range. Imaging, noninvasive tests, questionnaires, biomarker panels Top-20 features
  • 12.
    Shared, key challengesin the field Heterogeneity: Data, events, states, within and between individuals… Data not always “big”: relative lack of labelled data, curse of dimensionality Data: multi-layered, hierarchical For same data type/layer: multiple measurement platforms
  • 13.
    Shared, key challengesin the field (2) Interpretability, understandability: Global and local, novelty and consistency with prior knowledge Reproducibility: Crucial requirement “Gold standards”/”ground truth”: Lack, limitations Complexity of pattern recurrence, regularities
  • 14.
    Addressing key challengesthrough combination of ML and biological network models Why networks? • Networks are intuitive and biologically-meaningful representations of biological data • Networks can be used to encode and visualize data, and more importantly: to extract features and make predictions about the data • Network-based models can address different predictive modelling challenges, including: multi-modal/-layered data analysis applications and interpretable models
  • 15.
    A biological networkcan be represented as a graph that is biologically meaningful From: McGillivray et al., 2018, Annu. Rev. Biomed. Data Sci.
  • 16.
    Addressing key challengesthrough combination of ML and biological network models (cont.) Data Processed data Graphs Prediction modelsFeatures Our strategy:
  • 17.
    Combining ML andbiological networks: Two application examples in cancer research Application examples from SINGALUN project: New tools for the prediction of influential nodes and links in multi-level cancer-related networks L.C. Tranchevent (Postdoc. Fellow) PI: J.C. RajapaksePI: F. Azuaje
  • 18.
    Using biological networksand machine learning for multi-omics patient stratification Hypothesis: information encoded in graphs is biologically relevant. Protein-protein network Jeong et al., Nature (2001) Patient similarity network
  • 19.
    Using biological networksand machine learning for multi-omics patient stratification (cont.) Global strategy Examples of centrality features • 4 categories of topological features: Centrality (12 measures), modularity features (from 7 to 153 features), diffusion features (1000), Node2Vec- derived features (256). • Each category generates a model • Integrated models (weighted voting) also investigated
  • 20.
    Application example (1):neuroblastoma multi-omic datasets from the CAMDA challenge Dataset 1 (498 patients, 2 omic datasets) Dataset 2 (142 patients, 3 omic datasets) Focus on Data 1 6,300 classification models • Models based on graph topology features outperform models based on “classical” approach • Among topological features, centrality metrics are most predictive (followed by diffusion-based features)
  • 21.
    Application example (2):Neuroblastoma multi-omics datasets from the CAMDA challenge, a deep learning approach* Global strategy Algorithm Parameters Balanced accuracy Death from disease, Fischer-M DNN h=[8,8,8,2], o=Adam, lr=1e-3, d=0.3 87.3% * SVM t=RBF, c=64, g=0.25 75.4% RF n=100 75.1% * Disease progression, Fischer DNN h=[4,2,2,2], o=Adam, lr=1e-3, d=0.3 84.7% * SVM t=RBF, c=16, g=0.0625 81.8% RF n=100 78.1% *• Network features from each dataset: Centrality (12), modularity (30 to 47) features. • Models based on each feature category, and their combination • Data: 498 patients (2 omic datasets, gene expression data) • Training (50% of total data), validation and test datasets • DNNs: multiple architectures, Rectified Linear Units (ReLU), Softmax function (2 outputs) Prediction performance on test dataset (top models) Top DNN: Input features are graph centrality measures Fischer-M: 1 dataset only (microarrays) Fischer: 2 datasets (microarrays + RNA-Seq) *Article in preparation
  • 22.
    Global strategy • AdditionalIndependent dataset (Versteeg, 88 patients) (microarray dataset) • Network features: Centrality and modularity features concatenated • 3000 DNNs / classification task • DNNs: Rectified Linear Units (ReLU), Softmax function (2 outputs) Train Test DNN SVM RF Death from disease, centralities Fischer-M Fischer-M 87.3% 75.4% 75.1% Fischer-R 82.1% 53.5% 66.8% Versteeg 75.0% 53.3% 67.5% Fischer-R Fischer-R 85.8% 66.0% 62.4% Fischer-M 81.5% 75.4% 61.2% Versteeg 70.8% 68.3% 67.5% Further evaluation using independent datasets Deep neural nets using graph centrality- based input features offer best prediction performance
  • 23.
    Takeaways: • Many MLchallenges in BM research are shared by different application domains, but this field poses its unique challenges. • ML in BM research will continue advancing driven by: more data, new expectations and emerging questions. • Supervised learning, including e.g., deep learning, will meet many of these needs, however: unbiased exploration, hypothesis generation and interpretation (incl. “mechanistic”) are crucial. • The use of graphs/networks to represent data, extract predictive features and integrate datasets together with ML will continue enabling new discoveries and applications closer to the patient.
  • 24.
    Thanks to: Funding from: Bioinformaticsand Modelling Research Group (BIOMOD) Our research partners in Luxembourg and abroad