Multi-centric learning from medical data

Multi-centric learning from medical data

Nov 2012
Georgi Nalbantov

Multi-centric learning from medical data: why?

 Multiple sets of medical data exist in different hospitals

 Currently: models are built from data in 1 center only

 Currently: external model validation requires standardization

 Data privacy is an issue: data cannot leave hospitals (easily)

Multi-centric learning from medical data: why?

General Hypothesis:

Current way of learning from medical data is suboptimal, as modeling
techniques do not have access to all available data

Specific Hypothesis:

A distributed learning environment (euroCAT), by giving local access to all
data, can be used to produce optimal models, the same as if data was
centralized*

* For some modeling techniques

Learning from medical data: the current state

Center 1 Center 2 Center 3 Center 4 Center 5

C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4

We learn a model from one center only and validate at other centers (if possible)

We also check the predictions of the doctors. (The golden standard?)

Learning from medical data: the challenge

Optimal solution:
learning from centralized data

Decentralized data C2 C3 C4 C5 How to achieve that?

Option 1: Centralized learning
Centralize data Combine data from sites 1,2,3 and 4 at one
central database. That is: bring the data to
the model: NOT FEASIBLE
Center 2
Center 3
Data for learning Option 2: Distributed learning
Center 4 Apply “distributed learning” on the data
Center 5 from sites 1,2,3 and 4. That is, bring the
model to the data: FEASIBLE

Data to predict Center 1

NOT FEASIBLE

Centralized Learning
Nalbantov and Wiessler, Oct 2012

Data
centralization

Learning

Distributed Learning
Nalbantov and Wiessler, Oct 2012

Distributed Learning: doing it

- For distributed learning we need a statistical modeling technique that can
learn in distributed mode, that is without being able to “see” all the data at once.

- There exist learning models that are able to find an optimal solution no
matter whether data is scattered across different centers or not.

We choose one of them for this study: SVM’s

- Have shown excellent results across a wide range of data-analysis problems
- Robust to the inclusion of many features (bye-bye to the “15-1” rule of thumb)
- Can be constructed in distributed-learning mode

Learning from medical data: distributed learning

SVMs

Model evaluation:
ROC curve

The “Trial”:
prediction of 2-year survival of lung cancer patients

Patients: 322 (Maastro) lung cancer patients, distributed across 5 sites:

Maastro 186
Liege 52
Hasselt/Genk 45
Aachen 7
Eindhoven 32

Endpoint: 2-year survival

Method: distributed learning SVMs (ref: Boyd, ADMM), euroCAT

Predictive features: gender, WHO performance status, FEV1, number of
PLNSs, GTV (volume) and “EQD2,T”

The “Trial”:

The data for the trial

site patients 2-year gender WHO FEV1 number GTV EQD2,T
survival of PLNSs (volume)
Maastro 186 … … … … … … …

Liege 52 … … … … … … …

Hasselt/G 45 … … … … … … …

Aachen 7 … … … … … … …

Eindhoven 32 … … … … … … …

The “golden standard”: doctors’ predictions

“Traditional” solution

site patients

Maastro 186
Center 1 Build model at center 1
Liege 52

Hasselt/G 45

Aachen 7

Eindhoven 32
C4
Validate the model at centers
C2 C3 C5
2,3,4 and 5

Step 1. Build an SVM model from data in center 1.
- There is no “one-button” to press… It turns out SVM is a “family” of models and the trained statistician has to
choose one family member in much the same way as the surgeon has to choose from a variety of “knives”.

Step 2. Model evaluation: how will our model perform outside my hospital?
- Perform cross-validation to find optimal SVM from the “family”

Step 3. Build the final model using the “best-performing” SVM from the SVM family.

“Traditional” solution

Center 1

SVM family with different lambda’s cross-validation ROC External AUC
validation

SVM with Center 2 0.723

Lambda=1

Build the final SVM Center 3 0.757
model on all data from
center 1, that is “SVM
with lambda = 2”

Center 4 0.671

SVM with
Lambda=5 Center 5 0.6

euroCAT solution:

Learning from centralized data: Distributed learning:
optimal solution optimal solution*

Decentralized data C2 C3 C4 C5 Decentralized data C2 C3 C4 C5

Bring the data to the model Bring model to the data

Center 2 Center 2
Data for learning
Center 3 Data BEHAVES Center 3
like centralized*
Center 4 Center 4
Center 5 Center 5

Data to predict Center 1 Data to predict Center 1

NOT FEASIBLE FEASIBLE (ex: EuroCAT)
*Using ADMM to solve SVM

euroCAT: a breakthrough

What is the potential benefit from multi-centric batch learning on
predicting survival?

Training Predicted AUC euroCAT learning on color
site(s) site PREDICTED site

(same result as centralized)

2 1 0.754 red

3 1 0.678 green

4 1 0.610 cyan

5 1 0.723 pink

2,3,4,5 1 0.766 blue

1,2,3,4,5 world ? ?

The future

How can patients/clinics profit from distributed learning medical environments?

- Use real multi-centric data for modeling

- Use multiple endpoints: survival, dyspnea, dysphagia, fibrosis, etc.

- Include more variables: imaging, DNA, etc.

- Use standardized data (and more data)

- Etc…

Bedankt voor uw aandacht

Heeft u vragen of opmerkingen?
www.maastro.nl
info@maastro.nl

Multi-centric learning from medical data

Recommended

Recommended

More Related Content

More from MAASTRO clinic

More from MAASTRO clinic (16)

Multi-centric learning from medical data