2. Multi-centric learning from medical data: why?
Multiple sets of medical data exist in different hospitals
Currently: models are built from data in 1 center only
Currently: external model validation requires standardization
Data privacy is an issue: data cannot leave hospitals (easily)
3. Multi-centric learning from medical data: why?
General Hypothesis:
Current way of learning from medical data is suboptimal, as modeling
techniques do not have access to all available data
Specific Hypothesis:
A distributed learning environment (euroCAT), by giving local access to all
data, can be used to produce optimal models, the same as if data was
centralized*
* For some modeling techniques
4. Learning from medical data: the current state
Center 1 Center 2 Center 3 Center 4 Center 5
C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4
We learn a model from one center only and validate at other centers (if possible)
We also check the predictions of the doctors. (The golden standard?)
5. Learning from medical data: the current state
Center 1 Center 2 Center 3 Center 4 Center 5
C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4
We learn a model from one center only and validate at other centers (if possible)
We also check the predictions of the doctors. (The golden standard?)
6. Learning from medical data: the challenge
Optimal solution:
learning from centralized data
Decentralized data C2 C3 C4 C5 How to achieve that?
Option 1: Centralized learning
Centralize data Combine data from sites 1,2,3 and 4 at one
central database. That is: bring the data to
the model: NOT FEASIBLE
Center 2
Center 3
Data for learning Option 2: Distributed learning
Center 4 Apply “distributed learning” on the data
Center 5 from sites 1,2,3 and 4. That is, bring the
model to the data: FEASIBLE
Data to predict Center 1
NOT FEASIBLE
7. Centralized Learning
Nalbantov and Wiessler, Oct 2012
Data
centralization
Learning
9. Distributed Learning: doing it
- For distributed learning we need a statistical modeling technique that can
learn in distributed mode, that is without being able to “see” all the data at once.
- There exist learning models that are able to find an optimal solution no
matter whether data is scattered across different centers or not.
We choose one of them for this study: SVM’s
- Have shown excellent results across a wide range of data-analysis problems
- Robust to the inclusion of many features (bye-bye to the “15-1” rule of thumb)
- Can be constructed in distributed-learning mode
11. The “Trial”:
prediction of 2-year survival of lung cancer patients
Patients: 322 (Maastro) lung cancer patients, distributed across 5 sites:
Maastro 186
Liege 52
Hasselt/Genk 45
Aachen 7
Eindhoven 32
Endpoint: 2-year survival
Method: distributed learning SVMs (ref: Boyd, ADMM), euroCAT
Predictive features: gender, WHO performance status, FEV1, number of
PLNSs, GTV (volume) and “EQD2,T”
12. The “Trial”:
prediction of 2-year survival of lung cancer patients
The data for the trial
site patients 2-year gender WHO FEV1 number GTV EQD2,T
survival of PLNSs (volume)
Maastro 186 … … … … … … …
Liege 52 … … … … … … …
Hasselt/G 45 … … … … … … …
Aachen 7 … … … … … … …
Eindhoven 32 … … … … … … …
13. The “golden standard”: doctors’ predictions
prediction of 2-year survival of lung cancer patients
14. “Traditional” solution
prediction of 2-year survival of lung cancer patients
site patients
Maastro 186
Center 1 Build model at center 1
Liege 52
Hasselt/G 45
Aachen 7
Eindhoven 32
C4
Validate the model at centers
C2 C3 C5
2,3,4 and 5
Step 1. Build an SVM model from data in center 1.
- There is no “one-button” to press… It turns out SVM is a “family” of models and the trained statistician has to
choose one family member in much the same way as the surgeon has to choose from a variety of “knives”.
Step 2. Model evaluation: how will our model perform outside my hospital?
- Perform cross-validation to find optimal SVM from the “family”
Step 3. Build the final model using the “best-performing” SVM from the SVM family.
15. “Traditional” solution
prediction of 2-year survival of lung cancer patients
Center 1
SVM family with different lambda’s cross-validation ROC External AUC
validation
SVM with Center 2 0.723
Lambda=1
Build the final SVM Center 3 0.757
model on all data from
center 1, that is “SVM
with lambda = 2”
Center 4 0.671
SVM with
Lambda=5 Center 5 0.6
16. euroCAT solution:
prediction of 2-year survival of lung cancer patients
Learning from centralized data: Distributed learning:
optimal solution optimal solution*
Decentralized data C2 C3 C4 C5 Decentralized data C2 C3 C4 C5
Bring the data to the model Bring model to the data
Center 2 Center 2
Data for learning
Center 3 Data BEHAVES Center 3
like centralized*
Center 4 Center 4
Center 5 Center 5
Data to predict Center 1 Data to predict Center 1
NOT FEASIBLE FEASIBLE (ex: EuroCAT)
*Using ADMM to solve SVM
17. euroCAT: a breakthrough
What is the potential benefit from multi-centric batch learning on
predicting survival?
Training Predicted AUC euroCAT learning on color
site(s) site PREDICTED site
(same result as centralized)
2 1 0.754 red
3 1 0.678 green
4 1 0.610 cyan
5 1 0.723 pink
2,3,4,5 1 0.766 blue
1,2,3,4,5 world ? ?
18. The future
How can patients/clinics profit from distributed learning medical environments?
- Use real multi-centric data for modeling
- Use multiple endpoints: survival, dyspnea, dysphagia, fibrosis, etc.
- Include more variables: imaging, DNA, etc.
- Use standardized data (and more data)
- Etc…
19. Bedankt voor uw aandacht
Heeft u vragen of opmerkingen?
www.maastro.nl
info@maastro.nl