SlideShare a Scribd company logo
Multi-centric learning from medical data




                        Nov 2012
                        Georgi Nalbantov
Multi-centric learning from medical data: why?



 Multiple sets of medical data exist in different hospitals

 Currently: models are built from data in 1 center only

 Currently: external model validation requires standardization

 Data privacy is an issue: data cannot leave hospitals (easily)
Multi-centric learning from medical data: why?



General Hypothesis:

       Current way of learning from medical data is suboptimal, as modeling
       techniques do not have access to all available data




Specific Hypothesis:

       A distributed learning environment (euroCAT), by giving local access to all
       data, can be used to produce optimal models, the same as if data was
       centralized*


 * For some modeling techniques
Learning from medical data: the current state



     Center 1            Center 2             Center 3             Center 4         Center 5




C2   C3   C4   C5   C1    C3   C4   C5   C1    C2   C4   C5   C1    C2   C3   C5   C1   C2   C3   C4




     We learn a model from one center only and validate at other centers (if possible)

     We also check the predictions of the doctors. (The golden standard?)
Learning from medical data: the current state



     Center 1            Center 2             Center 3             Center 4         Center 5




C2   C3   C4   C5   C1    C3   C4   C5   C1    C2   C4   C5   C1    C2   C3   C5   C1   C2   C3   C4




     We learn a model from one center only and validate at other centers (if possible)

     We also check the predictions of the doctors. (The golden standard?)
Learning from medical data: the challenge

              Optimal solution:
        learning from centralized data

Decentralized data     C2    C3    C4   C5   How to achieve that?


                                             Option 1: Centralized learning
    Centralize data                                     Combine data from sites 1,2,3 and 4 at one
                                                        central database. That is: bring the data to
                                                        the model: NOT FEASIBLE
                            Center 2
                            Center 3
  Data for learning                          Option 2: Distributed learning
                            Center 4                     Apply “distributed learning” on the data
                            Center 5                     from sites 1,2,3 and 4. That is, bring the
                                                         model to the data: FEASIBLE

    Data to predict          Center 1




                      NOT FEASIBLE
Centralized Learning
                       Nalbantov and Wiessler, Oct 2012

    Data
centralization

  Learning
Distributed Learning
                       Nalbantov and Wiessler, Oct 2012
Distributed Learning: doing it


- For distributed learning we need a statistical modeling technique that can
  learn in distributed mode, that is without being able to “see” all the data at once.

- There exist learning models that are able to find an optimal solution no
  matter whether data is scattered across different centers or not.


We choose one of them for this study: SVM’s

- Have shown excellent results across a wide range of data-analysis problems
- Robust to the inclusion of many features (bye-bye to the “15-1” rule of thumb)
- Can be constructed in distributed-learning mode
Learning from medical data: distributed learning


              SVMs




    Model evaluation:
    ROC curve
The “Trial”:
 prediction of 2-year survival of lung cancer patients


Patients: 322 (Maastro) lung cancer patients, distributed across 5 sites:

Maastro            186
Liege              52
Hasselt/Genk       45
Aachen             7
Eindhoven          32



Endpoint: 2-year survival


Method: distributed learning SVMs (ref: Boyd, ADMM), euroCAT

Predictive features: gender, WHO performance status, FEV1, number of
PLNSs, GTV (volume) and “EQD2,T”
The “Trial”:
prediction of 2-year survival of lung cancer patients



                          The data for the trial

  site        patients    2-year     gender   WHO       FEV1   number     GTV        EQD2,T
                          survival                             of PLNSs   (volume)
  Maastro     186         …          …        …         …      …          …          …

  Liege       52          …          …        …         …      …          …          …

  Hasselt/G   45          …          …        …         …      …          …          …

  Aachen      7           …          …        …         …      …          …          …

  Eindhoven   32          …          …        …         …      …          …          …
The “golden standard”: doctors’ predictions
prediction of 2-year survival of lung cancer patients
“Traditional” solution
    prediction of 2-year survival of lung cancer patients

                                                                                                          site           patients

                                                                                                          Maastro        186
                          Center 1                     Build model at center 1
                                                                                                          Liege          52

                                                                                                          Hasselt/G      45

                                                                                                          Aachen         7

                                                                                                          Eindhoven      32
                                 C4
                                                  Validate the model at centers
                    C2     C3           C5
                                                            2,3,4 and 5

Step 1. Build an SVM model from data in center 1.
             - There is no “one-button” to press… It turns out SVM is a “family” of models and the trained statistician has to
choose one family member in much the same way as the surgeon has to choose from a variety of “knives”.


Step 2. Model evaluation: how will our model perform outside my hospital?
             - Perform cross-validation to find optimal SVM from the “family”


Step 3. Build the final model using the “best-performing” SVM from the SVM family.
“Traditional” solution
    prediction of 2-year survival of lung cancer patients



        Center 1

SVM family with different lambda’s   cross-validation ROC                            External     AUC
                                                                                     validation

   SVM with                                                                          Center 2     0.723

   Lambda=1

                                                            Build the final SVM       Center 3    0.757
                                                            model on all data from
                                                            center 1, that is “SVM
                                                            with lambda = 2”

                                                                                      Center 4    0.671




 SVM with
 Lambda=5                                                                             Center 5    0.6
euroCAT solution:
        prediction of 2-year survival of lung cancer patients


        Learning from centralized data:                                  Distributed learning:
               optimal solution                                            optimal solution*

Decentralized data     C2    C3    C4   C5             Decentralized data      C2    C3   C4    C5




             Bring the data to the model                             Bring model to the data

                            Center 2                                                Center 2
  Data for learning
                            Center 3                     Data BEHAVES               Center 3
                                                         like centralized*
                            Center 4                                                Center 4
                            Center 5                                                Center 5

    Data to predict          Center 1                        Data to predict         Center 1



                      NOT FEASIBLE                                             FEASIBLE (ex: EuroCAT)
                                                    *Using ADMM to solve SVM
euroCAT: a breakthrough

            What is the potential benefit from multi-centric batch learning on
            predicting survival?


Training     Predicted   AUC euroCAT learning on         color
site(s)      site        PREDICTED site

                          (same result as centralized)

2            1                       0.754               red

3            1                       0.678               green

4            1                       0.610               cyan

5            1                       0.723               pink

2,3,4,5      1                       0.766               blue

1,2,3,4,5    world                     ?                 ?
The future



How can patients/clinics profit from distributed learning medical environments?


- Use real multi-centric data for modeling

- Use multiple endpoints: survival, dyspnea, dysphagia, fibrosis, etc.

- Include more variables: imaging, DNA, etc.

- Use standardized data (and more data)

- Etc…
Bedankt voor uw aandacht




Heeft u vragen of opmerkingen?
www.maastro.nl
info@maastro.nl

More Related Content

More from MAASTRO clinic

e-Learning What, Why & How
e-Learning What, Why & Howe-Learning What, Why & How
e-Learning What, Why & How
MAASTRO clinic
 
ARTFORCE - Radiation Protection Zr-cetuximab
ARTFORCE - Radiation Protection Zr-cetuximabARTFORCE - Radiation Protection Zr-cetuximab
ARTFORCE - Radiation Protection Zr-cetuximab
MAASTRO clinic
 
Knowledge Engineering in Oncology
Knowledge Engineering in OncologyKnowledge Engineering in Oncology
Knowledge Engineering in Oncology
MAASTRO clinic
 
General information course 2.3
General information course 2.3General information course 2.3
General information course 2.3
MAASTRO clinic
 
HIF in cell Biology & Physiology
HIF in cell Biology & PhysiologyHIF in cell Biology & Physiology
HIF in cell Biology & Physiology
MAASTRO clinic
 
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and DiseasesMETOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
MAASTRO clinic
 

More from MAASTRO clinic (16)

Maastro corporate presentatie 2015
Maastro corporate presentatie 2015Maastro corporate presentatie 2015
Maastro corporate presentatie 2015
 
e-Learning What, Why & How
e-Learning What, Why & Howe-Learning What, Why & How
e-Learning What, Why & How
 
ARTFORCE - Radiation Protection Zr-cetuximab
ARTFORCE - Radiation Protection Zr-cetuximabARTFORCE - Radiation Protection Zr-cetuximab
ARTFORCE - Radiation Protection Zr-cetuximab
 
MAASTRO Knowledge Engineering: The Fun(ction) of Medical Physics in Cancer Care
MAASTRO Knowledge Engineering: The Fun(ction) of Medical Physics in Cancer CareMAASTRO Knowledge Engineering: The Fun(ction) of Medical Physics in Cancer Care
MAASTRO Knowledge Engineering: The Fun(ction) of Medical Physics in Cancer Care
 
Knowledge Engineering in Oncology
Knowledge Engineering in OncologyKnowledge Engineering in Oncology
Knowledge Engineering in Oncology
 
General information course 2.3
General information course 2.3General information course 2.3
General information course 2.3
 
Carbonic Anhydrase IX: regulation and role in cancer
Carbonic Anhydrase IX: regulation and role in cancerCarbonic Anhydrase IX: regulation and role in cancer
Carbonic Anhydrase IX: regulation and role in cancer
 
Measuring of biological parameters of the tumor microenvironment – advantages...
Measuring of biological parameters of the tumor microenvironment – advantages...Measuring of biological parameters of the tumor microenvironment – advantages...
Measuring of biological parameters of the tumor microenvironment – advantages...
 
Clinical Imaging Hypoxia
Clinical Imaging HypoxiaClinical Imaging Hypoxia
Clinical Imaging Hypoxia
 
Hypoxia as a target for personalized medicine
Hypoxia as a target for personalized medicineHypoxia as a target for personalized medicine
Hypoxia as a target for personalized medicine
 
Global & local oxygen control in in vitro systems
Global & local oxygen control in in vitro systemsGlobal & local oxygen control in in vitro systems
Global & local oxygen control in in vitro systems
 
Tumour Hypoxia - detection and prognostic significance
Tumour Hypoxia - detection and prognostic significanceTumour Hypoxia - detection and prognostic significance
Tumour Hypoxia - detection and prognostic significance
 
Biological responses to tumor hypoxia & their potential as therapeutic targets
Biological responses to tumor hypoxia & their potential as therapeutic targetsBiological responses to tumor hypoxia & their potential as therapeutic targets
Biological responses to tumor hypoxia & their potential as therapeutic targets
 
HIF in cell Biology & Physiology
HIF in cell Biology & PhysiologyHIF in cell Biology & Physiology
HIF in cell Biology & Physiology
 
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and DiseasesMETOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
 
Epigenetic Mechanisms Regulating the Cellular Response to Hypoxia
Epigenetic Mechanisms Regulating the Cellular Response to HypoxiaEpigenetic Mechanisms Regulating the Cellular Response to Hypoxia
Epigenetic Mechanisms Regulating the Cellular Response to Hypoxia
 

Multi-centric learning from medical data

  • 1. Multi-centric learning from medical data Nov 2012 Georgi Nalbantov
  • 2. Multi-centric learning from medical data: why?  Multiple sets of medical data exist in different hospitals  Currently: models are built from data in 1 center only  Currently: external model validation requires standardization  Data privacy is an issue: data cannot leave hospitals (easily)
  • 3. Multi-centric learning from medical data: why? General Hypothesis: Current way of learning from medical data is suboptimal, as modeling techniques do not have access to all available data Specific Hypothesis: A distributed learning environment (euroCAT), by giving local access to all data, can be used to produce optimal models, the same as if data was centralized* * For some modeling techniques
  • 4. Learning from medical data: the current state Center 1 Center 2 Center 3 Center 4 Center 5 C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4 We learn a model from one center only and validate at other centers (if possible) We also check the predictions of the doctors. (The golden standard?)
  • 5. Learning from medical data: the current state Center 1 Center 2 Center 3 Center 4 Center 5 C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4 We learn a model from one center only and validate at other centers (if possible) We also check the predictions of the doctors. (The golden standard?)
  • 6. Learning from medical data: the challenge Optimal solution: learning from centralized data Decentralized data C2 C3 C4 C5 How to achieve that? Option 1: Centralized learning Centralize data Combine data from sites 1,2,3 and 4 at one central database. That is: bring the data to the model: NOT FEASIBLE Center 2 Center 3 Data for learning Option 2: Distributed learning Center 4 Apply “distributed learning” on the data Center 5 from sites 1,2,3 and 4. That is, bring the model to the data: FEASIBLE Data to predict Center 1 NOT FEASIBLE
  • 7. Centralized Learning Nalbantov and Wiessler, Oct 2012 Data centralization Learning
  • 8. Distributed Learning Nalbantov and Wiessler, Oct 2012
  • 9. Distributed Learning: doing it - For distributed learning we need a statistical modeling technique that can learn in distributed mode, that is without being able to “see” all the data at once. - There exist learning models that are able to find an optimal solution no matter whether data is scattered across different centers or not. We choose one of them for this study: SVM’s - Have shown excellent results across a wide range of data-analysis problems - Robust to the inclusion of many features (bye-bye to the “15-1” rule of thumb) - Can be constructed in distributed-learning mode
  • 10. Learning from medical data: distributed learning SVMs Model evaluation: ROC curve
  • 11. The “Trial”: prediction of 2-year survival of lung cancer patients Patients: 322 (Maastro) lung cancer patients, distributed across 5 sites: Maastro 186 Liege 52 Hasselt/Genk 45 Aachen 7 Eindhoven 32 Endpoint: 2-year survival Method: distributed learning SVMs (ref: Boyd, ADMM), euroCAT Predictive features: gender, WHO performance status, FEV1, number of PLNSs, GTV (volume) and “EQD2,T”
  • 12. The “Trial”: prediction of 2-year survival of lung cancer patients The data for the trial site patients 2-year gender WHO FEV1 number GTV EQD2,T survival of PLNSs (volume) Maastro 186 … … … … … … … Liege 52 … … … … … … … Hasselt/G 45 … … … … … … … Aachen 7 … … … … … … … Eindhoven 32 … … … … … … …
  • 13. The “golden standard”: doctors’ predictions prediction of 2-year survival of lung cancer patients
  • 14. “Traditional” solution prediction of 2-year survival of lung cancer patients site patients Maastro 186 Center 1 Build model at center 1 Liege 52 Hasselt/G 45 Aachen 7 Eindhoven 32 C4 Validate the model at centers C2 C3 C5 2,3,4 and 5 Step 1. Build an SVM model from data in center 1. - There is no “one-button” to press… It turns out SVM is a “family” of models and the trained statistician has to choose one family member in much the same way as the surgeon has to choose from a variety of “knives”. Step 2. Model evaluation: how will our model perform outside my hospital? - Perform cross-validation to find optimal SVM from the “family” Step 3. Build the final model using the “best-performing” SVM from the SVM family.
  • 15. “Traditional” solution prediction of 2-year survival of lung cancer patients Center 1 SVM family with different lambda’s cross-validation ROC External AUC validation SVM with Center 2 0.723 Lambda=1 Build the final SVM Center 3 0.757 model on all data from center 1, that is “SVM with lambda = 2” Center 4 0.671 SVM with Lambda=5 Center 5 0.6
  • 16. euroCAT solution: prediction of 2-year survival of lung cancer patients Learning from centralized data: Distributed learning: optimal solution optimal solution* Decentralized data C2 C3 C4 C5 Decentralized data C2 C3 C4 C5 Bring the data to the model Bring model to the data Center 2 Center 2 Data for learning Center 3 Data BEHAVES Center 3 like centralized* Center 4 Center 4 Center 5 Center 5 Data to predict Center 1 Data to predict Center 1 NOT FEASIBLE FEASIBLE (ex: EuroCAT) *Using ADMM to solve SVM
  • 17. euroCAT: a breakthrough What is the potential benefit from multi-centric batch learning on predicting survival? Training Predicted AUC euroCAT learning on color site(s) site PREDICTED site (same result as centralized) 2 1 0.754 red 3 1 0.678 green 4 1 0.610 cyan 5 1 0.723 pink 2,3,4,5 1 0.766 blue 1,2,3,4,5 world ? ?
  • 18. The future How can patients/clinics profit from distributed learning medical environments? - Use real multi-centric data for modeling - Use multiple endpoints: survival, dyspnea, dysphagia, fibrosis, etc. - Include more variables: imaging, DNA, etc. - Use standardized data (and more data) - Etc…
  • 19. Bedankt voor uw aandacht Heeft u vragen of opmerkingen? www.maastro.nl info@maastro.nl