Dekker trog - learning outcome prediction models from cancer data - 2017

Learning outcome prediction models
from cancer data
Andre Dekker
Department of Radiation Oncology (MAASTRO)
GROW - Maastricht University Medical Centre +
Maastricht,The Netherlands
SLIDES AVAILABLE ON SLIDESHARE
(slideshare.net/AndreDekker)

2
Disclosures
• Research collaborations incl. funding and speaker honoraria
– Varian (VATE, SAGE, ROO, chinaCAT, euroCAT), Siemens (euroCAT), Sohard (SeDI,
CloudAtlas), Mirada Medical (CloudAtlas), Philips (EURECA,TraIT, SWIFT-RT, BIONIC),
Xerox (EURECA), De Praktijkindex (DLRA), ptTheragnostic (DART, Strategy), CZ (My
BestTreatment)
• Public research funding
– Radiomics (USA-NIH/U01CA143062), euroCAT(EU-Interreg), duCAT&Strategy (NL-
STW), EURECA (EU-FP7), SeDI & CloudAtlas & DART (EU-EUROSTARS),TraIT (NL-
CTMM), DLRA (NL-NVRO), BIONIC (NWO)
• Spin-offs and commercial ventures
– MAASTRO Innovations B.V. (CSO)
– Various patents on medical machine learning

3
TROG 2017 talks
• Learning outcome prediction models from cancer data
– Technical ResearchWorkshop, Monday 840-910, followed by
Panel Discussion
• Big Data in Radiation Oncology
– Statistical Methods, Evidence Appraisal and Research for
Trainees, Monday 1450-1520
• Knowledge Engineering in Oncology
– TROG Plenary,Tuesday, 925-1000
• Radiomics for Oncology
– TROG Plenary,Thursday, 1150-1220
Some
Overlap
No
Overlap

4
Learning objectives
After the lecture, attendees should be able to
• Name the major sources of cancer data and their absolute and relative size
• Understand the challenges of sharing data and solutions to these
• Itemize steps in the methodology to go from data to models
• Appraise papers that describe models incl. usingTRIPOD

7
Barriers to sharing data
[..] the problem is not really technical […]. Rather, the problems
are ethical, political, and administrative.
Lancet Oncol 2011;12:933
1. Administrative (I don’t have the resources)
2. Political (I don’t want to)
3. Ethical (I am not allowed to)
4. Technical (I can’t)

8
Common approaches to sharing
• Sharing standardized, highly curated data from clinical
research programs
• Very useful, but only 3% of patients (if that)
• Sharing standardized, highly curated data to clinical
registries
• Very useful, but limited amount of features and a lot of
work
• Big Data companies usually cloud based (Watson
Health Cloud, Flatiron/Google, ASCO/SAP CancerLinq)
• Worries about privacy, loss of control, limited
reusability, silos

9
Data landscape
• Clinical research
• 3% of patients
• 100% of features
• 5% missing
• 285 data points
• Clinical registries
• 100% of patients
• 3% of features
• 20% missing
• 240 data points
• Clinical routine
• 100% of patients
• 100% of features
• 80% missing
• 2000 data points
Data elements
Patients

10
A different approach
• If sharing is the problem: Don’t share the data
• If you can’t bring the data to the research
• You have to bring the research to the data
• Challenges
– The research application has to be distributed (trains & track)
– The data has to be understandable by an application (i.e. not a human) -> FAIR data stations

11
CORAL: Community in Oncology for RApid Learning
7
4
meerCAT
Lung - Dyspnea
U Michigan
MAASTRO
The Christie
Map © Copyright Showeet.com
canCAT
Lung SBRT - Control
Princess Margaret
MAASTRO
BIONIC
Radiomics
MAASTRO
Tata Memorial
duCAT
Lung - Dysphagia
MAASTRO
Radboud
NKI
euroCAT
Lung - Survival
UK Aachen
LOC Hasselt
Catharina
MAASTRO
CHU Liege
Interest to join
Erasmus (Breast)
BCCA (Breast)
Bloemfontein (Cervix)
Odense (HN, Lung)
Aalst (Lung)
McGill (Brain)
ozCAT
Head&Neck - Survival
Liverpool
Illawarra
Newcastle
Westmead
MAASTRO
RTOG/NRG
worldCAT
Rectum - Local Control
Fudan
Rome/EU
RTOG/NRG

12
Typical Data Quality challenges
• Data are unstructured
• Data are not understandable
• Data are missing
• Data are incorrect
• Data are contradicting
• Data are biased
• Data are biased missing
• Garbage in – Garbage out?
声门下区
T4N0M0 Stage IV patient
Patient weighing 1000kg
Grade 3+ toxicities

14
Horizontal Partitions
Data elements
PatientsMaastrichtPatientsShanghai • Reasonably well
understood
• Distributed learning
possible if data is FAIR
• No need for data to
leave the hospital

15
Vertical and Complex Partitions
Data elements MAASTRO
Patients
Data elements Registry

16
A bit more technical detail
• Keep data locally
• Standardize it according to
an ontology
• Make and send around
learning “bots”
• Share the results - not the
data!

17
Even more technical details
• De-identification
• Semantic web, linked data
• Imaging/DICOM data & clinical data stream

19
Our modelling approach
• Hypothesis driven!!

20
How much data do you need?
• Rule of thumb. Min. 10 events per input feature
• 200 NSCLC patients
• 25% survival at two years
• 50 events
• 10 input features
• More is better Source: vitalflux.com (2017)

21
Source: Jason Brownlee (2013)
Machine Learning

22
Considerations for machine learning
• Discrimination (AUC)
• Calibration (Brier)
• Interpretability (black box vs. transparent)
• Can it handle low data quality (of training and validation)?
• Can it be learned in a distributed setting?

23
Choose already
Simple and quick, but need complete data
• Logistic regression
• SupportVector Machines
Intuitive and can handle missing data
• Bayesian Networks
All can be learned in a distributed setting
Review pending

24
TRIPOD
https://www.tripod-statement.org/

25
Validation model
• Discrimination: Is the model able to classify the population into two
or more groups with different observed survival?
• Calibration: Is the estimated probability of survival equal to the
observed survival probability?
• Clinical usefulness: Is the data on which the data is based
representative for my patient and is the predicted outcome clinically
relevant for my patient?

26
Laryngeal carcinoma model
• 994 MAASTRO patients
• 1990-2005
• www.predictcancer.org
• Input parameters
– Age
– Hemoglobin
– T-stage
– Radiotherapy Dose (Gy)
– Gender
– N+
– Tumor location
• Output parameters
– Overall survival

27
Discrimination / Calibration / Clinical Relevance?
• Discrimination: Is the model able to classify the population into two or more groups
with different observed survival?
• Calibration: Is the estimated probability of survival equal to the observed survival
probability?
• Clinical usefulness: Is the data on which the data is based representative for my patient
and is the predicted outcome clinically relevant for my patient?

28
probability?

29
probability?

31
Learning objectives
After the lecture, attendees should be able to
• Name the major sources of cancer data and their absolute and relative size
• Understand the challenges of sharing data and solutions to these
• Itemize steps in the methodology to go from data to models
• Appraise papers that describe models incl. usingTRIPOD

32
Acknowledgements
• Fudan Cancer Center, Shanghai,China
• Varian, PaloAlto, CA, USA
• Siemens, Malvern, PA, USA
• RTOG, Philadelphia, PA, USA
• MAASTRO, Maastricht, Netherlands
• PoliclinicoGemelli, Roma, Italy
• UH Ghent, Belgium
• UZ Leuven, Belgium
• Radboud, Nijmegen, Netherlands
• University of Sydney, Australia
• University of Michigan,Ann Arbor, USA
• Liverpool and MacarthurCC, Australia
• CHU Liege, Belgium
• UniklinikumAachen, Germany
• LOC Genk/Hasselt, Belgium
• Princess Margaret CC, Canada
• The Christie, Manchester, UK
• UH Leuven, Belgium
• State Hospital, Rovigo, Italy
• Illawarra ShoalhavenCC, Australia
• CatharinaZkh Eindhoven, Netherlands
• Philips, Eindhoven, Netherlands
More info on: www.predictcancer.org www.cancerdata.org
www.eurocat.info www.mistir.info

Thank you for your attention
Andre Dekker
Department of Radiation Oncology (MAASTRO)
GROW - Maastricht University Medical Centre +
Maastricht,The Netherlands

35
Patient
(ncit:C16960)
Age at start RT
(roo:100003)
Year
(uo:UO_0000036)
Value
Non-small cell lung
carcinoma
(ncit:C2926)
Sex
(nci:C20197 and
nci:C16576)
Value
Hospital
(ncit:C19326)
(uri=http://
www.uhn.ca/
PrincessMargaret)
Month
(uo:UO_0000035)
Value
Survival
(roo:100063)
Vital Status
(ncit:C37987 or
ncit:28554)
FEV1
(nci:C38084)
Percentage FEV1
(nci:C112376)
Liter
(uo:UO_0000099)
Value
Percent
(uo:UO_0000187)
Age at diagnosis
(roo:100002) Year
(uo:UO_0000036)
Value
ECOG performance
status
(nci:105722
nci:105723
nci:105725
nci:105726
nci:105727
nci:105728)
Value
Positive Lymph
Node Stations
(roo:100049)
Count
(uo:UO_0000189)
has_unit
roo:100027
Value
DateTimeDescription
Clinical TNM
Finding
(ncit:C48881)
Generic T-stage 0-4
(ncit:48719)
(ncit:48720)
.
(ncit:48732)
has_clinical_t_stage
roo:100244
Diagnostic
Procedure
(ncit:C18020)
Volume of primary
tumor
(roo:100054)
has_volume
(roo:100315)
Cubic centimeter
(uo:UO_0000097)
Value
RT Structure Set
(sedi:RTStructureSet)
MIA Version
(mia:<version>)
AJCC Edition
(roo:100052)
(roo:100053)
Radition Therapy
(ncit:C15313)
OR
SBRT
(ncit:C118286)
Prescribed
Radiotherapy Dose
(roo:100013)
Gray
(uo:UO_0000134)
Value
(xsd:double)
No. RT Fractions
Per Treatment
(roo:100356)
Value
(xsd:integer)
No. RT Fractions
Per Day
(roo:100355)
Value
(xsd:integer)
Delivered
Radiotherapy Dose
(roo:100012)
Gray
(uo:UO_0000134)
Value
(xsd:double)
First radiotherapy
fraction
(roo:100058)
Last radiotherapy
fraction
(roo:100059)
Histology
(nci:2926
nci:2852
nci:3780
nci:2929
nci:2852
nci:3915)
DateTimeDescription
DateTimeDescription
DateTimeDescription
at_date_time
roo:100041
DateTimeDescription
Pneumonitis
(ctcae:Pneumonitis)
Fracture
(ctcae:Fracture)
Rib
(fma:fma7574)
DateTimeDescription
DateTimeDescription
Reaction
(ctcae:Radiation_re
call_reaction_derm
atologic)
DateTimeDescription
at_date_time
roo:100041
Fatigue
(ctcae:Fatigue)
DateTimeDescription
at_date_time
roo:100041
Dyspnea
(ctcae:Dyspnea)
DateTimeDescription
at_date_time
roo:100041
Couch
(ctcae:Couch)
DateTimeDescription
at_date_time
roo:100041
Anorexia
(ctcae:Anorexia)
DateTimeDescription
at_date_time
roo:100041
DateTimeDescription
Dysphagia
(ctcae:Dysphagia)
at_date_time
roo:100041
DateTimeDescription
Hemoptysis
(nci:C3094)
at_date_time
roo:100041
DateTimeDescription
Esophagitis
(ctcae:Esophagitis)
at_date_time
roo:100041
DateTimeDescription
Pulmonary Fibrosis
(ctcae:Pulmonary_fi
brosis)
at_date_time
roo:100041
DateTimeDescription
Brachial plexopathy
(ctcae:Brachial_plex
opathy)
at_date_time
roo:100041

36
Tech used
• ETL (Pentaho,Talend)
• DICOM de-identification (CTP)
• RDF store & SPARLQ endpoint
(Blazegraph, Sesame)
• Ontology editing (Protégé)
• Ontology publishing (BioPortal)
• Database (PostgreSQL)
• Database to RDF (D2R)
• DICOM to RDF (SeDI)
• PACS (dcm4chee)
• Image processing pipeline (MIA-
MAASTRO)
• Distributed application (Varian,
Docker)
• Generic & Machine learning
(Matlab, R, Java, Python)

Dekker trog - learning outcome prediction models from cancer data - 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Dekker trog - learning outcome prediction models from cancer data - 2017

Similar to Dekker trog - learning outcome prediction models from cancer data - 2017 (20)

Recently uploaded

Recently uploaded (20)

Dekker trog - learning outcome prediction models from cancer data - 2017