1
Elsevier Health Analytics
Medical Graph v1
Empowering
KnowledgeTM
Towards
• A map of medicine
• Personalized decision support in a
clinical setting
Paul Hellwig
Director Research & Development
p.hellwig@elsevier.com
https://www.linkedin.com/in/paulhellwig
Nov, 2016
2
Elsevier
• Publisher & world-leading provider of
information solutions
• 6,700 people worldwide, € 2.8 billion
revenues1
• >2,200 journals, >25,000 book titles
• ScienceDirect, Scopus, ClinicalKey and
Nursing Consult
• Health Analytics Team in Berlin
2
LexisNexis
• Helps predict and manage risk for
industry and government
• 7,200 people, € 2.2 billion revenues1
• 35 years experience in managing big
data, currently >5 Peta Bytes
• Have developed the HPCC2
supercomputer platform
1: 2015 2: High Performance Computing Cluster
Elsevier Health Analytics combines
RELX Group's medical and big data analytics expertise
3
3
Elsevier Health Analytics
- Our vision -
4
4
physician patient
Trends driving changes in physician - patient interaction…
25 million
biomed articles
referenced on PubMed
1.2 million
new biomed articles p.a.
3. information explosion1. medical data explosion
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
2. patient empowerment
105 mm ECG biosensor
high ecg quality, heart rate, respiratory,
body temp, activity, body position, water
tight, induction charged, bluetooth,
continuous data feed
5
5
physician patient
…and the real challenge
25 million
biomed articles
referenced on PubMed
1.2 million
new biomed articles p.a.
3. information explosion1. medical data explosion
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
2. patient empowerment
105 mm ECG biosensor
high ecg quality, heart rate, respiratory,
body temp, activity, body position, water
tight, induction charged, bluetooth,
continuous data feed
< 10
minutes1
1 Europe; US up to 20 mins: Ray KN, Chari AV, Engberg J, Bertolet M, Mehrotra A. Disparities in Time Spent Seeking Medical Care in the United States. JAMA
Intern Med. 2015;175(12):1983-1986. doi:10.1001/jamainternmed.2015.4468.
6
6
Medical Graph – Research Goal A:
Risk predictions: which diseases will you likely get within 4 years?
From Electronic Health Record…
…to Top Risks
7
7
I65
Verschluss und Stenose
präzerebraler Arterien
G40
Epilepsie
I61
C71
Bösartige Neubildung des
Gehirns
odds ratio: 1.12
Intrazerebrale Blutung
1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nature
Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022.
Weitere
Covariaten
Medical Graph – Research Goal B:
Map: How are diseases, medications and other data connected?
has_successor1
…für 1600
Zielkrankheiten
8
8
Medical Graph development
9
Example: Model to predict „I50 – Heart Failure“
9
I50 -
2009
„PAST“
time
I50 -
(coded
as 0)
I50 +
(coded
as1)
2011 2014
Covariates
• Age
• Gender
• Other diseases
• Medications
• Other
Analysis Design
Predict 4 year long-term effects, balanced for all co-variables
„FUTURE“
2010
10
10
Primary care
Secondary care
Medication
Other data
Visits & diagnoses
Visits, diagnoses
& procedures
Drug presciptions
Further cooperations just started
Will enable analysis of vital and laboratory parameters
Billing data flow
60+ sickness funds;
Anonymized
feature extraction
3943 features for 3.8m
patients
• 1623 targets, 2011-2014
• 2320 covariates, 2010
Our observation / feature matrix
11
11
Attempt no. #1
on server
#2
on cluster
#3
on server
machine
learning
algorithm
Component-wise
gradient boosting
(mboost)
GLM for p-values
Logistic Regression
with LASSO
GLM for p-values
Linear gradient boosting
(sklearn + xgboost)
F-test for p-values
Did it work for
full dataset?
Worked for 100k
patients.
Failure reason:
RAM (extensive dataset
copying)
Worked for 138 models.
Failure reason:
Memory Leak every 30-40
models
Worked for 800k
patients.
Failure reason:
int32 as index for sparse
matrixes
Runtime ~7 min / target model
(on 100k patients)
~8 min / target model
(on 3.8m patients)
~7 min / target model
(on 800k patients)
Predictive Modeling for ~1600 target diseases
Multiple attempts – no software is perfect
12
12
# model 1: component-wise linear boosting
boost_train_ds <- glmboost(as.formula(paste(icd_atc_use_names[i],"~.")), 
data=data[ins,][c(which_one,sample(which_zero,(length(which_one)),replace=F)),], 
family=Binomial(), control=boost_control(mstop=400,trace=T,center=F))
...
# model 1: GLM with ElasticNet
model1 = H2OGeneralizedLinearEstimator(model_id=post_col, family = 'binomial', solver='IRLSM', 
alpha = 0.99, #mainly LASSO
lambda_search=True, standardize=True, intercept=True)
model1.train(x=index_cols, y=post_col, training_frame=training, validation_frame=val)
...
+ XGBoost
+ mboost
# model 1: component-wise linear boosting
params={'silent': 0, 'nthread': 4, 
'eval_metric':['error','map','map@'+str(top1percent_train),'map@'+str(top1percent_eval),'auc'],
'objective': 'binary:logistic', 'booster': 'gblinear', 
'lambda': 0, #L2 regularization (Ridge) none 
'alpha': 500} #L1 regularization (LASSO)
booster = xgb.train( params, dtrain, num_boost_round=settings.boosting_iterations, 
evals=[(dtrain,'train'),(dtest,'eval')], early_stopping_rounds=10, evals_result =quality)
...
Code for model building
13
13
Krankheiten des
Nervensystems
Neubildungen
Validate & test
Interesting effects between disease chapters
14
Medical Graph backend
14
From last run:
• 2261 nodes
• 434995 edges
Relation Source Target OR beta p-value
number
relations
proportion of
incidents have source
proportion source
get incidents Mean age
has_successor Intercept ICD_M54 0,2483 -1,3930
has_successor AGE ICD_M54 1,0517 0,0504 0,000000 100,0% 21,9%
has_successor GENDER ICD_M54 0,9944 -0,0056 0,000000 82556 47,2% 21,2% 42
has_successor ICD_I10 ICD_M54 0,9260 -0,0768 0,000000 45013 25,8% 20,4% 62
has_successor ICD_H35 ICD_M54 0,9469 -0,0545 0,000000 8125 4,6% 19,5% 62
has_successor ATC_D01AC ICD_M54 1,0022 0,0022 0,000000 3382 1,9% 17,8% 47
has_successor ATC_M01AB ICD_M54 1,2207 0,1994 0,000000 16534 9,5% 17,0% 52
has_successor ICD_H26 ICD_M54 0,9420 -0,0597 0,000000 7550 4,3% 19,1% 67
has_successor ATC_C09AA ICD_M54 0,9603 -0,0405 0,000000 16840 9,6% 20,1% 62
has_successor ATC_C08CA ICD_M54 0,9299 -0,0727 0,000000 9892 5,7% 19,5% 67
has_successor ATC_C07BB ICD_M54 1,0031 0,0031 0,000000 2197 1,3% 21,3% 62
has_successor ICD_H52 ICD_M54 1,0006 0,0006 0,000000 35331 20,2% 20,5% 52
has_successor ATC_M01AE ICD_M54 1,0450 0,0440 0,000000 22808 13,0% 16,4% 42
has_successor ICD_H43 ICD_M54 1,0300 0,0296 0,000000 3599 2,1% 20,2% 62
has_successor ICD_L85 ICD_M54 0,9362 -0,0660 0,000978 1244 0,7% 18,4% 47
has_successor ICD_H02 ICD_M54 1,0165 0,0164 0,000000 1734 1,0% 19,8% 57
Edges
15
Medical Graph frontend
15
16
16
Key Learnings
17
Key learnings from working 5 years with medical data
17
Physicians want
explanations.
Otherwise they will not
trust the predictions.
Typical best-in-class
classification methods
(deep learning, random
forest) do not yet
deliver explainable
models. This won‘t
do.
Open source tools have failures
(as have proprietary tools).
Debugging can be a
nightmare.
In practice, you need to
save the users processing
time, not add to it.
Visualization is
key.
Building a classification model
using open source tools is simple.
Scaling input data size is also
manageable. Building 1000+
models is complex.
Implementing, applying and
maintaining a Security
Framework to keep personal
health information secure is a
substantial effort.
Feature
engineering is
not dead. If you
want explainable
effects, you most
probably need linear
models, so you need to
engineer non-linear
effects, e.g. using
clusters.

Elsevier Medical Graph – mit Machine Learning zu Precision Medicine

  • 1.
    1 Elsevier Health Analytics MedicalGraph v1 Empowering KnowledgeTM Towards • A map of medicine • Personalized decision support in a clinical setting Paul Hellwig Director Research & Development p.hellwig@elsevier.com https://www.linkedin.com/in/paulhellwig Nov, 2016
  • 2.
    2 Elsevier • Publisher &world-leading provider of information solutions • 6,700 people worldwide, € 2.8 billion revenues1 • >2,200 journals, >25,000 book titles • ScienceDirect, Scopus, ClinicalKey and Nursing Consult • Health Analytics Team in Berlin 2 LexisNexis • Helps predict and manage risk for industry and government • 7,200 people, € 2.2 billion revenues1 • 35 years experience in managing big data, currently >5 Peta Bytes • Have developed the HPCC2 supercomputer platform 1: 2015 2: High Performance Computing Cluster Elsevier Health Analytics combines RELX Group's medical and big data analytics expertise
  • 3.
  • 4.
    4 4 physician patient Trends drivingchanges in physician - patient interaction… 25 million biomed articles referenced on PubMed 1.2 million new biomed articles p.a. 3. information explosion1. medical data explosion 4500 tests for gene disorders available (2013: 3200 +20% CAGR) $1245 cost to sequence full genome (10/2014: $5730) patientslikeme has 400,000+ members 31 million data points covering 2,500+ conditions, donating data 2. patient empowerment 105 mm ECG biosensor high ecg quality, heart rate, respiratory, body temp, activity, body position, water tight, induction charged, bluetooth, continuous data feed
  • 5.
    5 5 physician patient …and thereal challenge 25 million biomed articles referenced on PubMed 1.2 million new biomed articles p.a. 3. information explosion1. medical data explosion 4500 tests for gene disorders available (2013: 3200 +20% CAGR) $1245 cost to sequence full genome (10/2014: $5730) patientslikeme has 400,000+ members 31 million data points covering 2,500+ conditions, donating data 2. patient empowerment 105 mm ECG biosensor high ecg quality, heart rate, respiratory, body temp, activity, body position, water tight, induction charged, bluetooth, continuous data feed < 10 minutes1 1 Europe; US up to 20 mins: Ray KN, Chari AV, Engberg J, Bertolet M, Mehrotra A. Disparities in Time Spent Seeking Medical Care in the United States. JAMA Intern Med. 2015;175(12):1983-1986. doi:10.1001/jamainternmed.2015.4468.
  • 6.
    6 6 Medical Graph –Research Goal A: Risk predictions: which diseases will you likely get within 4 years? From Electronic Health Record… …to Top Risks
  • 7.
    7 7 I65 Verschluss und Stenose präzerebralerArterien G40 Epilepsie I61 C71 Bösartige Neubildung des Gehirns odds ratio: 1.12 Intrazerebrale Blutung 1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nature Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022. Weitere Covariaten Medical Graph – Research Goal B: Map: How are diseases, medications and other data connected? has_successor1 …für 1600 Zielkrankheiten
  • 8.
  • 9.
    9 Example: Model topredict „I50 – Heart Failure“ 9 I50 - 2009 „PAST“ time I50 - (coded as 0) I50 + (coded as1) 2011 2014 Covariates • Age • Gender • Other diseases • Medications • Other Analysis Design Predict 4 year long-term effects, balanced for all co-variables „FUTURE“ 2010
  • 10.
    10 10 Primary care Secondary care Medication Otherdata Visits & diagnoses Visits, diagnoses & procedures Drug presciptions Further cooperations just started Will enable analysis of vital and laboratory parameters Billing data flow 60+ sickness funds; Anonymized feature extraction 3943 features for 3.8m patients • 1623 targets, 2011-2014 • 2320 covariates, 2010 Our observation / feature matrix
  • 11.
    11 11 Attempt no. #1 onserver #2 on cluster #3 on server machine learning algorithm Component-wise gradient boosting (mboost) GLM for p-values Logistic Regression with LASSO GLM for p-values Linear gradient boosting (sklearn + xgboost) F-test for p-values Did it work for full dataset? Worked for 100k patients. Failure reason: RAM (extensive dataset copying) Worked for 138 models. Failure reason: Memory Leak every 30-40 models Worked for 800k patients. Failure reason: int32 as index for sparse matrixes Runtime ~7 min / target model (on 100k patients) ~8 min / target model (on 3.8m patients) ~7 min / target model (on 800k patients) Predictive Modeling for ~1600 target diseases Multiple attempts – no software is perfect
  • 12.
    12 12 # model 1:component-wise linear boosting boost_train_ds <- glmboost(as.formula(paste(icd_atc_use_names[i],"~.")), data=data[ins,][c(which_one,sample(which_zero,(length(which_one)),replace=F)),], family=Binomial(), control=boost_control(mstop=400,trace=T,center=F)) ... # model 1: GLM with ElasticNet model1 = H2OGeneralizedLinearEstimator(model_id=post_col, family = 'binomial', solver='IRLSM', alpha = 0.99, #mainly LASSO lambda_search=True, standardize=True, intercept=True) model1.train(x=index_cols, y=post_col, training_frame=training, validation_frame=val) ... + XGBoost + mboost # model 1: component-wise linear boosting params={'silent': 0, 'nthread': 4, 'eval_metric':['error','map','map@'+str(top1percent_train),'map@'+str(top1percent_eval),'auc'], 'objective': 'binary:logistic', 'booster': 'gblinear', 'lambda': 0, #L2 regularization (Ridge) none 'alpha': 500} #L1 regularization (LASSO) booster = xgb.train( params, dtrain, num_boost_round=settings.boosting_iterations, evals=[(dtrain,'train'),(dtest,'eval')], early_stopping_rounds=10, evals_result =quality) ... Code for model building
  • 13.
    13 13 Krankheiten des Nervensystems Neubildungen Validate &test Interesting effects between disease chapters
  • 14.
    14 Medical Graph backend 14 Fromlast run: • 2261 nodes • 434995 edges Relation Source Target OR beta p-value number relations proportion of incidents have source proportion source get incidents Mean age has_successor Intercept ICD_M54 0,2483 -1,3930 has_successor AGE ICD_M54 1,0517 0,0504 0,000000 100,0% 21,9% has_successor GENDER ICD_M54 0,9944 -0,0056 0,000000 82556 47,2% 21,2% 42 has_successor ICD_I10 ICD_M54 0,9260 -0,0768 0,000000 45013 25,8% 20,4% 62 has_successor ICD_H35 ICD_M54 0,9469 -0,0545 0,000000 8125 4,6% 19,5% 62 has_successor ATC_D01AC ICD_M54 1,0022 0,0022 0,000000 3382 1,9% 17,8% 47 has_successor ATC_M01AB ICD_M54 1,2207 0,1994 0,000000 16534 9,5% 17,0% 52 has_successor ICD_H26 ICD_M54 0,9420 -0,0597 0,000000 7550 4,3% 19,1% 67 has_successor ATC_C09AA ICD_M54 0,9603 -0,0405 0,000000 16840 9,6% 20,1% 62 has_successor ATC_C08CA ICD_M54 0,9299 -0,0727 0,000000 9892 5,7% 19,5% 67 has_successor ATC_C07BB ICD_M54 1,0031 0,0031 0,000000 2197 1,3% 21,3% 62 has_successor ICD_H52 ICD_M54 1,0006 0,0006 0,000000 35331 20,2% 20,5% 52 has_successor ATC_M01AE ICD_M54 1,0450 0,0440 0,000000 22808 13,0% 16,4% 42 has_successor ICD_H43 ICD_M54 1,0300 0,0296 0,000000 3599 2,1% 20,2% 62 has_successor ICD_L85 ICD_M54 0,9362 -0,0660 0,000978 1244 0,7% 18,4% 47 has_successor ICD_H02 ICD_M54 1,0165 0,0164 0,000000 1734 1,0% 19,8% 57 Edges
  • 15.
  • 16.
  • 17.
    17 Key learnings fromworking 5 years with medical data 17 Physicians want explanations. Otherwise they will not trust the predictions. Typical best-in-class classification methods (deep learning, random forest) do not yet deliver explainable models. This won‘t do. Open source tools have failures (as have proprietary tools). Debugging can be a nightmare. In practice, you need to save the users processing time, not add to it. Visualization is key. Building a classification model using open source tools is simple. Scaling input data size is also manageable. Building 1000+ models is complex. Implementing, applying and maintaining a Security Framework to keep personal health information secure is a substantial effort. Feature engineering is not dead. If you want explainable effects, you most probably need linear models, so you need to engineer non-linear effects, e.g. using clusters.