Data Science for (Health) Science:
tales from a challenging front line, and how to cross a few T's
Paolo Missier
School of Computing
Newcastle University, UK
March 2021
A talk given to
The School of Information Sciences
Center for Informatics Research in Science and Scholarship
University of Illinois Urbana-Champaign
paolo.missier@ncl.ac.uk
LinkedIn: paolomissier
Twitter: @PMissier
2
The message:
1. “Data Science” for Health is hard. The hard part is the data
2. “AI for Health” is (Deep) Machine Learning
3. Ethics. Fairness. Trust. Acceptance.
4. Data Provenance for Data Science: Solution or distraction?
• Transparency
• Trustworthiness
• Traceability
3
A Grand Challenge
https://epsrc.ukri.org/research/ourportfolio/themes/healthcaretechnologies/strategy/grandchallenges/
4
AI for healthcare – the UK landscape
https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences
AI and data science will improve the detection, diagnosis, and treatment of
illness. They will optimise the provision of services, and support health service
providers to anticipate demand and deliver improved patient care.
• Explainability / Interpretability
• Exploiting EHR (Electronic Health Records)
• Image interpretation
• Fairness, Bias
• Ethical issues in …
• Predicting <disease / critical event> …
5
Personalised, Predictive, Preventive, Participatory Medicine (P4)
Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds.
Nat Biotechnol. 2017;35:747.
6
(*) Data-Driven, Personalised, Predictive, Preventive, Participatory
D2P4 (*)
Healthcare
research
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)
7
Big Data for Health Care
Genomics for
personalized medicine
personal monitors /
wearables
Medical Records
Article Source: Big Data: Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7):
e1002195. https://doi.org/10.1371/journal.pbio.1002195
9
D2P4  Accelerometry
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
10
Digital biomarkers
Digital biomarkers come from "novel sensing systems capable of continuously tracking
behavioral signals […] capture people's everyday routines, actions, and physiological
changes that can explain outcomes related to health, cognitive abilities, and more”
(Choudhury 2018).
Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156.
https://doi.org/10.1145/3266285
- physical activity
- glucose levels
- blood oxygen
levels
- …
Inexpensive  scalable personalised self-monitoring
11
A first project: markers from accelerometers?
Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome
- physical activity
- glucose levels
- blood oxygen levels
- …
Aligned with the P4 agenda
Readily available dataset
(+) 3,500+ features
(+) multi-omics coverage
(+) genomics
(+) links to EHR
(+) Activity monitors made in Newcastle!
(-) Limited follow-ups – little longitude
(-) Population not random
(-) Activity data / person very limited
100K
Activity traces
13
Using wearable activity trackers to predict Type-2 Diabetes
Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with
Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations.
Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P
Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer
Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press)
Feature
extraction
Clustering
Classification
??
14
Granular activity representation
feature extraction 60 features / day
15
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
A great UG project!
your (biomedical) dataset may not be as big as it looks
T2D vs Norm-1
16
(some) results
Negatives: HLAF SDL HLAF+SDL
Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2
RF .80 .68 .83 .78 .86 .77
LR .79 .70 .83 .78 .86 .78
XGB .78 .66 .80 .74 .85 .75
17
Ongoing work
Are there better embedded representations for acceleremetry data?
Can they be used as predictors for other outcomes?
Representation learning
Embedded
feature space
LSTM Autoencoder
Outcome:
Insulin sensitivity
DIRECT
DB
Standard classification
19
D2P4  COVID
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
D. Ferrari1, Prof. F. Mandreoli1, Prof. G. Guaraldi2
Prof. P. Missier
Predicting respiratory failure in patients with COVID-19
pneumonia: a case study from Northern Italy
Peak of Italian Covid crisis (March 2020 onwards)
Issue: ICU Capacity
Question: will my next patient require ICU resources? How soon?
(1)
(2)
Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health
emergency
Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges,
strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
21
Study structure
Applied Machine Learning driven by a clinical question
An example of typical data science pattern:
• Data selection  inclusion, exclusion criteria
• Data preparation / cleaning
• Variable selection
• Model learning  multiple models
• Model evaluation
With additional challenges:
“Live” evolving dataset with multiple versions of a patients database
• changes in recording practices
• Inconsistencies
• Lots of missing data
Small data: 198 patients  1068 observations  31-90 variables (symptoms, lab biomarkers)
In the data collection period, the dataset
was growing daily with the average of 84
new records per day, with a mean of 10 new
data points/patient.
out of the initial sample of 295 patients
and 2,889 data points available, 198
patients contributed to generate 1068
valuable observations. In detail, 603
observations contributed to the
definition of respiratory failure (PaO2/
FiO2 < 150 mmHg) and 465 did not
meet this definition.
Each data point included a complex record of observations
from multiple categories: (1) signs and symptoms, (2) blood
biomark- ers, (3) respiratory assessment with PaO2/FiO2, (4)
history of comorbidities (available in a sub- set of 119
patients). Some variables were collected daily, and others
were recorded upon clinical indications.
22
A case study to illustrate the problem
24
Modelling Requirements
• Parsimonious  few variables
• Robust to missing data  imputation not an option
• Explainable  Trust
• model reveals the relative importance of each variable for each prediction it
makes
• Minimize the number of false negatives
• risk of under-estimating the severity of a patient’s condition
26
Approach
• Parsimonious  feature ranking and selection
• Robust to missing data
• Explainable  Shapley values
• Minimize FN  bespoke loss function
Ensemble of Decision trees
27
Testing multiple models - Results
Parsimony:
Model 1 - suboptimal prediction accuracy
Model 2:
Adding biomarkers including respiratory variables increased performance
Model 3:
boosted mixed model - still requires about 20 variables
From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice.
What our approach offers in support to the decision-making process is a simple interpretation of the predictions.
28
Which are the most important predictors?
Shap values
29
Summary
Good results on “live” data, predicting a useful outcome for the purpose of ICU management
Major selling points:
• Variables (relatively) easy to collect in routine visits and in-hospital
• Models are explainable, medics can reality-check against their own understanding
… Opened the door to further collaborations:
New project on PACS: Post-Acute Covid Syndrome:
Following up recovery paths for 300 patients across 5 hospitals
30
D2P4  EHR analysis for dynamic risk prediction
D2P4 (*)
Healthcare
research
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Survival analysis
Longitudinal prediction models
31
Longitudinal data: Health-related events
https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270
UK Biobank - Primary Care Linked Data
32
Clinical Risk Prediction Models
Healthy participant or
missing data/under-
reported conditions?
Number/pattern of
records is a proxy
for health?
Informed presence bias
Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)
36
Case study: Type 2 Diabetes
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
80
Pre⌧diabetes
D
iabetes
R
em
ission
G
lycated
hem
oglobin
HbA
1c
(m
m
ol/m
ol)
Participant:
R
ED
A
C
T
ED
●
●
●
●
4
6
8
10
12
●
Prim
ary
care
records
U
K
B
B
visit
●
●
●
N
orm
oglycaem
ic
Pre⌧diabetic
D
iabetic
Fasting
plasm
a
glucose
(m
m
ol/l)
P r im a r y ca r e
Se
con d a r y ca r e
E v e
n t
O b s
D r u g
D ia g
O p
1987
(age
X
)
1991
(age
X
)
1995
(age
X
)
1999
(age
X
)
2003
(age
X
)
2007
(age
X
)
2011
(age
X
)
2015
(age
X
)
Estim
ated
observation
period
R
ecord
D
iabetes
record
Electronic
health
records
Figure
17:
Example
output
of
the
phenotyping
tool.
39
37
Case study: Type 2 Diabetes – remission study
Type 2 diabetes remission
Longitudinal phenotyping with large–scale observational data
Philip Darke
EPSRC Centre for Doctoral
Training in Cloud Computing for
Big Data Newcastle University
UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk
dle and old age with over 500,000 participants. Diabetes is one of
the most prevalent conditions in the cohort with nearly 70,000 diag-
noses2 expected by 2027. Study data is collected at participant visits 2
Naomi Allen, et al. UK Biobank:
Current status and what it means
for epidemiology. Health Policy and
Technology, 1(3):123–126, September
2012. doi : 10.1016/ j.hlpt.2012.07.003
and via linkage to national datasets including EHR data. These data
have been used to longitudinally phenotype over 200,000 partici-
pants for diabetes as illustrated in figure 1. The approach will be
expanded to all participants when further data is released.
●
●
● ●
● ●
● ● ●
● ●
30
40
50
60
HbA1c
(mmol/mol)
Pre−diabetes Type 2 diabetes Remission
●
● ●
● ●
●
●
● ● ● ●
● ● ●
●
70
80
90
100
Weight
(kg)
Biguanides
12.5
15.0
17.5
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Figure 1: Model output showing
HbA1c, weight, periods of medication
and inferred diabetic status for an
example participant. Long–term
remission was achieved by sustained
weight loss post diagnosis.
Many of those diagnosed with type 2 diabetes experience a sub-
sequent period of remission. Some relapse whilst others achieve
long–term remission and cease anti–diabetes medication. This
project will examine the pathways to remission at scale using ob-
38
D2P4  MLTC-M
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multiple Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
NLP
39
<event
name>
Multimorbidity and Long-Term Conditions
Patients with multimorbidities have the greatest healthcare needs and generate the
highest expenditure in the health system.
There is an increasing focus on identifying specific disease combinations for
addressing poor outcomes.
Matrix factorization / factor analysis
Clustering
Multiple correspondence analysis
Network analysis
…
Which data?
Fragmented / disconnected data sources
 Data access
 Data governance
40
D2P4  NAFLD / non-alcohol fatty liver disease
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
41
D2P4  NAFLD / NASH
NASH = non-alcoholic steatohepatitis
Aims:
- integrate cross-sectional and longitudinal outcomes clinical data with
a multi-dimensional ‘omics’ record
- Hypothesis: a precision medicine approach leads to better
understanding of individuals’ trajectories
- Personalised biomarkers  liquid biopsy
Dataset: European NAFLD Registry
7,750 patients with histologically proven NAFLD/NASH
- Omics (cross-sectional)
- Longitudinal follow ups
Methods:
- Precision: clustering
- Anticipating progression: Learn cluster-specific longitudinal models
42
DP4DS: Data Provenance for Data Science
D2P4
+
DP4DS(*)
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)
43
Data  Model  Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
Complementing current ML approaches to model interpretability
1. Can we explain these decisions?
2. Are these explanations useful?
44
Explaining data preparation
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts  Python / TensorFlow, Pandas, Spark
- Workflows  Knime, …
Provenance  Transparency
46
Recent early results
A small grassroots project… [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Reality check:
- How much does it cost?  provenance volume
- Does it help?  queries against the provenance database
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
47
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns
48
Code instrumentation
Create a provlet for
a specific
transformation
Initialize provenance
capture
…code injection is now being automated!
49
Provenance patterns
50
Provenance templates
Template + binding rules = instantiated provenance fragment
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’} +
51
This applies to all operators…
52
Putting it all together
53
Evaluation - performance
54
Evaluation: Provenance capture and query times
55
Scalability
56
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  what is the benefit to data analysts?
Work in progress! Interest? Ideas?
57
Acknowledgments
Prof. Mike Catt
PhD Students: Ben Lam, Philip Darke
MSc student: Sam Butterfield
Prof. Guaraldi
Prof. Mandreoli
MSc student: Davide Ferrari
Prof. Torlone
MSc student: Giulia Simonelli
Prof. Chapman

Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

  • 1.
    Data Science for(Health) Science: tales from a challenging front line, and how to cross a few T's Paolo Missier School of Computing Newcastle University, UK March 2021 A talk given to The School of Information Sciences Center for Informatics Research in Science and Scholarship University of Illinois Urbana-Champaign paolo.missier@ncl.ac.uk LinkedIn: paolomissier Twitter: @PMissier
  • 2.
    2 The message: 1. “DataScience” for Health is hard. The hard part is the data 2. “AI for Health” is (Deep) Machine Learning 3. Ethics. Fairness. Trust. Acceptance. 4. Data Provenance for Data Science: Solution or distraction? • Transparency • Trustworthiness • Traceability
  • 3.
  • 4.
    4 AI for healthcare– the UK landscape https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences AI and data science will improve the detection, diagnosis, and treatment of illness. They will optimise the provision of services, and support health service providers to anticipate demand and deliver improved patient care. • Explainability / Interpretability • Exploiting EHR (Electronic Health Records) • Image interpretation • Fairness, Bias • Ethical issues in … • Predicting <disease / critical event> …
  • 5.
    5 Personalised, Predictive, Preventive,Participatory Medicine (P4) Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat Biotechnol. 2017;35:747.
  • 6.
    6 (*) Data-Driven, Personalised,Predictive, Preventive, Participatory D2P4 (*) Healthcare research • Cleaning • Integration • Alignment • Imputation • NLP • … Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”… (plus traditional statistics!)
  • 7.
    7 Big Data forHealth Care Genomics for personalized medicine personal monitors / wearables Medical Records Article Source: Big Data: Astronomical or Genomical? Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195
  • 8.
    9 D2P4  Accelerometry PhysicalActivity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy • Cleaning • Integration • Alignment • Imputation • NLP • … Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML)
  • 9.
    10 Digital biomarkers Digital biomarkerscome from "novel sensing systems capable of continuously tracking behavioral signals […] capture people's everyday routines, actions, and physiological changes that can explain outcomes related to health, cognitive abilities, and more” (Choudhury 2018). Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156. https://doi.org/10.1145/3266285 - physical activity - glucose levels - blood oxygen levels - … Inexpensive  scalable personalised self-monitoring
  • 10.
    11 A first project:markers from accelerometers? Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome - physical activity - glucose levels - blood oxygen levels - … Aligned with the P4 agenda Readily available dataset (+) 3,500+ features (+) multi-omics coverage (+) genomics (+) links to EHR (+) Activity monitors made in Newcastle! (-) Limited follow-ups – little longitude (-) Population not random (-) Activity data / person very limited 100K Activity traces
  • 11.
    13 Using wearable activitytrackers to predict Type-2 Diabetes Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations. Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press) Feature extraction Clustering Classification ??
  • 12.
    14 Granular activity representation featureextraction 60 features / day
  • 13.
    15 Filter: Accelerometry study? 103,712 Split criteria: Type2 Diabetes? At baseline: 2,755 Through EHR analysis: 1,321 Total: 4,076 Non-Diabetes 99,636 Filter: EHR data available? 19,852 502, 664 All UK Biobank participants: Filter: QC on activity traces 3,103 Positives: T2D vs Norm-0 Physical Impairment analysis Severe impairment 1,666 No impairment 8,463 A great UG project! your (biomedical) dataset may not be as big as it looks T2D vs Norm-1
  • 14.
    16 (some) results Negatives: HLAFSDL HLAF+SDL Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2 RF .80 .68 .83 .78 .86 .77 LR .79 .70 .83 .78 .86 .78 XGB .78 .66 .80 .74 .85 .75
  • 15.
    17 Ongoing work Are therebetter embedded representations for acceleremetry data? Can they be used as predictors for other outcomes? Representation learning Embedded feature space LSTM Autoencoder Outcome: Insulin sensitivity DIRECT DB Standard classification
  • 16.
    19 D2P4  COVID PhysicalActivity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML)
  • 17.
    D. Ferrari1, Prof.F. Mandreoli1, Prof. G. Guaraldi2 Prof. P. Missier Predicting respiratory failure in patients with COVID-19 pneumonia: a case study from Northern Italy Peak of Italian Covid crisis (March 2020 onwards) Issue: ICU Capacity Question: will my next patient require ICU resources? How soon? (1) (2) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
  • 18.
    21 Study structure Applied MachineLearning driven by a clinical question An example of typical data science pattern: • Data selection  inclusion, exclusion criteria • Data preparation / cleaning • Variable selection • Model learning  multiple models • Model evaluation With additional challenges: “Live” evolving dataset with multiple versions of a patients database • changes in recording practices • Inconsistencies • Lots of missing data Small data: 198 patients  1068 observations  31-90 variables (symptoms, lab biomarkers) In the data collection period, the dataset was growing daily with the average of 84 new records per day, with a mean of 10 new data points/patient. out of the initial sample of 295 patients and 2,889 data points available, 198 patients contributed to generate 1068 valuable observations. In detail, 603 observations contributed to the definition of respiratory failure (PaO2/ FiO2 < 150 mmHg) and 465 did not meet this definition. Each data point included a complex record of observations from multiple categories: (1) signs and symptoms, (2) blood biomark- ers, (3) respiratory assessment with PaO2/FiO2, (4) history of comorbidities (available in a sub- set of 119 patients). Some variables were collected daily, and others were recorded upon clinical indications.
  • 19.
    22 A case studyto illustrate the problem
  • 20.
    24 Modelling Requirements • Parsimonious few variables • Robust to missing data  imputation not an option • Explainable  Trust • model reveals the relative importance of each variable for each prediction it makes • Minimize the number of false negatives • risk of under-estimating the severity of a patient’s condition
  • 21.
    26 Approach • Parsimonious feature ranking and selection • Robust to missing data • Explainable  Shapley values • Minimize FN  bespoke loss function Ensemble of Decision trees
  • 22.
    27 Testing multiple models- Results Parsimony: Model 1 - suboptimal prediction accuracy Model 2: Adding biomarkers including respiratory variables increased performance Model 3: boosted mixed model - still requires about 20 variables From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice. What our approach offers in support to the decision-making process is a simple interpretation of the predictions.
  • 23.
    28 Which are themost important predictors? Shap values
  • 24.
    29 Summary Good results on“live” data, predicting a useful outcome for the purpose of ICU management Major selling points: • Variables (relatively) easy to collect in routine visits and in-hospital • Models are explainable, medics can reality-check against their own understanding … Opened the door to further collaborations: New project on PACS: Post-Acute Covid Syndrome: Following up recovery paths for 300 patients across 5 hospitals
  • 25.
    30 D2P4  EHRanalysis for dynamic risk prediction D2P4 (*) Healthcare research Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Survival analysis Longitudinal prediction models
  • 26.
    31 Longitudinal data: Health-relatedevents https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270 UK Biobank - Primary Care Linked Data
  • 27.
    32 Clinical Risk PredictionModels Healthy participant or missing data/under- reported conditions? Number/pattern of records is a proxy for health? Informed presence bias Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)
  • 28.
    36 Case study: Type2 Diabetes ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 80 Pre⌧diabetes D iabetes R em ission G lycated hem oglobin HbA 1c (m m ol/m ol) Participant: R ED A C T ED ● ● ● ● 4 6 8 10 12 ● Prim ary care records U K B B visit ● ● ● N orm oglycaem ic Pre⌧diabetic D iabetic Fasting plasm a glucose (m m ol/l) P r im a r y ca r e Se con d a r y ca r e E v e n t O b s D r u g D ia g O p 1987 (age X ) 1991 (age X ) 1995 (age X ) 1999 (age X ) 2003 (age X ) 2007 (age X ) 2011 (age X ) 2015 (age X ) Estim ated observation period R ecord D iabetes record Electronic health records Figure 17: Example output of the phenotyping tool. 39
  • 29.
    37 Case study: Type2 Diabetes – remission study Type 2 diabetes remission Longitudinal phenotyping with large–scale observational data Philip Darke EPSRC Centre for Doctoral Training in Cloud Computing for Big Data Newcastle University UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk dle and old age with over 500,000 participants. Diabetes is one of the most prevalent conditions in the cohort with nearly 70,000 diag- noses2 expected by 2027. Study data is collected at participant visits 2 Naomi Allen, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy and Technology, 1(3):123–126, September 2012. doi : 10.1016/ j.hlpt.2012.07.003 and via linkage to national datasets including EHR data. These data have been used to longitudinally phenotype over 200,000 partici- pants for diabetes as illustrated in figure 1. The approach will be expanded to all participants when further data is released. ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 HbA1c (mmol/mol) Pre−diabetes Type 2 diabetes Remission ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 80 90 100 Weight (kg) Biguanides 12.5 15.0 17.5 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Figure 1: Model output showing HbA1c, weight, periods of medication and inferred diabetic status for an example participant. Long–term remission was achieved by sustained weight loss post diagnosis. Many of those diagnosed with type 2 diabetes experience a sub- sequent period of remission. Some relapse whilst others achieve long–term remission and cease anti–diabetes medication. This project will examine the pathways to remission at scale using ob-
  • 30.
    38 D2P4  MLTC-M PhysicalActivity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multiple Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) NLP
  • 31.
    39 <event name> Multimorbidity and Long-TermConditions Patients with multimorbidities have the greatest healthcare needs and generate the highest expenditure in the health system. There is an increasing focus on identifying specific disease combinations for addressing poor outcomes. Matrix factorization / factor analysis Clustering Multiple correspondence analysis Network analysis … Which data? Fragmented / disconnected data sources  Data access  Data governance
  • 32.
    40 D2P4  NAFLD/ non-alcohol fatty liver disease Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy • Cleaning • Integration • Alignment • Imputation • NLP • … Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”…
  • 33.
    41 D2P4  NAFLD/ NASH NASH = non-alcoholic steatohepatitis Aims: - integrate cross-sectional and longitudinal outcomes clinical data with a multi-dimensional ‘omics’ record - Hypothesis: a precision medicine approach leads to better understanding of individuals’ trajectories - Personalised biomarkers  liquid biopsy Dataset: European NAFLD Registry 7,750 patients with histologically proven NAFLD/NASH - Omics (cross-sectional) - Longitudinal follow ups Methods: - Precision: clustering - Anticipating progression: Learn cluster-specific longitudinal models
  • 34.
    42 DP4DS: Data Provenancefor Data Science D2P4 + DP4DS(*) Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”… (plus traditional statistics!)
  • 35.
    43 Data  Model Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied? Complementing current ML approaches to model interpretability 1. Can we explain these decisions? 2. Are these explanations useful?
  • 36.
    44 Explaining data preparation Data collection Model Populationdata pre-processing Raw datasets features Predicted you: - Ranking - Score - Class - Integration - Cleaning - Outlier removal - Normalisation - Feature selection - Class rebalancing - Sampling - Stratification - … Data acquisition and wrangling: - How were datasets acquired? - How recently? - For what purpose? - Are they being reused / repurposed? - What is their quality? Instances - Scripts  Python / TensorFlow, Pandas, Spark - Workflows  Knime, … Provenance  Transparency
  • 37.
    46 Recent early results Asmall grassroots project… [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Reality check: - How much does it cost?  provenance volume - Does it help?  queries against the provenance database [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  • 38.
    47 Operators 14/03/2021 03_ b_c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  • 39.
    48 Code instrumentation Create aprovlet for a specific transformation Initialize provenance capture …code injection is now being automated!
  • 40.
  • 41.
    50 Provenance templates Template +binding rules = instantiated provenance fragment 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} +
  • 42.
    51 This applies toall operators…
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
    56 Summary Multiple hypotheses regardingData Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  what is the benefit to data analysts? Work in progress! Interest? Ideas?
  • 48.
    57 Acknowledgments Prof. Mike Catt PhDStudents: Ben Lam, Philip Darke MSc student: Sam Butterfield Prof. Guaraldi Prof. Mandreoli MSc student: Davide Ferrari Prof. Torlone MSc student: Giulia Simonelli Prof. Chapman

Editor's Notes

  • #13 CVD leading cause of death for males (15.5%) and seconds for females (8.8%) in 2015 (*)
  • #44 How about the data used to train / build the model?
  • #45 Relatively easy to keep track of data pre-processing  provenance
  • #48 \newcommand{\f}{\textbf{a}} \text{features}~ X=[\f_1 \ldots \f_k] \text{new features}~ Y=[\f'_1 \ldots \f'_l] \noindent new values for each row are  obtained by applying $f$\\ to values in the $X$ features