Data Science for (Health) Science:tales from a challenging front line, and how to cross a few T's

Data Science for (Health) Science:
tales from a challenging front line, and how to cross a few T's
Paolo Missier
School of Computing
Newcastle University, UK
March 2021
A talk given to
The School of Information Sciences
Center for Informatics Research in Science and Scholarship
University of Illinois Urbana-Champaign
paolo.missier@ncl.ac.uk
LinkedIn: paolomissier
Twitter: @PMissier

2
The message:
1. “Data Science” for Health is hard. The hard part is the data
2. “AI for Health” is (Deep) Machine Learning
3. Ethics. Fairness. Trust. Acceptance.
4. Data Provenance for Data Science: Solution or distraction?
• Transparency
• Trustworthiness
• Traceability

3
A Grand Challenge
https://epsrc.ukri.org/research/ourportfolio/themes/healthcaretechnologies/strategy/grandchallenges/

4
AI for healthcare – the UK landscape
https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences
AI and data science will improve the detection, diagnosis, and treatment of
illness. They will optimise the provision of services, and support health service
providers to anticipate demand and deliver improved patient care.
• Explainability / Interpretability
• Exploiting EHR (Electronic Health Records)
• Image interpretation
• Fairness, Bias
• Ethical issues in …
• Predicting <disease / critical event> …

5
Personalised, Predictive, Preventive, Participatory Medicine (P4)
Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds.
Nat Biotechnol. 2017;35:747.

6
(*) Data-Driven, Personalised, Predictive, Preventive, Participatory
D2P4 (*)
Healthcare
research
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)

7
Big Data for Health Care
Genomics for
personalized medicine
personal monitors /
wearables
Medical Records
Article Source: Big Data: Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7):
e1002195. https://doi.org/10.1371/journal.pbio.1002195

9
D2P4  Accelerometry
(wearables)
+ prescriptions
Clinical protocols
Multi-omics
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Methods
Clustering (ML)

10
Digital biomarkers
Digital biomarkers come from "novel sensing systems capable of continuously tracking
behavioral signals […] capture people's everyday routines, actions, and physiological
changes that can explain outcomes related to health, cognitive abilities, and more”
(Choudhury 2018).
Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156.
https://doi.org/10.1145/3266285
- physical activity
- glucose levels
- blood oxygen
levels
- …
Inexpensive  scalable personalised self-monitoring

11
A first project: markers from accelerometers?
Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome
- physical activity
- glucose levels
- blood oxygen levels
- …
Aligned with the P4 agenda
Readily available dataset
(+) 3,500+ features
(+) multi-omics coverage
(+) genomics
(+) links to EHR
(+) Activity monitors made in Newcastle!
(-) Limited follow-ups – little longitude
(-) Population not random
(-) Activity data / person very limited
100K
Activity traces

13
Using wearable activity trackers to predict Type-2 Diabetes
Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with
Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations.
Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P
Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer
Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press)
Feature
extraction
Clustering
Classification
??

14
Granular activity representation
feature extraction 60 features / day

15
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
A great UG project!
your (biomedical) dataset may not be as big as it looks
T2D vs Norm-1

16
(some) results
Negatives: HLAF SDL HLAF+SDL
Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2
RF .80 .68 .83 .78 .86 .77
LR .79 .70 .83 .78 .86 .78
XGB .78 .66 .80 .74 .85 .75

17
Ongoing work
Are there better embedded representations for acceleremetry data?
Can they be used as predictors for other outcomes?
Representation learning
Embedded
feature space
LSTM Autoencoder
Outcome:
Insulin sensitivity
DIRECT
DB
Standard classification

19
D2P4  COVID
(wearables)
+ prescriptions
Clinical protocols
Multi-omics
Liquid biopsy
Programming:
Methods
Clustering (ML)

D. Ferrari1, Prof. F. Mandreoli1, Prof. G. Guaraldi2
Prof. P. Missier
Predicting respiratory failure in patients with COVID-19
pneumonia: a case study from Northern Italy
Peak of Italian Covid crisis (March 2020 onwards)
Issue: ICU Capacity
Question: will my next patient require ICU resources? How soon?
(1)
(2)
Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health
emergency
Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges,
strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172

21
Study structure
Applied Machine Learning driven by a clinical question
An example of typical data science pattern:
• Data selection  inclusion, exclusion criteria
• Data preparation / cleaning
• Variable selection
• Model learning  multiple models
• Model evaluation
With additional challenges:
“Live” evolving dataset with multiple versions of a patients database
• changes in recording practices
• Inconsistencies
• Lots of missing data
Small data: 198 patients  1068 observations  31-90 variables (symptoms, lab biomarkers)
In the data collection period, the dataset
was growing daily with the average of 84
new records per day, with a mean of 10 new
data points/patient.
out of the initial sample of 295 patients
and 2,889 data points available, 198
patients contributed to generate 1068
valuable observations. In detail, 603
observations contributed to the
definition of respiratory failure (PaO2/
FiO2 < 150 mmHg) and 465 did not
meet this definition.
Each data point included a complex record of observations
from multiple categories: (1) signs and symptoms, (2) blood
biomarkers, (3) respiratory assessment with PaO2/FiO2, (4)
history of comorbidities (available in a sub- set of 119
patients). Some variables were collected daily, and others
were recorded upon clinical indications.

22
A case study to illustrate the problem

24
Modelling Requirements
• Parsimonious  few variables
• Robust to missing data  imputation not an option
• Explainable  Trust
• model reveals the relative importance of each variable for each prediction it
makes
• Minimize the number of false negatives
• risk of under-estimating the severity of a patient’s condition

26
Approach
• Parsimonious  feature ranking and selection
• Robust to missing data
• Explainable  Shapley values
• Minimize FN  bespoke loss function
Ensemble of Decision trees

27
Testing multiple models - Results
Parsimony:
Model 1 - suboptimal prediction accuracy
Model 2:
Adding biomarkers including respiratory variables increased performance
Model 3:
boosted mixed model - still requires about 20 variables
From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice.
What our approach offers in support to the decision-making process is a simple interpretation of the predictions.

28
Which are the most important predictors?
Shap values

29
Summary
Good results on “live” data, predicting a useful outcome for the purpose of ICU management
Major selling points:
• Variables (relatively) easy to collect in routine visits and in-hospital
• Models are explainable, medics can reality-check against their own understanding
… Opened the door to further collaborations:
New project on PACS: Post-Acute Covid Syndrome:
Following up recovery paths for 300 patients across 5 hospitals

30
D2P4  EHR analysis for dynamic risk prediction
D2P4 (*)
Healthcare
research
(wearables)
+ prescriptions
Clinical protocols
Multi-omics
Liquid biopsy
Programming:
Methods
Clustering (ML)
Survival analysis
Longitudinal prediction models

31
Longitudinal data: Health-related events
https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270
UK Biobank - Primary Care Linked Data

32
Clinical Risk Prediction Models
Healthy participant or
missing data/under-
reported conditions?
Number/pattern of
records is a proxy
for health?
Informed presence bias
Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)

36
Case study: Type 2 Diabetes
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
80
Pre⌧diabetes
D
iabetes
R
em
ission
G
lycated
hem
oglobin
HbA
1c
(m
m
ol/m
ol)
Participant:
R
ED
A
C
T
ED
●
●
●
●
4
6
8
10
12
●
Prim
ary
care
records
U
K
B
B
visit
●
●
●
N
orm
oglycaem
ic
Pre⌧diabetic
D
iabetic
Fasting
plasm
a
glucose
(m
m
ol/l)
P r im a r y ca r e
Se
con d a r y ca r e
E v e
n t
O b s
D r u g
D ia g
O p
1987
(age
X
)
1991
(age
X
)
1995
(age
X
)
1999
(age
X
)
2003
(age
X
)
2007
(age
X
)
2011
(age
X
)
2015
(age
X
)
Estim
ated
observation
period
R
ecord
D
iabetes
record
Electronic
health
records
Figure
17:
Example
output
of
the
phenotyping
tool.
39

37
Case study: Type 2 Diabetes – remission study
Type 2 diabetes remission
Longitudinal phenotyping with large–scale observational data
Philip Darke
EPSRC Centre for Doctoral
Training in Cloud Computing for
Big Data Newcastle University
UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk
dle and old age with over 500,000 participants. Diabetes is one of
the most prevalent conditions in the cohort with nearly 70,000 diag-
noses2 expected by 2027. Study data is collected at participant visits 2
Naomi Allen, et al. UK Biobank:
Current status and what it means
for epidemiology. Health Policy and
Technology, 1(3):123–126, September
2012. doi : 10.1016/ j.hlpt.2012.07.003
and via linkage to national datasets including EHR data. These data
have been used to longitudinally phenotype over 200,000 partici-
pants for diabetes as illustrated in ﬁgure 1. The approach will be
expanded to all participants when further data is released.
●
●
● ●
● ●
● ● ●
● ●
30
40
50
60
HbA1c
(mmol/mol)
Pre−diabetes Type 2 diabetes Remission
●
● ●
● ●
●
●
● ● ● ●
● ● ●
●
70
80
90
100
Weight
(kg)
Biguanides
12.5
15.0
17.5
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Figure 1: Model output showing
HbA1c, weight, periods of medication
and inferred diabetic status for an
example participant. Long–term
remission was achieved by sustained
weight loss post diagnosis.
Many of those diagnosed with type 2 diabetes experience a sub-
sequent period of remission. Some relapse whilst others achieve
long–term remission and cease anti–diabetes medication. This
project will examine the pathways to remission at scale using ob-

38
D2P4  MLTC-M
(wearables)
+ prescriptions
Clinical protocols
Multi-omics
Multiple Long Term Conditions (MLTC)
Liquid biopsy
Programming:
Methods
Clustering (ML)
NLP

39
<event
name>
Multimorbidity and Long-Term Conditions
Patients with multimorbidities have the greatest healthcare needs and generate the
highest expenditure in the health system.
There is an increasing focus on identifying specific disease combinations for
addressing poor outcomes.
Matrix factorization / factor analysis
Clustering
Multiple correspondence analysis
Network analysis
…
Which data?
Fragmented / disconnected data sources
 Data access
 Data governance

40
D2P4  NAFLD / non-alcohol fatty liver disease
(wearables)
+ prescriptions
Clinical protocols
Multi-omics
Images -- Histology
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Methods
Clustering (ML)
… “AI”…

41
D2P4  NAFLD / NASH
NASH = non-alcoholic steatohepatitis
Aims:
- integrate cross-sectional and longitudinal outcomes clinical data with
a multi-dimensional ‘omics’ record
- Hypothesis: a precision medicine approach leads to better
understanding of individuals’ trajectories
- Personalised biomarkers  liquid biopsy
Dataset: European NAFLD Registry
7,750 patients with histologically proven NAFLD/NASH
- Omics (cross-sectional)
- Longitudinal follow ups
Methods:
- Precision: clustering
- Anticipating progression: Learn cluster-specific longitudinal models

42
DP4DS: Data Provenance for Data Science
D2P4
+
DP4DS(*)
(wearables)
+ prescriptions
Clinical protocols
Multi-omics
Liquid biopsy
Programming:
Methods
Clustering (ML)
… “AI”…
(plus traditional statistics!)

43
Data  Model  Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
Complementing current ML approaches to model interpretability
1. Can we explain these decisions?
2. Are these explanations useful?

44
Explaining data preparation
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts  Python / TensorFlow, Pandas, Spark
- Workflows  Knime, …
Provenance  Transparency

46
Recent early results
A small grassroots project… [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Reality check:
- How much does it cost?  provenance volume
- Does it help?  queries against the provenance database
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.

47
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns

48
Code instrumentation
Create a provlet for
a specific
transformation
Initialize provenance
capture
…code injection is now being automated!

50
Provenance templates
Template + binding rules = instantiated provenance fragment
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’} +

51
This applies to all operators…

54
Evaluation: Provenance capture and query times

56
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  what is the benefit to data analysts?
Work in progress! Interest? Ideas?

57
Acknowledgments
Prof. Mike Catt
PhD Students: Ben Lam, Philip Darke
MSc student: Sam Butterfield
Prof. Guaraldi
Prof. Mandreoli
MSc student: Davide Ferrari
Prof. Torlone
MSc student: Giulia Simonelli
Prof. Chapman

Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

More Related Content

What's hot

Similar to Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

More from Paolo Missier

Recently uploaded