Realising the potential of Health Data Science:opportunities and challenges to practical adoption

Professor Paolo Missier
School of Computing
Newcastle University
October 2023
Realising the potential of Health Data Science:
opportunities and challenges to practical adoption

2
<event
name>
The promise of data-driven medicine and healthcare
Predictive, Preventative, Personalised, Participatory: a systems biology perspective on the future of
medicine and health care
Hood L, Heath JR, Phelps ME, Lin B. Systems biology and new technologies enable predictive and preventative medicine. Science. 2004;306(5696):640–643.
Hood L, Balling R, Auffray C. Revolutionizing medicine in the 21st century through systems approaches. Biotechnol J. 2012;7(8):992–1001. Provides an overview of the science and
technological foundations of predictive, preventive, personalized and participatory healthcare
Flores M, Glusman G, Brogaard K, Price ND, Hood L. P4 medicine: how systems medicine will transform the healthcare sector and society. Per Med. 2013;10(6):565-576. doi:
10.2217/pme.13.57. PMID: 25342952; PMCID: PMC4204402.
Schmidt, Charlie. ‘Leroy Hood Looks Forward to P4 Medicine: Predictive, Personalized, Preventive, and Participatory’. JNCI Journal of the National Cancer Institute 106, no. 12
(December 2014): dju416–dju416. https://doi.org/10.1093/jnci/dju416.
[1] Sagner, M, A McNeil, P Puska, and R Arena. ‘The P4 Health Spectrum – A Predictive, Preventive, Personalized and Participatory Continuum for Promoting Healthspan’.
Progress in Cardiovascular Diseases 59, no. 5 (2017): 506–21. https://doi.org/10.1016/j.pcad.2016.08.002.
A new approach in medicine that is predictive, preventive, personalized and participatory, which we
label here as “P4” holds great promise to reduce the burden of chronic diseases by harnessing
technology and an increasingly better understanding of environment-biology interactions, evidence-
based interventions and the underlying mechanisms of chronic diseases. [1]

3
<event
name>
Data about us
Sagner, M, A McNeil, P Puska, and R Arena. ‘The P4 Health Spectrum – A Predictive, Preventive, Personalized and Participatory Continuum for Promoting Healthspan’.
Progress in Cardiovascular Diseases 59, no. 5 (2017): 506–21. https://doi.org/10.1016/j.pcad.2016.08.002.

4
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Challenges, opportunities, state of the art through two first-hand case studies
• Costs and Challenges throughout the data value chain

5
<event
name>
Understanding the facets of Health data
• Clinical
• Lifestyle, social
•Which data types?
• Prospective vs
retrospective
•Where do datasets
come from?
• Acquisition
• Curation, annotation
•How much do they
cost?
• Small vs Big Health
Data
•How large?
• Governance
• Protection
•Who can use it and
how?
Data
Science and
Engineering
Benefits to
patients

6
<event
name>
Which data? Capturing individuals’ complexity
Primary care records:
- Clinical tests / GP notes, diagnoses / Prescriptions
Secondary care records:
- hospital admission / diagnoses / operations / prescriptions
Multi-omics data:
- genotypes, exomes, genomes.
- Transcriptomics, proteomics
Digital Health:
- Data streams from wearable and environment sensors,
self-monitoring
Socio-demographics:
- Area of residence, family, social deprivation

7
Baseline
assessment
GP events
prescriptions
HESIN diagnoses
N = 240,000
N = 500,000
Hospital events
Used to determine
admission/ re-admission
patterns
operations
57,698,505
123,644,445
Example: UK biobank
eid
Up to 20 years of records

8
<event
name>
CPRD
Data access fee for research ~£60K
(non-commercial license)
Population makeup:
over 2,000 primary care practices
60 million patients (18m registered active patient)
at least 20 years of follow-up for 25% of the patients
Core dataset:
Demographics
Diagnoses and symptoms
Drug exposures
Vaccination history
Laboratory tests
Referrals to hospital and specialist care
Data linkages:
Hospital care (A&E; Inpatient; Outpatient; Imaging)
Death registry
Cancer registry and treatment
Mental health services
Socio-economic measures

9
<event
name>
A convergence of needs and opportunities
P4
Data-driven
Healthcare
Personal self-
monitoring
devices
Health Data
Science and
Engineering
Governance, consent
Secure data access
(Big) Health
Data
- Operations  Research
- ML, AI Methods
- Scalable computing
Medical grade  Consumer grade
- Privacy (eg GDPR)
- Opt-in vs opt-out
- Trusted Research Environments
Bigger == more useful?

10
<event
name>
The data-to-actions loop
Monitoring
Clinical testing
Data Engineering
Predictive Analytics
/ AI
Personalised
Predictions
- Prevention
- interventions

11
A complex health data science landscape for translational research
Challenges
Data
integration
Protocol design
Retrospective
Dataset search
and selection
Prospective
Data cleaning
Data standardisation
Data augmentation
- Annotation amplification
- Synthetic data
. Population characterization
. Subgroups identification
- Patient subtyping
- Disease subtyping
- ”group by”
- Clustering
- Latent Class Analysis
- Risk prediction
- Next disease prediction
- {bio, digital} markers discovery
- Other outcomes
Process modelling, HMM
Established ML
- Deep NN
- Generative AI (eg BEHRT)
Tasks
and
methods
Cross-source integration
across types:
clinical/EHR/Omics/sensors
Understanding
data semantics
Data and annotation scarcity
Managing the
quality/quantity/cost envelope
Bias control
Data noise
Advancing the methods:
“Better data science for better science”
Data governance, computational scalability  Safe Data Environments
End-to-end explainability  provenance engineering, demonstrating the benefits
Reproducible Analytics Pipelines (RAP)
Architectures
Data and
methods
Data ingestion
Data preparation /
engineering
Descriptive analytics
Pattern discovery
Predictions

12
✗
<event
name>
II. Prospective vs retrospective datasets
Prospective: defined for research purposes
✓ Stable and
predictable
✓ Follow protocol
✓ Research ready
✓ Potentially well-
curated
✓ Bias known a priori
✗ Expensive
✗ Not very reusable
✗ Scarce
 Potentially more reusable
 Natural Bias (reflects natural cohort locality)
✗ Generally not research ready
✗ Require data engineering
Retrospective: typically operational data
Example:
Clinical Practice Research Datalink
- Data collected from UK GP practices
- 60+ million patients
- (also prospective)
Example: UK Biobank
- 500,000 volunteer participants
- General health information
- Genotypes and whole genomes
- Selected internal organ imaging study (100K)
- Bias: 40+ years, geographic / social bias
Prospective datasets:

13
<event
name>
Cost of health data
Retrospective: integration/harmonisation, curation, cleaning
Prospective: cost of cohort recruitment, data collection, data processing
Acquisition + processing cost by data type:
Routinely collected
clinical variables
(GP test)
- Tests requiring specialist labs
- Proteomics
- Genotyping
(a few genes)
Whole exome
sequencing
Whole genome
sequencing
Low High

14
Case study: LITMUS
Retrospective data collected from hospitals datasets (N ≅ 10K)
Prospective data from active recruitment (N ≅ 2K)
- Routine clinical tests
- Omics (genotypes, transcriptomes, proteomes)
- Biopsies  provide label annotations
• EU IMI2 project
• Non-Alcoholic Fatty Liver Disease (NAFLD / steathosis) and NASH
(fibrosis, cirrhosis)
https://litmus-project.eu/litmus-partners/
Main contributor: Matt McTeer, PhD student
From multivariate linear regression to non-linear combinations
of markers

15
Data scarcity / sparsity issues
N= 9,449
Clinical: 8,745
GWAS: 2,216
miRNA: 183
RNASeq: 461

16
Exploring the cost/quality/importance envelope
Stratified feature set: Core  Extended  Specialist (85% missing)

17
Core variables may be enough?
Outcome: “at-risk NASH”

18
LITMUS
Challenges
Data
integration
Protocol design
Retrospective
Dataset search
and selection
Prospective Data cleaning
Data augmentation
- Synthetic data
- Patient subtyping
- Disease subtyping
- ”group by”
- Clustering
- Statistical modelling
- Multivariate regression
- Risk prediction
- Other outcomes
Established ML
- Deep NN
Tasks
and
methods
across types:
Understanding
data semantics
Managing the
Bias control
Data noise
Data governance, computational scalability  Safe Data Environment
Architectural
Data
“Long and thin” vs “short and broad” training sets
feature completeness vs
importance, imputation
Binary classifiers across
multiple feature sets

19
<event
name>
Issues requiring Data Engineering
Recurringdata
issues
Data–driven, AI–based clinical practice: experiences, challenges, and research directions
DATA SPARSITY
AND SCARSITY
• EHR: Irregular
collections of
time series
• Imputation is
not always
possible
DATA
IMBALANCE
• Predicting
rare events
can be a
priority
• No
downsampling
option
DATA
INCONSISTENCY
and INSTABILITY
• Retrospective
data are often
source of
inconsistency
and their
schema are
instable
NOT ALL
ERRORSARE
EQUALLY
WRONG
• In high-stake
domains
sometimes a
bias towards
one type of
error is
preferible
HUMAN-IN-
THE-LOOP
• Explanations
engender trust
in the models
• Trust should
include not
only the
clinician but
also the
patient.

20
<event
name>
Sparsity/ scarcity, imbalance
Classifiers are not resilient to class imbalance:
- Models will be biased towards predicting
majority class regardless of the input features
- Will struggle to generalise correctly on the
minority class
- In clinical datasets, data scarcity/sparsity often
conspires with data imbalance
- Imbalance is very common in medical datasets
Typical mitigation:
- Downsample the majority class  lose training examples
- Upsample the minority class.  SMOTE (Synthetic Minority Oversampling Technique)
When modelling processes, these mitigations do not work
We used Hidden Markov Models (HMMs) to predict oxygen-therapy state-transitions
However, intubation is a infrequent state (and so is “death”)
This makes it was difficult to accurately learn probability distributions.
[1] proposes a novel, generic ensemble technique to mitigate the imbalance problem in HMM

21
<event
name>
Instability
Retrospective studies are often unstable:
Data acquisition and management practices may change over time, following changes in
- Clinical practices
- Public policy
- Hospital resources
- Data collection technologies
- In our COVID dataset clinical tests vary daily depending on the patient’s
condition
- Scientific evidence for the need of certain tests changed rapidly
- Example: new biomarkers like interleukin-6 were introduced in “mid flight”
- Thus earlier study datasets completely miss this variable

22
<event
name>
Translational challenge: Not all errors are equally wrong
- In high-stakes domains, prediction errors are not symmetric:
- Typically, underestimating risk is less desirable than overestimating it
- Standard model performance metrics (eg AUC, F1 etc) fail to capture this distinction
Cost-sensitive learning (cf eg [1,2,3])
- Introduce an explicit penalty of mis-classifying samples
- Note that cost- sensitive methods can sometimes deal with imbalanced datasets without
altering the original data distribution [4]
[1] Lomax, S., and Vadera, S. (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surveys 45, 1–35. doi: 10.1145/2431211.2431215
[2] Wang, H., Cui, Z., Chen, Y., Avidan, M., Abdallah, A. B., and Kronzer, A. (2018). Predicting hospital readmission via cost-sensitive deep learning. ACM Trans.
Comput. Biol. Bioinformatics 15, 1968–1978. doi: 10.1109/TCBB.2018.2827029
[3] Freitas, A., Costa-Pereira, A., and Brazdil, P. (2007). “Cost-sensitive decision trees applied to medical data,” in Data Warehousing and Knowledge Discovery
(Regensburg), 303–312. doi: 10.1007/978-3-540-74553-2_28
[4] Mienye, I. D., and Sun, Y. (2021). Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlock.
25:100690. doi: 10.1016/j.imu.2021.100690

23
<event
name>
Translational challenge: human-in-the-loop AI
• Essential in medical AI
• Evidence of performance is not enough
• Black-box AI not acceptable in clinical practice
From technical explanations:
• non-linear [1] and Deep Learning [2] models
• Shapley values [3]
• Interpretable ML [4,5]
Also importantly:
Patient and Public Involvement (PPI) is essential in publicly funded clinical research
“Explanation gap”:
To expert involvement in the learning process:
- by accepting/rejecting predictions
- By expressing preference for a given error type
Causal Machine Learning (CML) [6,7]:
- Visualisation and reasoning over complex clinical scenarios
- Counterfactuals, what-if scenarios

24
IEEE
BigData
2022
Multimorbidities and disease prediction
Multiple Long-Term Conditions, defined as [1,2]:
• Two/Four or more long-term (chronic) conditions
A Long Term Condition (LTC) is a condition that cannot, at present, be cured
but is controlled by medication and/or other treatment/therapies (*)
(*) NHS and UK Dept. of Health, Long Term Conditions Compendium of Information Third Edition,
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/216528/dh_134486.pdf
[1] M. C. Johnston, M. Crilly, C. Black, G. J. Prescott, and S. W. Mercer, “Defining and measuring multimorbidity: a systematic review of systematic reviews,”
European journal of public health, vol. 29, no. 1, pp. 182–189, 2019.
[2] B. P. Nunes, T. R. Flores, G. I. Mielke, E. Thum ́e, and L. A. Facchini, “Multimorbidity and mortality in older adults: A systematic review and meta-analysis.”
Archives of gerontology and geriatrics, vol. 67, pp. 130–138, Dec. 2016, place: Netherlands.
Significant research investment by NIHR, the core
translational medicine funder in the UK
The number of people with multiple LTCs in the UK is set to
rise to 2.9 million in 2018 from 1.9 million in 2008.

25
Multiple Long Term Conditions: research at Newcastle
Characterising the inter-relationships between multiple long-term conditions (MLTC)
and polypharmacy
Funding: NIHR, 2022-2024 (CO-I)
 Disease clustering based on co-occurrence in patients’ medical timelines
 Patient clustering based on timeline similarity
 Predicting outcomes using diagnoses + prescriptions / deep learning
• Disease embeddings  supervised learning for outcome prediction
 Characterise patient pathways through hospital  process modelling

AI methods Outputs & Impacts
Datasets
Replicate
Test
Discover
UK-Biobank
CPRD
GNCR
ELPR
Event spatial &
temporal order
Event Characterisation
Event Prediction
Shared standards
Portable pipelines
Identification of high-risk
situations & tipping points
Trial emulation
Local and national policies
(high risk groups)
Training & capacity building
Clinical dashboard
Clinical support tools
NIHR AIM CISC
Connected Bradford
NIHR AIM
OPTIMAL
Improve
patient
care
Reduce
health
inequalities
Communication of results
Explainable research &
Explainable AI
Local
Health
Intelligence
Datasets
Replication
Datasets
National
Discovery
Datasets
> >
Within 5 years

LTC
Embedding: 200x100
Diagnosis
Embedding: 251x50
Historical Prescriptions
Embedding: 512x50
Preadmission Prescriptions
Embedding: 512x50
Postadmission Prescriptions
Embedding: 512x50
Demographics Vector
sex, ethnicity, townsend, etc.
Feature Vector
size: 6048
Gradient Boost
size: 100 estimators
Output:
{0, 1}
• Neural network + xgboost
combination
• Our readmission cohort is more
general vs domain specific cohorts
in literature
• Our model performs better than
current literature in spite of more
complex problem
Predicting MLTC-PP outcomes: hospital readmission
Adding explanations: which
predictors are more relevant to
explain unplanned readmission?
- LTCs and how they accumulate
- Prescriptions given between
discharge and readmission?

28
AI-MULTIPLY
Challenges
Data
integration
Protocol design
Prospective
Dataset search
and selection
Retrospective Data cleaning
Data augmentation
- Synthetic data
- Patient subtyping
- Disease subtyping
- ”group by”
- Clustering
- Risk prediction
- Other outcomes
Established ML
- Deep NN
Tasks
and
methods
across types:
Understanding
data semantics
Managing the
Bias control
Data noise
Data governance, computational scalability  Safe Data Environment
Architectural
Data
Challenges in Drug coding in UKBB Defining and predicting hospital readmission
Defining and coding MLTC
Reproduce DNN results across sites
Disease clustering and cluster prediction

29
<event
name>
Challenges and opportunities
Data:
• Multi-site research presents opportunities for cross-validation of results, but also challenges
• Newcastle  UK Biobank
• QMUL  CPRD
• Projects like these tend to “piggyback” on existing data licenses, which may restrictive
Modelling:
• LLMs and genAI have shown potential to “sidestep” some of the more traditional prediction techniques
 Next disease prediction becomes a case of “sentence completion”
Engineering / reproducibility:
• at this stage, prototyping and experimenting are distributed across sites and each piece is owned by
one researcher
• reproducibility and reusability both seem like distant goals…
Patient and Public Involvement and Engagement:
• Establishing a productive and sustained relationship between PPIE members and researchers is a
priority

30
<event
name>
Role of PPIE in Health Data Science / AI projects
PPIE involvement “built into” every NIHR-funded project: it’s an asset and opportunity
BUT: need to make it work!
What kind of involvement? Consultation vs research co-design
• Periodic, scheduled “themed” sessions at designated project checkpoints
• Key research questions defined upfront, but opportunities to revise / refine mid-flight
• The academic perspective and the lived experiences are very different
• Need to find a common language
• But also to find a way to ensure mutual benefit and a two-way learning experience

31
<event
name>
PPIE: some elements for reflection
Engendering trust in AI and in secure data management practices
• Where is your data held? How do TREs work?
• What are the boundaries of legitimate use of your data for research? How is the law changing?
• Transparency and explainability: How we can achieve effective communication on what an algorithm is doing?
What outcomes are most relevant? Are those aligned with the data we work with?
• Ex.: ensuring good Quality of Life for LTC patients: very important, but data hardly available
Medication / prescriptions:
• Meeting expectations like “predict the best combination of medicines” present hard challenges
Data limitations: “you don’t know half my story”

32
<event
name>
Data governance issues: the emerging UK landscape
https://www.goldacrereview.org/
Build a small number of Trusted Research Environments, avoiding duplication
Promote culture of reuse of code (curation pipelines, analytics)
- Reproducible Analytical Pipelines”, a set of best practices
- Promote high quality, shared, reviewable, re-usable, well-documented code for
standardized data curation and analysis
- Promote transparency, avoid black box analysis
Adopt single governance rules for integrated data access
- Rationalise approvals: create one map of all approval processes
Build appropriate capabilities:
- Train academic researchers and NHS analysts in computational data science
techniques

33
<event
name>
Cluster analysis workflows
patients x
diagnoses
Binary matrix
Patients’ medical histories
(EHR)
Latent
Class
Analysis
Patient / cluster
associations
(discard time)
[1,2]
[2]
Disease
clustering
Ex:
Topic
Modelling
[1]
Patient
clustering
Cluster
phenotyping
Patients
Cross-sectional
data
{[…]}
[1]
[1,2]

34
<event
name>
LCA example
[2]

35
<event
name>
Cluster phenotyping example
[2]

36
<event
name>
Workflow - example
[1]

38
<event
name>
Summary
Enablers:
Data availability
Scalable data processing technology
Inexpensive, accurate self-monitoring
Mature data science and engineering methods
Rapidly advancing AI
A unique convergence of opportunities and challenges to achieve a “P4” vision of data-driven medicine
and healthcare management
Blockers:
Data access and governance, data integration
Data Quality control, device tolerance, intrusiveness
Data engineering expensive and ad hoc
Still very experimental. Trustworthy, Ethical, Responsible AI
Hard “management” questions:
- how do you calculate the “total cost of operation” for data-driven medicine?
- at which point does it become cost-effective for the health service?
- what are the real benefits to patients?
- …

Realising the potential of Health Data Science:opportunities and challenges to practical adoption

Recommended

Recommended

More Related Content

Similar to Realising the potential of Health Data Science:opportunities and challenges to practical adoption

Similar to Realising the potential of Health Data Science:opportunities and challenges to practical adoption (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)