Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective

Paolo Missier
School of Computing
Newcastle University, UK
Comsys 2022
IIT Ropar, India
(online presentation)
Delivering on the promise of data-driven healthcare:
trade-offs, challenges, and research perspectives

2
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Data engineering for healthcare data: intrinsic and translational requirements
• Extracting actionable knowledge from EHRs
• Recent work
• Some Challenges

3
<event
name>
The promise of data-driven medicine and healthcare
Predictive, Preventative, Personalised, Participatory: a systems biology perspective on the future of
medicine and health care
Hood L, Heath JR, Phelps ME, Lin B. Systems biology and new technologies enable predictive and preventative medicine. Science. 2004;306(5696):640–643.
Hood L, Balling R, Auffray C. Revolutionizing medicine in the 21st century through systems approaches. Biotechnol J. 2012;7(8):992–1001. Provides an overview of the science and
technological foundations of predictive, preventive, personalized and participatory healthcare
Flores M, Glusman G, Brogaard K, Price ND, Hood L. P4 medicine: how systems medicine will transform the healthcare sector and society. Per Med. 2013;10(6):565-576. doi:
10.2217/pme.13.57. PMID: 25342952; PMCID: PMC4204402.
Schmidt, Charlie. ‘Leroy Hood Looks Forward to P4 Medicine: Predictive, Personalized, Preventive, and Participatory’. JNCI Journal of the National Cancer Institute 106, no. 12
(December 2014): dju416–dju416. https://doi.org/10.1093/jnci/dju416.
[1] Sagner, M, A McNeil, P Puska, and R Arena. ‘The P4 Health Spectrum – A Predictive, Preventive, Personalized and Participatory Continuum for Promoting Healthspan’.
Progress in Cardiovascular Diseases 59, no. 5 (2017): 506–21. https://doi.org/10.1016/j.pcad.2016.08.002.
A new approach in medicine that is predictive, preventive, personalized and participatory, which we
label here as “P4” holds great promise to reduce the burden of chronic diseases by harnessing
technology and an increasingly better understanding of environment-biology interactions, evidence-
based interventions and the underlying mechanisms of chronic diseases. [1]

5
<event
name>
Five pillars of P4 medicine
Pillar 1
■ Cutting-edge technologies for generating data regarding multiple dimensions of each person's experience of
health and disease.
Pillar 2
■ A digital infrastructure linking participating discovery science and clinical institutions, as well as
patients/consumers.
Pillar 3
■ Personalized data clouds providing information about multiple dimensions of each individual's unique dynamic
experience of health and disease ranging from the molecular to the social. These data will include genetic and
phenotypic characteristics, medical history, demographics and other sociometrics.
Pillar 4
■ New analytic techniques and technologies from deriving actionable knowledge from the data.
Pillar 5
■ Systems biology models for understanding the unique health status of each individual in terms of dynamic
network states that can be manipulated by cost-effective strategies
Source: [1]

6
<event
name>
Outline
• Recent work
• Some Challenges

7
<event
name>
A convergence of needs and opportunities
P4
Data-driven
Healthcare
Personal self-
monitoring
devices
Health Data
Science and
Engineering
Governance, consent
Secure data access
(Big) Health
Data
- Operations  Research
- ML, AI Methods
- Scalable computing
Medical grade  Consumer grade
- Privacy (eg GDPR)
- Opt-in vs opt-out
- Trusted Research Environments
Bigger == more useful?

8
<event
name>
The data-to-actions loop
Monitoring
Clinical testing
Data Engineering
Predictive Analytics
/ AI
Personalised
Predictions
- Prevention
- interventions

9
<event
name>
Outline
• Recent work
• Some Challenges

10
<event
name>
Understanding the facets of Health data
• Clinical
• Lifestyle, social
•Which data types?
• Prospective vs
retrospective
•Where do datasets
come from?
• Acquisition
• Curation, annotation
•How much do they
cost?
• Small vs Big Health
Data
•How large?
• Governance
• Protection
•Who can use it and
how?
Data
Science and
Engineering
Benefits to
patients

11
<event
name>
I. Which data? Capturing individuals’ complexity
Primary care records:
- Clinical tests / GP notes, diagnoses / Prescriptions
Secondary care records:
- hospital admission / diagnoses / operations / prescriptions
Multi-omics data:
- genotypes, exomes, genomes.
- Transcriptomics, proteomics
Digital Health:
- Data streams from wearable and environment sensors,
self-monitoring
Socio-demographics:
- Area of residence, family, social deprivation

12
Baseline
assessment
GP events
prescriptions
HESIN diagnoses
N = 240,000
N = 500,000
Hospital events
Used to determine
admission/ re-admission
patterns
operations
57,698,505
123,644,445
Example: UK biobank
eid
Up to 20 years of records

13
✗
<event
name>
II. Prospective vs retrospective datasets
Prospective: defined for research purposes
✓ Stable and
predictable
✓ Follow protocol
✓ Research ready
✓ Potentially well-
curated
✓ Bias known a priori
✗ Expensive
✗ Not very reusable
✗ Scarce
 Potentially more reusable
 Natural Bias (reflects natural cohort locality)
✗ Generally not research ready
✗ Require data engineering
Retrospective: typically operational data
Example:
Clinical Practice Research Datalink
- Data collected from UK GP practices
- 60+ million patients
- (also prospective)
Example: UK Biobank
- 500,000 volunteer participants
- General health information
- Genotypes and whole genomes
- Selected internal organ imaging study (100K)
- Bias: 40+ years, geographic / social bias
Prospective datasets:

14
<event
name>
Example: LITMUS
Retrospective + prospective data collection project
• EU IMI2 project
• Collecting data across Centres (EU + USA) on Non-Alcoholic Fatty
Liver Disease (NAFLD) and NASH (liver steathosis, fibrosis, cirrhosis)
https://litmus-project.eu/litmus-partners/
Phase 1a:
- Retrospective data collected from hospitals datasets
- Around 10,000 patients
- Varying degrees of quality / completeness
- Central curation required
Phase 1b:
- Prospective data from active recruitment
- Around 2,000 patients
- Omics data more abundant

15
<event
name>
Discovery + validation experimental design
Actual design will depend on dataset characteristics
Ex. “Explore relationship between social deprivation and mortality rate in MLTC population”
An ideal study will include both Prospective and Retrospective datasets
UKBiobank  machine-learning friendly  modelling  discovery of candidate associations
Regional dataset  validation dataset
Regional UK dataset:
- 50K actual patients
- Data availability depending on operational systems
- Likely data quality problems (incomplete, incorrect)
- Bias: geographic location  natural distribution of
social deprivation
UK Biobank:
- Nationwide data
- 140K MLTC participants
- Complete set of multiple
deprivation indicators available(*)
(*) Townsend deprivation index at recruitment, Index of Multiple Deprivation (England, Scotland, Wales)
Plus education score and other socio-demographics indeces. Distribution across population documented on UKBB site

16
<event
name>
III - Cost of health data
Retrospective: integration/harmonisation, curation, cleaning
Prospective: cost of cohort recruitment, data collection, data processing
Acquisition + processing cost by data type:
Routinely collected
clinical variables
(GP test)
- Tests requiring specialist labs
- Proteomics
- Genotyping
(a few genes)
Whole exome
sequencing
Whole genome
sequencing
Low High

17
<event
name>
Example: NAFLD dataset
N= 9,449
However:
Clinical: 8,745
GWAS: 2,216
miRNA: 183
RNASeq: 461

Issue: Data completeness
18
Possible issue: 85%+ missing on 7 out of 9 Specialist variables

Benefit analysis for Extended and Specialist variables
response: At-Risk NASH

20
<event
name>
Challenge: making the best of expensive features
RS1
RS2
FS1
Training set 1: (RS1+RS2 , FS1)
FS1: core features, FS2 extended features
FS1 available on entire cohort
FS2 only available on a subset
How do leverage a model learnt using Training set 1
to improve a model learnt from Training set 2?
FS2
Training set 2: (RS1 , FS1+FS2)

21
<event
name>
Cost of data: the imbalance problem
Example: Physical Monitoring:
Everyday
fitness
Pathological conditions
Eg cognitive impairment
Cost
Low High
High Low
Abundance
Consequence: class imbalance in classification tasks

22
IV - Size: Big Data for Health Care
Genomics for
personalized medicine
Article Source: Big Data: Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7):
e1002195. https://doi.org/10.1371/journal.pbio.1002195

25
<event
name>
Size: Do I always need the full granularity?
Dataset size = <Data point size> x <Number of data points>
 how much information do I lose by downsampling?
Genomics:
1 exome: >1 Billion data points (base pairs) x N exomes. No downsampling
Medical imaging:
1 image = N pixels
Some downsampling may be acceptable
Sensor data (eg accelerometers)
- Do I need 10Hz or 100 Hz?
- Typically very noisy
Feature engineering
vs
Representation learning

27
Case study: Using activity trackers to predict Type-2 Diabetes
Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with
Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations.
Lam, B; Catt, M; Cassidy, S; Bacardit, J; Darke, P; Butterfield, S; Alshabrawy, O; Trenell, M; and Missier, P, Using wearable activity trackers
to predict Type-2 Diabetes: A machine learning-based cross-sectional study of the UK Biobank accelerometer cohort. JMIR Diabetes.
January 2021. http://doi.org/10.2196/23364
Feature
extraction
Clustering
Classification

28
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
A great UG project!
your (biomedical) dataset may not be as big as it looks
T2D vs Norm-1

29
<event
name>
V - Data governance issues: the emerging UK landscape
https://www.goldacrereview.org/
Build a small number of Trusted Research Environments, avoiding duplication
Promote culture of reuse of code (curation pipelines, analytics)
- Reproducible Analytical Pipelines”, a set of best practices
- Promote high quality, shared, reviewable, re-usable, well-documented code for
standardized data curation and analysis
- Promote transparency, avoid black box analysis
Adopt single governance rules for integrated data access
- Rationalise approvals: create one map of all approval processes
Build appropriate capabilities:
- Train academic researchers and NHS analysts in computational data science
techniques

30
<event
name>
Outline
• Recent work
• Some Challenges

31
<event
name>
Data engineering for healthcare data
My Smart Age with HIV (MySAwH)
- International multi-center prospective study
- Aimed at studying and monitoring healthy aging in People Living with HIV (PLWH)
- Data from routine clinical assessments and innovative PROs, collected through mobile and wearable devices;
- retrospective studies
- focus on the hospital resource management and clinical decision making problems emerged during the Covid–19
pandemic
Mandreoli, Federica, Davide Ferrari, Veronica Guidetti, Federico Motta, and Paolo Missier. ‘Real-World Data Mining Meets Clinical Practice: Research
Challenges and Perspective’. Frontiers in Big Data 5 (2022). https://doi.org/10.3389/fdata.2022.1021621.
Ferrari, D., Guaraldi, G., Mandreoli, F., Martoglia, R., Milić, J., and Missier, P. (2020a). “Data-driven vs. knowledge-driven inference of health outcomes in the
ageing population: a case study,” in Proceedings of the Workshops of the EDBT-ICDT Joint Conference, Vol. 2578.
Ferrari, D., Mandreoli, F., Guaraldi, G., Milić, J., and Missier, P. (2020b). “Predicting respiratory failure in patients with COVID-19 pneumonia: a case study from
Northern Italy,” in Proceedings of the 1st International Advances in Artificial Intelligence for Healthcare Workshop, Vol. 2820, (Santiago de Compostela), 32–38.
Ferrari, D., Milić, J., Mussini, C., Mandreoli, F., Missier, P., Guaraldi, G., et al. (2020c). Machine learning in predicting respiratory failure in patients with COVID-19
pneumonia–Challenges, strengths, and opportunities in a global health emergency. PLoS ONE 15:e239172. doi: 10.1371/journal.pone.0239172
Mandreoli, F., Motta, F., and Missier, P. (2021). “An HMM-ensemble approach to predict severity progression of ICU treatment for hospitalized COVID-19
patients,” in 20th IEEE International Conference on Machine Learning and Applications (Pasadena, CA), 1299–1306. doi: 10.1109/ICMLA52953.2021.00211

32
<event
name>
Issues requiring Data Engineering
Recurringdata
issues
Data–driven, AI–based clinical practice: experiences, challenges, and research directions
DATA SPARSITY
AND SCARSITY
• EHR: Irregular
collections of
time series
• Imputation is
not always
possible
DATA
IMBALANCE
• Predicting
rare events
can be a
priority
• No
downsampling
option
DATA
INCONSISTENCY
and INSTABILITY
• Retrospective
data are often
source of
inconsistency
and their
schema are
instable
NOT ALL
ERRORSARE
EQUALLY
WRONG
• In high-stake
domains
sometimes a
bias towards
one type of
error is
preferible
HUMAN-IN-
THE-LOOP
• Explanations
engender trust
in the models
• Trust should
include not
only the
clinician but
also the
patient.

33
<event
name>
Data sizes, data issues by study

34
<event
name>
Sparsity/ scarcity, imbalance
Classifiers are not resilient to class imbalance:
- Models will be biased towards predicting
majority class regardless of the input features
- Will struggle to generalise correctly on the
minority class
- In clinical datasets, data scarcity/sparsity often
conspires with data imbalance
- Imbalance is very common in medical datasets
Typical mitigation:
- Downsample the majority class  lose training examples
- Upsample the minority class.  SMOTE (Synthetic Minority Oversampling Technique)
When modelling processes, these mitigations do not work
We used Hidden Markov Models (HMMs) to predict oxygen-therapy state-transitions
However, intubation is a infrequent state (and so is “death”)
This makes it was difficult to accurately learn probability distributions.
[1] proposes a novel, generic ensemble technique to mitigate the imbalance problem in HMM

35
<event
name>
Instability
Retrospective studies are often unstable:
Data acquisition and management practices may change over time, following changes in
- Clinical practices
- Public policy
- Hospital resources
- Data collection technologies
- In our COVID dataset clinical tests vary daily depending on the patient’s
condition
- Scientific evidence for the need of certain tests changed rapidly
- Example: new biomarkers like interleukin-6 were introduced in “mid flight”
- Thus earlier study datasets completely miss this variable

36
<event
name>
Translational challenge: Not all errors are equally wrong
- In high-stakes domains, prediction errors are not symmetric:
- Typically, underestimating risk is less desirable than overestimating it
- Standard model performance metrics (eg AUC, F1 etc) fail to capture this distinction
Cost-sensitive learning (cf eg [1,2,3])
- Introduce an explicit penalty of mis-classifying samples
- Note that cost- sensitive methods can sometimes deal with imbalanced datasets without
altering the original data distribution [4]
[1] Lomax, S., and Vadera, S. (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surveys 45, 1–35. doi: 10.1145/2431211.2431215
[2] Wang, H., Cui, Z., Chen, Y., Avidan, M., Abdallah, A. B., and Kronzer, A. (2018). Predicting hospital readmission via cost-sensitive deep learning. ACM Trans.
Comput. Biol. Bioinformatics 15, 1968–1978. doi: 10.1109/TCBB.2018.2827029
[3] Freitas, A., Costa-Pereira, A., and Brazdil, P. (2007). “Cost-sensitive decision trees applied to medical data,” in Data Warehousing and Knowledge Discovery
(Regensburg), 303–312. doi: 10.1007/978-3-540-74553-2_28
[4] Mienye, I. D., and Sun, Y. (2021). Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlock.
25:100690. doi: 10.1016/j.imu.2021.100690

37
<event
name>
Translational challenge: human-in-the-loop AI
• Essential in medical AI
• Evidence of performance is not enough
• Black-box AI not acceptable in clinical practice
From technical explanations:
• non-linear [1] and Deep Learning [2] models
• Shapley values [3]
• Interpretable ML [4,5]
Also importantly:
Patient and Public Involvement (PPI) is essential in publicly funded clinical research
“Explanation gap”:
To expert involvement in the learning process:
- by accepting/rejecting predictions
- By expressing preference for a given error type
Causal Machine Learning (CML) [6,7]:
- Visualisation and reasoning over complex clinical scenarios
- Counterfactuals, what-if scenarios

38
<event
name>
References on explainability and human-in-the-loop
[1] Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., et al. (2020). From local explanations to global
understanding with explainable AI for trees. Nat. Mach. Intell. 2, 2522–5839. doi: 10.1038/s42256-019-0138-9
[2] Singh, A., Sengupta, S., and Lakshminarayanan, V. (2020). Explainable deep learning models in medical image analysis. J. Imaging
6:52. doi: 10.3390/jimaging6060052
[3] Lundberg, S. M., and Lee, S. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information
Processing Systems, Vol. 30. (Long Beach, CA).
[4] Ahmad, M. A., Eckert, C., and Teredesai, A. (2018). “Interpretable machine learning in healthcare,” in Proceedings of the ACM
International Conference on Bioinformatics, Computational Biology, and Health Informatics (Washington, DC), 559–560. doi:
10.1145/3233547.3233667
[5] Abdullah, T. A. A., Zahid, M. S. M., and Ali, W. (2021). A review of interpretable ML in healthcare: taxonomy, applications,
challenges, and future directions. Symmetry 13:2439. doi: 10.3390/sym13122439
[6] Oneto, L., and Chiappa, S. (2020). “Fairness in machine learning,” in Recent Trends in Learning From Data: Tutorials from the INNS
Big Data and Deep Learning Conference (Sestri Levante, Genova), 155–196. doi: 10.1007/978-3-030-43883-8_7
[7] Sanchez, P., Voisey, J. P., Xia, T., Watson, H. I., O’Neil, A. Q., and Tsaftaris, S. A. (2022). Causal machine learning for healthcare and
precision medicine. R. Soc. Open Sci. 9:220638. doi: 10.1098/rsos.220638

39
<event
name>
Outline
• Recent work
• Some Challenges

40
<event
name>
EHR data: Traditional statistics and machine learning methods
Discovering patterns of multimorbid conditions and relationships between multimorbidities,
Socio-Demographic Factors, Health-Related Quality of Life, mortality
• See Systematic review [1]
• Clustering methods  but differing in proximity measures. Also, patient clusters vs disease
clusters?
• Other specific methods:
• Latent Class Analysis, Cox Regression models, SVM, Cox proportional hazard model,
Random Forests [2,3,4,5,6]
• Multilevel logistic regression for longitudinal analysis [7]
[1] Ng, Shu Kay, Richard Tawiah, Michael Sawyer, and Paul Scuffham. ‘Patterns of Multimorbid Health Conditions: A Systematic Review of Analytical Methods and
Comparison Analysis.’ International Journal of Epidemiology 47, no. 5 (1 October 2018): 1687–1704. https://doi.org/10.1093/ije/dyy134.

41
<event
name>
EHR data: Traditional methods – references
[2] Larsen, Finn Breinholt, Marie Hauge Pedersen, Karina Friis, Charlotte Glümer, and Mathias Lasgaard. ‘A Latent Class Analysis of
Multimorbidity and the Relationship to Socio-Demographic Factors and Health-Related Quality of Life. A National Population-Based Study
of 162,283 Danish Adults.’ PloS One 12, no. 1 (2017): e0169426. https://doi.org/10.1371/journal.pone.0169426.
[3] Jani, Bhautesh Dinesh, Peter Hanlon, Barbara I. Nicholl, Ross McQueenie, Katie I. Gallacher, Duncan Lee, and Frances S. Mair. ‘Relationship
between Multimorbidity, Demographic Factors and Mortality: Findings from the UK Biobank Cohort’. BMC Medicine 17, no. 1 (10 April 2019):
74. https://doi.org/10.1186/s12916-019-1305-x.
[4] Whitson, Heather E., Kimberly S. Johnson, Richard Sloane, Christine T. Cigolle, Carl F. Pieper, Lawrence Landerman, and Susan N. Hastings.
‘Identifying Patterns of Multimorbidity in Older Americans: Application of Latent Class Analysis.’ Journal of the American Geriatrics Society 64,
no. 8 (August 2016): 1668–73. https://doi.org/10.1111/jgs.14201.
[5] Zemedikun, Dawit T., Laura J. Gray, Kamlesh Khunti, Melanie J. Davies, and Nafeesa N. Dhalwani. ‘Patterns of Multimorbidity in Middle-
Aged and Older Adults: An Analysis of the UK Biobank Data.’ Mayo Clinic Proceedings 93, no. 7 (July 2018): 857–66.
https://doi.org/10.1016/j.mayocp.2018.02.012.
[6] Zhu, Yajing, Duncan Edwards, Jonathan Mant, Rupert A. Payne, and Steven Kiddle. ‘Characteristics, Service Use and Mortality of Clusters of
Multimorbid Patients in England: A Population-Based Study.’ BMC Medicine 18, no. 1 (10 April 2020): 78. https://doi.org/10.1186/s12916-020-
01543-8.
[7] Ashworth, Mark, Stevo Durbaba, David Whitney, James Crompton, Michael Wright, and Hiten Dodhia. ‘Journey to Multimorbidity:
Longitudinal Analysis Exploring Cardiovascular Risk Factors and Sociodemographic Determinants in an Urban Setting.’ BMJ Open 9, no. 12 (23
December 2019): e031649. https://doi.org/10.1136/bmjopen-2019-031649.

42
<event
name>
EHR data: Deep Learning
Recent survey on DNN methods for EHR-based modelling [1]:
- DNNs fully exploit the longitudinal nature of EHRs
- Useful predict outcomes where patient history is a relevant predictor
State of the art methods (< 2020):
- eNRBM [2]
- Deep Patient [3]
- Deepr [4]
- RETAIN [5]
Summary of Results:
- The methods are competitive
- Achieving AUC >.8 on each of the outcomes above
Target outcomes:
Future disease given medical history
Unplanned readmission
Disease progression
Specific complication, eg heart failure, cataract
Patient mortality
[1] Ayala Solares, Jose Roberto, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fatemeh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, et al. ‘Deep
Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Architectures’. Journal of Biomedical Informatics 101 (1 January 2020):
103337. https://doi.org/10.1016/j.jbi.2019.103337.

43
<event
name>
EHR data: Deep Learning -- References
[2] T. Tran, T. D. Nguyen, D. Phung, S. Venkatesh, Learning vector representation of medical objects via EMR-driven nonnegative restricted
Boltzmann machines (eNRBM), Journal of Biomedical Informatics 54 (2015) 96 – 105. doi:https://doi.org/10.1016/j.jbi.2015.01.012.
[3] Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health
Records. Sci Rep. 2016 May 17;6:26094. doi: 10.1038/srep26094. PMID: 27185194; PMCID: PMC4869115.
[4] P. Nguyen, T. Tran, N. Wickramasinghe, S. Venkatesh, Deepr: A convolutional net for medical records, IEEE Journal of Biomedical and Health
Informatics 21 (1) (2017) 22–30. doi:10.1109/JBHI.2016. 767 2633963.
[5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, W. Stewart, 850 RETAIN: An Interpretable Predictive Model for Healthcare using Reverse
Time Attention Mechanism, in: Advances in Neural Information Processing Systems, 2016, pp. 3504–3512.

46
<event
name>
A different approach: predicting target disease clusters

47
IEEE
BigData
2022
Can we model the likelihood of next disease?
Experimental models exists for
- Modelling disease progression [1]
- Discovering clinical pathway patterns [2]
- Predicting next disease(s) [3]
However, not very robust or actually deployed in practice
[1] Wang, Xiang, David Sontag, and Fei Wang. ‘Unsupervised Learning of Disease Progression Models’. In Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 85–94. KDD ’14. New York, NY, USA, 2014 https://doi.org/10.1145/2623330.2623754.
[2] Huang, Zhengxing, Wei Dong, Lei Ji, Chenxi Gan, Xudong Lu, and Huilong Duan. ‘Discovery of Clinical Pathway Patterns from Event Logs Using Probabilistic Topic
Models’. Journal of Biomedical Informatics 47 (1 February 2014): 39–57. https://doi.org/10.1016/j.jbi.2013.09.003.
[3] Men, Lu, Noyan Ilk, Xinlin Tang, and Yuan Liu. ‘Multi-Disease Prediction Using LSTM Recurrent Neural Networks’. Expert Systems with Applications 177 (1
September 2021): 114905. https://doi.org/10.1016/j.eswa.2021.114905.
use Electronic Health Records (diagnoses event logs) to predict patients’
long-term associations to a specific disease cluster
Our goal:

48
IEEE
BigData
2022
Research hypothesis
It is possible to identify clusters of diseases that:
1. Are described using disease terms that are familiar to health domain experts
2. Are clinically significant  based on expert validation
3. Admit a quantitative association of individual patients with each of the clusters
(*) limited to LTCs
What we hope to find:
1. A significant majority of patients are stable relative to the clustering
2. Stability emerges early in their medical history(*)

49
IEEE
BigData
2022
Contributions
• We use Topic Modelling as a form of semantic clustering
• Topics are defined by ranked lists of disease terms
• We define a cluster’s gravitational pull: patients are differently attracted by each
cluster at different points in time
• We propose a quantitative measure of stability with respect to clusters over time
• We study how stability increases as timelines progress

50
Dynamic patient-cluster associations: examples
Patient 1 topic 1 topic 2 topic 3 topic 4
OA 0.20 0.21 0.07 0.04
skin ulcer 1.29 0.21 0.07 0.05
dermatitis 1.76 0.21 0.07 0.05
erectile dysfunction 1.76 0.68 0.07 0.51
Primary malignancy skin 2.14 1.46 0.07 0.51
spondylosis 0.31 0.37 0.26 0.00
obesity 0.45 0.50 0.34 0.50
urine incontinence 0.75 0.50 1.06 0.51
female genital prolapse 1.22 0.51 1.75 0.51
type 2 diabetes 1.22 0.51 1.75 1.43
unspecified rare diabetes 1.22 0.51 1.75 2.56
fracture of the hip 1.23 2.62 1.75 2.57
dermatitis 0.47 0.00 0.00 0.00
hypertension 0.55 0.16 0.00 0.16
atrial fibrilation 0.55 1.41 0.00 0.16
OA 0.75 1.62 0.07 0.20
tinnitus 1.24 2.02 0.33 0.20
PTSD 0.22 0.00 0.68 0.00
COPD 0.54 0.94 0.69 0.08
Neuromuscular dysfunction
of bladder 0.54 1.26 1.87 0.09
OA 1.21 1.47 2.63 0.12
Peripheral venous and
lymphatic disease 0.59 0.24 0.07 0.00
psoriasis 1.07 0.45 0.07 0.55
CHD 1.54 1.11 0.76 0.84
Alcohol dependence 1.74 1.59 0.76 1.25
obesity 1.88 1.72 0.84 1.75
hearing loss 2.21 2.12 0.85 1.76
urine incontinence 2.51 2.12 1.57 1.76
asthma 0.80 0.00 0.00 0.00
hypertension 0.88 0.16 0.00 0.16
hearing loss 1.21 0.56 0.01 0.17
Alcohol dependence 1.41 1.04 0.01 0.58
Patient stages: T1 = [ t1], T2 = [ t1, t2], … Tn = T = [ t1, t2, …. tn]
Patient: disease terms sequence T = [ t1, t2, …. tn]
IEEE
BigData
2022

51
<event
name>
Challenge: making the best of expensive features
RS1
RS2
FS1
Training set 1: (RS1+RS2 , FS1)
FS1: core features, FS2 extended features
FS1 available on entire cohort
FS2 only available on a subset
How do leverage a model learnt using Training set 1
to improve a model learnt from Training set 2?
FS2
Training set 2: (RS1 , FS1+FS2)

52
<event
name>
Challenge: synthetic data generation for specialized data types
Self-monitoring contains potentially useful signal to anticipate specific conditions
- But data heavily imbalanced towards healthy controls
- Case data points harder to collect
Can we use the available “seed” true data points to generate new synthetic and plausible ones?
Specifically: physical activity data  general problem of time-series data generation
Challenge:
Existing GAN / TimeGAN approaches insufficient
- Hard to scale
- Require very strong signal

53
<event
name>
Key messages
• The weaknesses are in the data not in the models!
• Need for data integration + curation + engineering dominate the need for size
• Investments driven by “health crisis”
• Mental (dementia, Parkinson’s)
• Physical: multimorbidity in older population
• Focus on EHR:
• Good advances in using AI to draw insights from EHR, but data quality is a big barrier
AI for HealthCare: great opportunities for impactful research,
but many challenges remain

Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective

Recommended

Recommended

More Related Content

Similar to Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective

Similar to Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective

Editor's Notes