SlideShare a Scribd company logo
Paolo Missier
School of Computing
Newcastle University, UK
Comsys 2022
IIT Ropar, India
(online presentation)
Delivering on the promise of data-driven healthcare:
trade-offs, challenges, and research perspectives
2
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Data engineering for healthcare data: intrinsic and translational requirements
• Extracting actionable knowledge from EHRs
• Recent work
• Some Challenges
3
<event
name>
The promise of data-driven medicine and healthcare
Predictive, Preventative, Personalised, Participatory: a systems biology perspective on the future of
medicine and health care
Hood L, Heath JR, Phelps ME, Lin B. Systems biology and new technologies enable predictive and preventative medicine. Science. 2004;306(5696):640–643.
Hood L, Balling R, Auffray C. Revolutionizing medicine in the 21st century through systems approaches. Biotechnol J. 2012;7(8):992–1001. Provides an overview of the science and
technological foundations of predictive, preventive, personalized and participatory healthcare
Flores M, Glusman G, Brogaard K, Price ND, Hood L. P4 medicine: how systems medicine will transform the healthcare sector and society. Per Med. 2013;10(6):565-576. doi:
10.2217/pme.13.57. PMID: 25342952; PMCID: PMC4204402.
Schmidt, Charlie. ‘Leroy Hood Looks Forward to P4 Medicine: Predictive, Personalized, Preventive, and Participatory’. JNCI Journal of the National Cancer Institute 106, no. 12
(December 2014): dju416–dju416. https://doi.org/10.1093/jnci/dju416.
[1] Sagner, M, A McNeil, P Puska, and R Arena. ‘The P4 Health Spectrum – A Predictive, Preventive, Personalized and Participatory Continuum for Promoting Healthspan’.
Progress in Cardiovascular Diseases 59, no. 5 (2017): 506–21. https://doi.org/10.1016/j.pcad.2016.08.002.
A new approach in medicine that is predictive, preventive, personalized and participatory, which we
label here as “P4” holds great promise to reduce the burden of chronic diseases by harnessing
technology and an increasingly better understanding of environment-biology interactions, evidence-
based interventions and the underlying mechanisms of chronic diseases. [1]
5
<event
name>
Five pillars of P4 medicine
Pillar 1
■ Cutting-edge technologies for generating data regarding multiple dimensions of each person's experience of
health and disease.
Pillar 2
■ A digital infrastructure linking participating discovery science and clinical institutions, as well as
patients/consumers.
Pillar 3
■ Personalized data clouds providing information about multiple dimensions of each individual's unique dynamic
experience of health and disease ranging from the molecular to the social. These data will include genetic and
phenotypic characteristics, medical history, demographics and other sociometrics.
Pillar 4
■ New analytic techniques and technologies from deriving actionable knowledge from the data.
Pillar 5
■ Systems biology models for understanding the unique health status of each individual in terms of dynamic
network states that can be manipulated by cost-effective strategies
Source: [1]
6
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Data engineering for healthcare data: intrinsic and translational requirements
• Extracting actionable knowledge from EHRs
• Recent work
• Some Challenges
7
<event
name>
A convergence of needs and opportunities
P4
Data-driven
Healthcare
Personal self-
monitoring
devices
Health Data
Science and
Engineering
Governance, consent
Secure data access
(Big) Health
Data
- Operations  Research
- ML, AI Methods
- Scalable computing
Medical grade  Consumer grade
- Privacy (eg GDPR)
- Opt-in vs opt-out
- Trusted Research Environments
Bigger == more useful?
8
<event
name>
The data-to-actions loop
Monitoring
Clinical testing
Data Engineering
Predictive Analytics
/ AI
Personalised
Predictions
- Prevention
- interventions
9
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Data engineering for healthcare data: intrinsic and translational requirements
• Extracting actionable knowledge from EHRs
• Recent work
• Some Challenges
10
<event
name>
Understanding the facets of Health data
• Clinical
• Lifestyle, social
•Which data types?
• Prospective vs
retrospective
•Where do datasets
come from?
• Acquisition
• Curation, annotation
•How much do they
cost?
• Small vs Big Health
Data
•How large?
• Governance
• Protection
•Who can use it and
how?
Data
Science and
Engineering
Benefits to
patients
11
<event
name>
I. Which data? Capturing individuals’ complexity
Primary care records:
- Clinical tests / GP notes, diagnoses / Prescriptions
Secondary care records:
- hospital admission / diagnoses / operations / prescriptions
Multi-omics data:
- genotypes, exomes, genomes.
- Transcriptomics, proteomics
Digital Health:
- Data streams from wearable and environment sensors,
self-monitoring
Socio-demographics:
- Area of residence, family, social deprivation
12
Baseline
assessment
GP events
prescriptions
HESIN diagnoses
N = 240,000
N = 500,000
Hospital events
Used to determine
admission/ re-admission
patterns
operations
57,698,505
123,644,445
Example: UK biobank
eid
Up to 20 years of records
13
✗
<event
name>
II. Prospective vs retrospective datasets
Prospective: defined for research purposes
✓ Stable and
predictable
✓ Follow protocol
✓ Research ready
✓ Potentially well-
curated
✓ Bias known a priori
✗ Expensive
✗ Not very reusable
✗ Scarce
 Potentially more reusable
 Natural Bias (reflects natural cohort locality)
✗ Generally not research ready
✗ Require data engineering
Retrospective: typically operational data
Example:
Clinical Practice Research Datalink
- Data collected from UK GP practices
- 60+ million patients
- (also prospective)
Example: UK Biobank
- 500,000 volunteer participants
- General health information
- Genotypes and whole genomes
- Selected internal organ imaging study (100K)
- Bias: 40+ years, geographic / social bias
Prospective datasets:
14
<event
name>
Example: LITMUS
Retrospective + prospective data collection project
• EU IMI2 project
• Collecting data across Centres (EU + USA) on Non-Alcoholic Fatty
Liver Disease (NAFLD) and NASH (liver steathosis, fibrosis, cirrhosis)
https://litmus-project.eu/litmus-partners/
Phase 1a:
- Retrospective data collected from hospitals datasets
- Around 10,000 patients
- Varying degrees of quality / completeness
- Central curation required
Phase 1b:
- Prospective data from active recruitment
- Around 2,000 patients
- Omics data more abundant
15
<event
name>
Discovery + validation experimental design
Actual design will depend on dataset characteristics
Ex. “Explore relationship between social deprivation and mortality rate in MLTC population”
An ideal study will include both Prospective and Retrospective datasets
UKBiobank  machine-learning friendly  modelling  discovery of candidate associations
Regional dataset  validation dataset
Regional UK dataset:
- 50K actual patients
- Data availability depending on operational systems
- Likely data quality problems (incomplete, incorrect)
- Bias: geographic location  natural distribution of
social deprivation
UK Biobank:
- Nationwide data
- 140K MLTC participants
- Complete set of multiple
deprivation indicators available(*)
(*) Townsend deprivation index at recruitment, Index of Multiple Deprivation (England, Scotland, Wales)
Plus education score and other socio-demographics indeces. Distribution across population documented on UKBB site
16
<event
name>
III - Cost of health data
Retrospective: integration/harmonisation, curation, cleaning
Prospective: cost of cohort recruitment, data collection, data processing
Acquisition + processing cost by data type:
Routinely collected
clinical variables
(GP test)
- Tests requiring specialist labs
- Proteomics
- Genotyping
(a few genes)
Whole exome
sequencing
Whole genome
sequencing
Low High
17
<event
name>
Example: NAFLD dataset
N= 9,449
However:
Clinical: 8,745
GWAS: 2,216
miRNA: 183
RNASeq: 461
Issue: Data completeness
18
Possible issue: 85%+ missing on 7 out of 9 Specialist variables​
Benefit analysis for Extended and Specialist variables
response: At-Risk NASH
20
<event
name>
Challenge: making the best of expensive features
RS1
RS2
FS1
Training set 1: (RS1+RS2 , FS1)
FS1: core features, FS2 extended features
FS1 available on entire cohort
FS2 only available on a subset
How do leverage a model learnt using Training set 1
to improve a model learnt from Training set 2?
FS2
Training set 2: (RS1 , FS1+FS2)
21
<event
name>
Cost of data: the imbalance problem
Example: Physical Monitoring:
Everyday
fitness
Pathological conditions
Eg cognitive impairment
Cost
Low High
High Low
Abundance
Consequence: class imbalance in classification tasks
22
IV - Size: Big Data for Health Care
Genomics for
personalized medicine
Article Source: Big Data: Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7):
e1002195. https://doi.org/10.1371/journal.pbio.1002195
25
<event
name>
Size: Do I always need the full granularity?
Dataset size = <Data point size> x <Number of data points>
 how much information do I lose by downsampling?
Genomics:
1 exome: >1 Billion data points (base pairs) x N exomes. No downsampling
Medical imaging:
1 image = N pixels
Some downsampling may be acceptable
Sensor data (eg accelerometers)
- Do I need 10Hz or 100 Hz?
- Typically very noisy
Feature engineering
vs
Representation learning
27
Case study: Using activity trackers to predict Type-2 Diabetes
Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with
Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations.
Lam, B; Catt, M; Cassidy, S; Bacardit, J; Darke, P; Butterfield, S; Alshabrawy, O; Trenell, M; and Missier, P, Using wearable activity trackers
to predict Type-2 Diabetes: A machine learning-based cross-sectional study of the UK Biobank accelerometer cohort. JMIR Diabetes.
January 2021. http://doi.org/10.2196/23364
Feature
extraction
Clustering
Classification
28
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
A great UG project!
your (biomedical) dataset may not be as big as it looks
T2D vs Norm-1
29
<event
name>
V - Data governance issues: the emerging UK landscape
https://www.goldacrereview.org/
Build a small number of Trusted Research Environments, avoiding duplication
Promote culture of reuse of code (curation pipelines, analytics)
- Reproducible Analytical Pipelines”, a set of best practices
- Promote high quality, shared, reviewable, re-usable, well-documented code for
standardized data curation and analysis
- Promote transparency, avoid black box analysis
Adopt single governance rules for integrated data access
- Rationalise approvals: create one map of all approval processes
Build appropriate capabilities:
- Train academic researchers and NHS analysts in computational data science
techniques
30
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Data engineering for healthcare data: intrinsic and translational requirements
• Extracting actionable knowledge from EHRs
• Recent work
• Some Challenges
31
<event
name>
Data engineering for healthcare data
My Smart Age with HIV (MySAwH)
- International multi-center prospective study
- Aimed at studying and monitoring healthy aging in People Living with HIV (PLWH)
- Data from routine clinical assessments and innovative PROs, collected through mobile and wearable devices;
- retrospective studies
- focus on the hospital resource management and clinical decision making problems emerged during the Covid–19
pandemic
Mandreoli, Federica, Davide Ferrari, Veronica Guidetti, Federico Motta, and Paolo Missier. ‘Real-World Data Mining Meets Clinical Practice: Research
Challenges and Perspective’. Frontiers in Big Data 5 (2022). https://doi.org/10.3389/fdata.2022.1021621.
Ferrari, D., Guaraldi, G., Mandreoli, F., Martoglia, R., Milić, J., and Missier, P. (2020a). “Data-driven vs. knowledge-driven inference of health outcomes in the
ageing population: a case study,” in Proceedings of the Workshops of the EDBT-ICDT Joint Conference, Vol. 2578.
Ferrari, D., Mandreoli, F., Guaraldi, G., Milić, J., and Missier, P. (2020b). “Predicting respiratory failure in patients with COVID-19 pneumonia: a case study from
Northern Italy,” in Proceedings of the 1st International Advances in Artificial Intelligence for Healthcare Workshop, Vol. 2820, (Santiago de Compostela), 32–38.
Ferrari, D., Milić, J., Mussini, C., Mandreoli, F., Missier, P., Guaraldi, G., et al. (2020c). Machine learning in predicting respiratory failure in patients with COVID-19
pneumonia–Challenges, strengths, and opportunities in a global health emergency. PLoS ONE 15:e239172. doi: 10.1371/journal.pone.0239172
Mandreoli, F., Motta, F., and Missier, P. (2021). “An HMM-ensemble approach to predict severity progression of ICU treatment for hospitalized COVID-19
patients,” in 20th IEEE International Conference on Machine Learning and Applications (Pasadena, CA), 1299–1306. doi: 10.1109/ICMLA52953.2021.00211
32
<event
name>
Issues requiring Data Engineering
Recurringdata
issues
Data–driven, AI–based clinical practice: experiences, challenges, and research directions
DATA SPARSITY
AND SCARSITY
• EHR: Irregular
collections of
time series
• Imputation is
not always
possible
DATA
IMBALANCE
• Predicting
rare events
can be a
priority
• No
downsampling
option
DATA
INCONSISTENCY
and INSTABILITY
• Retrospective
data are often
source of
inconsistency
and their
schema are
instable
NOT ALL
ERRORSARE
EQUALLY
WRONG
• In high-stake
domains
sometimes a
bias towards
one type of
error is
preferible
HUMAN-IN-
THE-LOOP
• Explanations
engender trust
in the models
• Trust should
include not
only the
clinician but
also the
patient.
33
<event
name>
Data sizes, data issues by study
34
<event
name>
Sparsity/ scarcity, imbalance
Classifiers are not resilient to class imbalance:
- Models will be biased towards predicting
majority class regardless of the input features
- Will struggle to generalise correctly on the
minority class
- In clinical datasets, data scarcity/sparsity often
conspires with data imbalance
- Imbalance is very common in medical datasets
Typical mitigation:
- Downsample the majority class  lose training examples
- Upsample the minority class.  SMOTE (Synthetic Minority Oversampling Technique)
When modelling processes, these mitigations do not work
We used Hidden Markov Models (HMMs) to predict oxygen-therapy state-transitions
However, intubation is a infrequent state (and so is “death”)
This makes it was difficult to accurately learn probability distributions.
[1] proposes a novel, generic ensemble technique to mitigate the imbalance problem in HMM
35
<event
name>
Instability
Retrospective studies are often unstable:
Data acquisition and management practices may change over time, following changes in
- Clinical practices
- Public policy
- Hospital resources
- Data collection technologies
- In our COVID dataset clinical tests vary daily depending on the patient’s
condition
- Scientific evidence for the need of certain tests changed rapidly
- Example: new biomarkers like interleukin-6 were introduced in “mid flight”
- Thus earlier study datasets completely miss this variable
36
<event
name>
Translational challenge: Not all errors are equally wrong
- In high-stakes domains, prediction errors are not symmetric:
- Typically, underestimating risk is less desirable than overestimating it
- Standard model performance metrics (eg AUC, F1 etc) fail to capture this distinction
Cost-sensitive learning (cf eg [1,2,3])
- Introduce an explicit penalty of mis-classifying samples
- Note that cost- sensitive methods can sometimes deal with imbalanced datasets without
altering the original data distribution [4]
[1] Lomax, S., and Vadera, S. (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surveys 45, 1–35. doi: 10.1145/2431211.2431215
[2] Wang, H., Cui, Z., Chen, Y., Avidan, M., Abdallah, A. B., and Kronzer, A. (2018). Predicting hospital readmission via cost-sensitive deep learning. ACM Trans.
Comput. Biol. Bioinformatics 15, 1968–1978. doi: 10.1109/TCBB.2018.2827029
[3] Freitas, A., Costa-Pereira, A., and Brazdil, P. (2007). “Cost-sensitive decision trees applied to medical data,” in Data Warehousing and Knowledge Discovery
(Regensburg), 303–312. doi: 10.1007/978-3-540-74553-2_28
[4] Mienye, I. D., and Sun, Y. (2021). Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlock.
25:100690. doi: 10.1016/j.imu.2021.100690
37
<event
name>
Translational challenge: human-in-the-loop AI
• Essential in medical AI
• Evidence of performance is not enough
• Black-box AI not acceptable in clinical practice
From technical explanations:
• non-linear [1] and Deep Learning [2] models
• Shapley values [3]
• Interpretable ML [4,5]
Also importantly:
Patient and Public Involvement (PPI) is essential in publicly funded clinical research
“Explanation gap”:
To expert involvement in the learning process:
- by accepting/rejecting predictions
- By expressing preference for a given error type
Causal Machine Learning (CML) [6,7]:
- Visualisation and reasoning over complex clinical scenarios
- Counterfactuals, what-if scenarios
38
<event
name>
References on explainability and human-in-the-loop
[1] Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., et al. (2020). From local explanations to global
understanding with explainable AI for trees. Nat. Mach. Intell. 2, 2522–5839. doi: 10.1038/s42256-019-0138-9
[2] Singh, A., Sengupta, S., and Lakshminarayanan, V. (2020). Explainable deep learning models in medical image analysis. J. Imaging
6:52. doi: 10.3390/jimaging6060052
[3] Lundberg, S. M., and Lee, S. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information
Processing Systems, Vol. 30. (Long Beach, CA).
[4] Ahmad, M. A., Eckert, C., and Teredesai, A. (2018). “Interpretable machine learning in healthcare,” in Proceedings of the ACM
International Conference on Bioinformatics, Computational Biology, and Health Informatics (Washington, DC), 559–560. doi:
10.1145/3233547.3233667
[5] Abdullah, T. A. A., Zahid, M. S. M., and Ali, W. (2021). A review of interpretable ML in healthcare: taxonomy, applications,
challenges, and future directions. Symmetry 13:2439. doi: 10.3390/sym13122439
[6] Oneto, L., and Chiappa, S. (2020). “Fairness in machine learning,” in Recent Trends in Learning From Data: Tutorials from the INNS
Big Data and Deep Learning Conference (Sestri Levante, Genova), 155–196. doi: 10.1007/978-3-030-43883-8_7
[7] Sanchez, P., Voisey, J. P., Xia, T., Watson, H. I., O’Neil, A. Q., and Tsaftaris, S. A. (2022). Causal machine learning for healthcare and
precision medicine. R. Soc. Open Sci. 9:220638. doi: 10.1098/rsos.220638
39
<event
name>
Outline
• AI for HealthCare: a convergence of needs and opportunities
• A complex multifaceted landscape
• Data engineering for healthcare data: intrinsic and translational requirements
• Extracting actionable knowledge from EHRs
• Recent work
• Some Challenges
40
<event
name>
EHR data: Traditional statistics and machine learning methods
Discovering patterns of multimorbid conditions and relationships between multimorbidities,
Socio-Demographic Factors, Health-Related Quality of Life, mortality
• See Systematic review [1]
• Clustering methods  but differing in proximity measures. Also, patient clusters vs disease
clusters?
• Other specific methods:
• Latent Class Analysis, Cox Regression models, SVM, Cox proportional hazard model,
Random Forests [2,3,4,5,6]
• Multilevel logistic regression for longitudinal analysis [7]
[1] Ng, Shu Kay, Richard Tawiah, Michael Sawyer, and Paul Scuffham. ‘Patterns of Multimorbid Health Conditions: A Systematic Review of Analytical Methods and
Comparison Analysis.’ International Journal of Epidemiology 47, no. 5 (1 October 2018): 1687–1704. https://doi.org/10.1093/ije/dyy134.
41
<event
name>
EHR data: Traditional methods – references
[2] Larsen, Finn Breinholt, Marie Hauge Pedersen, Karina Friis, Charlotte Glümer, and Mathias Lasgaard. ‘A Latent Class Analysis of
Multimorbidity and the Relationship to Socio-Demographic Factors and Health-Related Quality of Life. A National Population-Based Study
of 162,283 Danish Adults.’ PloS One 12, no. 1 (2017): e0169426. https://doi.org/10.1371/journal.pone.0169426.
[3] Jani, Bhautesh Dinesh, Peter Hanlon, Barbara I. Nicholl, Ross McQueenie, Katie I. Gallacher, Duncan Lee, and Frances S. Mair. ‘Relationship
between Multimorbidity, Demographic Factors and Mortality: Findings from the UK Biobank Cohort’. BMC Medicine 17, no. 1 (10 April 2019):
74. https://doi.org/10.1186/s12916-019-1305-x.
[4] Whitson, Heather E., Kimberly S. Johnson, Richard Sloane, Christine T. Cigolle, Carl F. Pieper, Lawrence Landerman, and Susan N. Hastings.
‘Identifying Patterns of Multimorbidity in Older Americans: Application of Latent Class Analysis.’ Journal of the American Geriatrics Society 64,
no. 8 (August 2016): 1668–73. https://doi.org/10.1111/jgs.14201.
[5] Zemedikun, Dawit T., Laura J. Gray, Kamlesh Khunti, Melanie J. Davies, and Nafeesa N. Dhalwani. ‘Patterns of Multimorbidity in Middle-
Aged and Older Adults: An Analysis of the UK Biobank Data.’ Mayo Clinic Proceedings 93, no. 7 (July 2018): 857–66.
https://doi.org/10.1016/j.mayocp.2018.02.012.
[6] Zhu, Yajing, Duncan Edwards, Jonathan Mant, Rupert A. Payne, and Steven Kiddle. ‘Characteristics, Service Use and Mortality of Clusters of
Multimorbid Patients in England: A Population-Based Study.’ BMC Medicine 18, no. 1 (10 April 2020): 78. https://doi.org/10.1186/s12916-020-
01543-8.
[7] Ashworth, Mark, Stevo Durbaba, David Whitney, James Crompton, Michael Wright, and Hiten Dodhia. ‘Journey to Multimorbidity:
Longitudinal Analysis Exploring Cardiovascular Risk Factors and Sociodemographic Determinants in an Urban Setting.’ BMJ Open 9, no. 12 (23
December 2019): e031649. https://doi.org/10.1136/bmjopen-2019-031649.
42
<event
name>
EHR data: Deep Learning
Recent survey on DNN methods for EHR-based modelling [1]:
- DNNs fully exploit the longitudinal nature of EHRs
- Useful predict outcomes where patient history is a relevant predictor
State of the art methods (< 2020):
- eNRBM [2]
- Deep Patient [3]
- Deepr [4]
- RETAIN [5]
Summary of Results:
- The methods are competitive
- Achieving AUC >.8 on each of the outcomes above
Target outcomes:
Future disease given medical history
Unplanned readmission
Disease progression
Specific complication, eg heart failure, cataract
Patient mortality
[1] Ayala Solares, Jose Roberto, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fatemeh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, et al. ‘Deep
Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Architectures’. Journal of Biomedical Informatics 101 (1 January 2020):
103337. https://doi.org/10.1016/j.jbi.2019.103337.
43
<event
name>
EHR data: Deep Learning -- References
[2] T. Tran, T. D. Nguyen, D. Phung, S. Venkatesh, Learning vector representation of medical objects via EMR-driven nonnegative restricted
Boltzmann machines (eNRBM), Journal of Biomedical Informatics 54 (2015) 96 – 105. doi:https://doi.org/10.1016/j.jbi.2015.01.012.
[3] Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health
Records. Sci Rep. 2016 May 17;6:26094. doi: 10.1038/srep26094. PMID: 27185194; PMCID: PMC4869115.
[4] P. Nguyen, T. Tran, N. Wickramasinghe, S. Venkatesh, Deepr: A convolutional net for medical records, IEEE Journal of Biomedical and Health
Informatics 21 (1) (2017) 22–30. doi:10.1109/JBHI.2016. 767 2633963.
[5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, W. Stewart, 850 RETAIN: An Interpretable Predictive Model for Healthcare using Reverse
Time Attention Mechanism, in: Advances in Neural Information Processing Systems, 2016, pp. 3504–3512.
46
<event
name>
A different approach: predicting target disease clusters
47
IEEE
BigData
2022
Can we model the likelihood of next disease?
Experimental models exists for
- Modelling disease progression [1]
- Discovering clinical pathway patterns [2]
- Predicting next disease(s) [3]
However, not very robust or actually deployed in practice
[1] Wang, Xiang, David Sontag, and Fei Wang. ‘Unsupervised Learning of Disease Progression Models’. In Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 85–94. KDD ’14. New York, NY, USA, 2014 https://doi.org/10.1145/2623330.2623754.
[2] Huang, Zhengxing, Wei Dong, Lei Ji, Chenxi Gan, Xudong Lu, and Huilong Duan. ‘Discovery of Clinical Pathway Patterns from Event Logs Using Probabilistic Topic
Models’. Journal of Biomedical Informatics 47 (1 February 2014): 39–57. https://doi.org/10.1016/j.jbi.2013.09.003.
[3] Men, Lu, Noyan Ilk, Xinlin Tang, and Yuan Liu. ‘Multi-Disease Prediction Using LSTM Recurrent Neural Networks’. Expert Systems with Applications 177 (1
September 2021): 114905. https://doi.org/10.1016/j.eswa.2021.114905.
use Electronic Health Records (diagnoses event logs) to predict patients’
long-term associations to a specific disease cluster
Our goal:
48
IEEE
BigData
2022
Research hypothesis
It is possible to identify clusters of diseases that:
1. Are described using disease terms that are familiar to health domain experts
2. Are clinically significant  based on expert validation
3. Admit a quantitative association of individual patients with each of the clusters
(*) limited to LTCs
What we hope to find:
1. A significant majority of patients are stable relative to the clustering
2. Stability emerges early in their medical history(*)
49
IEEE
BigData
2022
Contributions
• We use Topic Modelling as a form of semantic clustering
• Topics are defined by ranked lists of disease terms
• We define a cluster’s gravitational pull: patients are differently attracted by each
cluster at different points in time
• We propose a quantitative measure of stability with respect to clusters over time
• We study how stability increases as timelines progress
50
Dynamic patient-cluster associations: examples
Patient 1 topic 1 topic 2 topic 3 topic 4
OA 0.20 0.21 0.07 0.04
skin ulcer 1.29 0.21 0.07 0.05
dermatitis 1.76 0.21 0.07 0.05
erectile dysfunction 1.76 0.68 0.07 0.51
Primary malignancy skin 2.14 1.46 0.07 0.51
Patient 2 topic 1 topic 2 topic 3 topic 4
spondylosis 0.31 0.37 0.26 0.00
obesity 0.45 0.50 0.34 0.50
urine incontinence 0.75 0.50 1.06 0.51
female genital prolapse 1.22 0.51 1.75 0.51
type 2 diabetes 1.22 0.51 1.75 1.43
unspecified rare diabetes 1.22 0.51 1.75 2.56
fracture of the hip 1.23 2.62 1.75 2.57
Patient 3 topic 1 topic 2 topic 3 topic 4
dermatitis 0.47 0.00 0.00 0.00
hypertension 0.55 0.16 0.00 0.16
atrial fibrilation 0.55 1.41 0.00 0.16
OA 0.75 1.62 0.07 0.20
tinnitus 1.24 2.02 0.33 0.20
Patient 4 topic 1 topic 2 topic 3 topic 4
PTSD 0.22 0.00 0.68 0.00
COPD 0.54 0.94 0.69 0.08
Neuromuscular dysfunction
of bladder 0.54 1.26 1.87 0.09
female genital prolapse 1.01 1.26 2.55 0.09
OA 1.21 1.47 2.63 0.12
Patient 5 topic 1 topic 2 topic 3 topic 4
Peripheral venous and
lymphatic disease 0.59 0.24 0.07 0.00
psoriasis 1.07 0.45 0.07 0.55
female genital prolapse 1.54 0.45 0.76 0.55
CHD 1.54 1.11 0.76 0.84
Alcohol dependence 1.74 1.59 0.76 1.25
obesity 1.88 1.72 0.84 1.75
hearing loss 2.21 2.12 0.85 1.76
urine incontinence 2.51 2.12 1.57 1.76
Patient 6 topic 1 topic 2 topic 3 topic 4
asthma 0.80 0.00 0.00 0.00
hypertension 0.88 0.16 0.00 0.16
hearing loss 1.21 0.56 0.01 0.17
Alcohol dependence 1.41 1.04 0.01 0.58
Patient stages: T1 = [ t1], T2 = [ t1, t2], … Tn = T = [ t1, t2, …. tn]
Patient: disease terms sequence T = [ t1, t2, …. tn]
IEEE
BigData
2022
51
<event
name>
Challenge: making the best of expensive features
RS1
RS2
FS1
Training set 1: (RS1+RS2 , FS1)
FS1: core features, FS2 extended features
FS1 available on entire cohort
FS2 only available on a subset
How do leverage a model learnt using Training set 1
to improve a model learnt from Training set 2?
FS2
Training set 2: (RS1 , FS1+FS2)
52
<event
name>
Challenge: synthetic data generation for specialized data types
Self-monitoring contains potentially useful signal to anticipate specific conditions
- But data heavily imbalanced towards healthy controls
- Case data points harder to collect
Can we use the available “seed” true data points to generate new synthetic and plausible ones?
Specifically: physical activity data  general problem of time-series data generation
Challenge:
Existing GAN / TimeGAN approaches insufficient
- Hard to scale
- Require very strong signal
53
<event
name>
Key messages
• The weaknesses are in the data not in the models!
• Need for data integration + curation + engineering dominate the need for size
• Investments driven by “health crisis”
• Mental (dementia, Parkinson’s)
• Physical: multimorbidity in older population
• Focus on EHR:
• Good advances in using AI to draw insights from EHR, but data quality is a big barrier
AI for HealthCare: great opportunities for impactful research,
but many challenges remain

More Related Content

Similar to Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective

Vph2012 20 sept12_shublaq_final
Vph2012 20 sept12_shublaq_finalVph2012 20 sept12_shublaq_final
Vph2012 20 sept12_shublaq_final
Nour Shublaq
 
From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...
Patient-Centered Outcomes Research Institute
 
From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...
Health Data Consortium
 
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
Mark Hawker
 
Medicine as a data science
Medicine as a data scienceMedicine as a data science
Medicine as a data science
improvemed
 
Health and Biomedical Informatics Centre @ The University of Melbourne
Health and Biomedical Informatics Centre @ The University of MelbourneHealth and Biomedical Informatics Centre @ The University of Melbourne
Health and Biomedical Informatics Centre @ The University of Melbourne
Health and Biomedical Informatics Centre @ The University of Melbourne
 
The Health and Biomedical Informatics Centre (HaBIC@UoM)
The Health and Biomedical Informatics Centre (HaBIC@UoM)The Health and Biomedical Informatics Centre (HaBIC@UoM)
The Health and Biomedical Informatics Centre (HaBIC@UoM)
Fernando Martin-Sanchez
 
Medicine as data science
Medicine as data scienceMedicine as data science
Medicine as data science
improvemed
 
Building a National Data Infrastructure to Advance Patient-Centered Comparati...
Building a National Data Infrastructure to Advance Patient-Centered Comparati...Building a National Data Infrastructure to Advance Patient-Centered Comparati...
Building a National Data Infrastructure to Advance Patient-Centered Comparati...
Patient-Centered Outcomes Research Institute
 
Innovative project1
Innovative project1Innovative project1
Innovative project1
LillySheebaS1
 
Open Educational Resources for Big Data Science
Open Educational Resources for Big Data ScienceOpen Educational Resources for Big Data Science
Open Educational Resources for Big Data Science
William Hersh, MD
 
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
D3 Consutling
 
Expert Panel on Data Challenges in Translational Research
Expert Panel on Data Challenges in Translational ResearchExpert Panel on Data Challenges in Translational Research
Expert Panel on Data Challenges in Translational Research
Eagle Genomics
 
Big data for health
Big data for healthBig data for health
Big data for health
redpel dot com
 
iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...
iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...
iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...
Health IT Conference – iHT2
 
iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...
iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...
iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...
Health IT Conference – iHT2
 
Fighting Neurodegenerative Diseases
Fighting Neurodegenerative DiseasesFighting Neurodegenerative Diseases
Fighting Neurodegenerative Diseases
InsideScientific
 
Implementing Clinical Decision
Implementing Clinical DecisionImplementing Clinical Decision
Implementing Clinical Decision
CMDLMS
 
Are we ready for disruption in Translational Research through Digital Medicine?
Are we ready for disruption in Translational Research through Digital Medicine?Are we ready for disruption in Translational Research through Digital Medicine?
Are we ready for disruption in Translational Research through Digital Medicine?
Ashish Atreja, MD, MPH
 

Similar to Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective (20)

Vph2012 20 sept12_shublaq_final
Vph2012 20 sept12_shublaq_finalVph2012 20 sept12_shublaq_final
Vph2012 20 sept12_shublaq_final
 
From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...
 
From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...
 
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
 
Medicine as a data science
Medicine as a data scienceMedicine as a data science
Medicine as a data science
 
Health and Biomedical Informatics Centre @ The University of Melbourne
Health and Biomedical Informatics Centre @ The University of MelbourneHealth and Biomedical Informatics Centre @ The University of Melbourne
Health and Biomedical Informatics Centre @ The University of Melbourne
 
The Health and Biomedical Informatics Centre (HaBIC@UoM)
The Health and Biomedical Informatics Centre (HaBIC@UoM)The Health and Biomedical Informatics Centre (HaBIC@UoM)
The Health and Biomedical Informatics Centre (HaBIC@UoM)
 
Medicine as data science
Medicine as data scienceMedicine as data science
Medicine as data science
 
Building a National Data Infrastructure to Advance Patient-Centered Comparati...
Building a National Data Infrastructure to Advance Patient-Centered Comparati...Building a National Data Infrastructure to Advance Patient-Centered Comparati...
Building a National Data Infrastructure to Advance Patient-Centered Comparati...
 
Innovative project1
Innovative project1Innovative project1
Innovative project1
 
Open Educational Resources for Big Data Science
Open Educational Resources for Big Data ScienceOpen Educational Resources for Big Data Science
Open Educational Resources for Big Data Science
 
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
 
Integrated health monitoring
Integrated health monitoringIntegrated health monitoring
Integrated health monitoring
 
Expert Panel on Data Challenges in Translational Research
Expert Panel on Data Challenges in Translational ResearchExpert Panel on Data Challenges in Translational Research
Expert Panel on Data Challenges in Translational Research
 
Big data for health
Big data for healthBig data for health
Big data for health
 
iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...
iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...
iHT² Health IT Summit Seattle 2013 - Josephine Briggs, MD, National Center fo...
 
iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...
iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...
iHT² Health IT Summit New York - Presentation “Harnessing EHRs and Health IT ...
 
Fighting Neurodegenerative Diseases
Fighting Neurodegenerative DiseasesFighting Neurodegenerative Diseases
Fighting Neurodegenerative Diseases
 
Implementing Clinical Decision
Implementing Clinical DecisionImplementing Clinical Decision
Implementing Clinical Decision
 
Are we ready for disruption in Translational Research through Digital Medicine?
Are we ready for disruption in Translational Research through Digital Medicine?Are we ready for disruption in Translational Research through Digital Medicine?
Are we ready for disruption in Translational Research through Digital Medicine?
 

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
Paolo Missier
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 

More from Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Recently uploaded

ABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROMEABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROME
Rommel Luis III Israel
 
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
ranishasharma67
 
Artificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular TherapyArtificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular Therapy
Iris Thiele Isip-Tan
 
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
o6ov5dqmf
 
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
Ameena Kadar
 
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Guillermo Rivera
 
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfCHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
Sachin Sharma
 
Introduction to Forensic Pathology course
Introduction to Forensic Pathology courseIntroduction to Forensic Pathology course
Introduction to Forensic Pathology course
fprxsqvnz5
 
Roti bank chennai PPT [Autosaved].pptx1
Roti bank  chennai PPT [Autosaved].pptx1Roti bank  chennai PPT [Autosaved].pptx1
Roti bank chennai PPT [Autosaved].pptx1
roti bank
 
ventilator, child on ventilator, newborn
ventilator, child on ventilator, newbornventilator, child on ventilator, newborn
ventilator, child on ventilator, newborn
Pooja Rani
 
CONSTRUCTION OF TEST IN MANAGEMENT .docx
CONSTRUCTION OF TEST IN MANAGEMENT .docxCONSTRUCTION OF TEST IN MANAGEMENT .docx
CONSTRUCTION OF TEST IN MANAGEMENT .docx
PGIMS Rohtak
 
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
ranishasharma67
 
The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........
TheDocs
 
ICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdfICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdf
NEHA GUPTA
 
CANCER CANCER CANCER CANCER CANCER CANCER
CANCER  CANCER  CANCER  CANCER  CANCER CANCERCANCER  CANCER  CANCER  CANCER  CANCER CANCER
CANCER CANCER CANCER CANCER CANCER CANCER
KRISTELLEGAMBOA2
 
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptxR3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cell
 
Antibiotic Stewardship by Anushri Srivastava.pptx
Antibiotic Stewardship by Anushri Srivastava.pptxAntibiotic Stewardship by Anushri Srivastava.pptx
Antibiotic Stewardship by Anushri Srivastava.pptx
AnushriSrivastav
 
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.pptNursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
Rommel Luis III Israel
 
VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...
VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...
VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...
rajkumar669520
 
Essential Metrics for Palliative Care Management
Essential Metrics for Palliative Care ManagementEssential Metrics for Palliative Care Management
Essential Metrics for Palliative Care Management
Care Coordinations
 

Recently uploaded (20)

ABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROMEABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROME
 
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
 
Artificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular TherapyArtificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular Therapy
 
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
 
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
 
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
 
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfCHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
 
Introduction to Forensic Pathology course
Introduction to Forensic Pathology courseIntroduction to Forensic Pathology course
Introduction to Forensic Pathology course
 
Roti bank chennai PPT [Autosaved].pptx1
Roti bank  chennai PPT [Autosaved].pptx1Roti bank  chennai PPT [Autosaved].pptx1
Roti bank chennai PPT [Autosaved].pptx1
 
ventilator, child on ventilator, newborn
ventilator, child on ventilator, newbornventilator, child on ventilator, newborn
ventilator, child on ventilator, newborn
 
CONSTRUCTION OF TEST IN MANAGEMENT .docx
CONSTRUCTION OF TEST IN MANAGEMENT .docxCONSTRUCTION OF TEST IN MANAGEMENT .docx
CONSTRUCTION OF TEST IN MANAGEMENT .docx
 
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
 
The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........
 
ICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdfICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdf
 
CANCER CANCER CANCER CANCER CANCER CANCER
CANCER  CANCER  CANCER  CANCER  CANCER CANCERCANCER  CANCER  CANCER  CANCER  CANCER CANCER
CANCER CANCER CANCER CANCER CANCER CANCER
 
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptxR3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
 
Antibiotic Stewardship by Anushri Srivastava.pptx
Antibiotic Stewardship by Anushri Srivastava.pptxAntibiotic Stewardship by Anushri Srivastava.pptx
Antibiotic Stewardship by Anushri Srivastava.pptx
 
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.pptNursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
 
VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...
VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...
VVIP Dehradun Girls 9719300533 Heat-bake { Dehradun } Genteel ℂall Serviℂe By...
 
Essential Metrics for Palliative Care Management
Essential Metrics for Palliative Care ManagementEssential Metrics for Palliative Care Management
Essential Metrics for Palliative Care Management
 

Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspective

  • 1. Paolo Missier School of Computing Newcastle University, UK Comsys 2022 IIT Ropar, India (online presentation) Delivering on the promise of data-driven healthcare: trade-offs, challenges, and research perspectives
  • 2. 2 <event name> Outline • AI for HealthCare: a convergence of needs and opportunities • A complex multifaceted landscape • Data engineering for healthcare data: intrinsic and translational requirements • Extracting actionable knowledge from EHRs • Recent work • Some Challenges
  • 3. 3 <event name> The promise of data-driven medicine and healthcare Predictive, Preventative, Personalised, Participatory: a systems biology perspective on the future of medicine and health care Hood L, Heath JR, Phelps ME, Lin B. Systems biology and new technologies enable predictive and preventative medicine. Science. 2004;306(5696):640–643. Hood L, Balling R, Auffray C. Revolutionizing medicine in the 21st century through systems approaches. Biotechnol J. 2012;7(8):992–1001. Provides an overview of the science and technological foundations of predictive, preventive, personalized and participatory healthcare Flores M, Glusman G, Brogaard K, Price ND, Hood L. P4 medicine: how systems medicine will transform the healthcare sector and society. Per Med. 2013;10(6):565-576. doi: 10.2217/pme.13.57. PMID: 25342952; PMCID: PMC4204402. Schmidt, Charlie. ‘Leroy Hood Looks Forward to P4 Medicine: Predictive, Personalized, Preventive, and Participatory’. JNCI Journal of the National Cancer Institute 106, no. 12 (December 2014): dju416–dju416. https://doi.org/10.1093/jnci/dju416. [1] Sagner, M, A McNeil, P Puska, and R Arena. ‘The P4 Health Spectrum – A Predictive, Preventive, Personalized and Participatory Continuum for Promoting Healthspan’. Progress in Cardiovascular Diseases 59, no. 5 (2017): 506–21. https://doi.org/10.1016/j.pcad.2016.08.002. A new approach in medicine that is predictive, preventive, personalized and participatory, which we label here as “P4” holds great promise to reduce the burden of chronic diseases by harnessing technology and an increasingly better understanding of environment-biology interactions, evidence- based interventions and the underlying mechanisms of chronic diseases. [1]
  • 4. 5 <event name> Five pillars of P4 medicine Pillar 1 ■ Cutting-edge technologies for generating data regarding multiple dimensions of each person's experience of health and disease. Pillar 2 ■ A digital infrastructure linking participating discovery science and clinical institutions, as well as patients/consumers. Pillar 3 ■ Personalized data clouds providing information about multiple dimensions of each individual's unique dynamic experience of health and disease ranging from the molecular to the social. These data will include genetic and phenotypic characteristics, medical history, demographics and other sociometrics. Pillar 4 ■ New analytic techniques and technologies from deriving actionable knowledge from the data. Pillar 5 ■ Systems biology models for understanding the unique health status of each individual in terms of dynamic network states that can be manipulated by cost-effective strategies Source: [1]
  • 5. 6 <event name> Outline • AI for HealthCare: a convergence of needs and opportunities • A complex multifaceted landscape • Data engineering for healthcare data: intrinsic and translational requirements • Extracting actionable knowledge from EHRs • Recent work • Some Challenges
  • 6. 7 <event name> A convergence of needs and opportunities P4 Data-driven Healthcare Personal self- monitoring devices Health Data Science and Engineering Governance, consent Secure data access (Big) Health Data - Operations  Research - ML, AI Methods - Scalable computing Medical grade  Consumer grade - Privacy (eg GDPR) - Opt-in vs opt-out - Trusted Research Environments Bigger == more useful?
  • 7. 8 <event name> The data-to-actions loop Monitoring Clinical testing Data Engineering Predictive Analytics / AI Personalised Predictions - Prevention - interventions
  • 8. 9 <event name> Outline • AI for HealthCare: a convergence of needs and opportunities • A complex multifaceted landscape • Data engineering for healthcare data: intrinsic and translational requirements • Extracting actionable knowledge from EHRs • Recent work • Some Challenges
  • 9. 10 <event name> Understanding the facets of Health data • Clinical • Lifestyle, social •Which data types? • Prospective vs retrospective •Where do datasets come from? • Acquisition • Curation, annotation •How much do they cost? • Small vs Big Health Data •How large? • Governance • Protection •Who can use it and how? Data Science and Engineering Benefits to patients
  • 10. 11 <event name> I. Which data? Capturing individuals’ complexity Primary care records: - Clinical tests / GP notes, diagnoses / Prescriptions Secondary care records: - hospital admission / diagnoses / operations / prescriptions Multi-omics data: - genotypes, exomes, genomes. - Transcriptomics, proteomics Digital Health: - Data streams from wearable and environment sensors, self-monitoring Socio-demographics: - Area of residence, family, social deprivation
  • 11. 12 Baseline assessment GP events prescriptions HESIN diagnoses N = 240,000 N = 500,000 Hospital events Used to determine admission/ re-admission patterns operations 57,698,505 123,644,445 Example: UK biobank eid Up to 20 years of records
  • 12. 13 ✗ <event name> II. Prospective vs retrospective datasets Prospective: defined for research purposes ✓ Stable and predictable ✓ Follow protocol ✓ Research ready ✓ Potentially well- curated ✓ Bias known a priori ✗ Expensive ✗ Not very reusable ✗ Scarce  Potentially more reusable  Natural Bias (reflects natural cohort locality) ✗ Generally not research ready ✗ Require data engineering Retrospective: typically operational data Example: Clinical Practice Research Datalink - Data collected from UK GP practices - 60+ million patients - (also prospective) Example: UK Biobank - 500,000 volunteer participants - General health information - Genotypes and whole genomes - Selected internal organ imaging study (100K) - Bias: 40+ years, geographic / social bias Prospective datasets:
  • 13. 14 <event name> Example: LITMUS Retrospective + prospective data collection project • EU IMI2 project • Collecting data across Centres (EU + USA) on Non-Alcoholic Fatty Liver Disease (NAFLD) and NASH (liver steathosis, fibrosis, cirrhosis) https://litmus-project.eu/litmus-partners/ Phase 1a: - Retrospective data collected from hospitals datasets - Around 10,000 patients - Varying degrees of quality / completeness - Central curation required Phase 1b: - Prospective data from active recruitment - Around 2,000 patients - Omics data more abundant
  • 14. 15 <event name> Discovery + validation experimental design Actual design will depend on dataset characteristics Ex. “Explore relationship between social deprivation and mortality rate in MLTC population” An ideal study will include both Prospective and Retrospective datasets UKBiobank  machine-learning friendly  modelling  discovery of candidate associations Regional dataset  validation dataset Regional UK dataset: - 50K actual patients - Data availability depending on operational systems - Likely data quality problems (incomplete, incorrect) - Bias: geographic location  natural distribution of social deprivation UK Biobank: - Nationwide data - 140K MLTC participants - Complete set of multiple deprivation indicators available(*) (*) Townsend deprivation index at recruitment, Index of Multiple Deprivation (England, Scotland, Wales) Plus education score and other socio-demographics indeces. Distribution across population documented on UKBB site
  • 15. 16 <event name> III - Cost of health data Retrospective: integration/harmonisation, curation, cleaning Prospective: cost of cohort recruitment, data collection, data processing Acquisition + processing cost by data type: Routinely collected clinical variables (GP test) - Tests requiring specialist labs - Proteomics - Genotyping (a few genes) Whole exome sequencing Whole genome sequencing Low High
  • 16. 17 <event name> Example: NAFLD dataset N= 9,449 However: Clinical: 8,745 GWAS: 2,216 miRNA: 183 RNASeq: 461
  • 17. Issue: Data completeness 18 Possible issue: 85%+ missing on 7 out of 9 Specialist variables​
  • 18. Benefit analysis for Extended and Specialist variables response: At-Risk NASH
  • 19. 20 <event name> Challenge: making the best of expensive features RS1 RS2 FS1 Training set 1: (RS1+RS2 , FS1) FS1: core features, FS2 extended features FS1 available on entire cohort FS2 only available on a subset How do leverage a model learnt using Training set 1 to improve a model learnt from Training set 2? FS2 Training set 2: (RS1 , FS1+FS2)
  • 20. 21 <event name> Cost of data: the imbalance problem Example: Physical Monitoring: Everyday fitness Pathological conditions Eg cognitive impairment Cost Low High High Low Abundance Consequence: class imbalance in classification tasks
  • 21. 22 IV - Size: Big Data for Health Care Genomics for personalized medicine Article Source: Big Data: Astronomical or Genomical? Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195
  • 22. 25 <event name> Size: Do I always need the full granularity? Dataset size = <Data point size> x <Number of data points>  how much information do I lose by downsampling? Genomics: 1 exome: >1 Billion data points (base pairs) x N exomes. No downsampling Medical imaging: 1 image = N pixels Some downsampling may be acceptable Sensor data (eg accelerometers) - Do I need 10Hz or 100 Hz? - Typically very noisy Feature engineering vs Representation learning
  • 23. 27 Case study: Using activity trackers to predict Type-2 Diabetes Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations. Lam, B; Catt, M; Cassidy, S; Bacardit, J; Darke, P; Butterfield, S; Alshabrawy, O; Trenell, M; and Missier, P, Using wearable activity trackers to predict Type-2 Diabetes: A machine learning-based cross-sectional study of the UK Biobank accelerometer cohort. JMIR Diabetes. January 2021. http://doi.org/10.2196/23364 Feature extraction Clustering Classification
  • 24. 28 Filter: Accelerometry study? 103,712 Split criteria: Type 2 Diabetes? At baseline: 2,755 Through EHR analysis: 1,321 Total: 4,076 Non-Diabetes 99,636 Filter: EHR data available? 19,852 502, 664 All UK Biobank participants: Filter: QC on activity traces 3,103 Positives: T2D vs Norm-0 Physical Impairment analysis Severe impairment 1,666 No impairment 8,463 A great UG project! your (biomedical) dataset may not be as big as it looks T2D vs Norm-1
  • 25. 29 <event name> V - Data governance issues: the emerging UK landscape https://www.goldacrereview.org/ Build a small number of Trusted Research Environments, avoiding duplication Promote culture of reuse of code (curation pipelines, analytics) - Reproducible Analytical Pipelines”, a set of best practices - Promote high quality, shared, reviewable, re-usable, well-documented code for standardized data curation and analysis - Promote transparency, avoid black box analysis Adopt single governance rules for integrated data access - Rationalise approvals: create one map of all approval processes Build appropriate capabilities: - Train academic researchers and NHS analysts in computational data science techniques
  • 26. 30 <event name> Outline • AI for HealthCare: a convergence of needs and opportunities • A complex multifaceted landscape • Data engineering for healthcare data: intrinsic and translational requirements • Extracting actionable knowledge from EHRs • Recent work • Some Challenges
  • 27. 31 <event name> Data engineering for healthcare data My Smart Age with HIV (MySAwH) - International multi-center prospective study - Aimed at studying and monitoring healthy aging in People Living with HIV (PLWH) - Data from routine clinical assessments and innovative PROs, collected through mobile and wearable devices; - retrospective studies - focus on the hospital resource management and clinical decision making problems emerged during the Covid–19 pandemic Mandreoli, Federica, Davide Ferrari, Veronica Guidetti, Federico Motta, and Paolo Missier. ‘Real-World Data Mining Meets Clinical Practice: Research Challenges and Perspective’. Frontiers in Big Data 5 (2022). https://doi.org/10.3389/fdata.2022.1021621. Ferrari, D., Guaraldi, G., Mandreoli, F., Martoglia, R., Milić, J., and Missier, P. (2020a). “Data-driven vs. knowledge-driven inference of health outcomes in the ageing population: a case study,” in Proceedings of the Workshops of the EDBT-ICDT Joint Conference, Vol. 2578. Ferrari, D., Mandreoli, F., Guaraldi, G., Milić, J., and Missier, P. (2020b). “Predicting respiratory failure in patients with COVID-19 pneumonia: a case study from Northern Italy,” in Proceedings of the 1st International Advances in Artificial Intelligence for Healthcare Workshop, Vol. 2820, (Santiago de Compostela), 32–38. Ferrari, D., Milić, J., Mussini, C., Mandreoli, F., Missier, P., Guaraldi, G., et al. (2020c). Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia–Challenges, strengths, and opportunities in a global health emergency. PLoS ONE 15:e239172. doi: 10.1371/journal.pone.0239172 Mandreoli, F., Motta, F., and Missier, P. (2021). “An HMM-ensemble approach to predict severity progression of ICU treatment for hospitalized COVID-19 patients,” in 20th IEEE International Conference on Machine Learning and Applications (Pasadena, CA), 1299–1306. doi: 10.1109/ICMLA52953.2021.00211
  • 28. 32 <event name> Issues requiring Data Engineering Recurringdata issues Data–driven, AI–based clinical practice: experiences, challenges, and research directions DATA SPARSITY AND SCARSITY • EHR: Irregular collections of time series • Imputation is not always possible DATA IMBALANCE • Predicting rare events can be a priority • No downsampling option DATA INCONSISTENCY and INSTABILITY • Retrospective data are often source of inconsistency and their schema are instable NOT ALL ERRORSARE EQUALLY WRONG • In high-stake domains sometimes a bias towards one type of error is preferible HUMAN-IN- THE-LOOP • Explanations engender trust in the models • Trust should include not only the clinician but also the patient.
  • 30. 34 <event name> Sparsity/ scarcity, imbalance Classifiers are not resilient to class imbalance: - Models will be biased towards predicting majority class regardless of the input features - Will struggle to generalise correctly on the minority class - In clinical datasets, data scarcity/sparsity often conspires with data imbalance - Imbalance is very common in medical datasets Typical mitigation: - Downsample the majority class  lose training examples - Upsample the minority class.  SMOTE (Synthetic Minority Oversampling Technique) When modelling processes, these mitigations do not work We used Hidden Markov Models (HMMs) to predict oxygen-therapy state-transitions However, intubation is a infrequent state (and so is “death”) This makes it was difficult to accurately learn probability distributions. [1] proposes a novel, generic ensemble technique to mitigate the imbalance problem in HMM
  • 31. 35 <event name> Instability Retrospective studies are often unstable: Data acquisition and management practices may change over time, following changes in - Clinical practices - Public policy - Hospital resources - Data collection technologies - In our COVID dataset clinical tests vary daily depending on the patient’s condition - Scientific evidence for the need of certain tests changed rapidly - Example: new biomarkers like interleukin-6 were introduced in “mid flight” - Thus earlier study datasets completely miss this variable
  • 32. 36 <event name> Translational challenge: Not all errors are equally wrong - In high-stakes domains, prediction errors are not symmetric: - Typically, underestimating risk is less desirable than overestimating it - Standard model performance metrics (eg AUC, F1 etc) fail to capture this distinction Cost-sensitive learning (cf eg [1,2,3]) - Introduce an explicit penalty of mis-classifying samples - Note that cost- sensitive methods can sometimes deal with imbalanced datasets without altering the original data distribution [4] [1] Lomax, S., and Vadera, S. (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surveys 45, 1–35. doi: 10.1145/2431211.2431215 [2] Wang, H., Cui, Z., Chen, Y., Avidan, M., Abdallah, A. B., and Kronzer, A. (2018). Predicting hospital readmission via cost-sensitive deep learning. ACM Trans. Comput. Biol. Bioinformatics 15, 1968–1978. doi: 10.1109/TCBB.2018.2827029 [3] Freitas, A., Costa-Pereira, A., and Brazdil, P. (2007). “Cost-sensitive decision trees applied to medical data,” in Data Warehousing and Knowledge Discovery (Regensburg), 303–312. doi: 10.1007/978-3-540-74553-2_28 [4] Mienye, I. D., and Sun, Y. (2021). Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlock. 25:100690. doi: 10.1016/j.imu.2021.100690
  • 33. 37 <event name> Translational challenge: human-in-the-loop AI • Essential in medical AI • Evidence of performance is not enough • Black-box AI not acceptable in clinical practice From technical explanations: • non-linear [1] and Deep Learning [2] models • Shapley values [3] • Interpretable ML [4,5] Also importantly: Patient and Public Involvement (PPI) is essential in publicly funded clinical research “Explanation gap”: To expert involvement in the learning process: - by accepting/rejecting predictions - By expressing preference for a given error type Causal Machine Learning (CML) [6,7]: - Visualisation and reasoning over complex clinical scenarios - Counterfactuals, what-if scenarios
  • 34. 38 <event name> References on explainability and human-in-the-loop [1] Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., et al. (2020). From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 2522–5839. doi: 10.1038/s42256-019-0138-9 [2] Singh, A., Sengupta, S., and Lakshminarayanan, V. (2020). Explainable deep learning models in medical image analysis. J. Imaging 6:52. doi: 10.3390/jimaging6060052 [3] Lundberg, S. M., and Lee, S. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, Vol. 30. (Long Beach, CA). [4] Ahmad, M. A., Eckert, C., and Teredesai, A. (2018). “Interpretable machine learning in healthcare,” in Proceedings of the ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Washington, DC), 559–560. doi: 10.1145/3233547.3233667 [5] Abdullah, T. A. A., Zahid, M. S. M., and Ali, W. (2021). A review of interpretable ML in healthcare: taxonomy, applications, challenges, and future directions. Symmetry 13:2439. doi: 10.3390/sym13122439 [6] Oneto, L., and Chiappa, S. (2020). “Fairness in machine learning,” in Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (Sestri Levante, Genova), 155–196. doi: 10.1007/978-3-030-43883-8_7 [7] Sanchez, P., Voisey, J. P., Xia, T., Watson, H. I., O’Neil, A. Q., and Tsaftaris, S. A. (2022). Causal machine learning for healthcare and precision medicine. R. Soc. Open Sci. 9:220638. doi: 10.1098/rsos.220638
  • 35. 39 <event name> Outline • AI for HealthCare: a convergence of needs and opportunities • A complex multifaceted landscape • Data engineering for healthcare data: intrinsic and translational requirements • Extracting actionable knowledge from EHRs • Recent work • Some Challenges
  • 36. 40 <event name> EHR data: Traditional statistics and machine learning methods Discovering patterns of multimorbid conditions and relationships between multimorbidities, Socio-Demographic Factors, Health-Related Quality of Life, mortality • See Systematic review [1] • Clustering methods  but differing in proximity measures. Also, patient clusters vs disease clusters? • Other specific methods: • Latent Class Analysis, Cox Regression models, SVM, Cox proportional hazard model, Random Forests [2,3,4,5,6] • Multilevel logistic regression for longitudinal analysis [7] [1] Ng, Shu Kay, Richard Tawiah, Michael Sawyer, and Paul Scuffham. ‘Patterns of Multimorbid Health Conditions: A Systematic Review of Analytical Methods and Comparison Analysis.’ International Journal of Epidemiology 47, no. 5 (1 October 2018): 1687–1704. https://doi.org/10.1093/ije/dyy134.
  • 37. 41 <event name> EHR data: Traditional methods – references [2] Larsen, Finn Breinholt, Marie Hauge Pedersen, Karina Friis, Charlotte Glümer, and Mathias Lasgaard. ‘A Latent Class Analysis of Multimorbidity and the Relationship to Socio-Demographic Factors and Health-Related Quality of Life. A National Population-Based Study of 162,283 Danish Adults.’ PloS One 12, no. 1 (2017): e0169426. https://doi.org/10.1371/journal.pone.0169426. [3] Jani, Bhautesh Dinesh, Peter Hanlon, Barbara I. Nicholl, Ross McQueenie, Katie I. Gallacher, Duncan Lee, and Frances S. Mair. ‘Relationship between Multimorbidity, Demographic Factors and Mortality: Findings from the UK Biobank Cohort’. BMC Medicine 17, no. 1 (10 April 2019): 74. https://doi.org/10.1186/s12916-019-1305-x. [4] Whitson, Heather E., Kimberly S. Johnson, Richard Sloane, Christine T. Cigolle, Carl F. Pieper, Lawrence Landerman, and Susan N. Hastings. ‘Identifying Patterns of Multimorbidity in Older Americans: Application of Latent Class Analysis.’ Journal of the American Geriatrics Society 64, no. 8 (August 2016): 1668–73. https://doi.org/10.1111/jgs.14201. [5] Zemedikun, Dawit T., Laura J. Gray, Kamlesh Khunti, Melanie J. Davies, and Nafeesa N. Dhalwani. ‘Patterns of Multimorbidity in Middle- Aged and Older Adults: An Analysis of the UK Biobank Data.’ Mayo Clinic Proceedings 93, no. 7 (July 2018): 857–66. https://doi.org/10.1016/j.mayocp.2018.02.012. [6] Zhu, Yajing, Duncan Edwards, Jonathan Mant, Rupert A. Payne, and Steven Kiddle. ‘Characteristics, Service Use and Mortality of Clusters of Multimorbid Patients in England: A Population-Based Study.’ BMC Medicine 18, no. 1 (10 April 2020): 78. https://doi.org/10.1186/s12916-020- 01543-8. [7] Ashworth, Mark, Stevo Durbaba, David Whitney, James Crompton, Michael Wright, and Hiten Dodhia. ‘Journey to Multimorbidity: Longitudinal Analysis Exploring Cardiovascular Risk Factors and Sociodemographic Determinants in an Urban Setting.’ BMJ Open 9, no. 12 (23 December 2019): e031649. https://doi.org/10.1136/bmjopen-2019-031649.
  • 38. 42 <event name> EHR data: Deep Learning Recent survey on DNN methods for EHR-based modelling [1]: - DNNs fully exploit the longitudinal nature of EHRs - Useful predict outcomes where patient history is a relevant predictor State of the art methods (< 2020): - eNRBM [2] - Deep Patient [3] - Deepr [4] - RETAIN [5] Summary of Results: - The methods are competitive - Achieving AUC >.8 on each of the outcomes above Target outcomes: Future disease given medical history Unplanned readmission Disease progression Specific complication, eg heart failure, cataract Patient mortality [1] Ayala Solares, Jose Roberto, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fatemeh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, et al. ‘Deep Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Architectures’. Journal of Biomedical Informatics 101 (1 January 2020): 103337. https://doi.org/10.1016/j.jbi.2019.103337.
  • 39. 43 <event name> EHR data: Deep Learning -- References [2] T. Tran, T. D. Nguyen, D. Phung, S. Venkatesh, Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM), Journal of Biomedical Informatics 54 (2015) 96 – 105. doi:https://doi.org/10.1016/j.jbi.2015.01.012. [3] Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci Rep. 2016 May 17;6:26094. doi: 10.1038/srep26094. PMID: 27185194; PMCID: PMC4869115. [4] P. Nguyen, T. Tran, N. Wickramasinghe, S. Venkatesh, Deepr: A convolutional net for medical records, IEEE Journal of Biomedical and Health Informatics 21 (1) (2017) 22–30. doi:10.1109/JBHI.2016. 767 2633963. [5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, W. Stewart, 850 RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism, in: Advances in Neural Information Processing Systems, 2016, pp. 3504–3512.
  • 40. 46 <event name> A different approach: predicting target disease clusters
  • 41. 47 IEEE BigData 2022 Can we model the likelihood of next disease? Experimental models exists for - Modelling disease progression [1] - Discovering clinical pathway patterns [2] - Predicting next disease(s) [3] However, not very robust or actually deployed in practice [1] Wang, Xiang, David Sontag, and Fei Wang. ‘Unsupervised Learning of Disease Progression Models’. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 85–94. KDD ’14. New York, NY, USA, 2014 https://doi.org/10.1145/2623330.2623754. [2] Huang, Zhengxing, Wei Dong, Lei Ji, Chenxi Gan, Xudong Lu, and Huilong Duan. ‘Discovery of Clinical Pathway Patterns from Event Logs Using Probabilistic Topic Models’. Journal of Biomedical Informatics 47 (1 February 2014): 39–57. https://doi.org/10.1016/j.jbi.2013.09.003. [3] Men, Lu, Noyan Ilk, Xinlin Tang, and Yuan Liu. ‘Multi-Disease Prediction Using LSTM Recurrent Neural Networks’. Expert Systems with Applications 177 (1 September 2021): 114905. https://doi.org/10.1016/j.eswa.2021.114905. use Electronic Health Records (diagnoses event logs) to predict patients’ long-term associations to a specific disease cluster Our goal:
  • 42. 48 IEEE BigData 2022 Research hypothesis It is possible to identify clusters of diseases that: 1. Are described using disease terms that are familiar to health domain experts 2. Are clinically significant  based on expert validation 3. Admit a quantitative association of individual patients with each of the clusters (*) limited to LTCs What we hope to find: 1. A significant majority of patients are stable relative to the clustering 2. Stability emerges early in their medical history(*)
  • 43. 49 IEEE BigData 2022 Contributions • We use Topic Modelling as a form of semantic clustering • Topics are defined by ranked lists of disease terms • We define a cluster’s gravitational pull: patients are differently attracted by each cluster at different points in time • We propose a quantitative measure of stability with respect to clusters over time • We study how stability increases as timelines progress
  • 44. 50 Dynamic patient-cluster associations: examples Patient 1 topic 1 topic 2 topic 3 topic 4 OA 0.20 0.21 0.07 0.04 skin ulcer 1.29 0.21 0.07 0.05 dermatitis 1.76 0.21 0.07 0.05 erectile dysfunction 1.76 0.68 0.07 0.51 Primary malignancy skin 2.14 1.46 0.07 0.51 Patient 2 topic 1 topic 2 topic 3 topic 4 spondylosis 0.31 0.37 0.26 0.00 obesity 0.45 0.50 0.34 0.50 urine incontinence 0.75 0.50 1.06 0.51 female genital prolapse 1.22 0.51 1.75 0.51 type 2 diabetes 1.22 0.51 1.75 1.43 unspecified rare diabetes 1.22 0.51 1.75 2.56 fracture of the hip 1.23 2.62 1.75 2.57 Patient 3 topic 1 topic 2 topic 3 topic 4 dermatitis 0.47 0.00 0.00 0.00 hypertension 0.55 0.16 0.00 0.16 atrial fibrilation 0.55 1.41 0.00 0.16 OA 0.75 1.62 0.07 0.20 tinnitus 1.24 2.02 0.33 0.20 Patient 4 topic 1 topic 2 topic 3 topic 4 PTSD 0.22 0.00 0.68 0.00 COPD 0.54 0.94 0.69 0.08 Neuromuscular dysfunction of bladder 0.54 1.26 1.87 0.09 female genital prolapse 1.01 1.26 2.55 0.09 OA 1.21 1.47 2.63 0.12 Patient 5 topic 1 topic 2 topic 3 topic 4 Peripheral venous and lymphatic disease 0.59 0.24 0.07 0.00 psoriasis 1.07 0.45 0.07 0.55 female genital prolapse 1.54 0.45 0.76 0.55 CHD 1.54 1.11 0.76 0.84 Alcohol dependence 1.74 1.59 0.76 1.25 obesity 1.88 1.72 0.84 1.75 hearing loss 2.21 2.12 0.85 1.76 urine incontinence 2.51 2.12 1.57 1.76 Patient 6 topic 1 topic 2 topic 3 topic 4 asthma 0.80 0.00 0.00 0.00 hypertension 0.88 0.16 0.00 0.16 hearing loss 1.21 0.56 0.01 0.17 Alcohol dependence 1.41 1.04 0.01 0.58 Patient stages: T1 = [ t1], T2 = [ t1, t2], … Tn = T = [ t1, t2, …. tn] Patient: disease terms sequence T = [ t1, t2, …. tn] IEEE BigData 2022
  • 45. 51 <event name> Challenge: making the best of expensive features RS1 RS2 FS1 Training set 1: (RS1+RS2 , FS1) FS1: core features, FS2 extended features FS1 available on entire cohort FS2 only available on a subset How do leverage a model learnt using Training set 1 to improve a model learnt from Training set 2? FS2 Training set 2: (RS1 , FS1+FS2)
  • 46. 52 <event name> Challenge: synthetic data generation for specialized data types Self-monitoring contains potentially useful signal to anticipate specific conditions - But data heavily imbalanced towards healthy controls - Case data points harder to collect Can we use the available “seed” true data points to generate new synthetic and plausible ones? Specifically: physical activity data  general problem of time-series data generation Challenge: Existing GAN / TimeGAN approaches insufficient - Hard to scale - Require very strong signal
  • 47. 53 <event name> Key messages • The weaknesses are in the data not in the models! • Need for data integration + curation + engineering dominate the need for size • Investments driven by “health crisis” • Mental (dementia, Parkinson’s) • Physical: multimorbidity in older population • Focus on EHR: • Good advances in using AI to draw insights from EHR, but data quality is a big barrier AI for HealthCare: great opportunities for impactful research, but many challenges remain

Editor's Notes

  1. Mention "reusable analysis pipelines" (RAP) NHS data in the UK are a prime example of retrospective data. In principle accessible for research, but There are governance issues It requires coding and integration
  2. Figure 2a shows that the total volume of data in genomics is considerably smaller than the data generated by earth science [26], but orders of magnitude larger than the social sciences. The data growth trend in genomics, however, is greater than in other disciplines. In fact, some researchers have suggested that if the genomics data generation growth trend remains constant, genomics will soon generate more data than applications such as social media, earth sciences, and astronomy [27]. In Fig. 2b, we compare genomics to other data-driven disciplines in the biological sciences. This analysis clearly shows that the large amount of early biological data was not in genomics, but rather in macromolecular structure. Only in 2001, for example, did the number of datasets in genomics finally surpass protein-structure data. More recently, new trends have emerged with the rapidly increasing amount of electron microscopy data, due to the advent of cryo-electron microscopy, and of mass-spectrometry-based proteomics data. Perhaps these trends will shift the balance of biomedical data science in the future.
  3. Vision is generating actionable knowledge from big health data. What sort of clinical questions are we considering?