Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

A talk given to the School of Information Sciences
Center for Informatics Research in Science and Scholarship
University of Illinois Urbana-Champaign

  • Be the first to comment

  • Be the first to like this

Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

  1. 1. Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's Paolo Missier School of Computing Newcastle University, UK March 2021 A talk given to The School of Information Sciences Center for Informatics Research in Science and Scholarship University of Illinois Urbana-Champaign paolo.missier@ncl.ac.uk LinkedIn: paolomissier Twitter: @PMissier
  2. 2. 2 The message: 1. “Data Science” for Health is hard. The hard part is the data 2. “AI for Health” is (Deep) Machine Learning 3. Ethics. Fairness. Trust. Acceptance. 4. Data Provenance for Data Science: Solution or distraction? • Transparency • Trustworthiness • Traceability
  3. 3. 3 A Grand Challenge https://epsrc.ukri.org/research/ourportfolio/themes/healthcaretechnologies/strategy/grandchallenges/
  4. 4. 4 AI for healthcare – the UK landscape https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences AI and data science will improve the detection, diagnosis, and treatment of illness. They will optimise the provision of services, and support health service providers to anticipate demand and deliver improved patient care. • Explainability / Interpretability • Exploiting EHR (Electronic Health Records) • Image interpretation • Fairness, Bias • Ethical issues in … • Predicting <disease / critical event> …
  5. 5. 5 Personalised, Predictive, Preventive, Participatory Medicine (P4) Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat Biotechnol. 2017;35:747.
  6. 6. 6 (*) Data-Driven, Personalised, Predictive, Preventive, Participatory D2P4 (*) Healthcare research • Cleaning • Integration • Alignment • Imputation • NLP • … Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”… (plus traditional statistics!)
  7. 7. 7 Big Data for Health Care Genomics for personalized medicine personal monitors / wearables Medical Records Article Source: Big Data: Astronomical or Genomical? Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195
  8. 8. 9 D2P4  Accelerometry Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy • Cleaning • Integration • Alignment • Imputation • NLP • … Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML)
  9. 9. 10 Digital biomarkers Digital biomarkers come from "novel sensing systems capable of continuously tracking behavioral signals […] capture people's everyday routines, actions, and physiological changes that can explain outcomes related to health, cognitive abilities, and more” (Choudhury 2018). Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156. https://doi.org/10.1145/3266285 - physical activity - glucose levels - blood oxygen levels - … Inexpensive  scalable personalised self-monitoring
  10. 10. 11 A first project: markers from accelerometers? Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome - physical activity - glucose levels - blood oxygen levels - … Aligned with the P4 agenda Readily available dataset (+) 3,500+ features (+) multi-omics coverage (+) genomics (+) links to EHR (+) Activity monitors made in Newcastle! (-) Limited follow-ups – little longitude (-) Population not random (-) Activity data / person very limited 100K Activity traces
  11. 11. 13 Using wearable activity trackers to predict Type-2 Diabetes Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations. Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press) Feature extraction Clustering Classification ??
  12. 12. 14 Granular activity representation feature extraction 60 features / day
  13. 13. 15 Filter: Accelerometry study? 103,712 Split criteria: Type 2 Diabetes? At baseline: 2,755 Through EHR analysis: 1,321 Total: 4,076 Non-Diabetes 99,636 Filter: EHR data available? 19,852 502, 664 All UK Biobank participants: Filter: QC on activity traces 3,103 Positives: T2D vs Norm-0 Physical Impairment analysis Severe impairment 1,666 No impairment 8,463 A great UG project! your (biomedical) dataset may not be as big as it looks T2D vs Norm-1
  14. 14. 16 (some) results Negatives: HLAF SDL HLAF+SDL Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2 RF .80 .68 .83 .78 .86 .77 LR .79 .70 .83 .78 .86 .78 XGB .78 .66 .80 .74 .85 .75
  15. 15. 17 Ongoing work Are there better embedded representations for acceleremetry data? Can they be used as predictors for other outcomes? Representation learning Embedded feature space LSTM Autoencoder Outcome: Insulin sensitivity DIRECT DB Standard classification
  16. 16. 19 D2P4  COVID Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML)
  17. 17. D. Ferrari1, Prof. F. Mandreoli1, Prof. G. Guaraldi2 Prof. P. Missier Predicting respiratory failure in patients with COVID-19 pneumonia: a case study from Northern Italy Peak of Italian Covid crisis (March 2020 onwards) Issue: ICU Capacity Question: will my next patient require ICU resources? How soon? (1) (2) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
  18. 18. 21 Study structure Applied Machine Learning driven by a clinical question An example of typical data science pattern: • Data selection  inclusion, exclusion criteria • Data preparation / cleaning • Variable selection • Model learning  multiple models • Model evaluation With additional challenges: “Live” evolving dataset with multiple versions of a patients database • changes in recording practices • Inconsistencies • Lots of missing data Small data: 198 patients  1068 observations  31-90 variables (symptoms, lab biomarkers) In the data collection period, the dataset was growing daily with the average of 84 new records per day, with a mean of 10 new data points/patient. out of the initial sample of 295 patients and 2,889 data points available, 198 patients contributed to generate 1068 valuable observations. In detail, 603 observations contributed to the definition of respiratory failure (PaO2/ FiO2 < 150 mmHg) and 465 did not meet this definition. Each data point included a complex record of observations from multiple categories: (1) signs and symptoms, (2) blood biomark- ers, (3) respiratory assessment with PaO2/FiO2, (4) history of comorbidities (available in a sub- set of 119 patients). Some variables were collected daily, and others were recorded upon clinical indications.
  19. 19. 22 A case study to illustrate the problem
  20. 20. 24 Modelling Requirements • Parsimonious  few variables • Robust to missing data  imputation not an option • Explainable  Trust • model reveals the relative importance of each variable for each prediction it makes • Minimize the number of false negatives • risk of under-estimating the severity of a patient’s condition
  21. 21. 26 Approach • Parsimonious  feature ranking and selection • Robust to missing data • Explainable  Shapley values • Minimize FN  bespoke loss function Ensemble of Decision trees
  22. 22. 27 Testing multiple models - Results Parsimony: Model 1 - suboptimal prediction accuracy Model 2: Adding biomarkers including respiratory variables increased performance Model 3: boosted mixed model - still requires about 20 variables From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice. What our approach offers in support to the decision-making process is a simple interpretation of the predictions.
  23. 23. 28 Which are the most important predictors? Shap values
  24. 24. 29 Summary Good results on “live” data, predicting a useful outcome for the purpose of ICU management Major selling points: • Variables (relatively) easy to collect in routine visits and in-hospital • Models are explainable, medics can reality-check against their own understanding … Opened the door to further collaborations: New project on PACS: Post-Acute Covid Syndrome: Following up recovery paths for 300 patients across 5 hospitals
  25. 25. 30 D2P4  EHR analysis for dynamic risk prediction D2P4 (*) Healthcare research Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Survival analysis Longitudinal prediction models
  26. 26. 31 Longitudinal data: Health-related events https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270 UK Biobank - Primary Care Linked Data
  27. 27. 32 Clinical Risk Prediction Models Healthy participant or missing data/under- reported conditions? Number/pattern of records is a proxy for health? Informed presence bias Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)
  28. 28. 36 Case study: Type 2 Diabetes ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 80 Pre⌧diabetes D iabetes R em ission G lycated hem oglobin HbA 1c (m m ol/m ol) Participant: R ED A C T ED ● ● ● ● 4 6 8 10 12 ● Prim ary care records U K B B visit ● ● ● N orm oglycaem ic Pre⌧diabetic D iabetic Fasting plasm a glucose (m m ol/l) P r im a r y ca r e Se con d a r y ca r e E v e n t O b s D r u g D ia g O p 1987 (age X ) 1991 (age X ) 1995 (age X ) 1999 (age X ) 2003 (age X ) 2007 (age X ) 2011 (age X ) 2015 (age X ) Estim ated observation period R ecord D iabetes record Electronic health records Figure 17: Example output of the phenotyping tool. 39
  29. 29. 37 Case study: Type 2 Diabetes – remission study Type 2 diabetes remission Longitudinal phenotyping with large–scale observational data Philip Darke EPSRC Centre for Doctoral Training in Cloud Computing for Big Data Newcastle University UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk dle and old age with over 500,000 participants. Diabetes is one of the most prevalent conditions in the cohort with nearly 70,000 diag- noses2 expected by 2027. Study data is collected at participant visits 2 Naomi Allen, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy and Technology, 1(3):123–126, September 2012. doi : 10.1016/ j.hlpt.2012.07.003 and via linkage to national datasets including EHR data. These data have been used to longitudinally phenotype over 200,000 partici- pants for diabetes as illustrated in figure 1. The approach will be expanded to all participants when further data is released. ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 HbA1c (mmol/mol) Pre−diabetes Type 2 diabetes Remission ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 80 90 100 Weight (kg) Biguanides 12.5 15.0 17.5 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Figure 1: Model output showing HbA1c, weight, periods of medication and inferred diabetic status for an example participant. Long–term remission was achieved by sustained weight loss post diagnosis. Many of those diagnosed with type 2 diabetes experience a sub- sequent period of remission. Some relapse whilst others achieve long–term remission and cease anti–diabetes medication. This project will examine the pathways to remission at scale using ob-
  30. 30. 38 D2P4  MLTC-M Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multiple Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) NLP
  31. 31. 39 <event name> Multimorbidity and Long-Term Conditions Patients with multimorbidities have the greatest healthcare needs and generate the highest expenditure in the health system. There is an increasing focus on identifying specific disease combinations for addressing poor outcomes. Matrix factorization / factor analysis Clustering Multiple correspondence analysis Network analysis … Which data? Fragmented / disconnected data sources  Data access  Data governance
  32. 32. 40 D2P4  NAFLD / non-alcohol fatty liver disease Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy • Cleaning • Integration • Alignment • Imputation • NLP • … Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”…
  33. 33. 41 D2P4  NAFLD / NASH NASH = non-alcoholic steatohepatitis Aims: - integrate cross-sectional and longitudinal outcomes clinical data with a multi-dimensional ‘omics’ record - Hypothesis: a precision medicine approach leads to better understanding of individuals’ trajectories - Personalised biomarkers  liquid biopsy Dataset: European NAFLD Registry 7,750 patients with histologically proven NAFLD/NASH - Omics (cross-sectional) - Longitudinal follow ups Methods: - Precision: clustering - Anticipating progression: Learn cluster-specific longitudinal models
  34. 34. 42 DP4DS: Data Provenance for Data Science D2P4 + DP4DS(*) Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”… (plus traditional statistics!)
  35. 35. 43 Data  Model  Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied? Complementing current ML approaches to model interpretability 1. Can we explain these decisions? 2. Are these explanations useful?
  36. 36. 44 Explaining data preparation Data collection Model Population data pre-processing Raw datasets features Predicted you: - Ranking - Score - Class - Integration - Cleaning - Outlier removal - Normalisation - Feature selection - Class rebalancing - Sampling - Stratification - … Data acquisition and wrangling: - How were datasets acquired? - How recently? - For what purpose? - Are they being reused / repurposed? - What is their quality? Instances - Scripts  Python / TensorFlow, Pandas, Spark - Workflows  Knime, … Provenance  Transparency
  37. 37. 46 Recent early results A small grassroots project… [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Reality check: - How much does it cost?  provenance volume - Does it help?  queries against the provenance database [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  38. 38. 47 Operators 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  39. 39. 48 Code instrumentation Create a provlet for a specific transformation Initialize provenance capture …code injection is now being automated!
  40. 40. 49 Provenance patterns
  41. 41. 50 Provenance templates Template + binding rules = instantiated provenance fragment 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} +
  42. 42. 51 This applies to all operators…
  43. 43. 52 Putting it all together
  44. 44. 53 Evaluation - performance
  45. 45. 54 Evaluation: Provenance capture and query times
  46. 46. 55 Scalability
  47. 47. 56 Summary Multiple hypotheses regarding Data Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  what is the benefit to data analysts? Work in progress! Interest? Ideas?
  48. 48. 57 Acknowledgments Prof. Mike Catt PhD Students: Ben Lam, Philip Darke MSc student: Sam Butterfield Prof. Guaraldi Prof. Mandreoli MSc student: Davide Ferrari Prof. Torlone MSc student: Giulia Simonelli Prof. Chapman

×