Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ReComp and P4@NU: Reproducible Data Science for Health

brief overview of the ReComp project ( on Selective recurring re-computation of complex analytics, and a brief outlook for the P4@NU project on seeking digital biomarkers for age-0related metabolic diseases

  • Be the first to comment

  • Be the first to like this

ReComp and P4@NU: Reproducible Data Science for Health

  1. 1. 1 Paolo Missier School of Computing, Newcastle University Alan Turing Institute UK 3rd MAQC Conference Riva del Garda, Italy April, 2019 ReComp(1) and P4@NU(2): Reproducible Data Science for Health In collaboration with the Institute of Genetic Medicine (1) (2)
  2. 2. 2 ReComp: Context Big Data The Big Analytics Machine “Valuable Knowledge” (Life Science) Analytics Data Science… t V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t over time
  3. 3. 3 ReComp • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings be improved over time? Scope: expensive analysis + frequent changes + high change impact Selective recurring re-computation of complex analytics
  4. 4. 44 Use case: genomics variants analysis Image credits: Broad Institute Current cost of whole-genome sequencing: < £1,000 (our) processing time: about 40’ / GB  @ 11GB / sample (exome): 8 hours  @ 300-500GB / sample (genome): …
  5. 5. 55 Called variants for a 16 patient dataset (7 AD, 9 FTD-ALS) Using different versions of the Freebayes variant caller Expensive, evolving Input: ≈15 GB compressed / 40 GB uncompressed (FASTQ) 30-40 input samples  1.2-1.6 TB uncompressed Small 6-sample run takes about 30h on a HPC cluster Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016 GATK best practices pipeline
  6. 6. 6 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain
  7. 7. 7 Changes that affect variant interpretation Evolution in number of variants that affect patients (a) with a specific phenotype-specific (b) Across all phenotypes phenotype-specific variants Across all phenotypes
  8. 8. 8 Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected ≈7 minutes / patient (single-core VM) A game of battleship 4.2 hours of computation per change
  9. 9. 9 ReComp and Reproducibility Current reproducibility practice: How ReComp aims to determine: When To what extent
  10. 10. 10 The ReComp meta-process Estimate impact of changes Select relevant sub-processes Record execution history Detect and quantify changes History DB data diff(d,d’) Change Events Analytics Process P Log / provenance Partially Re-exec P  P’ Changes: • Process components • Library versions • OS stack • Reference datasets Impact(dd’, o) impact estimation functions
  11. 11. 1111 Key outcomes 1. A generic customizable decision framework: CONFIG: user-defined change detection and impact functions IN: Process P, old inputs, changes OUT: recomp / no-recomp decisions on P 2. Framework validation on two very different case studies - Flood simulations - High throughput genomics data processing – variant calling and interpretation Cała J, Missier P. Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study. Journal of Big Data Research 2018;13:76-94. doi:10.1016/j.bdr.2018.06.001
  12. 12. 12 SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  13. 13. 13 Workflow Provenance Each process invocation generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage ProgramWorkflow Execution Entity(ref data)
  14. 14. 14 Approach: 1/3 Overhead: caching intermediate data Time savings Partial re- execution (seC) Complete re-execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37 Partial re-execution: Which process fragments are impacted by the change? Change in ClinVar Change in GeneMap Insight: Impact analysis driven by the provenance trace
  15. 15. 15 Approach: 2/3 Idea: instead of re-executing P(<New Input>) execute P(diff(<New Input>), <Old Input> )) ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9% Potentially effective but hard to generalise Differential execution
  16. 16. 16 Goal: Avoid re-executing instances that are not affected by the change Sample Result: Reduction in number of complete re- executions 495  71 Approach 3/3 Scope across a population: Which instances of the process are affected?
  17. 17. 17 New case studies – linking with the Turing Data Science processes to support personalized healthcare 2-year pilot project P4@NU: Data Science for Health
  18. 18. 1818 P4@NU: Data Science for Health Biological age is the mismatch between chronological age and the stage of an individual along the ageing process. [2] Digital biomarkers come from "novel sensing systems capable of continuously tracking behavioral signals […] [1] [1] Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156. [2] Partridge L, Deelen J, Slagboom PE. Facing up to the global challenges of ageing. Nature. 2018;561(7721):45. The UK Biobank
  19. 19. 1919 Research questions Science: What is the role of digital biomarkers to predict onset of age-related diseases? - Signals ahead of phenotype expression - Interaction with genetic and biochemical markers Data Science: - What kind of learning framework is needed to efficiently extract and integrate multi-scale, multi-source features at scale? Focus on preventable metabolic diseases associated with inactivity and ageing - Cardiovascular disease - Type II diabetes
  20. 20. 2020 Approach: activity + genetics - Raw activity traces - Disease codes - Health Records - Genotypes UK Biobank Signal processing ↓ Cleaning ↓ Segmentation ↓ (Annotations) Representation learning Candidate digital biomarkers GWAS Relevant genetic loci (*) Predictive modelling Polygenic Risk factors for T2D Feature extraction Are any of the biomarkers also predictors of risk? 500K 100k  50k traces 24h x 7d @100HZ Training dataset: 130 features/data point 20K data points / trace 50k traces fine-grained: 1E11 data points 500K 100k participants Genotypes Oxford accel analysis GGIR
  21. 21. 2121 School of Computing Dr. Jacek Cala, Sr. Researcher Dr. Jannetta Steyn, Sr. Researcher Ben Lam, PhD student Institute of Genetic Medicine Prof. Joris Veltman, Head of the Institute Prof. Heather Cordell, Genetics Statistics National Innovation Centre for Ageing Thank you