Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges

121 views

Published on

Invited talk at university of Leeds School of Computing (School colloquia series), Nov. 24, 2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges

  1. 1. 1 ReComp–UniversityofLeeds November,2017 Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges recomp.org.uk Paolo Missier, Jacek Cala, Jannetta Steyn School of Computing Newcastle University, UK University of Leeds School of Computing Colloquia series November, 2017 Meta-* In collaboration with • Cambridge University (Prof. Chinnery, Department of Clinical Neurosciences) • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  2. 2. 2 ReComp–UniversityofLeeds November,2017 Data Science Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge”
  3. 3. 3 ReComp–UniversityofLeeds November,2017 Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Life Science Analytics
  4. 4. 4 ReComp–UniversityofLeeds November,2017 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • ReComp: selective re-computation to refresh outcomes in reaction to change • Techniques and initial findings • Open challenges
  5. 5. 5 ReComp–UniversityofLeeds November,2017 Data Analytics enabled by Next Gen Sequencing Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  6. 6. 6 ReComp–UniversityofLeeds November,2017 SVI: Simple Variant Interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Filters then classifies variants into three categories: pathogenic, benign and unknown/uncertain SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
  7. 7. 7 ReComp–UniversityofLeeds November,2017 Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  8. 8. 8 ReComp–UniversityofLeeds November,2017 Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected: 4.2 hours of computation per change ≈7 minutes / patient (single-core VM) Should we care about database updates?
  9. 9. 9 ReComp–UniversityofLeeds November,2017 Whole-exome variant calling: expensive Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration
  10. 10. 10 ReComp–UniversityofLeeds November,2017 Whole-Exome Sequencing pipeline: scale Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2) Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016
  11. 11. 11 ReComp–UniversityofLeeds November,2017 Workflow Design echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP mkdir -p $PICARD_OUTDIR mkdir -p $PICARD_TEMP echo Starting PICARD to clean BAM files... $Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED echo Starting PICARD to remove duplicates... $Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = $SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true echo Adding read group information to bam file... $Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}” echo Indexing bam files... samtools index $SORTED_BAM_FILE_NODUPS “Wrapper” blocksUtility blocks From To
  12. 12. 12 ReComp–UniversityofLeeds November,2017 Workflow design raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Conceptual: Actual: 11 workflows 101 blocks 28 tool blocks
  13. 13. 13 ReComp–UniversityofLeeds November,2017 Parallelism in the pipeline align-clean- recalibrate-coverage … align-clean- recalibrate-coverage Sample 1 Sample n Variant calling recalibration Variant calling recalibration Variant filtering annotation Variant filtering annotation …… Chromosome split Per-sample Parallel processing Per-chromosome Parallel processing Stage I Stage II Stage III
  14. 14. 15 ReComp–UniversityofLeeds November,2017 Performance Configurations for 3VMs experiments: Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04. 00:00 12:00 24:00 36:00 48:00 60:00 72:00 0 6 12 18 24 Responsetime[hh:mm] Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  15. 15. 17 ReComp–UniversityofLeeds November,2017 Whole-exome variant calling: unstable Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration dbSNP builds 150 2/17 149 11/16 148 6/16 147 4/16 Any of these stages may change over time – semi-independently
  16. 16. 18 ReComp–UniversityofLeeds November,2017 Comparing three versions of Freebayes Should we care about changes in the pipeline? • Tested three versions of the caller: • 0.9.10  Dec 2013 • 1.0.2  Dec 2015 • 1.1  Nov 2016 • The Venn diagram shows quantitative comparison (% and number) of filtered variants; • Phred quality score >30 • 16 patient BAM files (7 AD, 9 FTD-ALS)
  17. 17. 20 ReComp–UniversityofLeeds November,2017 Impact on SVI classification Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS The ONLY change in the pipeline is the version of Freebayes used to call variants (R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity Patient ID Freebayes version B_0190 B_0191 B_0192 B_0193 B_0195 B_0196 B_0198 B_0199 B_0201 B_0202 B_0203 B_0208 B_0209 B_0211 B_0213 B_0214 0.9.10 A A R A R R R R R A R R R R A R 1.0.2 A A R A R R A A R A R A R A A R 1.1 A A R A R R A A R A R A R A A R Phenotype ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD AD ALS-FTD AD AD AD AD AD ALS-FTD ALS-FTD AD
  18. 18. 21 ReComp–UniversityofLeeds November,2017 Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation
  19. 19. 22 ReComp–UniversityofLeeds November,2017 Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation ReComp space
  20. 20. 23 ReComp–UniversityofLeeds November,2017 Understanding change Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings be improved over time? Challenge space = expensive analysis + frequent changes + high impact Case studies: - Bioinformatics pipelines: long running time per case x thousands of cases (- Long-running simulations: modelling flood events with terrain changes)
  21. 21. 24 ReComp–UniversityofLeeds November,2017 When should we repeat an expensive simulation? CityCat Flood simulator CityCat Flood simulator Can we predict high difference areas? New buildings may alter data flow Processing the Newcastle area: 5 hours Extreme Rainfall event
  22. 22. 25 ReComp–UniversityofLeeds November,2017 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • ReComp: selective re-computation to refresh outcomes in reaction to change • Techniques and initial findings • Open challenges Project structure • 3 years funding - Feb. 2016 - Jan. 2019 • In collaboration with • Cambridge University (Prof. Chinnery, Department of Clinical Neurosciences) • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  23. 23. 26 ReComp–UniversityofLeeds November,2017 The ReComp meta-process Estimate impact of changes Select and Enact Record execution history Detect and measure changes History DB Data diff(.,.) functions Change Events Process P Observe Exec 1. Capture the history of past computations: - Process Structure and dependencies - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits Approach: 2. Collect and exploit process history metadata 1. Quantify data-diff and impact of changes on prior outcomes Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap…)
  24. 24. 27 ReComp–UniversityofLeeds November,2017 Diff functions: example ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  25. 25. 28 ReComp–UniversityofLeeds November,2017 Compute difference sets – ClinVar The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662
  26. 26. 29 ReComp–UniversityofLeeds November,2017 For tabular data, difference is just Select-Project Key columns: {"#AlleleID", "Assembly", "Chromosome”} “where” columns:{"ClinicalSignificance”}
  27. 27. 31 ReComp–UniversityofLeeds November,2017 History DB: Workflow Provenance Each invocation of an eSC workflow generates a provenance trace http://vcvcomputing.com/provone/provone.html User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  28. 28. 32 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution • Identify and re-enact the portion of a process that are affected by change 2. Differential execution • Input to the new execution consists of the differences between two versions of a changed dataset • Only feasible if some algebraic properties of the process hold 3. Identifying the scope of change – Loss-less • Exclude instances of the population that are certainly not affected
  29. 29. 33 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  30. 30. 34 ReComp–UniversityofLeeds November,2017 SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  31. 31. 35 ReComp–UniversityofLeeds November,2017 1. Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  32. 32. 36 ReComp–UniversityofLeeds November,2017 Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Overhead: cache intermediate data required for partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re- execution (seC) Complete re- execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  33. 33. 41 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  34. 34. 42 ReComp–UniversityofLeeds November,2017 2. Differential execution ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  35. 35. 43 ReComp–UniversityofLeeds November,2017 P2: Differential execution Suppose D is a relation (a table). diffD() can be expressed as: Where: We compute: as the combination of: This is effective if: This can be achieved as follows: …provided P satisfies the required algebraic properties
  36. 36. 44 ReComp–UniversityofLeeds November,2017 P2: Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion record count Difference record count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  37. 37. 45 ReComp–UniversityofLeeds November,2017 Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  38. 38. 46 ReComp–UniversityofLeeds November,2017 3: precisely identify the scope of a change Patient / DB version impact matrix Strong scope: (fine-grained provenance) Weak scope: “if CVi was used in the processing of pj then pj is in scope” (coarse-grained provenance – next slide) Semantic scope: (domain-specific scoping rules)
  39. 39. 47 ReComp–UniversityofLeeds November,2017 A weak scoping algorithm Coarse-grained provenance Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF) - For each candidate invocation I of P: - partially re-execute using the difference sets as inputs # see (2) - find the minimal subgraph P’ of P that needs re-computation # see (1) - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> and not <empty output> then - Execute P’ on the full inputs Sketch of the algorithm: WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  40. 40. 48 ReComp–UniversityofLeeds November,2017 Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  41. 41. 49 ReComp–UniversityofLeeds November,2017 ReComp challenges Change Events History DB Reproducibility: - virtualisation Sensitivity analysis unlikely to work well Small input perturbations  potentially large impact on diagnosis Learning useful estimators is hard Diff functions are both type- and application-specific Not all runtime environments support provenance recording specific  generic Data diff(.,.) functions Process P Observe Exec
  42. 42. 50 ReComp–UniversityofLeeds November,2017 Questions? http://recomp.org.uk/ Meta-*
  43. 43. 51 ReComp–UniversityofLeeds November,2017 The Metadata Analytics challenge: Learning from a metadata DB of execution history to support automated ReComp decisions
  44. 44. 52 ReComp–UniversityofLeeds November,2017 changes, data diff, impact 1) Observed change events: (inputs, dependencies, or both) 3) Impact occurs to various degree on multiple prior outcomes. Impact of change C on the processing of a specific X: 2) Type-specific Diff functions: Impact is process- and data-specific:
  45. 45. 53 ReComp–UniversityofLeeds November,2017 Impact: importance and Scope Scope: which cases are affected? - Individual variants have an associated phenotype. - Patient cases also have a phenotype “a change in variant v can only have impact on a case X if V and X share the same phenotype” Importance: “Any variant with status moving from/to Red causes High impact on any X that is affected by the variant”
  46. 46. 54 ReComp–UniversityofLeeds November,2017 History Database HDB: A metadata-database containing records of past executions: Execution records: C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB Example: Consider only one type of change: Variant caller
  47. 47. 55 ReComp–UniversityofLeeds November,2017 ReComp decisions Given: - A population X of processed inputs: - Change ReComp must learn to make yes/no decisions for each returns True if P is to be executed again on X, and False otherwise To decide, ReComp must estimate impact: (as well as estimate the re-computation cost) Example: Objective: maximise reward
  48. 48. 57 ReComp–UniversityofLeeds November,2017 History DB and Differences DB Whenever P is re-computed on input X, a new er’ is added to HDB for X: Using diff() functions we produce a derived difference record dr: … collected in a Differences database: dr1 = Imp(C1,X1) dr2= Imp(C12,X4) dr3 = Imp(C1,X5) dr4 = Imp(C2,X5) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB   
  49. 49. 58 ReComp–UniversityofLeeds November,2017 ReComp algorithm ReComp C X E: HDB DDB decisions: E’: HDB’DDB’
  50. 50. 59 ReComp–UniversityofLeeds November,2017 Learning challenges • Evidence is small and sparse • How can it be used for selecting from X? • Learning a reliable imp() function is not feasible • What’s the use of history? You never see the same change twice! • Must somehow use evidence from related changes • A possible approach: • ReComp makes probabilistic decisions, takes chances • Associate a reward to each ReComp decision  reinforcement learning • Bayesian inference (use new evidence to update probabilities) X1 X2 X3 X4 X5 HDB dr1 = Imp(C1,X1) dr2= Imp(C12,X4) dr3 = Imp(C1,X5) dr4 = Imp(C2,X5) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53
  51. 51. 60 ReComp–UniversityofLeeds November,2017 Questions? http://recomp.org.uk/ Meta-*

×