Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics

102 views

Published on

Invited research seminar given at Durham University, Computer Science about findings from the Recomp project http://recomp.org.uk/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics

  1. 1. 1 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics recomp.org.uk Paolo Missier, Jacek Cala, Jannetta Steyn School of Computing Newcastle University, UK Durham University May 31st, 2018 Meta-* In collaboration with • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  2. 2. 2 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Science Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge”
  3. 3. 3 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Life Science Analytics
  4. 4. 4 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Understanding change Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings be improved over time? ReComp space = expensive analysis + frequent changes + high impact Analytics within ReComp space… C1: are resource-intensive and thus expensive when repeatedly executed over time, i.e., on a cloud or HPC cluster; C2: require sophisticated implementations to run efficiently, such as workflows with a nested structure; C3: depend on multiple reference datasets and software libraries and tools, some of which are versioned and evolve over time; C4: apply to a possibly large population of input instances C5: deliver valuable knowledge
  5. 5. 5 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Talk Outline ReComp: selective re-computation to refresh outcomes in reaction to change • Case study 1: Re-computation decisions for flood simulations • Learning useful estimators for the impact of change • Black box computation, coarse-grained changes • Case study 2: high throughput genomics data processing • An exercise in provenance collection and analytics • White-box computation, fine-grained changes • Open challenges
  6. 6. 6 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Case study 1: Flood modelling simulation Simulation characteristics: Part of Newcastle upon Tyne DTM: ≈2.3M cells, 2x2m cell size Building and green areas from Nov 2017 Rainfall event with return period 50 years Simulation time: 60 mins 10–25 frames with water depth and velocity in each cell Output size: 23x65 MiB ≈ 1.5 GiB Water depth heat map City Catchment Analysis Tool (CityCAT) Vassilis Glenis, et al. School of Engineering, NU
  7. 7. 7 ReComp–DurhamUniversityMay31st,2018 PaoloMissier When should we repeat an expensive simulation? CityCat Flood simulator CityCat Flood simulator Can we predict high difference areas without re- running the simulation? New buildings / green areas may alter data flow Extreme weather event simulation (in Newcastle) Extreme Rainfall event Running CityCat is generally expensive: - Processing for the Newcastle area: ≈3h on a 4-core i7 3.2GHz CPU Placeholder for more expensive simulations! Maps updates are infrequent (6 months) But useful when simulating changes eg for planning purposes Flood Diffusion Time series
  8. 8. 8 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Estimating the impact of a flood simulation Suppose we are able to quantify: - Difference in inputs, M,M’ - Difference in outputs F,F’ Suppose also that we are only interested in large enough changes between two outputs: For some user-defined parameter Problem statement: Can we define an ideal ReComp decision function which - Operates on two versions of the inputs, M, M’, and old output F - Returns true iff (1) would return true when F’ is actually computed (1) Can we predict when F’ needs to be computed?
  9. 9. 9 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach 1. Define input diff and output diff functions: 2. Define an impact function: 3. Define the ReComp decision function: where is a tunable parameter ReComp approximates (1), so it’s subject to errors: False Positives: False Negatives: 4. Use ground data to determine values for as a function of FPR and FNR Note: The ReComp function should be much less expensive to compute than sim()
  10. 10. 10 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Diff and impact functions B: Buildings L: other Land H: hard surface f() partitions polygons changes into 6 types: For each type, compute the average water depth within and around the footprint of the change returns the max of the avg water depth over all changes d Water depth B–L+ B–∩ L+ d B– Water depth : max of the differences between spatially averaged F,F’ over window W
  11. 11. 11 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Tuning the threshold parameter Ground data from all past re-computations: FP: <1,0> FN: <0,1> Set FNR to be close to 0. Experimentally find that minimises FPR. (max specificity) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.10 0.15 0.20 0.25 θImp 0.10 0.15 0.20 0.25 θImp Precision Recall Accuracy Specificity window size 20x20m, θO = 0.2m, all changes window size 20x20m, θO = 0.2m, consecutive changes
  12. 12. 12 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Experimental results
  13. 13. 13 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Summary of the approach M F M’ True F’ False Ground data Tune Target FPR Historical data <M,M’,F,F’>
  14. 14. 14 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Talk Outline ReComp: selective re-computation to refresh outcomes in reaction to change • Case study 1: Re-computation decisions for flood simulations • Learning useful estimators for the impact of change • Case study 2: high throughput genomics data processing • An exercise in provenance collection and analytics • Open challenges
  15. 15. 15 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Analytics enabled by Next Gen Sequencing Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  16. 16. 16 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Whole-exome variant calling pipeline Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration
  17. 17. 17 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Expensive Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2) Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016
  18. 18. 19 ReComp–DurhamUniversityMay31st,2018 PaoloMissier SVI: Simple Variant Interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Filters then classifies variants into three categories: pathogenic, benign and unknown/uncertain SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
  19. 19. 20 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  20. 20. 21 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected: 4.2 hours of computation per change ≈7 minutes / patient (single-core VM) Should we care about database updates?
  21. 21. 22 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Unstable Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration dbSNP builds 150 2/17 149 11/16 148 6/16 147 4/16 Any of these stages may change over time – semi-independently Human reference genome: H19  h37, h38,…
  22. 22. 23 ReComp–DurhamUniversityMay31st,2018 PaoloMissier FreeBayes vs SamTools vs GATK-Haplotype Caller GATK: McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110 FreeBayes: Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read sequencing." arXiv preprint arXiv:1207.3907 (2012). GIAB: Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech, 32(3), 246–251. http://dx.doi.org/10.1038/nbt.2835 Adam Cornish and Chittibabu Guda, “A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference,” BioMed Research International, vol. 2015, Article ID 456479, 11 pages, 2015. doi:10.1155/2015/456479 Hwang, S., Kim, E., Lee, I., & Marcotte, E. M. (2015). Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports, 5(December), 17875. https://doi.org/10.1038/srep17875
  23. 23. 24 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Comparing three versions of Freebayes Should we care about changes in the pipeline? • Tested three versions of the caller: • 0.9.10  Dec 2013 • 1.0.2  Dec 2015 • 1.1  Nov 2016 • The Venn diagram shows quantitative comparison (% and number) of filtered variants; • Phred quality score >30 • 16 patient BAM files (7 AD, 9 FTD-ALS)
  24. 24. 25 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact on SVI classification Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS The ONLY change in the pipeline is the version of Freebayes used to call variants (R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity Patient ID Freebayes version B_0190 B_0191 B_0192 B_0193 B_0195 B_0196 B_0198 B_0199 B_0201 B_0202 B_0203 B_0208 B_0209 B_0211 B_0213 B_0214 0.9.10 A A R A R R R R R A R R R R A R 1.0.2 A A R A R R A A R A R A R A A R 1.1 A A R A R R A A R A R A R A A R Phenotype ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD AD ALS-FTD AD AD AD AD AD ALS-FTD ALS-FTD AD
  25. 25. 26 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation
  26. 26. 27 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation ReComp space
  27. 27. 28 ReComp–DurhamUniversityMay31st,2018 PaoloMissier When is ReComp effective?
  28. 28. 29 ReComp–DurhamUniversityMay31st,2018 PaoloMissier The ReComp meta-process Estimate impact of changes Select and Enact Record execution history Detect and measure changes History DB Data diff(.,.) functions Change Events Process P Observe Exec 1. Capture the history of past computations: - Process Structure and dependencies - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits Approach: 2. Collect and exploit process history metadata 1. Quantify data-diff and impact of changes on prior outcomes Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap…)
  29. 29. 32 ReComp–DurhamUniversityMay31st,2018 PaoloMissier changes, data diff, impact 1) Observed change events: (inputs, dependencies, or both) 3) Impact occurs to various degree on multiple prior outcomes. Impact of change C on the processing of a specific X: 2) Type-specific Diff functions: Impact is process- and data-specific:
  30. 30. 33 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: However a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  31. 31. 34 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact: importance and Scope Scope: which cases are affected? - Individual variants have an associated phenotype. - Patient cases also have a phenotype “a change in variant v can only have impact on a case X if V and X share the same phenotype” Importance: “Any variant with status moving from/to Red causes High impact on any X that is affected by the variant”
  32. 32. 35 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution • Identify and re-enact the portion of a process that are affected by change 2. Differential execution • Input to the new execution consists of the differences between two versions of a changed dataset • Only feasible if some algebraic properties of the process hold 3. Identifying the scope of change – Loss-less • Exclude instances of the population that are certainly not affected
  33. 33. 37 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  34. 34. 38 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Role of Workflow Provenance in partial re-run User Execution «Association » «Usage» «Generation » « «C Controller Program Workflow Channel Port wasPartOf «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] hadEntity hasDefaultPara
  35. 35. 39 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History DB: Workflow Provenance Each invocation of an eSC workflow generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage ProgramWorkflow Execution Entity (ref data)
  36. 36. 40 ReComp–DurhamUniversityMay31st,2018 PaoloMissier SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  37. 37. 41 ReComp–DurhamUniversityMay31st,2018 PaoloMissier 1. Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  38. 38. 42 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Overhead: cache intermediate data required for partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re- execution (seC) Complete re- execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  39. 39. 47 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  40. 40. 48 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Diff functions: example ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  41. 41. 49 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Compute difference sets – ClinVar The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662
  42. 42. 50 ReComp–DurhamUniversityMay31st,2018 PaoloMissier For tabular data, difference is just Select-Project Key columns: {"#AlleleID", "Assembly", "Chromosome”} “where” columns:{"ClinicalSignificance”}
  43. 43. 51 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Differential execution ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  44. 44. 52 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Differential execution Suppose D is a relation (a table). diffD() can be expressed as: Where: We compute: as the combination of: This is effective if: This can be achieved as follows: …provided P is distributive wrt st union and difference Cf. F. McSherry, D. Murray, R. Isaacs, and M. Isard, “Differential dataflow,” in Proceedings of CIDR 2013, 2013.
  45. 45. 53 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion record count Difference record count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  46. 46. 54 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  47. 47. 55 ReComp–DurhamUniversityMay31st,2018 PaoloMissier 3: precisely identify the scope of a change Patient / DB version impact matrix Strong scope: (fine-grained provenance) Weak scope: “if CVi was used in the processing of pj then pj is in scope” (coarse-grained provenance – next slide) Semantic scope: (domain-specific scoping rules)
  48. 48. 56 ReComp–DurhamUniversityMay31st,2018 PaoloMissier A weak scoping algorithm Coarse-grained provenance Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF) - For each candidate invocation I of P: - partially re-execute using the difference sets as inputs # see previous slides - find the minimal subgraph P’ of P that needs re-computation # see above - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> and not <empty output> then - Execute P’ on the full inputs Sketch of the algorithm: WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  49. 49. 57 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  50. 50. 58 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Summary of ReComp challenges Change Events History DB Reproducibility: - virtualisation Sensitivity analysis unlikely to work well Small input perturbations  potentially large impact on diagnosis Learning useful estimators is hard Diff functions are both type- and application-specific Not all runtime environments support provenance recording specific  generic Data diff(.,.) functions Process P Observe Exec
  51. 51. 59 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Come to our workshop during Provenance Week! https://sites.google.com/view/incremental-recomp-workshop July 12th (pm) and 13th (am), King’s College London http://provenanceweek2018.org/
  52. 52. 60 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Questions? http://recomp.org.uk/ Meta-*
  53. 53. 61 ReComp–DurhamUniversityMay31st,2018 PaoloMissier The Metadata Analytics challenge: Learning from a metadata DB of execution history to support automated ReComp decisions
  54. 54. 62 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History Database HDB: A metadata-database containing records of past executions: Execution records: C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB Example: Consider only one type of change: Variant caller
  55. 55. 63 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact (again) Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: While a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  56. 56. 64 ReComp–DurhamUniversityMay31st,2018 PaoloMissier ReComp decisions Given a population X of prior inputs: Given a change ReComp makes yes/no decisions for each returns True if P is to be executed again on X, and False otherwise To decide, ReComp must estimate impact: (as well as estimate the re-computation cost) Example:
  57. 57. 65 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Two possible approaches 1. Direct estimator of impact function: Here the problem is learn such function for specific P, C, and data types Y 2. Learning an emulator for P which is simpler to compute and provides a useful approximation: surrogate (emulator) where ε is a stochastic term that accounts for the error in approximating f Learning requires a training set { (xi, yi) } … If can be found, then we can hope to use it to approximate: Such that
  58. 58. 66 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History DB and Differences DB Whenever P is re-computed on input X, a new er’ is added to HDB for X: Using diff() functions we produce a derived difference record dr: … collected in a Differences database: dr1 = imp(C1,Y11) dr2= imp(C12,Y41) dr3 = imp(C1,Y51) dr4 = imp(C2,Y52) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB    
  59. 59. 67 ReComp–DurhamUniversityMay31st,2018 PaoloMissier ReComp algorithm ReComp C X E: HDB DDB decisions: E’: HDB’DDB’
  60. 60. 68 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Learning challenges • Evidence is small and sparse • How can it be used for selecting from X? • Learning a reliable imp() function is not feasible • What’s the use of history? You never see the same change twice! • Must somehow use evidence from related changes • A possible approach: • ReComp makes probabilistic decisions, takes chances • Associate a reward to each ReComp decision  reinforcement learning • Bayesian inference (use new evidence to update probabilities) dr1 = imp(C1,Y11) dr2= imp(C12,Y41) dr3 = imp(C1,Y51) dr4 = imp(C2,Y52) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB    

×