Successfully reported this slideshow.

Preserving the currency of genomics outcomes over time through selective re-computation: models and initial findings

1

Share

1 of 48
1 of 48

More Related Content

More from Paolo Missier

Related Books

Free with a 14 day trial from Scribd

See all

Preserving the currency of genomics outcomes over time through selective re-computation: models and initial findings

  1. 1. ReComp–BITSmeeting Italy,June,2017–P.Missier 1 Preserving the currency of genomics outcomes over time through selective re-computation: models and initial findings recomp.org.uk Paolo Missier, Jacek Cala, Jannetta Steyn School of Computing Newcastle University, UK 14th Annual Meeting of the Bioinformatics Italian Society Cagliari, Italy July, 2017 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)
  2. 2. ReComp–BITSmeeting Italy,June,2017–P.Missier 2 Data Science Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge”
  3. 3. ReComp–BITSmeeting Italy,June,2017–P.Missier 3 Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Life Science Analytics
  4. 4. ReComp–BITSmeeting Italy,June,2017–P.Missier 4 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • Cloud e-Genome: WES data processing on the cloud using workflow technology • ReComp: selective re-analysis of cases in reaction to change • Techniques and initial findings • Open challenges
  5. 5. ReComp–BITSmeeting Italy,June,2017–P.Missier 5 Data Analytics enabled by NGS Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  6. 6. ReComp–BITSmeeting Italy,June,2017–P.Missier 6 SVI: Simple Variant Interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Filters then classifies variants into three categories: pathogenic, benign and unknown/uncertain SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
  7. 7. ReComp–BITSmeeting Italy,June,2017–P.Missier 7 Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  8. 8. ReComp–BITSmeeting Italy,June,2017–P.Missier 8 Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected: 4.2 hours of computation per change ≈7 minutes / patient (single-core VM)
  9. 9. ReComp–BITSmeeting Italy,June,2017–P.Missier 9 1. Whole-exome variant calling GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign PicardMark Duplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 dbSNP builds 150 2/17 149 11/16 148 6/16 147 4/16
  10. 10. ReComp–BITSmeeting Italy,June,2017–P.Missier 10 FreeBayes vs SamTools vs GATK-HC GATK: McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110 FreeBayes: Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read sequencing." arXiv preprint arXiv:1207.3907 (2012). GIAB: Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech, 32(3), 246–251. http://dx.doi.org/10.1038/nbt.2835 Adam Cornish and Chittibabu Guda, “A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference,” BioMed Research International, vol. 2015, Article ID 456479, 11 pages, 2015. doi:10.1155/2015/456479 Hwang, S., Kim, E., Lee, I., & Marcotte, E. M. (2015). Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports, 5(December), 17875. https://doi.org/10.1038/srep17875
  11. 11. ReComp–BITSmeeting Italy,June,2017–P.Missier 11 Our study: comparing three versions of Freebayes • Tested three versions of the caller: • 0.9.10  Dec 2013 • 1.0.2  Dec 2015 • 1.1  Nov 2016 • The Venn diagram shows quantitative comparison (% and number) of filtered variants; • Phred quality score >30 • 16 patient BAM files (7 AD, 9 FTD-ALS)
  12. 12. ReComp–BITSmeeting Italy,June,2017–P.Missier 12 Impact on SVI classification Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS The ONLY change in the pipeline is the version of Freebayes used to call variants The table shows the final SVI classification (R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity Patient ID Freebayes version B_0190 B_0191 B_0192 B_0193 B_0195 B_0196 B_0198 B_0199 B_0201 B_0202 B_0203 B_0208 B_0209 B_0211 B_0213 B_0214 0.9.10 A A R A R R R R R A R R R R A R 1.0.2 A A R A R R A A R A R A R A A R 1.1 A A R A R R A A R A R A R A A R Phenotype ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD AD ALS-FTD AD AD AD AD AD ALS-FTD ALS-FTD AD In four cases change in the caller version changes the classification
  13. 13. ReComp–BITSmeeting Italy,June,2017–P.Missier 14 Changes: frequency vs impact Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation
  14. 14. ReComp–BITSmeeting Italy,June,2017–P.Missier 15 Changes: frequency vs impact Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation ReComp space
  15. 15. ReComp–BITSmeeting Italy,June,2017–P.Missier 16 Understanding change Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? (diagnoses) • Opportunities: Can the findings (diagnoses) be improved over time? • Impact: Which patients/samples are going to be affected? To what extent? Many of the elements involved in producing analytical knowledge change over time: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap, GeneCard,…) ReComp space = expensive analysis + frequent changes + high impact
  16. 16. ReComp–BITSmeeting Italy,June,2017–P.Missier 17 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • Cloud e-Genome: WES data processing on the cloud using workflow technology • ReComp: selective re-analysis of cases in reaction to change • Techniques and initial findings • Open challenges
  17. 17. ReComp–BITSmeeting Italy,June,2017–P.Missier 20 WES pipeline: scale Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2)
  18. 18. ReComp–BITSmeeting Italy,June,2017–P.Missier 22 Workflow Design echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP mkdir -p $PICARD_OUTDIR mkdir -p $PICARD_TEMP echo Starting PICARD to clean BAM files... $Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED echo Starting PICARD to remove duplicates... $Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = $SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true echo Adding read group information to bam file... $Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}” echo Indexing bam files... samtools index $SORTED_BAM_FILE_NODUPS “Wrapper” blocksUtility blocks
  19. 19. ReComp–BITSmeeting Italy,June,2017–P.Missier 23 Workflow design raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Conceptual: Actual: 11 workflows 101 blocks 28 tool blocks
  20. 20. ReComp–BITSmeeting Italy,June,2017–P.Missier 25 Parallelism in the pipeline Chr1 Chr2 ChrM Chr1 Chr2 ChrM Chr1 Chr2 ChrM align, clean, recalibrate call variants annotate align, clean, recalibrate align, clean, recalibrate Stage 1 Stage 2 Stage 3 annotate annotate call variants call variants Chr1 Chr1 Chr1 Chr2 Chr2 Chr2 ChrM ChrM ChrM chromosomesplit samplesplit chromosomesplit samplesplit Sample 1 Sample 2 Sample N Annotated variants Annotated variants Annotated variants align-clean- recalibrate-coverage … align-clean- recalibrate-coverage Sample 1 Sample n Variant calling recalibration Variant calling recalibration Variant filtering annotation Variant filtering annotation …… Chromosome split Per-sample Parallel processing Per-chromosome Parallel processing Stage I Stage II Stage III
  21. 21. ReComp–BITSmeeting Italy,June,2017–P.Missier 27 Workflow on Azure Cloud – modular configuration <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Module configuration: 3 nodes, 24 cores
  22. 22. ReComp–BITSmeeting Italy,June,2017–P.Missier 28 Performance Configurations for 3VMs experiments: HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU 48 GiB RAM, 160 GB scratch space Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04. 00:00 12:00 24:00 36:00 48:00 60:00 72:00 0 6 12 18 24 Responsetime[hh:mm] Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  23. 23. ReComp–BITSmeeting Italy,June,2017–P.Missier 29 Comparison with HPC 0 24 48 72 96 120 144 168 0 6 12 18 24 Responsetime[hours] Number of input samples HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016
  24. 24. ReComp–BITSmeeting Italy,June,2017–P.Missier 32 Cost A 6 engine configuration achieves near-optimal cost/sample 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 0 6 12 18 24 0 2 4 6 8 10 12 14 16 18 Size of the input data [GiB] CostperGiB[£] Number of samples Costpersample[£] 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  25. 25. ReComp–BITSmeeting Italy,June,2017–P.Missier 33 What about flexibility? GATK-HC  FreeBayes raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants align lane recalibrate sample <<Exec>> call variants <<Exec>> recalibrate variants align sample haplotype caller recalibrate sample raw sequences <<Exec>> align <<Exec>> clean <<Exec>> calculate coverage coverage information <<Exec>> recalibrate alignments <<Exec>> annotate annotated variants Stage 1 Stage 2 Stage 3 <<Exec>> filter variants align lanealign lane align sample align lane clean sampleclean sampleclean sample align lane align sample align lane recalibrate sample VC with chr-split haplotype callerhaplotype callerhaplotype caller annotates ampleannotates ampleannotates ample filter samplefilter samplefilter sample annotated variantsannotated variants raw sequencesraw sequences coverage informationcoverage information coverage per samplecoverage per samplecoverage per sample recalibrate Freebayes callerFreebayes caller Freebayes callerFreebayes caller
  26. 26. ReComp–BITSmeeting Italy,June,2017–P.Missier 35 From Cloud-eGenome to ReComp • Variant calling / interpretation pipelines are still computationally expensive • Workflow technology suitable and scalable, but flexibility largely an illusion • With the additional advantage of automated recording of data provenance
  27. 27. ReComp–BITSmeeting Italy,June,2017–P.Missier 37 Talk Outline • The importance of quantifying changes to meta-knowledge, and their impact • Cloud e-Genome: WES data processing on the cloud using workflow technology • ReComp: selective re-analysis of cases in reaction to change • Techniques and initial findings • Open challenges
  28. 28. ReComp–BITSmeeting Italy,June,2017–P.Missier 38 The ReComp meta-process Estimate impact of changes Enact on demand Record execution history Detect and measure changes History DB Data diff(.,.) functions Change Events Process P Observe Exec 1. Capture the history of past computations: - Process Structure and dependencies - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits Approach: 1. Metadata 2. Data-specific “diff” functions
  29. 29. ReComp–BITSmeeting Italy,June,2017–P.Missier 39 History DB: Workflow Provenance Each invocation of an eSC workflow generates a detailed provenance trace http://vcvcomputing.com/provone/provone.html User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam
  30. 30. ReComp–BITSmeeting Italy,June,2017–P.Missier 40 Diff functions: example ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  31. 31. ReComp–BITSmeeting Italy,June,2017–P.Missier 41 Compute difference sets – ClinVar The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662
  32. 32. ReComp–BITSmeeting Italy,June,2017–P.Missier 42 A generic tool to compute difference sets for tabular data Key columns: {"#AlleleID", "Assembly", "Chromosome”} “where” columns:{"ClinicalSignificance”}
  33. 33. ReComp–BITSmeeting Italy,June,2017–P.Missier 45 Reducing the computation performed in reaction to changes Constraint: selective re-computation should be lossless • All instances that are subject to impact will be considered 1. Partial re-execution • Identify and re-enact the portion of a process that are affected by change 2. Differential execution • Input to the new execution consists of the differences between two versions of a changed dataset • Only feasible if some algebraic properties of the process hold 3. Identifying the scope of change • Determine which instances out of a population of outcomes are going to be affected by the change
  34. 34. ReComp–BITSmeeting Italy,June,2017–P.Missier 46 SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  35. 35. ReComp–BITSmeeting Italy,June,2017–P.Missier 47 1. Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(B1exec,WFexec), used(B1exec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  36. 36. ReComp–BITSmeeting Italy,June,2017–P.Missier 48 Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Partial execution following a change in only one of the databases requires caching the intermediate data at the boundary of the blue and read areas
  37. 37. ReComp–BITSmeeting Italy,June,2017–P.Missier 49 Generating the sub-workflow
  38. 38. ReComp–BITSmeeting Italy,June,2017–P.Missier 50 Workflows are stored in the History DB Neo4J Graph DB Serialise
  39. 39. ReComp–BITSmeeting Italy,June,2017–P.Missier 51 Workflow copy’n’paste Neo4J Graph DB ReComp Partial rerun {Starting blocks} Sub-workflow extractor
  40. 40. ReComp–BITSmeeting Italy,June,2017–P.Missier 52 Workflow copy’n’paste Neo4J Graph DB ReComp Partial rerun {Starting blocks} exec Sub-workflow extractor De-serialise
  41. 41. ReComp–BITSmeeting Italy,June,2017–P.Missier 53 Results • Overhead: storing interim data required in partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re- execution (seC) Complete re- execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  42. 42. ReComp–BITSmeeting Italy,June,2017–P.Missier 54 2. Differential execution ClinVar 1/2016 ClinVar 1/2017 diff (unchanged)
  43. 43. ReComp–BITSmeeting Italy,June,2017–P.Missier 56 P2: Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion record count Difference record count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  44. 44. ReComp–BITSmeeting Italy,June,2017–P.Missier 57 3. Identifying the scope of change: a game of battleship Patient / change impact matrix Challenge: precisely identify the scope of a change Blind reaction to change: recompute the entire matrix Can we do better? - Hit the high impact cases (the X) without re- computing the entire matrix
  45. 45. ReComp–BITSmeeting Italy,June,2017–P.Missier 58 A scoping algorithm Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF) - For each candidate invocation I of WF: - partially re-execute using the difference sets as inputs # see (2) - find the minimal subgraph P’ of P that needs re-computation # see (1) - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> and not <empty output> then - Execute P’ on the full inputs Sketch of the algorithm: WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  46. 46. ReComp–BITSmeeting Italy,June,2017–P.Missier 59 Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  47. 47. ReComp–BITSmeeting Italy,June,2017–P.Missier 61 Conclusions: ReComp open problems Change Events History DB Reproducibility: - virtualisation Sensitivity analysis unlikely to work well Small input perturbations  potentially large impact on diagnosis Learning useful estimators is hard Diff functions are both type- and application-specific Not all runtime environments support provenance recording specific  generic Data diff(.,.) functions Process P Observe Exec
  48. 48. ReComp–BITSmeeting Italy,June,2017–P.Missier 62 Questions? http://recomp.org.uk/

Editor's Notes

  • Genomics is a form of data-intensive / computation-intensive analysis

  • Changes in the reference databases have an impact on the classification
  • returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):


    $\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
    where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.

    \begin{align*}
    \diffCV&(\CV^t, \CV^{t'}) = \\
    &\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
    & \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
    \label{eq:diff-cv}
    \end{align*}
    where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.


  • Point of slide: sparsity of impact demands better than blind recomp.

    Table 1 summarises the results. We recorded four types of outcomes. Firstly, confirming the current diagnosis (� ), which happens when additional variants are added to the

    Red class. Secondly, retracting the diagnosis, which may happen (rarely) when all red variants are retracted, de-noted ❖. Thirdly, changes in the amber class which do not alter the diagnosis (� ), and finally, no change at all ( ).

    `Table reports results from nearly 500 executions, concern-ing a cohort of 33 patients, for a total runtime of about 58.7 hours. As merely 14 relevant output changes were de-tected, this is about 4.2 hours of computation per change: a steep cost, considering that the actual execution time of SVI takes a little over 7 minutes.
  • our recommendation is the use of BWA-MEM and Samtools pipeline for SNP calls and BWA-MEM and GATK-HC pipeline for indel calls. 
  • Changes can be frequent or rare, disruptive or marginal
  • Changes can be frequent or rare, disruptive or marginal
  • Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store).

    These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).

  • A Modular architecture
  • Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
  • 3 workflow engines perform better than our HPC benchmark on larger sample sizes
  • 3 workflow engines perform better than our HPC benchmark on larger sample sizes
  • Largely, all variant calling pipelines consist of a number of common steps such as sequence alignment, calling variants and variant annotation
    In details, they may differ, however, in how particular steps are connected together. For example:
    alignment –> variant calling
    GATK pipeline best practices suggest that between these steps the pipeline should run base quality score recalibration (our “GATK phase 1” workflow or the “recalibrate alignment” step).
    Freebayes pipeline does not include this step.
    variant calling –> annotation
    GATK pipeline includes additional variant recalibration step for SNPs and INDELs.
    Freebayes pipeline is simpler and does not include variant recalibration.
  • y^t = \mathit{exec}(P,x,D^t)

    y^{t'}_+ = \mathit{exec}(P,x,\delta^+)
  • This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.

    As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
    Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  • This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.

    As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
    Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  • Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  • Experimental setup for our study of ReComp techniques:

    SVI workflow with automated provenance recording
    Cohort of about 100 exomes (neurological disorders)
    Changes in ClinVar and OMIM GeneMap
  • y^t = \mathit{exec}(P,x,D^t)

    y^{t'}_+ = \mathit{exec}(P,x,\delta^+)
  • Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI.
    What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
  • Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.
  • This figure emphasizes the penalty for running the algorithm when the differ- ence sets were large compared to actual new data. But it also highlights the importance of the diff and impact functions. Clearly, the more accurate the functions are the higher runtime savings may be, which stems from two fact. Firstly, more accurate diff function tends to pro- duce smaller difference sets which reduces time of task re-execution (cf. CV-diff and CV-SVI-diff lines in Fig. 7). Secondly, more accurate impact function tends to produce false more frequently, and so the algorithm can more of- ten avoid re-computation with the complete new version of the data (cf. the number of black squares vs the total number of patients affected by a change in Tab. 4)
  • ×