Introduction to High-performance In-memory Genome Project at HPI Matthieu Schapranow
Similar to Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges (20)
Presentation on how to chat with PDF using ChatGPT code interpreter
Preserving the currency of analytics outcomes over time through selective re-computation: techniques, initial findings, and open challenges
1. 1
ReComp–UniversityofLeeds
November,2017
Preserving the currency of analytics outcomes over time
through selective re-computation:
techniques, initial findings, and open challenges
recomp.org.uk
Paolo Missier, Jacek Cala, Jannetta Steyn
School of Computing
Newcastle University, UK
University of Leeds
School of Computing Colloquia series
November, 2017
Meta-*
In collaboration with
• Cambridge University (Prof. Chinnery,
Department of Clinical Neurosciences)
• Institute of Genetic Medicine, Newcastle
University
• School of GeoSciences, Newcastle University
4. 4
ReComp–UniversityofLeeds
November,2017
Talk Outline
• The importance of quantifying changes to meta-knowledge, and
their impact
• ReComp: selective re-computation to refresh outcomes in reaction
to change
• Techniques and initial findings
• Open challenges
5. 5
ReComp–UniversityofLeeds
November,2017
Data Analytics enabled by Next Gen Sequencing
Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software tools
Data presentation and
visualisation through
web interface
Visualisation
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Metagenomics: Species identification
- Eg The EBI metagenomics portal
6. 6
ReComp–UniversityofLeeds
November,2017
SVI: Simple Variant Interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Filters then classifies variants into three categories: pathogenic,
benign and unknown/uncertain
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya,
E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences,
Los Angeles, CA, 2015. Springer
7. 7
ReComp–UniversityofLeeds
November,2017
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes
9. 9
ReComp–UniversityofLeeds
November,2017
Whole-exome variant calling: expensive
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From
FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in
Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43
GATK quality
score
recalibration
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
BWA, Bowtie,
Novoalign
Picard:
MarkDuplicates
GATK-Haplotype Caller
FreeBayes
SamTools
Variant
recalibration
10. 10
ReComp–UniversityofLeeds
November,2017
Whole-Exome Sequencing pipeline: scale
Data stats per sample:
4 files per sample (2-lane, pair-end,
reads)
≈15 GB of compressed text data (gz)
≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples
0.45-0.6 TB of compressed data
1.2-1.6 TB uncompressed
Most steps use 8-10 GB of
reference data
Small 6-sample run takes
about 30h on the IGM HPC
machine (Stage1+2)
Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.;
Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue:
Big Data in the Cloud, 2016
11. 11
ReComp–UniversityofLeeds
November,2017
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks
From
To
13. 13
ReComp–UniversityofLeeds
November,2017
Parallelism in the pipeline
align-clean-
recalibrate-coverage
…
align-clean-
recalibrate-coverage
Sample
1
Sample
n
Variant calling
recalibration
Variant calling
recalibration
Variant filtering
annotation
Variant filtering
annotation
……
Chromosome
split
Per-sample
Parallel
processing
Per-chromosome
Parallel
processing
Stage I Stage II Stage III
14. 15
ReComp–UniversityofLeeds
November,2017
Performance
Configurations for 3VMs experiments:
Azure workflow engines:
D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)
15. 17
ReComp–UniversityofLeeds
November,2017
Whole-exome variant calling: unstable
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From
FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in
Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43
GATK quality
score
recalibration
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
BWA, Bowtie,
Novoalign
Picard:
MarkDuplicates
GATK-Haplotype Caller
FreeBayes
SamTools
Variant
recalibration
dbSNP builds
150 2/17
149 11/16
148 6/16
147 4/16
Any of these stages may change over time – semi-independently
16. 18
ReComp–UniversityofLeeds
November,2017
Comparing three versions of Freebayes
Should we care about changes in the pipeline?
• Tested three versions of the caller:
• 0.9.10 Dec 2013
• 1.0.2 Dec 2015
• 1.1 Nov 2016
• The Venn diagram shows quantitative comparison (% and number) of filtered
variants;
• Phred quality score >30
• 16 patient BAM files (7 AD, 9 FTD-ALS)
17. 20
ReComp–UniversityofLeeds
November,2017
Impact on SVI classification
Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS
The ONLY change in the pipeline is the version of Freebayes used to call variants
(R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity
Patient ID
Freebayes
version
B_0190
B_0191
B_0192
B_0193
B_0195
B_0196
B_0198
B_0199
B_0201
B_0202
B_0203
B_0208
B_0209
B_0211
B_0213
B_0214
0.9.10 A A R A R R R R R A R R R R A R
1.0.2 A A R A R R A A R A R A R A A R
1.1 A A R A R R A A R A R A R A A R
Phenotype
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
AD
ALS-FTD
AD
AD
AD
AD
AD
ALS-FTD
ALS-FTD
AD
18. 21
ReComp–UniversityofLeeds
November,2017
Changes: frequency / impact / cost
Change Frequency
Changeimpactonacohort
GATK
Variant annotations
(Annovar)
Reference
Human genome
Variant DB
(eg ClinVar)
Phenotype
disease mapping
(eg OMIM
GeneMap)
New
sequences
LowHigh
Low High
Variant
Caller
Variant calling
N+1 problem
Variant interpretation
19. 22
ReComp–UniversityofLeeds
November,2017
Changes: frequency / impact / cost
Change Frequency
Changeimpactonacohort
GATK
Variant annotations
(Annovar)
Reference
Human genome
Variant DB
(eg ClinVar)
Phenotype
disease mapping
(eg OMIM
GeneMap)
New
sequences
LowHigh
Low High
Variant
Caller
Variant calling
N+1 problem
Variant interpretation
ReComp
space
21. 24
ReComp–UniversityofLeeds
November,2017
When should we repeat an expensive simulation?
CityCat
Flood simulator
CityCat
Flood simulator
Can we predict
high difference
areas?
New buildings
may alter data flow
Processing the Newcastle area:
5 hours Extreme
Rainfall
event
22. 25
ReComp–UniversityofLeeds
November,2017
Talk Outline
• The importance of quantifying changes to meta-knowledge, and
their impact
• ReComp: selective re-computation to refresh outcomes in reaction
to change
• Techniques and initial findings
• Open challenges
Project structure
• 3 years funding - Feb. 2016 - Jan. 2019
• In collaboration with
• Cambridge University (Prof. Chinnery, Department of Clinical Neurosciences)
• Institute of Genetic Medicine, Newcastle University
• School of GeoSciences, Newcastle University
23. 26
ReComp–UniversityofLeeds
November,2017
The ReComp meta-process
Estimate impact of
changes
Select and
Enact
Record execution
history
Detect and
measure
changes
History
DB
Data diff(.,.)
functions
Change
Events
Process P
Observe
Exec
1. Capture the history of past computations:
- Process Structure and dependencies
- Cost
- Provenance of the outcomes
2. Metadata analytics: Learn from history
- Estimation models for impact, cost, benefits
Approach:
2. Collect and exploit
process history metadata
1. Quantify data-diff and impact of changes on prior outcomes
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD,
ClinVar, OMIM GeneMap…)
27. 31
ReComp–UniversityofLeeds
November,2017
History DB: Workflow Provenance
Each invocation of an eSC workflow generates a provenance trace
http://vcvcomputing.com/provone/provone.html
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
«wasDerivedFrom »
[*][*]
[0..1]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam
“plan”
“plan
execution”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
28. 32
ReComp–UniversityofLeeds
November,2017
Approach – a combination of techniques
1. Partial re-execution
• Identify and re-enact the portion of a process that are affected by change
2. Differential execution
• Input to the new execution consists of the differences between two versions of a
changed dataset
• Only feasible if some algebraic properties of the process hold
3. Identifying the scope of change – Loss-less
• Exclude instances of the population that are certainly not affected
31. 35
ReComp–UniversityofLeeds
November,2017
1. Partial re-execution
1. Change detection: A provenance fact indicates that a new version Dnew of
database d is available wasDerivedFrom(“db”,Dnew)
:- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”)
2.1 Find the entry point(s) into the workflow, where db was used
:- execution(WFexec), execution(B1exec), execution(B2exec),
wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec),
wasGeneratedBy(Data, B1exec), used(B2exec,Data)
2.2 Discover the rest of the sub-workflow graph (execute recursively)
2. Reacting to the change:
Provenance
pattern:
“plan”
“plan
execution”
Ex. db = “ClinVar v.x”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
32. 36
ReComp–UniversityofLeeds
November,2017
Minimal sub-graphs in SVI
Change in
ClinVar
Change in
GeneMap
Overhead: cache intermediate data required for partial re-execution
• 156 MB for GeneMap changes and 37 kB for ClinVar changes
Time savings Partial re-
execution (seC)
Complete re-
execution
Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
36. 44
ReComp–UniversityofLeeds
November,2017
P2: Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV) Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Bigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion record
count
Difference
record count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion record
count
Difference
record count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
38. 46
ReComp–UniversityofLeeds
November,2017
3: precisely identify the scope of a change
Patient / DB version
impact matrix
Strong scope:
(fine-grained provenance)
Weak scope: “if CVi was used in the processing of pj then pj is in scope”
(coarse-grained provenance – next slide)
Semantic scope:
(domain-specific scoping rules)
39. 47
ReComp–UniversityofLeeds
November,2017
A weak scoping algorithm
Coarse-grained
provenance
Candidate invocation:
Any invocation I of P
whose provenance
contains statements of
the form:
used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF)
- For each candidate invocation I of P:
- partially re-execute using the difference sets as inputs # see (2)
- find the minimal subgraph P’ of P that needs re-computation # see (1)
- repeat:
execute P’ one step at-a-time
until <empty output> or <P’ completed>
- If <P’ completed> and not <empty output> then
- Execute P’ on the full inputs
Sketch of the algorithm:
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
44. 52
ReComp–UniversityofLeeds
November,2017
changes, data diff, impact
1) Observed change events:
(inputs, dependencies, or both)
3) Impact occurs to various degree on multiple prior outcomes.
Impact of change C on the processing of a specific X:
2) Type-specific Diff functions:
Impact is process- and data-specific:
45. 53
ReComp–UniversityofLeeds
November,2017
Impact: importance and Scope
Scope: which cases are affected?
- Individual variants have an associated phenotype.
- Patient cases also have a phenotype
“a change in variant v can only have impact on a case X if V and X
share the same phenotype”
Importance: “Any variant with status moving from/to Red causes High
impact on any X that is affected by the variant”
46. 54
ReComp–UniversityofLeeds
November,2017
History Database
HDB: A metadata-database containing records of past executions:
Execution records:
C1 C2 C3
GATK
(Haplotype caller) FreeBayes
0.9
FreeBayes
1.0
FreeBayes
1.1
X1
X2
X3
X4
X5
Y11
Y21
Y31
Y41
Y51
Y12
Y52
Y43
Y53
HDB
Example: Consider only one type of change: Variant caller
47. 55
ReComp–UniversityofLeeds
November,2017
ReComp decisions
Given:
- A population X of processed inputs:
- Change
ReComp must learn to make yes/no decisions for each
returns True if P is to be executed again on X, and False otherwise
To decide, ReComp must estimate impact:
(as well as estimate the re-computation cost)
Example:
Objective: maximise reward
48. 57
ReComp–UniversityofLeeds
November,2017
History DB and Differences DB
Whenever P is re-computed on input X, a new er’ is added to HDB for X:
Using diff() functions we produce a derived difference record dr:
… collected in a Differences database:
dr1 = Imp(C1,X1)
dr2= Imp(C12,X4)
dr3 = Imp(C1,X5)
dr4 = Imp(C2,X5)
DDB
C1 C2 C3
GATK
(Haplotype caller) FreeBayes
0.9
FreeBayes
1.0
FreeBayes
1.1
X1
X2
X3
X4
X5
Y11
Y21
Y31
Y41
Y51
Y12
Y52
Y43
Y53
HDB
50. 59
ReComp–UniversityofLeeds
November,2017
Learning challenges
• Evidence is small and sparse
• How can it be used for selecting from X?
• Learning a reliable imp() function is not feasible
• What’s the use of history? You never see the same change twice!
• Must somehow use evidence from related changes
• A possible approach:
• ReComp makes probabilistic decisions, takes chances
• Associate a reward to each ReComp decision reinforcement learning
• Bayesian inference (use new evidence to update probabilities)
X1
X2
X3
X4
X5
HDB
dr1 = Imp(C1,X1)
dr2= Imp(C12,X4)
dr3 = Imp(C1,X5)
dr4 = Imp(C2,X5)
DDB
C1 C2 C3
GATK
(Haplotype caller) FreeBayes
0.9
FreeBayes
1.0
FreeBayes
1.1
Y11
Y21
Y31
Y41
Y51
Y12
Y52
Y43
Y53
Genomics is a form of data-intensive / computation-intensive analysis
Changes in the reference databases have an impact on the classification
returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):
$\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.
\begin{align*}
\diffCV&(\CV^t, \CV^{t'}) = \\
&\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
& \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
\label{eq:diff-cv}
\end{align*}
where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
Point of slide: sparsity of impact demands better than blind recomp.
Table 1 summarises the results. We recorded four types of outcomes. Firstly, confirming the current diagnosis (� ), which happens when additional variants are added to the
Red class. Secondly, retracting the diagnosis, which may happen (rarely) when all red variants are retracted, de-noted ❖. Thirdly, changes in the amber class which do not alter the diagnosis (� ), and finally, no change at all ( ).
`Table reports results from nearly 500 executions, concern-ing a cohort of 33 patients, for a total runtime of about 58.7 hours. As merely 14 relevant output changes were de-tected, this is about 4.2 hours of computation per change: a steep cost, considering that the actual execution time of SVI takes a little over 7 minutes.
Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store).
These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
A Modular architecture
Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
3 workflow engines perform better than our HPC benchmark on larger sample sizes
our recommendation is the use of BWA-MEM and Samtools pipeline for SNP calls and BWA-MEM and GATK-HC pipeline for indel calls.
In four cases change in the caller version changes the classification
Changes can be frequent or rare, disruptive or marginal
Changes can be frequent or rare, disruptive or marginal
This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.
As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.
As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
Experimental setup for our study of ReComp techniques:
SVI workflow with automated provenance recording
Cohort of about 100 exomes (neurological disorders)
Changes in ClinVar and OMIM GeneMap
Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI.
What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
v \in (\delta^- \cup \delta^+) \cap \mathit{used}(p_j, v) \Rightarrow p_j \text{ in scope }
v.\mathit{phenotype} == p_j.\mathit{phenotype} \Rightarrow p_j \text{ in scope }
Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.