The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Paolo Missier and Jacek Cala discuss efficient re-computation of analytics processes
1. Paolo Missier and Jacek Cala
Newcastle University
Cardiff, Dec 4th, 2019
Efficient Re-computation of Big Data Analytics Processes
in the Presence of Changes
In collaboration with
• Institute of Genetic Medicine, Newcastle University
3. 3
What changes?
• Data analytics pipelines
• Reference databases
• Algorithms and libraries
• Simulation
• Large parameter space
• Input conditions
• Machine Learning
• Evolving ground truth datasets
• Model re-training
4. 4
Analytics pipelines: Genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours
5. 5
Genomics: WES / WGS, Variant calling Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain
6. 6
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes
7. 7
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations
8. 8
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent changes +
not all changes significant
Challenge:
Make re-computation efficient in response
to changes
Assumptions:
Processes are
• Observable
• Reproducible
Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions
9. 9
Reproducibility
How
Selective:
- Across a cohort of past executions. which subset of individuals?
- Within a single re-execution which process fragments?
Change in
ClinVar
Change in
GeneMap
Why, when, to what extent
10. 10
The rest of the talk
Technical Approach:
• The ReComp meta-process
• Selecting past executions: measuring impact of changes on each past outcome
• Selecting process fragments for partial re-run
Empirical evaluation
Architecture and customization opportunities
11. 11
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P’(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instance:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
12. 12
Changes, data diff, impact
1) Observed change events:
(inputs, dependencies, or both)
3) Impact of change C on output y:
2) Type-specific Diff functions:
Impact is process- and data-specific:
13. 13
Impact
Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output:
However a change in one of the dependencies: C= {dd’}
affects all outputs yt where version d of D was used
14. 14
How much do we know about P?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
IO, DO
All-or-nothing
monolithic process, legacy
a complex simulator
white box
step-by-step provenance
workflows, R / python code
genomics analyticsTypical process
Fine-grained Impact
Partial restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.
15. 15
Recomp meta-process flow
PaoloMissier2019
Identify the subset of executions that are
potentially affected by the changes
Determine whether changes may have
had any impact on outputs
Identify and re-execute the
minimal fragments of workflow
that have been affected
16. 16
PaoloMissier2019
1. Identify the subset of executions that are potentially affected by the changes
2. Identify and re-execute the minimal fragments of workflow that have been affected
17. 17
Diff and impact estimation functions
PaoloMissier2019
Actual
impact
Estimated
impact
19. 19
Change impact analysis algorithm
PaoloMissier2019
Aim:
To identify the minimal subset of observed changes that have an actual effect on past outcomes
This is done by progressively eliminating changes for which impact has been estimated as null
Intuition:
- From the workflow, derive an impact graph
- This is a new type of dataflow where execution semantics is designed to
- Propagate input changes
- Compute diff functions
- Compute impact functions on diffs
- When impact is null, eliminate changes from the inputs
- Input: set of changes, eg
- Output: set of bindings that indicates which changes are relevant and have non-zero impact
on the process
20. 20
SVI: data-diff and impact functions
- Data-specific
- Process-specificomim
clinvar
Overall
impact
impact on ‘p1 select genes’
impact on the SVI output
21. 21
Diff functions for SVI: ClinVar
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
Relational data simple set difference
The ClinVar dataset: 30 columns
Changes:
Records: 349,074 543,841
Added 200,746 Removed 5,979.
Updated 27,662
22. 22
For tabular data, difference is just Select-Project
Key columns: {"#AlleleID", "Assembly", "Chromosome”}
“where” columns:{"ClinicalSignificance”}
23. 23
Binary SVI impact function
Returns True iff:
- Known variants for this patient have moved in/out of Red status
- New Red variants have appeared for this patient’s phenotype
- Known Red variants for this patient’s phenotype have been retracted
24. 25
ReComp decision matrix for SVI
Impact: yes / no / not assessed
delta functions: data diff detected?
27. 28
Role of provenance
PaoloMissier2019
Impact facts:
- During each execution ReComp records port-data bindings for all the data that flow
through annotated input and output ports
- Each impact function is able to use some of these bindings as its own inputs
- These are the impact facts that the function is evaluated on
- To find these bindings, traverse the dependencies of impact to diff functions
28. 29
ReComp History DB
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instances:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
30. 31
PaoloMissier2019
1. Identify the subset of executions that are potentially affected by the changes
2. Identify and re-execute the minimal fragments of workflow that have been affected
31. 32
SVI implemented using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype
33. 34
History DB: Workflow Provenance
Each invocation of an eSC workflow generates a provenance trace
“plan”
“plan
execution”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
ProgramWorkflow
Execution
Entity
(ref data)
34. 35
Within an instance: Partial re-execution
Change detection: A provenance fact indicates that a new version Dnew of
database d is available
wasDerivedFrom(“db”,Dnew)
:- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”)
Find the entry point(s) into the workflow, where db was used
:- execution(WFexec), execution(Xexec), execution(Yexec),
wasPartOf(Xexec, WFexec), wasPartOf(Yexec, WFexec),
wasGeneratedBy(Data, Xexec), used(Yexec,Data)
Discover the rest of the sub-workflow graph (execute recursively)
Reacting to the change:
Provenance
pattern:
“plan”
“plan
execution”
Ex. db = “ClinVar v.x”
WF
B1 B2
Xexec Yexec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
35. 36
SVI – partial re-execution
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
38. 43
Summary of ReComp challenges
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
For each past
instances:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
Not all runtime environments support
provenance recording
Diff functions are both type-
and application-specific
Sensitivity analysis unlikely to work well
Small input perturbations potentially large impact
Reproducibility
39. 44
Summary
<eventname>
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic ReComp framework:
- Observe changes, Provenance DB (History), control re-exec
Customisation:
- Diff functions, impact functions
Fine-grained provenance + control max savings
Editor's Notes
We are going to ignore BDA in this talk
And also simulation although it’s a case study
We are going to use this smaller process as a testbed
Changes in the reference databases have an impact on the classification
returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):
$\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.
\begin{align*}
\diffCV&(\CV^t, \CV^{t'}) = \\
&\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
& \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
\label{eq:diff-cv}
\end{align*}
where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
Threats: Will any of the changes invalidate prior findings?
Opportunities: Can the findings be improved over time?
Can we do better in a generic way?
We need to control re-computation on two dimensions
Across a population
Within a single process
Success criteria:
performance, but this is on a case-by-case basis
Ease of customization. The focus of this paper
The framework is a meta-process…
Changes can also occur to OS, libraries and other dependencies but these are out of scope
This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows.
As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column.
Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
\text{let } v \in \diff{Y}(Y^t, Y^{t'}): \\
\text{for any $X$: } \impact_{P}(C,X) = \texttt{High} \text{ if }\\
v.\texttt{status:}
\begin{cases}
* \rightarrow \texttt{red} \\
\texttt{red} \rightarrow *
\end{cases}
\delta_1, \delta_4
\phi_1, \phi_5
The framework is a meta-process…
Changes can also occur to OS, libraries and other dependencies but these are out of scope
Shows Essential ProvONE fragment used by ReComp
This shows the good case of “Gerry box” workflow and box-level provenance
SVI workflow with automated provenance recording
Cohort of about 100 exomes (neurological disorders)
Changes in ClinVar and OMIM GeneMap
How these two restart trees are discovered is explained in the two papers
IPAW
BDC
uses difference and impact services to analyse the impact of the changes on past executions and submits a subset of affected executions to rerun.
HDB will have been discussed earlier
Facts stored and queried using Prolog
store/retrieve REST API. Canned queries or ad hoc queries (advanced interface)
Impact functions realized as external services reachable through a REST API
reExec function takes restart tress and executes them – this may not always be possible in fact it’s a major limitation for current systems
ReComp loop produces recomp/no-recop decisions at the level of each restart tree
Data diff is an additional external service
The framework is a meta-process…
Changes can also occur to OS, libraries and other dependencies but these are out of scope