Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Provenance Annotation and Analysis to Support Process Re-Computation
1. Provenance Annotation and Analysis to
Support
Process Re-Computation
Jacek Cała, Paolo Missier
School of Computing
Newcastle University, UK
2. Problem Outline
• Consider process P, e.g. the following NGS pipeline [5]:
[5] Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud.
Future Generation Computer Systems (Jan 2016).
3. Problem Outline
• Only rarely P is a static entity.
• Usually, a variety of elements in P change:
• data dependencies,
• software tools & dependencies,
• [out of scope] the structure of P.
• Changes in the elements of P
=> the need to update past P outcomes
=> the need for re-computation.
4. The Re-Computation Framework
• To control the re-computation of processes
• proposed earlier in [6].
• The core of the framework is
the re-computation loop:
[6] Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics
tasks: insights from a Genomics case study. Big Data Research (2018); in press.
6. Preliminaries
• The ProvONE model: prospective + retrospective provenance [7].
• Set of software and data dependencies: D ={a0, b0, …}
• Process, input and execution configuration: E(P, x,V)
• Version change event: C = {an → an-1}
• Composite version change event: C = {an → an-1, bm → bm-1, …}
• Change front.
• Re-computation front.
• Restart tree.
[7] Cuevas-Vicenttín, V., Ludäscher, B., Missier, P., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016).
7. Change Front
• The accumulation of change events over a specified time window.
t
C0
{a1 → a0}
CF3
{a3, b1, c2}
CF5
{a3, b2, c2, d1}
C1
{b1 → b0}
C3
{a3 → a2, c2 → c1}
C4
{d1 → d0}
C5
{b2 → b1}
C2
{a2 → a1, c1 → c0}
E(…, [a0, b0, e0])
E(…, [a0, b1, d0])
E(…, [a2, b1, c1])
8. Re-computation Front
• Over time the population of executions grow
• Some of them may result from re-executions
• Some of them may be user-initiated
• may use historical versions of elements
• Looking for the transitive closure of the elements’ derivation is too
broad.
=> find out which of the past executions really need an update.
13. Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
14. Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
15. Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
16. Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
17. Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
The provenance trace includes multiple
interrelated executions.
During re-execution we have to combine
all of them within a single context – the
top-level execution.
18. Restart Tree
• To build a restart tree we rely on the proveone:wasPartOf
statements.
CF = {b2, e1}
19. Restart Tree
• Captures the vertical dimension of a single execution
• the transitive closure of the wasPartOf relation.
RT ≝ {Execution, [DataChange], [Children]}
CF = {b2, e1}
20. Restart Tree
• Captures the vertical dimension of a single execution
• the transitive closure of the wasPartOf relation.
RT = {E0, [], [{SE0, [], [{SSE1, [⟨b2 → b0⟩], []},{SSE3, [⟨e1 → e0⟩], []}]}, {SE1, …}, …]}
CF = {b2, e1}
21. The algorithm
• Combines all three aspects:
• the change front,
• the re-computation front and
• the restart tree.
• For a given change front,
–> produces the recomputation front that
–> includes a set of restart trees,
–> each refers to a single top-level execution with only the parts related to the
change(s).
• Enables ReComp to identify the minimal set of executions that may be affected by the
change(s)
• The remaining executions are either unaffected at all or refreshed previously.
24. Conclusions
• We address the problem of the re-computation of:
• complex hierarchical processes,
• run over a cohort of input data samples,
• with multiple points of change,
• in the open system – allow users to initiate (re-)executions any time.
• The solution starts from the changes observed:
• In contrast to previous work, e.g. smart re-run and workflow caching.
• We proposed a simple algorithm to find the re-computation front:
• written in Prolog,
• very effective (response in the order of 1–100 ms),
• available on GitHub.
• The algorithm is the initial step in further scope identification and execution optimisation.
Highlight the key elements: complex hierarchical workflow, multiple patient inputs, variety of software and data dependencies.
Single run of the pipeline may include 30–40 patient samples, and tens or hundreds of such executions are made in practice, e.g. 1511 brain tumor patients from IGM.
Examples of why changes in P should invoke updates of past P outcomes:
reference genome ???
dbsnp variants no longer considered as de-novo – important in the case of rare diseases
tool versions more accurate alignment or variant discovery.
Multiple past executions – scope of change is to filter irrelevant execs.
Mention that this single pass here is slightly simplified – no cost considered.
notation is slightly different – for the sake of presentation
Mention that x is explicitly indicated to denote the fact that V describes dependency changes, whereas x⋲X is an element of a population of inputs.
The last three are here to explain the effective algorithm. The alg. is supposed to produce the minimal set of potentially affected execs.
Simple change – wait until CF
Composite change
Only the changed artefacts
Change front CF3
Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
The open system – users can introduce execs with non-recent versions.
More changes
Change front CF5
Includes all change artefacts – keeps the system open.
Prioritisation within impact analysis may cause that not all past execs are recomputed.
Windowing, windowing policy
Explain that P may depend on many other components but C and CF include only the changed ones.
Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
Isn’t it as simple as filtering out execs which are wasInformedBy
Grey arrows indicate the data flow: solid lines – the usage of data, dotted lines – the communication / generation-usage pattern
Black dashed arrows reflect the structure of the process.
Mention that for the sake of simplicity we are not showing the port-data usage, which is also needed.
Mention that for the sake of simplicity we are not showing the port-data usage, which is also needed.
How is it different from workflow caching?
Multiple past executions – scope of change is to filter irrelevant execs.
Mention that this single pass here is slightly simplified – no cost considered.
Mention that x is explicitly indicated to denote the fact that V describes dependency changes, whereas x⋲X is an element of a population of inputs.
The last three are here to explain the effective algorithm. The alg. is supposed to produce the minimal set of potentially affected execs.
Simple change
Composite change
Only the changed artefacts
Change front CF0
Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
The open system – users can introduce execs with non-recent versions.
More changes
Change front CF1
Includes all change artefacts – keeps the system open.
Prioritisation within impact analysis may cause that not all past execs are recomputed.
Windowing, windowing policy
Explain that P may depend on many other components but C and CF include only the changed ones.
Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
Simple change
Composite change
Only the changed artefacts
Change front CF0
a point in time when ReComp initiates the re-computation loop
Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
The open system – users can introduce execs with non-recent versions.
More changes
Change front CF1
Includes all change artefacts – keeps the system open.
Prioritisation within impact analysis may cause that not all past execs are recomputed.
Windowing, windowing policy
Explain that P may depend on many other components but C and CF include only the changed ones.
Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
Simple change
Composite change
Only the changed artefacts
Change front CF0
Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
The open system – users can introduce execs with non-recent versions.
More changes
Change front CF1
Includes all change artefacts – keeps the system open.
Prioritisation within impact analysis may cause that not all past execs are recomputed.
Windowing, windowing policy
Explain that P may depend on many other components but C and CF include only the changed ones.
Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
The last Unless there is the top-level coordination process specified already.
A few simplifications:
All execs depend on all show artefacts {a & b}.
There may be a whole set of execs not on the front, which depend on other elements/artefacts.
Re-execution of E0 but not E1 (as explained in the paper) may be due to the scope of changes in ‘a’ by which E1 may not be affected by a2–>a1 and so not refreshed.
Execs on the front are such that they are not an informant to any other informed exec.
A few simplifications:
All execs depend on all show artefacts {a & b}.
There may be a whole set of execs not on the front, which depend on other elements/artefacts.
Re-execution of E0 but not E1 (as explained in the paper) may be due to the scope of changes in ‘a’ by which E1 may not be affected by a2–>a1 and so not refreshed.