Provenance Annotation and Analysis to Support Process Re-Computation

Provenance Annotation and Analysis to
Support
Process Re-Computation
Jacek Cała, Paolo Missier
School of Computing
Newcastle University, UK

Problem Outline
• Consider process P, e.g. the following NGS pipeline [5]:
[5] Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud.
Future Generation Computer Systems (Jan 2016).

Problem Outline
• Only rarely P is a static entity.
• Usually, a variety of elements in P change:
• data dependencies,
• software tools & dependencies,
• [out of scope] the structure of P.
• Changes in the elements of P
=> the need to update past P outcomes
=> the need for re-computation.

The Re-Computation Framework
• To control the re-computation of processes
• proposed earlier in [6].
• The core of the framework is
the re-computation loop:
[6] Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics
tasks: insights from a Genomics case study. Big Data Research (2018); in press.

Re-Computation Process
• Here we consider a single pass of the loop:
• And focus on the first step only (S1).

Preliminaries
• The ProvONE model: prospective + retrospective provenance [7].
• Set of software and data dependencies: D ={a0, b0, …}
• Process, input and execution configuration: E(P, x,V)
• Version change event: C = {an → an-1}
• Composite version change event: C = {an → an-1, bm → bm-1, …}
• Change front.
• Re-computation front.
• Restart tree.
[7] Cuevas-Vicenttín, V., Ludäscher, B., Missier, P., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016).

Change Front
• The accumulation of change events over a specified time window.
t
C0
{a1 → a0}
CF3
{a3, b1, c2}
CF5
{a3, b2, c2, d1}
C1
{b1 → b0}
C3
{a3 → a2, c2 → c1}
C4
{d1 → d0}
C5
{b2 → b1}
C2
{a2 → a1, c1 → c0}
E(…, [a0, b0, e0])
E(…, [a0, b1, d0])
E(…, [a2, b1, c1])

Re-computation Front
• Over time the population of executions grow
• Some of them may result from re-executions
• Some of them may be user-initiated
• may use historical versions of elements
• Looking for the transitive closure of the elements’ derivation is too
broad.
=> find out which of the past executions really need an update.

We use:
wasInformedBy(..., [prov:type=“recomp:re-execution”])
to denote a ReComp-initiated re-execution.

…
user-initiated

Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.

Restart Tree
• Re-computation front handles single executions well.
• What if the process is more complex than that?
• pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
 The provenance trace includes multiple
interrelated executions.
 During re-execution we have to combine
all of them within a single context – the
top-level execution.

Restart Tree
• To build a restart tree we rely on the proveone:wasPartOf
statements.
CF = {b2, e1}

Restart Tree
• Captures the vertical dimension of a single execution
• the transitive closure of the wasPartOf relation.
RT ≝ {Execution, [DataChange], [Children]}
CF = {b2, e1}

Restart Tree
• Captures the vertical dimension of a single execution
• the transitive closure of the wasPartOf relation.
RT = {E0, [], [{SE0, [], [{SSE1, [⟨b2 → b0⟩], []},{SSE3, [⟨e1 → e0⟩], []}]}, {SE1, …}, …]}
CF = {b2, e1}

The algorithm
• Combines all three aspects:
• the change front,
• the re-computation front and
• the restart tree.
• For a given change front,
–> produces the recomputation front that
–> includes a set of restart trees,
–> each refers to a single top-level execution with only the parts related to the
change(s).
• Enables ReComp to identify the minimal set of executions that may be affected by the
change(s)
• The remaining executions are either unaffected at all or refreshed previously.

Re-Computation Process
• Enables difference and impact analysis of the executions on the front
and their partial re-execution.

Difference and Impact Analysis
<<hasSubProgram>>
<<hasSubProgram>>

Conclusions
• We address the problem of the re-computation of:
• complex hierarchical processes,
• run over a cohort of input data samples,
• with multiple points of change,
• in the open system – allow users to initiate (re-)executions any time.
• The solution starts from the changes observed:
• In contrast to previous work, e.g. smart re-run and workflow caching.
• We proposed a simple algorithm to find the re-computation front:
• written in Prolog,
• very effective (response in the order of 1–100 ms),
• available on GitHub.
• The algorithm is the initial step in further scope identification and execution optimisation.

Thank you!
http://www.recomp.org.uk

Provenance Annotation and Analysis to Support Process Re-Computation

More Related Content

What's hot

Similar to Provenance Annotation and Analysis to Support Process Re-Computation

More from Paolo Missier

Recently uploaded

Provenance Annotation and Analysis to Support Process Re-Computation

Editor's Notes