Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Provenance Annotation and Analysis to Support Process Re-Computation

73 views

Published on

paper talk given at the IPAW 2018 conference.
paper is here: http://www.lamsade.dauphine.fr/~belhajjame/Program_files/pdf/s3_p1.pdf

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Provenance Annotation and Analysis to Support Process Re-Computation

  1. 1. Provenance Annotation and Analysis to Support Process Re-Computation Jacek Cała, Paolo Missier School of Computing Newcastle University, UK
  2. 2. Problem Outline • Consider process P, e.g. the following NGS pipeline [5]: [5] Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Generation Computer Systems (Jan 2016).
  3. 3. Problem Outline • Only rarely P is a static entity. • Usually, a variety of elements in P change: • data dependencies, • software tools & dependencies, • [out of scope] the structure of P. • Changes in the elements of P => the need to update past P outcomes => the need for re-computation.
  4. 4. The Re-Computation Framework • To control the re-computation of processes • proposed earlier in [6]. • The core of the framework is the re-computation loop: [6] Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study. Big Data Research (2018); in press.
  5. 5. Re-Computation Process • Here we consider a single pass of the loop: • And focus on the first step only (S1).
  6. 6. Preliminaries • The ProvONE model: prospective + retrospective provenance [7]. • Set of software and data dependencies: D ={a0, b0, …} • Process, input and execution configuration: E(P, x,V) • Version change event: C = {an → an-1} • Composite version change event: C = {an → an-1, bm → bm-1, …} • Change front. • Re-computation front. • Restart tree. [7] Cuevas-Vicenttín, V., Ludäscher, B., Missier, P., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016).
  7. 7. Change Front • The accumulation of change events over a specified time window. t C0 {a1 → a0} CF3 {a3, b1, c2} CF5 {a3, b2, c2, d1} C1 {b1 → b0} C3 {a3 → a2, c2 → c1} C4 {d1 → d0} C5 {b2 → b1} C2 {a2 → a1, c1 → c0} E(…, [a0, b0, e0]) E(…, [a0, b1, d0]) E(…, [a2, b1, c1])
  8. 8. Re-computation Front • Over time the population of executions grow • Some of them may result from re-executions • Some of them may be user-initiated • may use historical versions of elements • Looking for the transitive closure of the elements’ derivation is too broad. => find out which of the past executions really need an update.
  9. 9. Re-computation Front We use: wasInformedBy(..., [prov:type=“recomp:re-execution”]) to denote a ReComp-initiated re-execution.
  10. 10. Re-computation Front
  11. 11. Re-computation Front … user-initiated
  12. 12. Re-computation Front
  13. 13. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  14. 14. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  15. 15. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  16. 16. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  17. 17. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.  The provenance trace includes multiple interrelated executions.  During re-execution we have to combine all of them within a single context – the top-level execution.
  18. 18. Restart Tree • To build a restart tree we rely on the proveone:wasPartOf statements. CF = {b2, e1}
  19. 19. Restart Tree • Captures the vertical dimension of a single execution • the transitive closure of the wasPartOf relation. RT ≝ {Execution, [DataChange], [Children]} CF = {b2, e1}
  20. 20. Restart Tree • Captures the vertical dimension of a single execution • the transitive closure of the wasPartOf relation. RT = {E0, [], [{SE0, [], [{SSE1, [⟨b2 → b0⟩], []},{SSE3, [⟨e1 → e0⟩], []}]}, {SE1, …}, …]} CF = {b2, e1}
  21. 21. The algorithm • Combines all three aspects: • the change front, • the re-computation front and • the restart tree. • For a given change front, –> produces the recomputation front that –> includes a set of restart trees, –> each refers to a single top-level execution with only the parts related to the change(s). • Enables ReComp to identify the minimal set of executions that may be affected by the change(s) • The remaining executions are either unaffected at all or refreshed previously.
  22. 22. Re-Computation Process • Enables difference and impact analysis of the executions on the front and their partial re-execution.
  23. 23. Difference and Impact Analysis <<hasSubProgram>> <<hasSubProgram>>
  24. 24. Conclusions • We address the problem of the re-computation of: • complex hierarchical processes, • run over a cohort of input data samples, • with multiple points of change, • in the open system – allow users to initiate (re-)executions any time. • The solution starts from the changes observed: • In contrast to previous work, e.g. smart re-run and workflow caching. • We proposed a simple algorithm to find the re-computation front: • written in Prolog, • very effective (response in the order of 1–100 ms), • available on GitHub. • The algorithm is the initial step in further scope identification and execution optimisation.
  25. 25. Thank you! http://www.recomp.org.uk

×