Successfully reported this slideshow.

Provenance Annotation and Analysis to Support Process Re-Computation

1

Share

Loading in …3
×
1 of 25
1 of 25

More Related Content

More from Paolo Missier

Related Books

Free with a 14 day trial from Scribd

See all

Provenance Annotation and Analysis to Support Process Re-Computation

  1. 1. Provenance Annotation and Analysis to Support Process Re-Computation Jacek Cała, Paolo Missier School of Computing Newcastle University, UK
  2. 2. Problem Outline • Consider process P, e.g. the following NGS pipeline [5]: [5] Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Generation Computer Systems (Jan 2016).
  3. 3. Problem Outline • Only rarely P is a static entity. • Usually, a variety of elements in P change: • data dependencies, • software tools & dependencies, • [out of scope] the structure of P. • Changes in the elements of P => the need to update past P outcomes => the need for re-computation.
  4. 4. The Re-Computation Framework • To control the re-computation of processes • proposed earlier in [6]. • The core of the framework is the re-computation loop: [6] Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study. Big Data Research (2018); in press.
  5. 5. Re-Computation Process • Here we consider a single pass of the loop: • And focus on the first step only (S1).
  6. 6. Preliminaries • The ProvONE model: prospective + retrospective provenance [7]. • Set of software and data dependencies: D ={a0, b0, …} • Process, input and execution configuration: E(P, x,V) • Version change event: C = {an → an-1} • Composite version change event: C = {an → an-1, bm → bm-1, …} • Change front. • Re-computation front. • Restart tree. [7] Cuevas-Vicenttín, V., Ludäscher, B., Missier, P., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016).
  7. 7. Change Front • The accumulation of change events over a specified time window. t C0 {a1 → a0} CF3 {a3, b1, c2} CF5 {a3, b2, c2, d1} C1 {b1 → b0} C3 {a3 → a2, c2 → c1} C4 {d1 → d0} C5 {b2 → b1} C2 {a2 → a1, c1 → c0} E(…, [a0, b0, e0]) E(…, [a0, b1, d0]) E(…, [a2, b1, c1])
  8. 8. Re-computation Front • Over time the population of executions grow • Some of them may result from re-executions • Some of them may be user-initiated • may use historical versions of elements • Looking for the transitive closure of the elements’ derivation is too broad. => find out which of the past executions really need an update.
  9. 9. Re-computation Front We use: wasInformedBy(..., [prov:type=“recomp:re-execution”]) to denote a ReComp-initiated re-execution.
  10. 10. Re-computation Front
  11. 11. Re-computation Front … user-initiated
  12. 12. Re-computation Front
  13. 13. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  14. 14. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  15. 15. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  16. 16. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.
  17. 17. Restart Tree • Re-computation front handles single executions well. • What if the process is more complex than that? • pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.  The provenance trace includes multiple interrelated executions.  During re-execution we have to combine all of them within a single context – the top-level execution.
  18. 18. Restart Tree • To build a restart tree we rely on the proveone:wasPartOf statements. CF = {b2, e1}
  19. 19. Restart Tree • Captures the vertical dimension of a single execution • the transitive closure of the wasPartOf relation. RT ≝ {Execution, [DataChange], [Children]} CF = {b2, e1}
  20. 20. Restart Tree • Captures the vertical dimension of a single execution • the transitive closure of the wasPartOf relation. RT = {E0, [], [{SE0, [], [{SSE1, [⟨b2 → b0⟩], []},{SSE3, [⟨e1 → e0⟩], []}]}, {SE1, …}, …]} CF = {b2, e1}
  21. 21. The algorithm • Combines all three aspects: • the change front, • the re-computation front and • the restart tree. • For a given change front, –> produces the recomputation front that –> includes a set of restart trees, –> each refers to a single top-level execution with only the parts related to the change(s). • Enables ReComp to identify the minimal set of executions that may be affected by the change(s) • The remaining executions are either unaffected at all or refreshed previously.
  22. 22. Re-Computation Process • Enables difference and impact analysis of the executions on the front and their partial re-execution.
  23. 23. Difference and Impact Analysis <<hasSubProgram>> <<hasSubProgram>>
  24. 24. Conclusions • We address the problem of the re-computation of: • complex hierarchical processes, • run over a cohort of input data samples, • with multiple points of change, • in the open system – allow users to initiate (re-)executions any time. • The solution starts from the changes observed: • In contrast to previous work, e.g. smart re-run and workflow caching. • We proposed a simple algorithm to find the re-computation front: • written in Prolog, • very effective (response in the order of 1–100 ms), • available on GitHub. • The algorithm is the initial step in further scope identification and execution optimisation.
  25. 25. Thank you! http://www.recomp.org.uk

Editor's Notes

  • Highlight the key elements: complex hierarchical workflow, multiple patient inputs, variety of software and data dependencies.
    Single run of the pipeline may include 30–40 patient samples, and tens or hundreds of such executions are made in practice, e.g. 1511 brain tumor patients from IGM.
  • Examples of why changes in P should invoke updates of past P outcomes:
    reference genome  ???
    dbsnp  variants no longer considered as de-novo – important in the case of rare diseases
    tool versions  more accurate alignment or variant discovery.
  • Multiple past executions – scope of change is to filter irrelevant execs.
    Mention that this single pass here is slightly simplified – no cost considered.
  • notation is slightly different – for the sake of presentation

    Mention that x is explicitly indicated to denote the fact that V describes dependency changes, whereas x⋲X is an element of a population of inputs.

    The last three are here to explain the effective algorithm. The alg. is supposed to produce the minimal set of potentially affected execs.
  • Simple change – wait until CF
    Composite change
    Only the changed artefacts
    Change front CF3
    Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
    The open system – users can introduce execs with non-recent versions.
    More changes
    Change front CF5
    Includes all change artefacts – keeps the system open.
    Prioritisation within impact analysis may cause that not all past execs are recomputed.
    Windowing, windowing policy


    Explain that P may depend on many other components but C and CF include only the changed ones.
    Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
    Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
  • Isn’t it as simple as filtering out execs which are wasInformedBy
  • Grey arrows indicate the data flow: solid lines – the usage of data, dotted lines – the communication / generation-usage pattern
    Black dashed arrows reflect the structure of the process.
  • Mention that for the sake of simplicity we are not showing the port-data usage, which is also needed.
  • Mention that for the sake of simplicity we are not showing the port-data usage, which is also needed.
  • How is it different from workflow caching?
  • Multiple past executions – scope of change is to filter irrelevant execs.
    Mention that this single pass here is slightly simplified – no cost considered.
  • Mention that x is explicitly indicated to denote the fact that V describes dependency changes, whereas x⋲X is an element of a population of inputs.

    The last three are here to explain the effective algorithm. The alg. is supposed to produce the minimal set of potentially affected execs.
  • Simple change
    Composite change
    Only the changed artefacts
    Change front CF0
    Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
    The open system – users can introduce execs with non-recent versions.
    More changes
    Change front CF1
    Includes all change artefacts – keeps the system open.
    Prioritisation within impact analysis may cause that not all past execs are recomputed.
    Windowing, windowing policy


    Explain that P may depend on many other components but C and CF include only the changed ones.
    Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
    Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
  • Simple change
    Composite change
    Only the changed artefacts
    Change front CF0
    a point in time when ReComp initiates the re-computation loop
    Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
    The open system – users can introduce execs with non-recent versions.
    More changes
    Change front CF1
    Includes all change artefacts – keeps the system open.
    Prioritisation within impact analysis may cause that not all past execs are recomputed.
    Windowing, windowing policy


    Explain that P may depend on many other components but C and CF include only the changed ones.
    Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
    Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
  • Simple change
    Composite change
    Only the changed artefacts
    Change front CF0
    Only the newest references – transitive closure of wDF on the set of all derivations of artefact D.v.
    The open system – users can introduce execs with non-recent versions.
    More changes
    Change front CF1
    Includes all change artefacts – keeps the system open.
    Prioritisation within impact analysis may cause that not all past execs are recomputed.
    Windowing, windowing policy


    Explain that P may depend on many other components but C and CF include only the changed ones.
    Also, mention that CF includes the references to only the newest version of the dependencies and includes all of them –> it is important because various executions may depend on earlier versions, not necessarily the immediate predecessor. Also, the user can introduce executions which depend on versions well before the last CF.
    Mention that the windowing policy may vary widely, e.g. fixed window size or adaptive window based on some measure of change event significance.
  • The last Unless there is the top-level coordination process specified already.
  • A few simplifications:
    All execs depend on all show artefacts {a & b}.
    There may be a whole set of execs not on the front, which depend on other elements/artefacts.

    Re-execution of E0 but not E1 (as explained in the paper) may be due to the scope of changes in ‘a’ by which E1 may not be affected by a2–>a1 and so not refreshed.
    Execs on the front are such that they are not an informant to any other informed exec.
  • A few simplifications:
    All execs depend on all show artefacts {a & b}.
    There may be a whole set of execs not on the front, which depend on other elements/artefacts.

    Re-execution of E0 but not E1 (as explained in the paper) may be due to the scope of changes in ‘a’ by which E1 may not be affected by a2–>a1 and so not refreshed.
  • ×