provenance of lists - TAPP'11 Mini-tutorial

649 views

Published on

short invited mini-tutorial on workflow-based provenance and principled extensions to the Open Provenance Model,
TAPP'11 workshop, June 2011

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
649
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • provenance of lists - TAPP'11 Mini-tutorial

    1. 1. Provenance of workflow data products Paolo Missier School of Computing, Newcastle University, UK TAPP’11 workshop Heraklion, Crete June 20-21, 2011
    2. 2. Workflow provenanceTaverna type system:- strings + nested lists- “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] Dataflow model: - data-driven execution - services activate when input is ready Raw provenance: A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced Linköping, Sweden -- January 2010
    3. 3. Workflow provenanceTaverna type system:- strings + nested lists- “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] Dataflow model: - data-driven execution - services activate when input is ready Raw provenance: A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced Linköping, Sweden -- January 2010
    4. 4. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] lister Dataflow model: get pathways by genes1 - data-driven execution - services activate when input is ready merge pathways gene_id Raw provenance: A detailed trace of workflow execution concat gene pathway ids - tasks performed, data transformations output - inputs used, outputs producedpathway_genes Linköping, Sweden -- January 2010
    5. 5. Implicit iteration in Taverna c = [c1 ... ck]a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y Linköping, Sweden -- January 2010
    6. 6. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1)a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y Linköping, Sweden -- January 2010
    7. 7. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1)a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] Linköping, Sweden -- January 2010
    8. 8. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1)a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]How y is computed at P: let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleavedy = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ..., (map P [ <an,c, b1> ... <an,c, bm>]) ] = [ [y11 ... y1n], ... [yn1 ... ynm] ] Linköping, Sweden -- January 2010
    9. 9. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1)a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], bottom line: ... yij depends only on values ai, c, bj [ym1 ... ymn] ]How y is computed at P: let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleavedy = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ..., (map P [ <an,c, b1> ... <an,c, bm>]) ] = [ [y11 ... y1n], ... [yn1 ... ynm] ] Linköping, Sweden -- January 2010
    10. 10. Fine-grained (precise?) provenance [] [0] [1] [2] ... ... ... [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] 4
    11. 11. The Open Provenance Model aims to capture the causal make this dependency explicit, it is required to assert that Which provenance model/language? endencies between the artifacts, processes, and agents.erefore, a provenance graph is defined as a directed artifact A2 was derived from another artifact A1 . This edge gives us a dataflow oriented view of provenance. ph, whose nodes are artifacts, processes and agents, It is also recognized that we may not be aware of the • Let’s toedge represents Open depen- exact artifact generated bya starting point was icted in Figure 1. An one at the a causal Provenance Modela as another process P . Process Pd whose edges belong look of the following categories some artifact that process P2 used, but that there 1 2 cy, between its source, denoting the effect, and its des- is then said to have been triggered by P1 . In contrast toation, denoting the cause. edge was derived from, a was triggered by edge allows for a process oriented view of past executions to be adopted. (Since these edges summarize some activities for which all Core OPM not being exposed, it was felt that it was not details are • agnostic wrtassociate a role with them.) types necessary to Artifact, Processor As far as conventions are concerned, we note that causal- • roles: edges use past tense to indicaterelationsrefer to past ity annotations on binary that they execution. Causal relationships are defined as follows. • extensions by subclassing Definition 4 (Causal Relationship). A causal relation- • node represented by an arc and denotes the presence of a ship is types, • relation types between arc source of the arc (the effect) causal dependency and the destination of the the (the cause). Five causal relationships are recognized: a process used an artifact, an artifact was generated by a process, a process was triggered by a process, an artifact was derived from an artifact, and a process was controlled by an agent. By means of annotations (see Section 8), we allow edges to be further subtyped from these five categories. Formal (temporal) semantics hopefully available soonure 1: Edges in the Open Provenance Model: sources are effects, destinations causes Multiple notions of causal dependencies were consid- ered for OPM. A very strong notion of causal dependency The first two edges express that a process used an arti- would express that a set of entities was necessary and suffi-t and that an artifact was generated by a process. Since cient to explain the existence of another entity. It was felt rocess may have used several artifacts, it is important that such a notion was not practical, since, with an open dentify the roles under which these artifacts were used. world assumption, one could always argue that additional 5oles are denoted by letter ‘R’ in Figure 1.) Likewise, factors may have influenced an outcome (e.g. electricity was used, temperature range allowed computer to work,
    12. 12. OPM relations in the workflow context Single User’s View (Alice) Workflow Specification (WA) in out in out X A Y B Z in Process space Data Binding & Enactment Data Storage and Binding X / [x1, x2]! SA Z / [z1, z2]! Data space Execution & Trace Capture read Provenance Trace (TA) x2 read a2 write y2 b2 z2 Provenance space idep x1 a1 y1 b1 z1 ddep ans(X) :- ddep*(X, z1).! Provenance QueriesMissier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
    13. 13. Data and invocation dependencies - read, write are natural observables for a workflow run - possible additional relations (recorded or inferred): •  invocation dependencies: Explicit or via: a2 depends on a1 because a1 has written data d, a2 has read d •  data dependencies: Explicit or via: d2 depends on d1 ! because some actor invocation a read d1 prior to writing d27
    14. 14. Provenance queries •  Closure queries: •  operate on the transitive closure ddep* over ddep: But also: - queries on the workflow structure - queries on the data structures (e.g. collections) and importantly: use workflow graphs to justify/explain the provenance graph for one workflow run: TA trace instance of WA: h: TA ➔ WA homomorphism h(x1 ➔ a1) = h(x2 ➔ a2) = X➔A, h(a1 ➔ y1) = h(a2 ➔ y2) = A➔Y ...8
    15. 15. OPM extensions, principled core OPM / PIL(*) -used -wasGeneratedBy -wasDerivedFrom9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
    16. 16. OPM extensions, principled Processor Data types types core OPM / PIL(*) -used -wasGeneratedBy -wasDerivedFrom (Additional who context) when where ...9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
    17. 17. OPM extensions, principled Processor Data types types core OPM / PIL(*) -used Nested -wasGeneratedBy Ordered Lists -wasDerivedFrom Operators on lists: - create, insert, delete, select... - map, fold (Additional who context) when where ...9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
    18. 18. OPM extensions, principled Graph models? Graph Relations? matching (sets of tuples) queries? Relational Processor queries? Data types types core OPM / PIL(*) -used Nested -wasGeneratedBy Ordered Lists -wasDerivedFrom Operators on lists: - create, insert, delete, select... - map, fold (Additional who context) when where ...9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
    19. 19. Role used in relation: Proposed extensions Context of use element Contained L Contained(element) x list Role Used in relation: used P Used(list) L Context of use position element Used i ned C onta P Used(position) p L C onta i ned (element) x term list Used Used ListUsed (list) L P comprehensions, see Sec. 4.2 generator position Used P Used (position) p filter term Used List comprehensions, see Sec. 4.2 function Used map, see Sec. 4.3 generator operand filter function Used map, see Sec. 4.3 operand Table 2: New roles for Used and Contained relations Table 2: New roles Causal relation for Used and Contained relations Example Contained(R) ⊆ [τ ] × A × [Int] L Contained([i1 . . . in ]) x x was inserted into L at position [i1 . . . in ] Causal relation wasSelectedFrom(R) ⊆ [τ ] × [τ ] × [Int] Example L wasSelectedFrom([i1 . . . in ]) L C onta i ned (R) [ ] × A × [Int] L at onta i ned ([i 1. . . . nn ]) x selected from L L C position [i1 . . i i ] was wasRemovedFrom(R) ⊆ [τ ] × [τ ] × [Int] L was inserted into L 1 . .position [i 1 . . . in ] x wasRemovedFrom([i at . in ]) L wasSelected F rom (R) [ ] × [ ] × [Int] L at position [iF romi([i 1was in ]) L from L L wasSelected 1 . . . n ] . . . deleted wasSameAs ⊆ A × A L wasSameAs x[i 1 . . . in ] was selected from L L at position wasRemoved F rom (R) [ ] × [ ] × [Int] inferred (various F rom ([i 1 . . . in ]) L L wasRemoved contexts) L at position [i 1 . . . in ] was deleted from L wasSame A s A × A L wasSame A s x Table 3: Specialised OPM dependencycontexts) inferred (various relations.10
    20. 20. OPM fragments for elementary operations empty list wasGenerated by insertion L ! wasDerivedFrom wasGenerated by used(list) L P:ins L unit us ed (p used(element) contained os itio n) co p nt ai ne wasGenerated by used(element) d L P:unit x x selection p p ition) pos deletion on) d ( po siti use d( use wasGenerated by used(list) L P:sel L wasGeneratedBy used(list) L P:del L WasSelectedFrom WasRemovedFrom11
    21. 21. templates for the two operations. Additionally, however, the following Operator composition → graph compositiony holds whenever insertion is followed by selection, if no other interveningon changes the state operators may translate into inferences on the graphs • Equalities on of the list: sel (ins x L p) p = xw translates into the following OPM inference rule: sameAs(L1,L) :- pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1), pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2). 7 12
    22. 22. templates for the two operations. Additionally, however, the following Operator composition → graph compositiony holds whenever insertion is followed by selection, if no other interveningon changes the state operators may translate into inferences on the graphs • Equalities on of the list: sel (ins x L p) p = xw translates into the following OPM inference rule: sameAs(L1,L) :- pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1), pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2). 7used(position) p wasDerivedFrom wasGeneratedBy used(list) wasGenerated by used(list) L P:sel L P:ins L us ed (p used(element) os WasSelectedFrom itio co n) nt ai ne d wasSam eAs x 12
    23. 23. nd p a position in the list L = map f L. The following equality follows fromhe definition of map: Approach applies to map, fold (reduce)... sel (map f x) p = f (sel x p) (10)This equality is captured by the inferred dependency (y wasSameAs y ) in the WasSelectedFrom nhanced OPM template of Fig. 7 (inference rule omitted for simplicity). wasGenerated by used(list) wasGenerated by used(list) y Q2:sel L P:map WasSelectedFrom us ed L (fu nc use tio d(p n) o siti on) f wasSameAs wasGenerated by used(list) wasGenerated by used(list) y Q2:sel L P:map io n) unct ed(f p t) us (lis us ed ed L us (fu wasGenerated by used(operand) wasGenerated by nc y R:apply x use Q1:sel used(position) tio d(p n) osi tion ) f wasSameAs n ) (fu nctio used p t) lis e d( us 13
    24. 24. Take-home message • OPM (PIL) a candidate starting point for workflow-based provenance • extension mechanisms are provided, but they must be used sensibly – data types – processor types • Provenance of nested ordered lists used as a prototypical example – semantics of provenance graphs and graph composition grounded in the semantics of lists • Can this approach be useful for other interesting data types? – sets of tuples / relational algebra14

    ×