Fine-grained and efficient lineage querying   of collection-based workflow provenance    Paolo Missier, Norman Paton, Khalid...
Problem statement and outline• Setting:  – Black box provenance of workflow data products• Fine-grained provenance:  – tra...
Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph  structure   – large: grows w...
Main result             Workflow graph                           Provenance graph      X                      X           ...
Main result               Workflow graph                         Provenance graph             [1]                  []     ...
Main result               Workflow graph                         Provenance graph             [1]                  []     ...
Workflow as data integrator            EDBT, Lausanne, March 2010
Workflow as data integrator  QTLgenomicregions  genes in QTLmetabolicpathways (KEGG)                        EDBT, Lausanne...
Workflow as data integrator  QTLgenomicregions  genes in QTLmetabolicpathways (KEGG)                        EDBT, Lausanne...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Motivation for fine-grained provenance                                List-structured                                KEGG ...
Problem statement and outline• Setting:  – Black box provenance of workflow data products• Fine-grained provenance:  – tra...
Functional model for collection processing /1Simple processing:service expects atomic values,receives atomic values    v1 ...
Functional model for collection processing /1Simple processing:                       Simple iteration:service expects ato...
Functional model for collection processing /1Simple processing:                       Simple iteration:service expects ato...
Functional model for collection processing /1Simple processing:                           Simple iteration:service expects...
Functional model for collection processing /1  Simple processing:                             Simple iteration:  service e...
Functional model /2The simple iteration model                v = [[...], ...[...]]generalises by induction to a           ...
Functional model /2The simple iteration model                             v = [[...], ...[...]]generalises by induction to...
Functional model - multiple inputs /3                                   v2 = [v21 ... v2k]v1 = [v11 ... vin]     X1      X...
Functional model - multiple inputs /3                                   v2 = [v21 ... v2k]v1 = [v11 ... vin]     X1      X...
Functional model - multiple inputs /3                                   v2 = [v21 ... v2k]v1 = [v11 ... vin]     X1      X...
Generalised cross productBinary product, δ = 1:   a × b = [[ ai , bj ]|bj ← b]|ai ← a]                         (eval2 P a,...
Generalised cross productBinary product, δ = 1:             a × b = [[ ai , bj ]|bj ← b]|ai ← a]                          ...
Generalised cross productBinary product, δ = 1:               a × b = [[ ai , bj ]|bj ← b]|ai ← a]                        ...
Static mapping of output to input values          The iteration structure can be determined statically                    ...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Static mapping of output to input values          The iteration structure can be determined statically     (0,1)       (1,...
Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large...
Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large...
Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large...
From granularity to efficient query processingSummary so far:• whenever iterations are involved, we can trace theprovenanc...
Lineage query model /1I - Focusing:Not all processors areinteresting:– report lineage only at  specified nodes in the  gra...
Lineage query model /2                                List-structured                                KEGG gene ids:       ...
Lineage query model and language                                  workflow scope             optionally specifies one or m...
Fine granularity + efficient processing• Scalability:  – query time depends on size of workflow graph, not size   of prove...
Outline• Assumption:  – Black box provenance of workflow data products✓Fine-grained provenance:  – tracking provenance thr...
Experimental setup - I• Performance evaluation performed on programmatically  generated dataflows– the “T-towers”parameter...
125                                                                                                      time (           ...
Summary• A simple lineage query model for Taverna  –grounded in the semantics of collection-oriented   processing  –combin...
Upcoming SlideShare
Loading in …5
×

Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance.

511 views

Published on

Missier, P., Paton, N., & Belhajjame, K. (2010). Fine-grained and efficient lineage querying of collection-based workflow provenance. Procs. EDBT. Lausanne, Switzerland.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
511
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance.

  1. 1. Fine-grained and efficient lineage querying of collection-based workflow provenance Paolo Missier, Norman Paton, Khalid Belhajjame Information Management GroupSchool of Computer Science, University of Manchester, UK EDBT Conference Lausanne, Switzerland, March 2010
  2. 2. Problem statement and outline• Setting: – Black box provenance of workflow data products• Fine-grained provenance: – tracking provenance through collections: motivation – functional model of collection-oriented workflow processing• Efficient query processing: – leveraging the functional model to achieve efficient processing for a simple query model• Experimental evaluation EDBT, Lausanne, March 2010
  3. 3. Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph structure – large: grows with size of input – lineage queries involve graph traversalFrom:Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific 3Workflows," Procs. ICDE, 2009. EDBT, Lausanne, March 2010
  4. 4. Main result Workflow graph Provenance graph X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm X1 X2 X3 P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph • potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries EDBT, Lausanne, March 2010
  5. 5. Main result Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph • potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries EDBT, Lausanne, March 2010
  6. 6. Main result Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph • potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries EDBT, Lausanne, March 2010
  7. 7. Workflow as data integrator EDBT, Lausanne, March 2010
  8. 8. Workflow as data integrator QTLgenomicregions genes in QTLmetabolicpathways (KEGG) EDBT, Lausanne, March 2010
  9. 9. Workflow as data integrator QTLgenomicregions genes in QTLmetabolicpathways (KEGG) EDBT, Lausanne, March 2010
  10. 10. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
  11. 11. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
  12. 12. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
  13. 13. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
  14. 14. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
  15. 15. Problem statement and outline• Setting: – Black box provenance of workflow data products• Fine-grained provenance: – tracking provenance through collection elements – motivation ➡ functional model of collection-oriented workflow processing EDBT, Lausanne, March 2010
  16. 16. Functional model for collection processing /1Simple processing:service expects atomic values,receives atomic values v1 v2 v3 X1 X2 X3 P Y1 Y2 w1 w2 EDBT, Lausanne, March 2010
  17. 17. Functional model for collection processing /1Simple processing: Simple iteration:service expects atomic values, service expects atomic values,receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 X2 X3 X X X Y1 P Y2 Y P ➠ P1 Y ... Pn Y w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn] EDBT, Lausanne, March 2010
  18. 18. Functional model for collection processing /1Simple processing: Simple iteration:service expects atomic values, service expects atomic values,receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 X2 X3 X X X Y1 P Y2 Y P ➠ P1 Y ... Pn Y w1 w2 w = [w1 ... wn] w1 wn lineage(wi) = vi w = [w1 ... wn] EDBT, Lausanne, March 2010
  19. 19. Functional model for collection processing /1Simple processing: Simple iteration:service expects atomic values, service expects atomic values,receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 Y P ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn lineage(wi) = vi w = [w1 ... wn] EDBT, Lausanne, March 2010
  20. 20. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 P Y ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn lineage(wi) = vi w = [w1 ... wn] v = [[...], ...[...]]Extension:service expects atomic ad =2 Xvalues,receives input nested list dd = 0 P ➠ lineage(wii) = vij Y δ=2 w = [[..] ...[...]] EDBT, Lausanne, March 2010
  21. 21. Functional model /2The simple iteration model v = [[...], ...[...]]generalises by induction to a ad =ngeneric δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m EDBT, Lausanne, March 2010
  22. 22. Functional model /2The simple iteration model v = [[...], ...[...]]generalises by induction to a ad =ngeneric δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m This leads to a recursive functional formulation for simple collection processing: v = a1 . . . an (P v) if l = 0 (evall P v) = (map (evall−1 P ) v) if l > 0 EDBT, Lausanne, March 2010
  23. 23. Functional model - multiple inputs /3 v2 = [v21 ... v2k]v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] EDBT, Lausanne, March 2010
  24. 24. Functional model - multiple inputs /3 v2 = [v21 ... v2k]v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] Cross-product involving v1 and v2 (but not v3): v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ] EDBT, Lausanne, March 2010
  25. 25. Functional model - multiple inputs /3 v2 = [v21 ... v2k]v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] lineage(wii) = < v1i, v2, v3j> Cross-product involving v1 and v2 (but not v3): v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ] EDBT, Lausanne, March 2010
  26. 26. Generalised cross productBinary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) EDBT, Lausanne, March 2010
  27. 27. Generalised cross productBinary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b)Generalized to arbitrary depths:  [[(vi , wj )|wj ← w]|vi ← v] if d1 > 0, d2  >0  [(v , w)|v ← v] i i if d1 > 0, d2 =0(v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w]  if d1 = 0, d2 >0   (v, w) if d1 = 0, d2 =0...and to n operands: ⊗i:1...n (vi , di ) EDBT, Lausanne, March 2010
  28. 28. Generalised cross productBinary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b)Generalized to arbitrary depths:  [[(vi , wj )|wj ← w]|vi ← v] if d1 > 0, d2  >0  [(v , w)|v ← v] i i if d1 > 0, d2 =0(v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w]  if d1 = 0, d2 >0   (v, w) if d1 = 0, d2 =0...and to n operands: ⊗i:1...n (vi , di )Finally: general functional semantics for collection-based processing (evall P (v1 , d1 ), . . . , (vn , dn ) ) (P v1 , . . . , vn ) if l = 0 = (map (evall−1 P ) ⊗i:1...n vi , di ) if l > 0 EDBT, Lausanne, March 2010
  29. 29. Static mapping of output to input values The iteration structure can be determined statically dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = EDBT, Lausanne, March 2010
  30. 30. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = EDBT, Lausanne, March 2010
  31. 31. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j] δ1[i1 . i2 . ... . ik] = EDBT, Lausanne, March 2010
  32. 32. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = X1 EDBT, Lausanne, March 2010
  33. 33. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j] δ2[i1 . i2 . ... . ik] = X1 EDBT, Lausanne, March 2010
  34. 34. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = X1 X2 EDBT, Lausanne, March 2010
  35. 35. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j] δk[i1 . i2 . ... . ik] = X1 X2 EDBT, Lausanne, March 2010
  36. 36. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = X1 X2 Xk EDBT, Lausanne, March 2010
  37. 37. Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end Workflow graph Provenance graph X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm X1 X2 X3 P y11 ... ymn Y y = [ [y11 ... y1n], ... EDBT, Lausanne, March 2010 [ym1 ... ymn] ]
  38. 38. Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... EDBT, Lausanne, March 2010 [ym1 ... ymn] ]
  39. 39. Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... EDBT, Lausanne, March 2010 [ym1 ... ymn] ]
  40. 40. From granularity to efficient query processingSummary so far:• whenever iterations are involved, we can trace theprovenance of individual elements of a processor’soutput• iterations are explained in terms of a functionalmodel and based on list depth discrepancies• The relationships between output and input indexesare derived using the workflow specification graph(statically)How about expressivity and efficient processing oflineage queries? EDBT, Lausanne, March 2010
  41. 41. Lineage query model /1I - Focusing:Not all processors areinteresting:– report lineage only at specified nodes in the graph EDBT, Lausanne, March 2010
  42. 42. Lineage query model /2 List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] II - Granularity: Trace lineage for individual elements within collections - when possible! [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
  43. 43. Lineage query model and language workflow scope optionally specifies one or more defaults to latest run runs for the target workflow <pquery> <scope workflow="keggPathways"> <run id="ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3"/> </scope> <select> <outputPort name="paths_per_gene" index="[1,2]"/> <outputPort name="paths_per_gene" index="[3,4]"/> <outputPort name="commonPathways" index="[1]"/> <processor name="getPathwayDescriptions"> <outputPort name="return"/> </processor> </select> <focus> <processor name="get_pathway_by_genes" /> </focus> </pquery>port values for which lineage is sought: processors where lineage is to be reportedglobal outputs or processor-qualified - possibly workflow-qualified EDBT, Lausanne, March 2010
  44. 44. Fine granularity + efficient processing• Scalability: – query time depends on size of workflow graph, not size of provenance graph – workflow graphs are small, fit in memory, can be indexed easily• Graceful degradation: – worst case is a completely unfocused query – no worse than other approaches• Fine-grain answers provided at the same time EDBT, Lausanne, March 2010
  45. 45. Outline• Assumption: – Black box provenance of workflow data products✓Fine-grained provenance: – tracking provenance through collection elements – motivation, functional model of collection-oriented workflow processing✓Efficient query processing: ✓leveraging the functional model to achieve efficient processing for a simple query model➡Experimental evaluation EDBT, Lausanne, March 2010
  46. 46. Experimental setup - I• Performance evaluation performed on programmatically generated dataflows– the “T-towers”parameters:- size of the lists involved- length of the paths- includes one cross product 20 EDBT, Lausanne, March 2010
  47. 47. 125 time ( Experimental results - II 100 75 50 25 0 10 28 50 75 100 150 Naive traversal of path length l 10 elements in provenance graph 150 elements input list d=10 ind=150 list input 225 225 200 200 175 workflow as index single multi-join query 175 150 150 (our approach) time (ms) time (ms) 125 125 NI NGQ 100 100 PP 75 75 50 50 25 25 0 0 10 28 50 75 100 150 10 25 50 75 100 150 path length l path length l d=150 225 200 175 150 time (ms) 125 100 NR NGQ performance degradation on 75 workflow pre-processing time by graph size PP fully unfocused queries 50 3500 3257 3250 25 response times for PP on unfocused queries (l=150) 3000 130 0 2750 120 10 25 50 75 100 150 2500 110 2250 path length l 100time (ms) 2024 90 2000 80 time (ms) 1750 70 1500 60 1250 1000 50 1000 40 750 700 30 500 440 20 256 250 124 10 0 0 10 28 50 75 100 150 200 1.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67% path length l % of processors in target set EDBT, Lausanne, March 2010
  48. 48. Summary• A simple lineage query model for Taverna –grounded in the semantics of collection-oriented processing –combines fine-grain answers with efficient query processing• Ongoing work: – space compression, indexing – QLP? – semantic provenance (initial paper submitted)• Currently part of the Taverna 2.1 release EDBT, Lausanne, March 2010

×