Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Session talk @ AGU09 by Paolo Missier 634 views
- C4Bio paper talk by Paolo Missier 556 views
- Paper presentations: UK e-science A... by Paolo Missier 549 views
- Paper talk (presented by Prof. Luda... by Paolo Missier 474 views
- PDT: Personal Data from Things,and... by Paolo Missier 247 views
- Structured Occurrence Network for p... by Paolo Missier 3105 views

511 views

Published on

Published in:
Technology

No Downloads

Total views

511

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

9

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Fine-grained and efﬁcient lineage querying of collection-based workﬂow provenance Paolo Missier, Norman Paton, Khalid Belhajjame Information Management GroupSchool of Computer Science, University of Manchester, UK EDBT Conference Lausanne, Switzerland, March 2010
- 2. Problem statement and outline• Setting: – Black box provenance of workflow data products• Fine-grained provenance: – tracking provenance through collections: motivation – functional model of collection-oriented workflow processing• Efficient query processing: – leveraging the functional model to achieve efficient processing for a simple query model• Experimental evaluation EDBT, Lausanne, March 2010
- 3. Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph structure – large: grows with size of input – lineage queries involve graph traversalFrom:Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific 3Workflows," Procs. ICDE, 2009. EDBT, Lausanne, March 2010
- 4. Main result Workflow graph Provenance graph X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm X1 X2 X3 P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph • potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries EDBT, Lausanne, March 2010
- 5. Main result Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph • potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries EDBT, Lausanne, March 2010
- 6. Main result Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ]• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph • potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries EDBT, Lausanne, March 2010
- 7. Workflow as data integrator EDBT, Lausanne, March 2010
- 8. Workflow as data integrator QTLgenomicregions genes in QTLmetabolicpathways (KEGG) EDBT, Lausanne, March 2010
- 9. Workflow as data integrator QTLgenomicregions genes in QTLmetabolicpathways (KEGG) EDBT, Lausanne, March 2010
- 10. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
- 11. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
- 12. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
- 13. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
- 14. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
- 15. Problem statement and outline• Setting: – Black box provenance of workflow data products• Fine-grained provenance: – tracking provenance through collection elements – motivation ➡ functional model of collection-oriented workflow processing EDBT, Lausanne, March 2010
- 16. Functional model for collection processing /1Simple processing:service expects atomic values,receives atomic values v1 v2 v3 X1 X2 X3 P Y1 Y2 w1 w2 EDBT, Lausanne, March 2010
- 17. Functional model for collection processing /1Simple processing: Simple iteration:service expects atomic values, service expects atomic values,receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 X2 X3 X X X Y1 P Y2 Y P ➠ P1 Y ... Pn Y w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn] EDBT, Lausanne, March 2010
- 18. Functional model for collection processing /1Simple processing: Simple iteration:service expects atomic values, service expects atomic values,receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 X2 X3 X X X Y1 P Y2 Y P ➠ P1 Y ... Pn Y w1 w2 w = [w1 ... wn] w1 wn lineage(wi) = vi w = [w1 ... wn] EDBT, Lausanne, March 2010
- 19. Functional model for collection processing /1Simple processing: Simple iteration:service expects atomic values, service expects atomic values,receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 Y P ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn lineage(wi) = vi w = [w1 ... wn] EDBT, Lausanne, March 2010
- 20. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 P Y ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn lineage(wi) = vi w = [w1 ... wn] v = [[...], ...[...]]Extension:service expects atomic ad =2 Xvalues,receives input nested list dd = 0 P ➠ lineage(wii) = vij Y δ=2 w = [[..] ...[...]] EDBT, Lausanne, March 2010
- 21. Functional model /2The simple iteration model v = [[...], ...[...]]generalises by induction to a ad =ngeneric δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m EDBT, Lausanne, March 2010
- 22. Functional model /2The simple iteration model v = [[...], ...[...]]generalises by induction to a ad =ngeneric δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m This leads to a recursive functional formulation for simple collection processing: v = a1 . . . an (P v) if l = 0 (evall P v) = (map (evall−1 P ) v) if l > 0 EDBT, Lausanne, March 2010
- 23. Functional model - multiple inputs /3 v2 = [v21 ... v2k]v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] EDBT, Lausanne, March 2010
- 24. Functional model - multiple inputs /3 v2 = [v21 ... v2k]v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] Cross-product involving v1 and v2 (but not v3): v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ] EDBT, Lausanne, March 2010
- 25. Functional model - multiple inputs /3 v2 = [v21 ... v2k]v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] lineage(wii) = < v1i, v2, v3j> Cross-product involving v1 and v2 (but not v3): v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ] EDBT, Lausanne, March 2010
- 26. Generalised cross productBinary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) EDBT, Lausanne, March 2010
- 27. Generalised cross productBinary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b)Generalized to arbitrary depths: [[(vi , wj )|wj ← w]|vi ← v] if d1 > 0, d2 >0 [(v , w)|v ← v] i i if d1 > 0, d2 =0(v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w] if d1 = 0, d2 >0 (v, w) if d1 = 0, d2 =0...and to n operands: ⊗i:1...n (vi , di ) EDBT, Lausanne, March 2010
- 28. Generalised cross productBinary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b)Generalized to arbitrary depths: [[(vi , wj )|wj ← w]|vi ← v] if d1 > 0, d2 >0 [(v , w)|v ← v] i i if d1 > 0, d2 =0(v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w] if d1 = 0, d2 >0 (v, w) if d1 = 0, d2 =0...and to n operands: ⊗i:1...n (vi , di )Finally: general functional semantics for collection-based processing (evall P (v1 , d1 ), . . . , (vn , dn ) ) (P v1 , . . . , vn ) if l = 0 = (map (evall−1 P ) ⊗i:1...n vi , di ) if l > 0 EDBT, Lausanne, March 2010
- 29. Static mapping of output to input values The iteration structure can be determined statically dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = EDBT, Lausanne, March 2010
- 30. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = EDBT, Lausanne, March 2010
- 31. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j] δ1[i1 . i2 . ... . ik] = EDBT, Lausanne, March 2010
- 32. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = X1 EDBT, Lausanne, March 2010
- 33. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j] δ2[i1 . i2 . ... . ik] = X1 EDBT, Lausanne, March 2010
- 34. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = X1 X2 EDBT, Lausanne, March 2010
- 35. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j] δk[i1 . i2 . ... . ik] = X1 X2 EDBT, Lausanne, March 2010
- 36. Static mapping of output to input values The iteration structure can be determined statically (0,1) (1,1) (0,1) dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 X1 X2 X3 P dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 Y dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 (0,2) this leads to a simple mapping rule: index of an output list value ➔ {index of input values} Y[i.j] → X1[i], X2[], X3[j][i1 . i2 . ... . ik] = X1 X2 Xk EDBT, Lausanne, March 2010
- 37. Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end Workflow graph Provenance graph X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm X1 X2 X3 P y11 ... ymn Y y = [ [y11 ... y1n], ... EDBT, Lausanne, March 2010 [ym1 ... ymn] ]
- 38. Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... EDBT, Lausanne, March 2010 [ym1 ... ymn] ]
- 39. Workflow graph as index for query processing1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... EDBT, Lausanne, March 2010 [ym1 ... ymn] ]
- 40. From granularity to efficient query processingSummary so far:• whenever iterations are involved, we can trace theprovenance of individual elements of a processor’soutput• iterations are explained in terms of a functionalmodel and based on list depth discrepancies• The relationships between output and input indexesare derived using the workflow specification graph(statically)How about expressivity and efficient processing oflineage queries? EDBT, Lausanne, March 2010
- 41. Lineage query model /1I - Focusing:Not all processors areinteresting:– report lineage only at specified nodes in the graph EDBT, Lausanne, March 2010
- 42. Lineage query model /2 List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] II - Granularity: Trace lineage for individual elements within collections - when possible! [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ][ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ] EDBT, Lausanne, March 2010
- 43. Lineage query model and language workflow scope optionally specifies one or more defaults to latest run runs for the target workflow <pquery> <scope workflow="keggPathways"> <run id="ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3"/> </scope> <select> <outputPort name="paths_per_gene" index="[1,2]"/> <outputPort name="paths_per_gene" index="[3,4]"/> <outputPort name="commonPathways" index="[1]"/> <processor name="getPathwayDescriptions"> <outputPort name="return"/> </processor> </select> <focus> <processor name="get_pathway_by_genes" /> </focus> </pquery>port values for which lineage is sought: processors where lineage is to be reportedglobal outputs or processor-qualified - possibly workflow-qualified EDBT, Lausanne, March 2010
- 44. Fine granularity + efficient processing• Scalability: – query time depends on size of workflow graph, not size of provenance graph – workflow graphs are small, fit in memory, can be indexed easily• Graceful degradation: – worst case is a completely unfocused query – no worse than other approaches• Fine-grain answers provided at the same time EDBT, Lausanne, March 2010
- 45. Outline• Assumption: – Black box provenance of workflow data products✓Fine-grained provenance: – tracking provenance through collection elements – motivation, functional model of collection-oriented workflow processing✓Efficient query processing: ✓leveraging the functional model to achieve efficient processing for a simple query model➡Experimental evaluation EDBT, Lausanne, March 2010
- 46. Experimental setup - I• Performance evaluation performed on programmatically generated dataflows– the “T-towers”parameters:- size of the lists involved- length of the paths- includes one cross product 20 EDBT, Lausanne, March 2010
- 47. 125 time ( Experimental results - II 100 75 50 25 0 10 28 50 75 100 150 Naive traversal of path length l 10 elements in provenance graph 150 elements input list d=10 ind=150 list input 225 225 200 200 175 workflow as index single multi-join query 175 150 150 (our approach) time (ms) time (ms) 125 125 NI NGQ 100 100 PP 75 75 50 50 25 25 0 0 10 28 50 75 100 150 10 25 50 75 100 150 path length l path length l d=150 225 200 175 150 time (ms) 125 100 NR NGQ performance degradation on 75 workflow pre-processing time by graph size PP fully unfocused queries 50 3500 3257 3250 25 response times for PP on unfocused queries (l=150) 3000 130 0 2750 120 10 25 50 75 100 150 2500 110 2250 path length l 100time (ms) 2024 90 2000 80 time (ms) 1750 70 1500 60 1250 1000 50 1000 40 750 700 30 500 440 20 256 250 124 10 0 0 10 28 50 75 100 150 200 1.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67% path length l % of processors in target set EDBT, Lausanne, March 2010
- 48. Summary• A simple lineage query model for Taverna –grounded in the semantics of collection-oriented processing –combines fine-grain answers with efficient query processing• Ongoing work: – space compression, indexing – QLP? – semantic provenance (initial paper submitted)• Currently part of the Taverna 2.1 release EDBT, Lausanne, March 2010

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment