Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008
1. Granular workflow provenance in Taverna
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
Symposium on Provenance in Scientific Workflows
Salt Lake City, Oct. 2008
1
2. Outline
• Collection values in [bioinformatics] workflows are important
• Granular provenance over collections: model and issues
• Measuring “provenance friendliness” of dataflows
• Increasing friendliness of existing dataflows
• Extending the Open Provenance Model graph to describe
granular data derivations
• Provenance service architecture - brief description
2
5. Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166
4
6. Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166
gene ->
genomic region
4
7. Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166
gene ->
genomic region
extend region
4
8. Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166
gene ->
genomic region
extend region
retrieve SNPs in
the region
4
9. Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166
gene ->
genomic region
extend region
retrieve SNPs in
the region
rearrange SNP
details
4
10. Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166
[ ENSG00000139618 , ENSG00000083093 ]
gene ->
genomic region
extend region
retrieve SNPs in
the region
[[<1,23554512,16,rs45585833>,
<1,23554712,16,rs45594034>,
... rearrange SNP
],
[<1,31820153,13,ENSSNP10730823>, details
<1,31818497,13,ENSSNP10730820>,
...
]]
4
11. Computational model for collections
Depth mismatch between declared / offered type:
type(P4:X1) = s but type(a) = list(s)
type(P4:X2) = type(c) = list(s)
type(P4:X3) = s but type(c) = list(s)
Execution at P4:
Y = (map P1 <(a ⊗ b) , c>) // cross product
Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]
5
24. Tracing granular lineage
• Provenance traces are most useful when they are
granular
– trace individual items in a collection
– “which geneID is responsible for the presence of SNP
rs169546 in the output?”
• Curse of black box processors:
– M-M (many-many) and M-1 (many-one) processors
destroy granularity
7
40. Granular lineage: recap
• Lineage query model accounts for granular traces
over nested collections
• arbitrary nesting levels:
– values are trees in general
– lineage query identifies the correct sub-trees
• Lineage queries are efficient
– recursion problem “compiled away” by query rewriting
– (shameless claim - details omitted)
• But:
– One single M-* processor can destroy granularity
– in some cases annotations are a remedy
13
42. Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision
– how well is granularity preserved over a lineage trace?
– what is the impact of M-* processors?
– use to prioritize remedial actions
14
43. Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision
– how well is granularity preserved over a lineage trace?
– what is the impact of M-* processors?
– use to prioritize remedial actions
2.Make workflows more provenance friendly:
– Add knowledge (static):
• “lightweight annotations” [MBZ+08] -- see IPAW08
– Add knowledge (dynamic):
–provenance-active workflow processors
– Redesign processors / workflow
• general guidelines, provenance friendly patterns
[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with
lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)
14
45. Lineage precision: example
c = [c1, c2, c3]
a = [a1, a2] e = [e1, e2]
b = [b1, b2] f
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
46. Lineage precision: example
c = [c1, c2, c3]
a = [a1, a2] e = [e1, e2]
b = [b1, b2] f
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
47. Lineage precision: example
c = [c1, c2, c3]
a = [a1, a2] e = [e1, e2]
b = [b1, b2] f
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
48. Lineage precision: example
c = [c1, c2, c3]
a = [a1, a2] e = [e1, e2]
b = [b1, b2] f
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
49. Lineage precision: example
c = [c1, c2, c3]
a = [a1, a2] e = [e1, e2]
b = [b1, b2] f
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) = { P0:Y[1]= a1, P2:X=c, P3:X=e }
15
50. Lineage precision: example
c = [c1, c2, c3]
a = [a1, a2] e = [e1, e2]
b = [b1, b2] f
precision = (1 + .5 + .5) / 3 = 2/3[d , d ]
d= 1 2
lineage(P4:Y1[1.2.2], {P0, P2, P3}) = { P0:Y[1]= a1, P2:X=c, P3:X=e }
15
51. Precision relative to a sub-graph
• Refining the previous idea:
– precision relative to a set O of output variables and a set I of input variables
• because not all variables are equally interesting...
• weights WI, WO account for relative importance of variables
I1 I2
O2 O3
O1
16
52. Precision relative to a sub-graph
• Refining the previous idea:
– precision relative to a set O of output variables and a set I of input variables
• because not all variables are equally interesting...
• weights WI, WO account for relative importance of variables
len(pi )
prec(I, WI , O, WO ) = WO (Oj ) WI (Xi ) ·
nl (Xi )
j:1...|O| Xi (pi )∈lin(Oj ,I)
I1 I2
wi = wj = 1
wi ∈WI wj ∈WO
O2 O3
O1
16
53. Impact of M-* processors on precision
I1 I2 Count the number of variables in O that
can be reached from P
• weighted sum
P
impact(P, O) = W (o) · reach(P, o)
o∈O
O2 O3
1 if v is reachable from P
O1
reach(P, v) =
0 otherwise
17
54. Improving provenance precision
• Impact used to prioritize user actions on processors
• Precision used to assess improvement
• add index-preserving annotations
✓illustrated earlier
• refactor M-* processors
• make processors provenance-active
18
63. Provenance-active processors
–Passive processors do not contribute explicit
provenance info
–provenance-active processors actively feed metadata
to the lineage service
X: l(s) = [a1, a2, a3] X: l(s) = [a1, a2, a3]
P P
Y: s = b Y: l(s) = [b1, b2]
Static aggregation f() P is index-
annotations: preserving
Dynamic b = X[i] sorting:
annotations: Y = Π(X)
b = f(X[1]...X[k])
IPAW'08 – Salt Lake City, Utah, June 2008
64. Open Provenance Model
• A graph notation to represent process provenance
– independent of the provenance producers
– suitable for exchanging provenance across different workflow
systems
• State: draft 1.01 (July 2008)
21
65. Mapping to OPM - granularity issue
a X1 X2 b
P0
c Y1 Y2 d used
a used wgb c P1
P0
X:s X:s used
b used wgb d P2
P1 P2
e Y:s Y:s f
22
66. Mapping to OPM - granularity issue
a X1 X2 b
P0
c Y1 Y2 d used
a used wgb c P1
P0
X:s X:s used
b used wgb d P2
P1 P2
e Y:s Y:s f wasDerivedFrom
22
67. Mapping to OPM - granularity issue
a X1 X2 b
P0
c Y1 Y2 d used
a used wgb c P1
P0
X:s X:s used
b used wgb d P2
P1 P2
e Y:s Y:s f wasDerivedFrom
☐ ☐
22
68. Mapping to OPM - granularity issue
a X1 X2 b
P0
c Y1 Y2 d used
a used wgb c P1
P0
X:s X:s used
b used wgb d P2
P1 P2
e Y:s Y:s f wasDerivedFrom
☐ ☐
b[p] wasDerivedFrom d[p’]
22
69. Mapping to OPM - granularity issue
a X1 X2 b
P0
c Y1 Y2 d used
a used wgb c P1
P0
X:s X:s used
b used wgb d P2
P1 P2
e Y:s Y:s f wasDerivedFrom
☐ ☐
b[p] wasDerivedFrom d[p’]
How can this granular dependency be described for all arbitrary paths p?
Currently cannot be expressed using OPM
22
70. Path mapping rules
Static graph structure sufficient c
used
P2
a used wgb
to provide this (in Taverna)
P1
used
b used wgb d P3
But this is only known at query time wasDerivedFrom
(extensional enumeration not an
option) ☐ ☐
b[p] actual lineage d[p’]
23
71. Path mapping rules
Static graph structure sufficient c
used
P2
a used wgb
to provide this (in Taverna)
P1
used
b used wgb d P3
But this is only known at query time wasDerivedFrom
(extensional enumeration not an
option) ☐ ☐
b[p] actual lineage d[p’]
Observation:
• only need to consider individual processor transformations
• exploit local processor rules for propagating granular lineage
23
72. Path mapping rules
Static graph structure sufficient c
used
P2
a used wgb
to provide this (in Taverna)
P1
used
b used wgb d P3
But this is only known at query time wasDerivedFrom
(extensional enumeration not an
option) ☐ ☐
b[p] actual lineage d[p’]
Observation:
• only need to consider individual processor transformations
• exploit local processor rules for propagating granular lineage
Hint:
granularity is only determined by depth of the path
At query time, the Taverna lineage query algorithm encodes a path
mapping rule to compute p’ given p
23
73. Architecture provenance-active processors
lin( P:Y, , Psel, E(D))
inputs outputs lineage query
interface
Taverna workflow engine provenance
provenance
events manager
external
services
provenance
information
repository
1. Common content:
–processor execution details
–binding of input/output variables to values
–completion status
24
74. Architecture provenance-active processors
lin( P:Y, , Psel, E(D))
inputs outputs lineage query
interface
Taverna workflow engine provenance
provenance
events manager
external
services
provenance
information
repository
1. Common content:
–processor execution details
–binding of input/output variables to values
–completion status
24
2. Optional content for provenance-active processors:
– explicit output → input dependency assertions:
let I, O be the input, resp. output variables set
depends(Y, X[p], <depType>), X ∈ I, Y ∈ O
75. Architecture provenance-active processors
lin( P:Y, , Psel, E(D))
inputs outputs lineage query
interface
Taverna workflow engine provenance
provenance
events manager
external p-active API
services
provenance
information
repository
1. Common content:
–processor execution details
–binding of input/output variables to values
–completion status
24
2. Optional content for provenance-active processors:
– explicit output → input dependency assertions:
let I, O be the input, resp. output variables set
depends(Y, X[p], <depType>), X ∈ I, Y ∈ O
76. Ongoing work
• Experimental evaluation:
– to what extent is granularity a real practical problem?
– Quantify provenance friendliness by analysing a large
collection of workflows from myExperiment
– Quantify available improvements (i.e. by refactoring)
• Compare collection management in Taverna with
other workflow models
– can we sucessfully exchange provenance graphs?
• Integration of the provenance service with the new
version of Taverna
– to be released before end of year
25