Reconstructing Provenance                         Sara Magliacane - VU University Amsterdam                               ...
Advisors: Paul Groth and Frank van Harmelen                            Problem Statement                                  ...
013.*%4.-+15,%                                      <2-+91=47.,9>,%                                                 <2-+91...
013.*%4.-+15,%                                                                                                            ...
isors: Paul Groth and Frank van Harmelennt                                                      An initial prototype imple...
ISWC DC poster "Reconstructing Provenance"
ISWC DC poster "Reconstructing Provenance"
ISWC DC poster "Reconstructing Provenance"
Upcoming SlideShare
Loading in …5
×

ISWC DC poster "Reconstructing Provenance"

373 views

Published on

The slides based on the poster of the ISWC 2012 doctoral consortium on "Reconstructing Provenance". Trying to summarize what I think of the next 3 years of my PhD.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
373
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ISWC DC poster "Reconstructing Provenance"

  1. 1. Reconstructing Provenance Sara Magliacane - VU University Amsterdam Advisors: Paul Groth and Frank van Harmelen Problem Statement An initial prototype implementationThe provenance of a data item is the metadata describing how, As a first step we focus on dependencies between files instead ofwhen and by whom the data item was produced. sequences of operations.Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the pipeline using open-sourceresulting in collections of files with only basic filesystem components, like Apache Lucene, Apache Tika and Dropbox API.metadata, e.g. timestamps. As signal detectors we used well-known similarity measures.In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;% !#$% @9:).*%).-72*+:% ! " ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-% 91,2A.*.1,.% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# ! @*A#<7"#A,#8,/# B9-9:+*9)6% & $#"% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# 9*5,#.":*597B*"C# )A*.4A2:/4% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% )67.4% 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging) results )#*+$#!,$)%!&( !"!#$%!&( =+",# # # # # # # # # # # # # #></*?,5# We performed an experiment with a small set of biomedical !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts. ./01( ./31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 2 6 7 Research Question 13 14 20 16 21 18 19 0 1 4 3 5 8 9 10 11 24 12 How can one automatically, accurately and efficiently 5 reconstruct a plausible provenance of files in a shared folder, 23 )"*+#,-*+( 20 17 intended as the sequences of operations connecting the files? 19 7 4 15 8 3 14 2 18 9 6 22 21 16 0 13 10 1 11 Approach & Methodology 12 24 Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline We propose a multi-signal pipeline approach that reconstructs F1-score of 0.49 for only text similarity plausible provenance traces using the contents of the files and F1-score of 0.70 for the aggregation of various similarities metadata as evidence of the relationships between files. The pipeline consists of four stages, each containing several components that can be executed in parallel: Future work #$4:2-4#-;<=> #$%& Following the planned methodology, we will explore additional8$#A @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, ! " components for each of the pipeline phases and consider also ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. ( )*+,- ! ( 342-/#$40-40 6),4+78-0-#0$1( 6),4+79)70-1( G,,1-,+0$1( #$4:2-4#-;<=? " 5 5 5 === ! #$%& " Bibliography ( (1) Sara Magliacane: Reconstructing Provenance, ISWC Doctoral Consortium 2012 The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacane: Automatic Metadata incrementally integrate existing approaches in literature and Annotation through Reconstructing Provenance, Third International evaluate the performance on benchmark corpora. Workshop on the role of Semantic Web in Provenance Management, ESWC 2012
  2. 2. Advisors: Paul Groth and Frank van Harmelen Problem Statement An initial prototype imThe provenance of a data item is the metadata describing how, As a first step we focus on dependenwhen and by whom the data item was produced. sequences of operations.Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the presulting in collections of files with only basic filesystem components, like Apache Lucene, Apmetadata, e.g. timestamps. As signal detectors we used well-knoIn this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F @9:).*%).-72* ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% 91,2A.*.1, !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# ! @*A#<7"#A,#8,/# B9-9:+*9)6% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# 9*5,#.":*597B*"C# )A*.4A2:/4 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47., )67.4% 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouragin )#*+$#!,$)%!&( !"!#$%!&( =+",# # # # # # # # # # # # # #></*?,5# We performed an experiment with a !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by ./01( ./31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: G EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 Research Question 13 14 20 16 21 18 19 0 1 24
  3. 3. 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% Advisors: Paul Groth and Frank van )67.4% Harmelen 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# Problem Statement An initial prototype im !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#The provenance of a data item is the metadata describing how, 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging As a first step we focus on dependencwhen !"!#$%!&( whom the data item was produced. and by )#*+$#!,$)%!&( sequences of operations. We performed an experiment with a sm =+",# # # # # # # # # # # # # #></*?,5# !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by twProvenance is crucial in many ./01( settings, but often it is ./21( tracked, ./31( not We implemented a prototype of the pipresulting in collections of files with only basic filesystem components, like Apache Lucene, Apa Cluster 1: Blood Cultures EvidenceQ|| Cluster 2: Markers EvidenceQX Cluster 3: General Guidelinemetadata, e.g. timestamps. As signal detectors we used well-know !"#$#%&( 22 23 17In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% 15 D672)A.4.4%C*F191;% 2 Research Question ()*+,)%-.)+/+)+%% 13 14 20 16 8.()%49-9:+*9)6% 21 18 19 0 @9:).*%).-72*+:% 91,2A.*.1,.% ! 1 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# @*A#<7"#A,#8,/# 24 B9-9:+*9)6% How !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# can one automatically, accurately and efficiently 9*5,#.":*597B*"C# & 01/.(%,21).1)% 5 0-+;.%49-9:+*9)6% )A*.4A2:/4% reconstruct a plausible provenance of files in a shared folder, !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 23 )"*+#,-*+( 013.*%4.-+15,% <2-+91=47.,9>,% 20 <2-+91=47.,9>,% 17 >:).*91;% intended as the sequences of operations connecting the files? )67.4% 49-9:+*9)6% 19 4 15 3 14 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# 2 ?.)+/+)+% 18 6 49-9:+*9)6% 22 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# 21 16 0 13 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 1 Approach & Methodology Initial (encouraging 24 Cluster 1: Blood Cultures Cluster 2: Markers C EvidenceQ|| EvidenceQX G 9"$"!+$"-#: 4,!5( !"$"8$"!+( 4,!5( 67"8#( !"$"8$"!+( We !"!#$%!&( propose )#*+$#!,$)%!&( a multi-signal pipeline approach that reconstructs F1-score of 0.49an experiment with a sm We performed for only text similarity plausible provenance# traces# !,-)#$%!!)( #the# contents of the files and =+",# # # # # # using # # !,-)#$%!!)( # #></*?,5# !,-)#$%!!)( F1-score of 0.70 for the aggregation of v publications, annotated manually by tw metadata as evidence of the./01( relationships between./21( ./31( files. Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General Future work EvidenceQ|| EvidenceQX Guideline The pipeline consists of four stages, each containing several !"#$#%&( 22 components that can be executed in parallel: #$4:2-4#-;<=> 23 17 15 2 Following the planned methodology, we8$#A @1-%1$#-AA)4, Research Question B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, ! #$%& " components for each of the pipeline ph 13 14 20 16 21 18 19 0 ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. 1 ( 24
  4. 4. 013.*%4.-+15,% 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% <2-+91=47.,9>,% <2-+91=47.,9>,% >:).*91;% >:).*91;% )"*+#, )67.4% )67.4% 2 49-9:+*9)6% 49-9:+*9)6% 18 6 22 21 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% ?.)+/+)+% 16 49-9:+*9)6% 49-9:+*9)6%0 13 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# 1 Approach & Methodology !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# Cluster 1: Blood Cultures EvidenceQ|| Cluster 2: Markers EvidenceQX 24 Cluste Guide We propose a multi-signal pipeline approach that reconstructs 4,!5( 4,!5( 67"8#( 67"8#( 4,!5( 4,!5( !"$"8$"!+( !"$"8$"!+( 9"$"!+$"-#: 9"$"!+$"-#: !"$"8$"!+( !"$"8$"!+( Initial (encouraging) Initial (encouraging F1-score of 0.49 for only text similarity plausible provenance traces using the contents of the files and )#*+$#!,$)%!&( )#*+$#!,$)%!&( F1-score of 0.70 for the aggregation of va !"!#$%!&( !"!#$%!&( metadata #as evidence of# the relationships between files. =+",# # =+",# # # # # # # # # # # # # # # # # # # # # # #></*?,5# #></*?,5# We performed an experiment with a a sm We performed an experiment with sma !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two publications, annotated manually by tw ./01( ./01( ./31( ./31( ./21( ./21( The pipeline consists of four stages, each containing several components that can be executed in parallel: Cluster 1: Blood Blood Cultures Cluster 2: Markers Cluster 1: Cultures EvidenceQ|| EvidenceQ|| Cluster 2: Markers Future work EvidenceQX EvidenceQX Cluster 3: General Cluster 3: General Guideline Guideline !"#$#%&( !"#$#%&( #$4:2-4#-;<=> 22 22 #$%& Following the planned methodology, we w 23 23 17 178$#A @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, components for each of the pipeline phas 15 15 2 2 6 Research Question Research Question ! " 14 14 20 20 18 18 19 19 4 ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. ( 13 13 16 16 21 21 0 0 3 )*+,- ! 1 1 ( How can automatically, accurately and efficiently #$4:2-4#-;<=? How342-/#$40-40 one automatically, 6),4+79)70-1( can one 6),4+78-0-#0$1( 24 24 G,,1-,+0$1( accurately and efficiently " Bibliography 5 5 #$%& reconstruct a a plausible provenance of files === a shared folder, reconstruct plausible provenance of files in in shared folder, 5 5 5 a 23 23 )"*+#,-*+( )"*+#,-*+( 20 20 17 17 intended as the sequences ofof operations connecting the!files? intended as the sequences operations connecting the files? 19 19 " 4 4 15 15 (1) Sara Magliacane: Reconstructing Prove 3 3 14 14 ( 2 2 18 18 Consortium 2012 6 6 22 22 21 21 16 16 0 0 13 13 The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacan 1 1 Approach &&Methodology Approach Methodology incrementally integrate existing approaches in literature and Annotation through Reconstructing Provena Cluster 1: BloodBlood Cultures Cluster 1: Cultures Workshop on the role of Semantic Web in P EvidenceQ|| EvidenceQ|| Cluster 2: Markers Cluster 2: Markers EvidenceQX EvidenceQX 24 24 Cluste Guide C G evaluate the performance on benchmark corpora. ESWC 2012 We propose a a multi-signal pipeline approach that reconstructs We propose multi-signal pipeline approach that reconstructs F1-score ofof 0.49 for only text similarity F1-score 0.49 for only text similarity plausible provenance traces using the contents ofof the files and plausible provenance traces using the contents the files and F1-score ofof 0.70 for the aggregation of v F1-score 0.70 for the aggregation of va metadata as evidence ofof the relationships between files. metadata as evidence the relationships between files. The pipeline consists ofof four stages, each containing several The pipeline consists four stages, each containing several components that can be executed in in parallel: components that can be executed parallel: Future work Future work #$4:2-4#-;<=> #$4:2-4#-;<=> #$%& #$%& Following the planned methodology, we w Following the planned methodology, we8$#A 8$#A @1-%1$#-AA)4, @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, G,,1-,+E$4+421+4H)4, ! ! " " components for each ofof the pipeline ph components for each the pipeline phas ./01+#0*-0+2+0+ ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+78-0-#0$1! 6),4+79)70-1! 6),4+79)70-1! G,,1-,+0$1! G,,1-,+0$1! computational efficiency. computational efficiency. ( (
  5. 5. isors: Paul Groth and Frank van Harmelennt An initial prototype implementationadata describing how, As a first step we focus on dependencies between files instead ofduced. sequences of operations.t often it is not tracked, We implemented a prototype of the pipeline using open-sourcesic filesystem components, like Apache Lucene, Apache Tika and Dropbox API. As signal detectors we used well-known similarity measures.ovenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;% !#$% @9:).*%).-72*+:% ! " ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-% 91,2A.*.1,.% !<7"#A,#8,/# B9-9:+*9)6% & $#"% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6%#.":*597B*"C# )A*.4A2:/4% " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% )67.4% 49-9:+*9)6% >:).*91;%563-:6################# ?.)+/+)+% 49-9:+*9)6%,<05,3*5/63-:6#3,563-:6# 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging) results # # #></*?,5# We performed an experiment with a small set of biomedical,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts. 31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 2 6 7on 13 14 20 16 21 18 19 0 1 4 3 5 8 9 10 11 24 12

×