Your SlideShare is downloading. ×
ISWC DC poster "Reconstructing Provenance"
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ISWC DC poster "Reconstructing Provenance"

156

Published on

The slides based on the poster of the ISWC 2012 doctoral consortium on "Reconstructing Provenance". Trying to summarize what I think of the next 3 years of my PhD.

The slides based on the poster of the ISWC 2012 doctoral consortium on "Reconstructing Provenance". Trying to summarize what I think of the next 3 years of my PhD.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
156
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Reconstructing Provenance Sara Magliacane - VU University Amsterdam Advisors: Paul Groth and Frank van Harmelen Problem Statement An initial prototype implementationThe provenance of a data item is the metadata describing how, As a first step we focus on dependencies between files instead ofwhen and by whom the data item was produced. sequences of operations.Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the pipeline using open-sourceresulting in collections of files with only basic filesystem components, like Apache Lucene, Apache Tika and Dropbox API.metadata, e.g. timestamps. As signal detectors we used well-known similarity measures.In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;% !#$% @9:).*%).-72*+:% ! " ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-% 91,2A.*.1,.% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# ! @*A#<7"#A,#8,/# B9-9:+*9)6% & $#"% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# 9*5,#.":*597B*"C# )A*.4A2:/4% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% )67.4% 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging) results )#*+$#!,$)%!&( !"!#$%!&( =+",# # # # # # # # # # # # # #></*?,5# We performed an experiment with a small set of biomedical !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts. ./01( ./31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 2 6 7 Research Question 13 14 20 16 21 18 19 0 1 4 3 5 8 9 10 11 24 12 How can one automatically, accurately and efficiently 5 reconstruct a plausible provenance of files in a shared folder, 23 )"*+#,-*+( 20 17 intended as the sequences of operations connecting the files? 19 7 4 15 8 3 14 2 18 9 6 22 21 16 0 13 10 1 11 Approach & Methodology 12 24 Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline We propose a multi-signal pipeline approach that reconstructs F1-score of 0.49 for only text similarity plausible provenance traces using the contents of the files and F1-score of 0.70 for the aggregation of various similarities metadata as evidence of the relationships between files. The pipeline consists of four stages, each containing several components that can be executed in parallel: Future work #$4:2-4#-;<=> #$%& Following the planned methodology, we will explore additional8$#A @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, ! " components for each of the pipeline phases and consider also ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. ( )*+,- ! ( 342-/#$40-40 6),4+78-0-#0$1( 6),4+79)70-1( G,,1-,+0$1( #$4:2-4#-;<=? " 5 5 5 === ! #$%& " Bibliography ( (1) Sara Magliacane: Reconstructing Provenance, ISWC Doctoral Consortium 2012 The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacane: Automatic Metadata incrementally integrate existing approaches in literature and Annotation through Reconstructing Provenance, Third International evaluate the performance on benchmark corpora. Workshop on the role of Semantic Web in Provenance Management, ESWC 2012
  • 2. Advisors: Paul Groth and Frank van Harmelen Problem Statement An initial prototype imThe provenance of a data item is the metadata describing how, As a first step we focus on dependenwhen and by whom the data item was produced. sequences of operations.Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the presulting in collections of files with only basic filesystem components, like Apache Lucene, Apmetadata, e.g. timestamps. As signal detectors we used well-knoIn this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F @9:).*%).-72* ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% 91,2A.*.1, !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# ! @*A#<7"#A,#8,/# B9-9:+*9)6% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# 9*5,#.":*597B*"C# )A*.4A2:/4 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47., )67.4% 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouragin )#*+$#!,$)%!&( !"!#$%!&( =+",# # # # # # # # # # # # # #></*?,5# We performed an experiment with a !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by ./01( ./31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: G EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 Research Question 13 14 20 16 21 18 19 0 1 24
  • 3. 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% Advisors: Paul Groth and Frank van )67.4% Harmelen 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# Problem Statement An initial prototype im !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6#The provenance of a data item is the metadata describing how, 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging As a first step we focus on dependencwhen !"!#$%!&( whom the data item was produced. and by )#*+$#!,$)%!&( sequences of operations. We performed an experiment with a sm =+",# # # # # # # # # # # # # #></*?,5# !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by twProvenance is crucial in many ./01( settings, but often it is ./21( tracked, ./31( not We implemented a prototype of the pipresulting in collections of files with only basic filesystem components, like Apache Lucene, Apa Cluster 1: Blood Cultures EvidenceQ|| Cluster 2: Markers EvidenceQX Cluster 3: General Guidelinemetadata, e.g. timestamps. As signal detectors we used well-know !"#$#%&( 22 23 17In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% 15 D672)A.4.4%C*F191;% 2 Research Question ()*+,)%-.)+/+)+%% 13 14 20 16 8.()%49-9:+*9)6% 21 18 19 0 @9:).*%).-72*+:% 91,2A.*.1,.% ! 1 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# @*A#<7"#A,#8,/# 24 B9-9:+*9)6% How !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# can one automatically, accurately and efficiently 9*5,#.":*597B*"C# & 01/.(%,21).1)% 5 0-+;.%49-9:+*9)6% )A*.4A2:/4% reconstruct a plausible provenance of files in a shared folder, !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 23 )"*+#,-*+( 013.*%4.-+15,% <2-+91=47.,9>,% 20 <2-+91=47.,9>,% 17 >:).*91;% intended as the sequences of operations connecting the files? )67.4% 49-9:+*9)6% 19 4 15 3 14 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# 2 ?.)+/+)+% 18 6 49-9:+*9)6% 22 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# 21 16 0 13 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 1 Approach & Methodology Initial (encouraging 24 Cluster 1: Blood Cultures Cluster 2: Markers C EvidenceQ|| EvidenceQX G 9"$"!+$"-#: 4,!5( !"$"8$"!+( 4,!5( 67"8#( !"$"8$"!+( We !"!#$%!&( propose )#*+$#!,$)%!&( a multi-signal pipeline approach that reconstructs F1-score of 0.49an experiment with a sm We performed for only text similarity plausible provenance# traces# !,-)#$%!!)( #the# contents of the files and =+",# # # # # # using # # !,-)#$%!!)( # #></*?,5# !,-)#$%!!)( F1-score of 0.70 for the aggregation of v publications, annotated manually by tw metadata as evidence of the./01( relationships between./21( ./31( files. Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General Future work EvidenceQ|| EvidenceQX Guideline The pipeline consists of four stages, each containing several !"#$#%&( 22 components that can be executed in parallel: #$4:2-4#-;<=> 23 17 15 2 Following the planned methodology, we8$#A @1-%1$#-AA)4, Research Question B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, ! #$%& " components for each of the pipeline ph 13 14 20 16 21 18 19 0 ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. 1 ( 24
  • 4. 013.*%4.-+15,% 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% <2-+91=47.,9>,% <2-+91=47.,9>,% >:).*91;% >:).*91;% )"*+#, )67.4% )67.4% 2 49-9:+*9)6% 49-9:+*9)6% 18 6 22 21 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% ?.)+/+)+% 16 49-9:+*9)6% 49-9:+*9)6%0 13 !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# 1 Approach & Methodology !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# Cluster 1: Blood Cultures EvidenceQ|| Cluster 2: Markers EvidenceQX 24 Cluste Guide We propose a multi-signal pipeline approach that reconstructs 4,!5( 4,!5( 67"8#( 67"8#( 4,!5( 4,!5( !"$"8$"!+( !"$"8$"!+( 9"$"!+$"-#: 9"$"!+$"-#: !"$"8$"!+( !"$"8$"!+( Initial (encouraging) Initial (encouraging F1-score of 0.49 for only text similarity plausible provenance traces using the contents of the files and )#*+$#!,$)%!&( )#*+$#!,$)%!&( F1-score of 0.70 for the aggregation of va !"!#$%!&( !"!#$%!&( metadata #as evidence of# the relationships between files. =+",# # =+",# # # # # # # # # # # # # # # # # # # # # # #></*?,5# #></*?,5# We performed an experiment with a a sm We performed an experiment with sma !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two publications, annotated manually by tw ./01( ./01( ./31( ./31( ./21( ./21( The pipeline consists of four stages, each containing several components that can be executed in parallel: Cluster 1: Blood Blood Cultures Cluster 2: Markers Cluster 1: Cultures EvidenceQ|| EvidenceQ|| Cluster 2: Markers Future work EvidenceQX EvidenceQX Cluster 3: General Cluster 3: General Guideline Guideline !"#$#%&( !"#$#%&( #$4:2-4#-;<=> 22 22 #$%& Following the planned methodology, we w 23 23 17 178$#A @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, components for each of the pipeline phas 15 15 2 2 6 Research Question Research Question ! " 14 14 20 20 18 18 19 19 4 ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. ( 13 13 16 16 21 21 0 0 3 )*+,- ! 1 1 ( How can automatically, accurately and efficiently #$4:2-4#-;<=? How342-/#$40-40 one automatically, 6),4+79)70-1( can one 6),4+78-0-#0$1( 24 24 G,,1-,+0$1( accurately and efficiently " Bibliography 5 5 #$%& reconstruct a a plausible provenance of files === a shared folder, reconstruct plausible provenance of files in in shared folder, 5 5 5 a 23 23 )"*+#,-*+( )"*+#,-*+( 20 20 17 17 intended as the sequences ofof operations connecting the!files? intended as the sequences operations connecting the files? 19 19 " 4 4 15 15 (1) Sara Magliacane: Reconstructing Prove 3 3 14 14 ( 2 2 18 18 Consortium 2012 6 6 22 22 21 21 16 16 0 0 13 13 The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacan 1 1 Approach &&Methodology Approach Methodology incrementally integrate existing approaches in literature and Annotation through Reconstructing Provena Cluster 1: BloodBlood Cultures Cluster 1: Cultures Workshop on the role of Semantic Web in P EvidenceQ|| EvidenceQ|| Cluster 2: Markers Cluster 2: Markers EvidenceQX EvidenceQX 24 24 Cluste Guide C G evaluate the performance on benchmark corpora. ESWC 2012 We propose a a multi-signal pipeline approach that reconstructs We propose multi-signal pipeline approach that reconstructs F1-score ofof 0.49 for only text similarity F1-score 0.49 for only text similarity plausible provenance traces using the contents ofof the files and plausible provenance traces using the contents the files and F1-score ofof 0.70 for the aggregation of v F1-score 0.70 for the aggregation of va metadata as evidence ofof the relationships between files. metadata as evidence the relationships between files. The pipeline consists ofof four stages, each containing several The pipeline consists four stages, each containing several components that can be executed in in parallel: components that can be executed parallel: Future work Future work #$4:2-4#-;<=> #$4:2-4#-;<=> #$%& #$%& Following the planned methodology, we w Following the planned methodology, we8$#A 8$#A @1-%1$#-AA)4, @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, G,,1-,+E$4+421+4H)4, ! ! " " components for each ofof the pipeline ph components for each the pipeline phas ./01+#0*-0+2+0+ ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+78-0-#0$1! 6),4+79)70-1! 6),4+79)70-1! G,,1-,+0$1! G,,1-,+0$1! computational efficiency. computational efficiency. ( (
  • 5. isors: Paul Groth and Frank van Harmelennt An initial prototype implementationadata describing how, As a first step we focus on dependencies between files instead ofduced. sequences of operations.t often it is not tracked, We implemented a prototype of the pipeline using open-sourcesic filesystem components, like Apache Lucene, Apache Tika and Dropbox API. As signal detectors we used well-known similarity measures.ovenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;% !#$% @9:).*%).-72*+:% ! " ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-% 91,2A.*.1,.% !<7"#A,#8,/# B9-9:+*9)6% & $#"% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6%#.":*597B*"C# )A*.4A2:/4% " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% )67.4% 49-9:+*9)6% >:).*91;%563-:6################# ?.)+/+)+% 49-9:+*9)6%,<05,3*5/63-:6#3,563-:6# 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging) results # # #></*?,5# We performed an experiment with a small set of biomedical,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts. 31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 2 6 7on 13 14 20 16 21 18 19 0 1 4 3 5 8 9 10 11 24 12
  • 6. <7"#A,#8,/# &isors: Paul Groth#.":*597B*"C# and Frank van Harmelen & 01/.(%,21).1)% 0-+;.%49-9:+*9)6% B9-9:+*9)6% )A*.4A2:/4% $#"% " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% )67.4% 49-9:+*9)6% >:).*91;%nt An initial prototype implementation 563-:6################# ?.)+/+)+% 49-9:+*9)6%,<05,3*5/63-:6#3,563-:6#adata describing how, As a first step we focus on dependencies between files instead ofduced. 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging) results sequences of operations.t often it is not tracked, We implemented a prototype of the pipeline using open-source We performed an experiment with a small set of biomedicalsic filesystem # # #></*?,5# components, like Apache Lucene, Apache Tika and Dropbox API.,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts. 31( ./21( As signal detectors we used well-known similarity measures. Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: Generalovenance post hoc? EvidenceQ|| EvidenceQX Guideline <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;% !"#$#%&( !#$% 22 ! 23 17 @9:).*%).-72*+:% I.9;A)./%BF-% " ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% 15 91,2A.*.1,.% 2 6 7 !on<7"#A,#8,/##.":*597B*"C# & " 01/.(%,21).1)% 13 14 20 0-+;.%49-9:+*9)6% 16 21 18 19 0 1 B9-9:+*9)6% )A*.4A2:/4% 4 3 5 8 9 10 11 & $#"% 013.*%4.-+15,% <2-+91=47.,9>,% 24 <2-+91=47.,9>,% 12 )67.4% 49-9:+*9)6% >:).*91;%and efficiently 5 es in a shared folder, 563-:6################# ?.)+/+)+% 23 )"*+#,-*+( 49-9:+*9)6% 20 17s connecting the files? ,<05,3*5/63-:6# 19 7 4 15 8 3 143,563-:6# 2 18 9 6 22 21 Initial (encouraging) results 16 0 13 10 1 11dology 9"$"!+$"-#: !"$"8$"!+( 12 24 Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General We performed an experiment with a small set of biomedical EvidenceQ|| EvidenceQX Guideline # # #></*?,5#,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts.oach that reconstructs 31( ./21( F1-score of 0.49 for only text similarityontents of the files and F1-score of 0.70 for the aggregation of various similarities Cluster 1: Blood Cultures EvidenceQ|| Cluster 2: Markers EvidenceQX Cluster 3: General Guidelineps between files. !"#$#%&( 22 23 17ch containing severalonrallel: 14 20 Future work 15 18 19 2 4 6 8 7 #$4:2-4#-;<=> 13 16 21 0 3 5 9 10 Following the planned methodology, we will explore additional 1 11 #$%& 24 12
  • 7. and efficientlyGroth and Frank van Harmelen isors: Paul 24 12 5 es in a shared folder, 23 )"*+#,-*+( 20 17s connecting the files? 19 7 4 15 8nt An initial prototype implementation 3 14 2 18 9 6 22 21 16 0 13 10 1 11dologyadata describing how, As a first step we focus on dependencies between files instead of 12 24 duced. sequences of operations. Cluster 1: Blood Cultures EvidenceQ|| Cluster 2: Markers EvidenceQX Cluster 3: General Guidelineoach thatis not tracked, t often it reconstructs We implemented a prototype of the pipeline using open-source F1-score of 0.49 for only text similaritysic filesystem files andontents of the components, like Apache Lucene, Apache Tika and Dropbox API. F1-score of 0.70 for the aggregation of various similaritiesps between files. As signal detectors we used well-known similarity measures.ovenance post hoc?ch containing severalrallel: <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% Future work D672)A.4.4%C*F191;% @9:).*%).-72*+:% G;;*.;+521%+1/%*+1H91;% ! !#$% " #$4:2-4#-;<=> ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-% 91,2A.*.1,.% ! Following the planned methodology, we will explore additional<7"#A,#8,/#G,,1-,+E$4+421+4H)4, #$%& B9-9:+*9)6% & $#"% &#.":*597B*"C# ! " components for each of the pipeline phases and consider also 01/.(%,21).1)% 0-+;.%49-9:+*9)6% )A*.4A2:/4% " G,,1-,+0$1! computational efficiency. 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% ( )*+,- )67.4% 49-9:+*9)6% >:).*91;% 563-:6################# G,,1-,+0$1( #$4:2-4#-;<=? ?.)+/+)+% 49-9:+*9)6% ,<05,3*5/63-:6# 3,563-:6# === ! #$%& " Bibliography (1) Sara Magliacane: Reconstructing Provenance, ISWC Doctoral Initial (encouraging) results ( 9"$"!+$"-#: Consortium 2012 !"$"8$"!+( e process, that will We Paul Groth, an experiment with a small set of biomedical (2) performed Yolanda Gil, Sara Magliacane: Automatic Metadataches #in literature and # #></*?,5# Annotation through Reconstructing Provenance, Third International,-)#$%!!)( !,-)#$%!!)( publications, annotated Semantic Webtwo domain experts. Workshop on the role of manually by in Provenance Management, k corpora. ./21( 31( ESWC 2012 Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 2 6 7on 13 14 20 16 21 18 19 0 1 4 3 5 8 9 10 11 24 12
  • 8. Reconstructing Provenance Sara Magliacane - VU University Amsterdam Advisors: Paul Groth and Frank van Harmelen Problem Statement An initial prototype implementationThe provenance of a data item is the metadata describing how, As a first step we focus on dependencies between files instead ofwhen and by whom the data item was produced. sequences of operations.Provenance is crucial in many settings, but often it is not tracked, We implemented a prototype of the pipeline using open-sourceresulting in collections of files with only basic filesystem components, like Apache Lucene, Apache Tika and Dropbox API.metadata, e.g. timestamps. As signal detectors we used well-known similarity measures.In this case, is it possible to reconstruct provenance post hoc? <2,4% C*.7*2,.4491;% D672)A.4.4%E.1.*+521% D672)A.4.4%C*F191;% G;;*.;+521%+1/%*+1H91;% !#$% @9:).*%).-72*+:% ! " ()*+,)%-.)+/+)+%% 8.()%49-9:+*9)6% I.9;A)./%BF-% 91,2A.*.1,.% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# ! @*A#<7"#A,#8,/# B9-9:+*9)6% & $#"% & 01/.(%,21).1)% 0-+;.%49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/6# 9*5,#.":*597B*"C# )A*.4A2:/4% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/# " 013.*%4.-+15,% <2-+91=47.,9>,% <2-+91=47.,9>,% )67.4% 49-9:+*9)6% >:).*91;% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#7"#.978,#:5*9#/0,#12,#373,563-:6################# ?.)+/+)+% 49-9:+*9)6% !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#7--."8#;#3757857304#:5*9#/0,#12,#/,<05,3*5/63-:6# !"#!$%&#&(#)*+#,-./,-#/0,#12,#3*4/,5633/#9*-.1,-#7#375785730#4.9.275#/*#373,563-:6# 4,!5( 67"8#( 4,!5( !"$"8$"!+( 9"$"!+$"-#: !"$"8$"!+( Initial (encouraging) results )#*+$#!,$)%!&( !"!#$%!&( =+",# # # # # # # # # # # # # #></*?,5# We performed an experiment with a small set of biomedical !,-)#$%!!)( !,-)#$%!!)( !,-)#$%!!)( publications, annotated manually by two domain experts. ./01( ./31( ./21( Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline !"#$#%&( 22 23 17 15 2 6 7 Research Question 13 14 20 16 21 18 19 0 1 4 3 5 8 9 10 11 24 12 How can one automatically, accurately and efficiently 5 reconstruct a plausible provenance of files in a shared folder, 23 )"*+#,-*+( 20 17 intended as the sequences of operations connecting the files? 19 7 4 15 8 3 14 2 18 9 6 22 21 16 0 13 10 1 11 Approach & Methodology 12 24 Cluster 1: Blood Cultures Cluster 2: Markers Cluster 3: General EvidenceQ|| EvidenceQX Guideline We propose a multi-signal pipeline approach that reconstructs F1-score of 0.49 for only text similarity plausible provenance traces using the contents of the files and F1-score of 0.70 for the aggregation of various similarities metadata as evidence of the relationships between files. The pipeline consists of four stages, each containing several components that can be executed in parallel: Future work #$4:2-4#-;<=> #$%& Following the planned methodology, we will explore additional8$#A @1-%1$#-AA)4, B&%$0C-A-AD-4-1+E$4 B&%$0C-A-A@1F4)4, G,,1-,+E$4+421+4H)4, ! " components for each of the pipeline phases and consider also ./01+#0*-0+2+0+ 6),4+78-0-#0$1! 6),4+79)70-1! G,,1-,+0$1! computational efficiency. ( )*+,- ! ( 342-/#$40-40 6),4+78-0-#0$1( 6),4+79)70-1( G,,1-,+0$1( #$4:2-4#-;<=? " 5 5 5 === ! #$%& " Bibliography ( (1) Sara Magliacane: Reconstructing Provenance, ISWC Doctoral Consortium 2012 The research methodology is an iterative process, that will (2) Paul Groth, Yolanda Gil, Sara Magliacane: Automatic Metadata incrementally integrate existing approaches in literature and Annotation through Reconstructing Provenance, Third International evaluate the performance on benchmark corpora. Workshop on the role of Semantic Web in Provenance Management, ESWC 2012

×