Your SlideShare is downloading. ×
0
Provenance Central:
More Mileage from Provenance Metadata
Bertram Ludäscher
UC Davis, USA
ludaesch@ucdavis.edu
Paolo Missi...
Outline
• A foundation for Provenance management: the PROV data model
– From the W3C. Recommendation as of Spring, 2013
– ...
PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendation
track
Prov-dictionaryplus:
Friday, ...
PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendation
track
Prov-dictionaryplus:
Friday, ...
PROV Core Elements (graph depiction)
4
An entity is a physical, digital, conceptual, or other kind of thing with some fixed...
Summary of the PROV Core model
5
– PROV-DC mapping available
– Recent Tutorial @EDBT’13 (June, 2013) [1]
• Model, Constrai...
PROV-DM relations at a glance
6
Friday, 6 September 13
Context: ProvWG@DataONE
• DataONE: Data Observation Network for Earth
– 5yr NSF DataNet data preservation project (current...
DataONE collaboration scenario - 2012
8
Alice’s Workflow: generates benchmark climate data for model comparison
Input is r...
DataONE collaboration scenario - 2012
8
."."." ."."." ."."."
The workflow, provenance, and other metadata are uploaded to ...
Searching
9
Bob: Search based on keywords in the metadata
➡ including provenance terms
Bob discovers Alice’s workflow. He ...
PBase and DataONE
10
System
Metadata
Extract-Align-AugmentMetadata
ScienceData
Search
API
Science
Metadata
Provenance Cura...
DataOne Provenance components I: D-PROV
11
D-PROV extends PROV - Connects trace metadata to workflow structure
Missier, Pa...
DataOne Provenance components I: D-PROV
onOutPort
T1Inv
d
onInPort
T2Inv
wasAssociatedWith
T1
wasAssociatedWith
T2
op1
ip1...
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
Neo4J&loader& Gr...
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house compone...
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house compone...
Baseline provenance queries in PBase
14
Ancestors and descendents (lineage): [2,3]
– Which datasets were involved in the p...
Application - The social life of research data
• We know all about searching in the publications space
– who else is worki...
From Pull (client queries) to Push (notifications)
• Uncovering latent connections amongst services / data / people:
– Ran...
Credits
17
Current members of the DataONE Provenance Working Group:
In the USA:
Bertram Ludaescher, UC Davis (co-lead)
Vic...
Upcoming SlideShare
Loading in...5
×

Camp 4-data workshop presentation

281

Published on

A presentation at the CAMP-4-DATA workshop, Sept. 6, Lisbon:
http://dcevents.dublincore.org/IntConf/index/pages/view/camp-4-data

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
281
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Camp 4-data workshop presentation"

  1. 1. Provenance Central: More Mileage from Provenance Metadata Bertram Ludäscher UC Davis, USA ludaesch@ucdavis.edu Paolo Missier Newcastle University, UK paolo.missier@ncl.ac.uk Members of the DataONE Provenance Working Group CAMP-4-DATA workshop @IPres 2013 Sept, 6, 2013 Lisbon, Portugal Friday, 6 September 13
  2. 2. Outline • A foundation for Provenance management: the PROV data model – From the W3C. Recommendation as of Spring, 2013 – generic, extensible model • The role of provenance in the DataONE project – Provenance enables search and discovery, reuse, reproducibility – PBase: Provenance warehousing – Integration with the DataONE architecture – Provenance mining: the social life of research data 2 Friday, 6 September 13
  3. 3. PROV: scope and structure 3 source: http://www.w3.org/TR/prov-overview/ Recommendation track Prov-dictionaryplus: Friday, 6 September 13
  4. 4. PROV: scope and structure 3 source: http://www.w3.org/TR/prov-overview/ Recommendation track Prov-dictionaryplus: Friday, 6 September 13
  5. 5. PROV Core Elements (graph depiction) 4 An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. Entity Activity Agent An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity. drafting commenting paper1 paper2 used draft v1 wasGeneratedBy used draft comments wasGeneratedBy Alice Bob wasAssociatedWith actedOnBehalfOf Remote past Recent past distribution=internal status=draft version=0.1 ex:role=main_editor type=person ex:role=sr_editor prov:role=editor time=... time=... Friday, 6 September 13
  6. 6. Summary of the PROV Core model 5 – PROV-DC mapping available – Recent Tutorial @EDBT’13 (June, 2013) [1] • Model, Constraints, Applications [1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013. Friday, 6 September 13
  7. 7. PROV-DM relations at a glance 6 Friday, 6 September 13
  8. 8. Context: ProvWG@DataONE • DataONE: Data Observation Network for Earth – 5yr NSF DataNet data preservation project (current phase) – Provides a large scale, federated data infrastructure to the Earth Sciences community • Provenance Working Group – Active until July, 2014 (current phase, looking at extending) – One/two interns per year since 2010 – One dedicated researcher (postdoc) since 2012 – 12 core members, additional guest members on a rotation • specific focus on the provenance of workflow-based e-science data 7 Friday, 6 September 13
  9. 9. DataONE collaboration scenario - 2012 8 Alice’s Workflow: generates benchmark climate data for model comparison Input is retrieved from DataONE to generate an output file Friday, 6 September 13
  10. 10. DataONE collaboration scenario - 2012 8 ."."." ."."." ."."." The workflow, provenance, and other metadata are uploaded to DataONE A data package is created and indexed Friday, 6 September 13
  11. 11. Searching 9 Bob: Search based on keywords in the metadata ➡ including provenance terms Bob discovers Alice’s workflow. He may be able to execute it again Friday, 6 September 13
  12. 12. PBase and DataONE 10 System Metadata Extract-Align-AugmentMetadata ScienceData Search API Science Metadata Provenance Curation Index Identifiers/ Text fields Graph Structure ProvExplorer Internal Metadata Index Repository PBase /D-PROV Querying – Provenance traces in PBase linked to DataONE packages – Provenance traces indexed for searching Friday, 6 September 13
  13. 13. DataOne Provenance components I: D-PROV 11 D-PROV extends PROV - Connects trace metadata to workflow structure Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013. Friday, 6 September 13
  14. 14. DataOne Provenance components I: D-PROV onOutPort T1Inv d onInPort T2Inv wasAssociatedWith T1 wasAssociatedWith T2 op1 ip1 wf isTaskOf isTaskOf hasInputPort hasOutputPort wfInv wasAssociatedWith wasStartedBy wasStartedBy dataLink 12 D-PROV extends PROV Connects trace metadata to workflow structure Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013. Friday, 6 September 13
  15. 15. DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Friday, 6 September 13
  16. 16. DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv In-house components Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Neo4J graph DBMS [AllegroGraph] [Graph-*] Can we do better than the built-in Neo indexing? Friday, 6 September 13
  17. 17. DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv In-house components Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Neo4J graph DBMS [AllegroGraph] [Graph-*] Cypher (Neo, declarative) [Gremlin (procedural)] can we do better? scaling graph queries Can we do better than the built-in Neo indexing? to be developed Friday, 6 September 13
  18. 18. Baseline provenance queries in PBase 14 Ancestors and descendents (lineage): [2,3] – Which datasets were involved in the production of data at node “e33”? – Reachability: was task “e11_miny” involved in producing data at node “e38”? Execution analysis: [3] – Which tasks did not execute to completion for execution X of a given workflow? – Find all inputs [outputs] of a given workflow across all its executions – Given a data item, find all workflows / tasks that have used it as input – Suppose we discover that service S has a bug, which data products were impacted by it? – How many times was task T activated across a pool of workflow executions? Provenance differencing: [4] – Why do the results from two executions of the same workflow differ? Attribution: [5] – Who was responsible for this {data {usage, production}, service invocation}? [2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013. [3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012. [4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013 [5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun, and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150. Friday, 6 September 13
  19. 19. Application - The social life of research data • We know all about searching in the publications space – who else is working on problems similar to mine? – which results are available? • In the data and process space: 1.Search and discovery • Who else has used the {datasets, services, workflows,...} I am using? – how do others rate them? • Who used my {datasets, services, workflows,...}? How were they used? 2.Reuse, reproduction, validation • Can I reproduce these results? – using the same exact method – using a variation of the method • How do I apply this method to my data? • ... 15 Social provenance for community building Friday, 6 September 13
  20. 20. From Pull (client queries) to Push (notifications) • Uncovering latent connections amongst services / data / people: – Ranking, clustering, association rules – Requires new similarity metrics • A recommender system for scientists – Analytics layer activated when new traces are added • Challenges: – How large a corpus of provenance graphs is needed? – How global should the community be? • Little new to discover in a small community – Requires graphs with rich attribution / association relations 16 Graph& storage& Query&layer& indexing& Analy5cal&services& Friday, 6 September 13
  21. 21. Credits 17 Current members of the DataONE Provenance Working Group: In the USA: Bertram Ludaescher, UC Davis (co-lead) Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher) Saumen Dey, UC Davis (researcher) Parisa Kianmajd, UC Davis (intern) Juliana Freire, NYU-Poly David Koop, NYU-Poly Fernando Chirigati, NYU-Poly Shawn Bowers, Gonzaga University Ilkay Altintas, SDSC/UCSD Karthik Ram, UC Berkeley Yolanda Gil,USC - ISI Yaxing Wei, ORNL Dave Vieglais, DataONE Technical Lead In the UK: Paolo Missier, Newcastle University James Cheney, University of Edinburgh Khalid Belhajjame, University of Manchester Friday, 6 September 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×