Camp 4-data workshop presentation
Upcoming SlideShare
Loading in...5
×
 

Camp 4-data workshop presentation

on

  • 367 views

A presentation at the CAMP-4-DATA workshop, Sept. 6, Lisbon:

A presentation at the CAMP-4-DATA workshop, Sept. 6, Lisbon:
http://dcevents.dublincore.org/IntConf/index/pages/view/camp-4-data

Statistics

Views

Total Views
367
Views on SlideShare
366
Embed Views
1

Actions

Likes
0
Downloads
2
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Camp 4-data workshop presentation Camp 4-data workshop presentation Presentation Transcript

  • Provenance Central: More Mileage from Provenance Metadata Bertram Ludäscher UC Davis, USA ludaesch@ucdavis.edu Paolo Missier Newcastle University, UK paolo.missier@ncl.ac.uk Members of the DataONE Provenance Working Group CAMP-4-DATA workshop @IPres 2013 Sept, 6, 2013 Lisbon, Portugal Friday, 6 September 13
  • Outline • A foundation for Provenance management: the PROV data model – From the W3C. Recommendation as of Spring, 2013 – generic, extensible model • The role of provenance in the DataONE project – Provenance enables search and discovery, reuse, reproducibility – PBase: Provenance warehousing – Integration with the DataONE architecture – Provenance mining: the social life of research data 2 Friday, 6 September 13
  • PROV: scope and structure 3 source: http://www.w3.org/TR/prov-overview/ Recommendation track Prov-dictionaryplus: Friday, 6 September 13
  • PROV: scope and structure 3 source: http://www.w3.org/TR/prov-overview/ Recommendation track Prov-dictionaryplus: Friday, 6 September 13
  • PROV Core Elements (graph depiction) 4 An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. Entity Activity Agent An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity. drafting commenting paper1 paper2 used draft v1 wasGeneratedBy used draft comments wasGeneratedBy Alice Bob wasAssociatedWith actedOnBehalfOf Remote past Recent past distribution=internal status=draft version=0.1 ex:role=main_editor type=person ex:role=sr_editor prov:role=editor time=... time=... Friday, 6 September 13
  • Summary of the PROV Core model 5 – PROV-DC mapping available – Recent Tutorial @EDBT’13 (June, 2013) [1] • Model, Constraints, Applications [1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013. Friday, 6 September 13
  • PROV-DM relations at a glance 6 Friday, 6 September 13
  • Context: ProvWG@DataONE • DataONE: Data Observation Network for Earth – 5yr NSF DataNet data preservation project (current phase) – Provides a large scale, federated data infrastructure to the Earth Sciences community • Provenance Working Group – Active until July, 2014 (current phase, looking at extending) – One/two interns per year since 2010 – One dedicated researcher (postdoc) since 2012 – 12 core members, additional guest members on a rotation • specific focus on the provenance of workflow-based e-science data 7 Friday, 6 September 13
  • DataONE collaboration scenario - 2012 8 Alice’s Workflow: generates benchmark climate data for model comparison Input is retrieved from DataONE to generate an output file Friday, 6 September 13
  • DataONE collaboration scenario - 2012 8 ."."." ."."." ."."." The workflow, provenance, and other metadata are uploaded to DataONE A data package is created and indexed Friday, 6 September 13
  • Searching 9 Bob: Search based on keywords in the metadata ➡ including provenance terms Bob discovers Alice’s workflow. He may be able to execute it again Friday, 6 September 13
  • PBase and DataONE 10 System Metadata Extract-Align-AugmentMetadata ScienceData Search API Science Metadata Provenance Curation Index Identifiers/ Text fields Graph Structure ProvExplorer Internal Metadata Index Repository PBase /D-PROV Querying – Provenance traces in PBase linked to DataONE packages – Provenance traces indexed for searching Friday, 6 September 13
  • DataOne Provenance components I: D-PROV 11 D-PROV extends PROV - Connects trace metadata to workflow structure Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013. Friday, 6 September 13
  • DataOne Provenance components I: D-PROV onOutPort T1Inv d onInPort T2Inv wasAssociatedWith T1 wasAssociatedWith T2 op1 ip1 wf isTaskOf isTaskOf hasInputPort hasOutputPort wfInv wasAssociatedWith wasStartedBy wasStartedBy dataLink 12 D-PROV extends PROV Connects trace metadata to workflow structure Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013. Friday, 6 September 13
  • DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Friday, 6 September 13
  • DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv In-house components Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Neo4J graph DBMS [AllegroGraph] [Graph-*] Can we do better than the built-in Neo indexing? Friday, 6 September 13
  • DataOne Provenance components II: PBase 13 R ➞ DProv T ➞ DProv V ➞ DProv eSc ➞ DProv Tr ➞ DProv K ➞ DProv In-house components Neo4J&loader& Graph& storage& Query&layer& indexing& Analy8cal&services& Neo4J graph DBMS [AllegroGraph] [Graph-*] Cypher (Neo, declarative) [Gremlin (procedural)] can we do better? scaling graph queries Can we do better than the built-in Neo indexing? to be developed Friday, 6 September 13
  • Baseline provenance queries in PBase 14 Ancestors and descendents (lineage): [2,3] – Which datasets were involved in the production of data at node “e33”? – Reachability: was task “e11_miny” involved in producing data at node “e38”? Execution analysis: [3] – Which tasks did not execute to completion for execution X of a given workflow? – Find all inputs [outputs] of a given workflow across all its executions – Given a data item, find all workflows / tasks that have used it as input – Suppose we discover that service S has a bug, which data products were impacted by it? – How many times was task T activated across a pool of workflow executions? Provenance differencing: [4] – Why do the results from two executions of the same workflow differ? Attribution: [5] – Who was responsible for this {data {usage, production}, service invocation}? [2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013. [3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012. [4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013 [5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun, and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150. Friday, 6 September 13
  • Application - The social life of research data • We know all about searching in the publications space – who else is working on problems similar to mine? – which results are available? • In the data and process space: 1.Search and discovery • Who else has used the {datasets, services, workflows,...} I am using? – how do others rate them? • Who used my {datasets, services, workflows,...}? How were they used? 2.Reuse, reproduction, validation • Can I reproduce these results? – using the same exact method – using a variation of the method • How do I apply this method to my data? • ... 15 Social provenance for community building Friday, 6 September 13
  • From Pull (client queries) to Push (notifications) • Uncovering latent connections amongst services / data / people: – Ranking, clustering, association rules – Requires new similarity metrics • A recommender system for scientists – Analytics layer activated when new traces are added • Challenges: – How large a corpus of provenance graphs is needed? – How global should the community be? • Little new to discover in a small community – Requires graphs with rich attribution / association relations 16 Graph& storage& Query&layer& indexing& Analy5cal&services& Friday, 6 September 13
  • Credits 17 Current members of the DataONE Provenance Working Group: In the USA: Bertram Ludaescher, UC Davis (co-lead) Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher) Saumen Dey, UC Davis (researcher) Parisa Kianmajd, UC Davis (intern) Juliana Freire, NYU-Poly David Koop, NYU-Poly Fernando Chirigati, NYU-Poly Shawn Bowers, Gonzaga University Ilkay Altintas, SDSC/UCSD Karthik Ram, UC Berkeley Yolanda Gil,USC - ISI Yaxing Wei, ORNL Dave Vieglais, DataONE Technical Lead In the UK: Paolo Missier, Newcastle University James Cheney, University of Edinburgh Khalid Belhajjame, University of Manchester Friday, 6 September 13