Your SlideShare is downloading. ×
0
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Provenance: From e-Science to the Web Of Data

1,682

Published on

Provenance is broadly defined as the origin or source from which something comes and the history of subsequent owners. In the context of data, process and computation-intensive disciplines, provenance …

Provenance is broadly defined as the origin or source from which something comes and the history of subsequent owners. In the context of data, process and computation-intensive disciplines, provenance focuses on the description and understanding of where and how data is produced, the actors involved in its production, and the processes applied to it. Provenance has been a hot topic in the last years in scientific disciplines, with a strong emphasis in eScience, where technology and means for representing provenance have been proposed, ranging between different degrees of expressivity. Since the amount of data involved has increased in the different domains, provenance models have eventually evolved into semantic overlays, which describe provenance at different levels of granularity, facilitating user understanding. Nowadays, the need of provenance analysis has expanded beyond scientific domains into the Web of Data arena. The abundance of data is encouraging organizations and governments to publish and expose their data in a way that can be made available to the public and reused for a number of purposes through the Linked Data initiative. However, while an important number of large and interlinked data sets such as the UK government and the BBC web sites are starting to be now publicly available, important challenges still need to be addressed before this vision can be achieved. Amongst them, provenance is one of the most outstanding issues in order to guarantee data quality, trustworthiness and realiability in the Web of Data. In this talk, we will provide an insight on provenance, from eScience to the Web of Data, describing old problems and new challenges, which need to be addressed in the upcoming years.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,682
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. iSOCO Provenance: From eScience to the Web of Data José Manuel Gómez Pérez Invited Lecture CETINIA 17/11/2009
  • 2. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 2
  • 3. Provenance is… Records of Origin or source from which something comes History of subsequent owners (change of custody) Adapted from James Cheney’s Principles of Provenance 3
  • 4. Provenance is… Evidence of authenticity, integrity, and quality Certifies products of good process Adapted from James Cheney’s Principles of Provenance 4
  • 5. Provenance is… Valuable Hard to collect and verify Necessary to assign credit …and blame i.e. establish Trust Adapted from James Cheney’s Principles of Provenance 5
  • 6. Why provenance of electronic data is difficult Paper data Electronic data Creation process leaves Often, there is no bits paper trail trail Easier to detect Easy to forge, modification, copy, plagiarize, and modify forgery data Usually, one can judge There is no cover to a book by the cover judge by Addressing this requires explicitly representing the provenance of data, store it, keep it secure, and reason with it. Adapted from James Cheney’s Principles of Provenance 6
  • 7. Provenance in eScience One of the most active fields in Provenance development Curated scientific biologic databases - Ensure database quality - Need provenance for data quality control and accountability - Currently done manually by curators Scientific workflows – grid computing - Abstract process execution complexity - Need provenance for process reproducibility, efficiency - Currently supported by ad-hoc systems 7
  • 8. Past approaches to provenance in eScience 8
  • 9. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 9
  • 10. Provenance analysis of process executions ? 10
  • 11. Semantic overlays for provenance analysis Objective: To support domain experts in Problem Solving Methods understanding process executions (PSMs) (McDermott 1988) How • Provide reusable guidelines to formulate process knowledge • Support reasoning • Describe the main rationale Semantic behind a process What Overlays Whom PROVENANCE SMEs 11
  • 12. PSM perspectives Task-method Interaction decomposition Black-box perspective Knowledge transformation within the PSM Hierarchically defines how tasks PSM establishes and controls the decompose into simpler sequence of actions required to (sub)tasks perform a task Describes tasks at several levels Defines knowledge required at of detail each task step Provides alternative ways to achieve a task Knowledge flow Task Method Role 12
  • 13. Towards knowledge provenance PSMs as semantic overlays on top of existing process documentation Task: What is going to be achieved by executing a process PSM: HOW Provenance, from a knowledge perspective - How recorded provenance relates to the execution of a process - Simpler process analysis proposing decompositions into simpler subprocesses - Visualize provenance at different levels of detail Supporting domain experts in two main ways - Validation of process executions Source: myGrid - Identification of reasoning patterns in process executions 13
  • 14. The twig join function Based on XML pattern matching algorithms on Directed Acyclic Graphs (Bruno et al., 2002) twig_join detects the occurrence of a pattern in a XML DAG Given - P, a process - T, a task potentially describing P - M, a PSM providing a strategy on how to achieve T - i(T), the set of input roles of T - o(T), the set of output roles of T - D, the DAG resulting from documenting the execution of P twig_join(D,i(T),o(T)) checks whether a twig exists for M that connects i(T) with o(T) in D In this case, PSM M is the pattern to be identified in the process documentation DAG D 14
  • 15. A twig join example in provenance analysis Domain Bridges PSM entities entities (mapping) twig join! 15
  • 16. The matching algorithm • twig_join recursively applied at Task-method decomposition each decomposition level • Each task decomposed by one or several PSMs (task-method twig_join(Ti, D) decomposition view) • Knowledge flow defines the sequence of evaluation decompose(Ti) twig_join(T11, D) Knowledge flow twig_join(T12, D) twig_join(T13, D) Backtracking possible at PSM and role levels twig_join(T14, D) Interaction 16
  • 17. KOPE: A Knowledge-Oriented Provenance Environment PSM- Matching Ontology visualization bridges Provenance Matching query detection 17
  • 18. KOPE Evaluation PSM Catalogue Task-Method Decomposition Brain Atlas Workflow PSM Catalogue Knowledge Flow 18
  • 19. KOPE video 19
  • 20. KOPE evaluation (II) 120% Focus on precision and recall 100% metrics 80% 60% Precision Identified at three different Recall 40% layered contexts 20% - Method 0% Level1 Level2 Level3 Level4 - Task Goal 1: identify the main - Decomposition-level rationale behind process executions by detecting occurrences of semantic overlays in their logs Goal 2: To exploit the structure of semantic overlays to describe process executions at different levels of detail Perfect match Partial match No match 20
  • 21. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 21
  • 22. WAKE UP VIDEO! 22
  • 23. While the economy contracts, the digital universe expands… Source: IDC In 2006, the size of the digital universe was estimated in 161 exabytes 3 million times, the information in all books ever written By 2010, expected to turn 988 exabytes …and all this data is potentially exposed online 23
  • 24. Web data 24
  • 25. The Linked Data paradigm Tim Berners Lee, 2006 (Design Issues) How can we exploit all the available data? 1. Use URIs to identify things - Anything, not just documents 2. Use HTTP URIs for people to Data reuse and remix lookup such names Common flexible and usable APIs - Globally unique names Standard vocabularies to - Distributed ownership describe interlinked datasets 3. Provide useful information in RDF Tools upon URI resolution Realize the Semantic Web vision 4. Include RDF links to other URIs - Enable discovery of related information 25
  • 26. The Linked Data Cloud (May 2007) 26
  • 27. The Linked Data Cloud (August 2007) 27
  • 28. The Linked Data Cloud (March 2008) 28
  • 29. The Linked Data Cloud (September 2008) 29
  • 30. The Linked Data Cloud (March 2009) 30
  • 31. The Web of Data Apply the Linked Data principles to expose open datasets in RDF Define RDF links between data items for different datasets Over 7.5 billion triples, 5 million links (as of November 2009) 31
  • 32. Linked Data going mainstream 32
  • 33. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 33
  • 34. A real-life example Linking and exploiting distributed data sets without the means that allow contrasting its provenance can be harmful, Two fake web sites especially in sensitive domains. A fake Wikipedia entry Fake California public safety phone numbers The hoax caused a 1000-word tome on Frankfurter Allgemeine Zeitung… and public apologies from DPA Trust on Wikipedia misled DPA In a provenance-aware world, DPA would have had means based on data provenance to automatically check that - The town did not exist - The Berlin Boys do not exist - The reporting local TV station does not exist 34
  • 35. The Linked Data flow Linked Data applications Data trustworthiness Exploit Linked Data SPARQL EPRs Provenance Provenance Linked Data Data quality Publish Linked Data (RDF, HTTP, URIs) Web documents Data lineage Multimedia Legacy resources e.g. DBs, XML repositories 35
  • 36. Provenance and Linked Data Linked Data is largely about reusing. However, reusing data from 3rd parties requires knowing its provenance!!! Is the data Is the quality reliable? of the data Provenance shall provide the ability to good? - Trace the sources of data - Enable the exploration of relationships between datasets, their authors and affiliations Provenance analysis shall provide an insight on how data is produced and exploited Provenance should create a notion of information quality - is a certain dataset consistent and up to date? - is the connection between two interlinked datasets meaningful? - is a given dataset relevant for a particular domain? Provenance to establish information trustworthiness Provenance to provide data views following some criteria 36
  • 37. Provenance challenges in the Web of Data Provenance information needs to be Represented Captured and recorded Stored and secured, queried, and reasoned about Visualized and browsed 37
  • 38. A Provenance architecture for the Web of Data Authoritative agencies required to certify and keep data provenance secure!!! 38
  • 39. Semantics in support of provenance in the Web of Data Semantic Web Provenance stack stack This, we still need to define! Information quality inference Trust inference Reasoning with provenance Provenance querying Provenance capture Provenance access policy definition Provenance encryption 39
  • 40. Towards a model of Web Data provenance Adapted from Olaf Hartig’s Provenance Information in the Web of data Provenance represented as a graph - Nodes: provenance elements (pieces of provenance information) - Edges: relate provenance elements to each other - Subgraphs for related data items possible Provenance models define - Types of provenance elements (roles) - Relationships between them Actor Execution Artifact 40
  • 41. Provenance-related vocabularies DC – Dublin Core Metadata Terms FOAF – Friend of a Friend SIOC – Semantically-Interlinked Online Communities SWP – Semantic Web Publishing vocabulary WOT – Web Of Trust schema VOiD – VOcabulary of Interlinked Datasets However, general lack of provenance-related metadata on the Web of Data! 41
  • 42. Action points Provenance Awareness of Tools for data vocabularies data providers providers Represent and reason Generation of with trust and W3C Provenance IG provenance metadata information quality Extend emerging Provenance Linked Data authoritative agencies vocabularies Linked Data standards (VOiD Provenance VOiD again) visualization 42
  • 43. An example of provenance visualization 43
  • 44. Questions 44
  • 45. José Manuel Gómez-Pérez Thanks for R&D Director your T +34913349778 attention! M +34609077103 jmgomez@isoco.com iSOCO Para obtener más información sobre como puede ayudar a su empresa a optimizar sus negocios digitales y aportar una solución innovadora, contáctenos en www. .com Barcelona Madrid Valencia Tel +34 93 5677200 +34 91 3349797 +34 96 3467143 Edificio Testa A C/Pedro de Valdivia, 10 Oficina 107 C/ Alcalde Barnils 64-68 28006 Madrid C/ Prof. Beltrán Báguena 4, St. Cugat del Vallès 46009 Valencia 08174 Barcelona 45

×