iSOCO


Provenance: From eScience to the Web of Data

           José Manuel Gómez Pérez

                Invited Lecture
...
Agenda




Introduction to Provenance
Semantic Overlays for Provenance Analysis
The Web of Data
Provenance in the Web of D...
Provenance is…




Records of

Origin or source from
which something comes
History of subsequent
owners (change of
custody...
Provenance is…




Evidence of authenticity, integrity,
and quality
Certifies products of good process




               ...
Provenance is…




Valuable
Hard to collect and verify
Necessary to assign credit
…and blame

i.e. establish

Trust
      ...
Why provenance of electronic data is difficult


      Paper data                        Electronic data


Creation proces...
Provenance in eScience




One of the most active fields in Provenance development

Curated scientific biologic databases
...
Past approaches to provenance in eScience




                                       8
Agenda




Introduction to Provenance
Semantic Overlays for Provenance Analysis
The Web of Data
Provenance in the Web of D...
Provenance analysis of process executions




?
                                           10
Semantic overlays for provenance analysis


Objective: To support domain experts in
                                      ...
PSM perspectives

  Task-method                      Interaction
 decomposition




                                      ...
Towards knowledge provenance


                                           PSMs as semantic overlays on top
               ...
The twig join function


    Based on XML pattern matching algorithms on Directed Acyclic
    Graphs (Bruno et al., 2002)
...
A twig join example in provenance analysis


Domain      Bridges    PSM entities
entities   (mapping)




                ...
The matching algorithm

                                                •   twig_join recursively applied at
             ...
KOPE: A Knowledge-Oriented Provenance Environment



                  PSM-             Matching
                 Ontology...
KOPE Evaluation
                    PSM Catalogue
                     Task-Method
                    Decomposition




B...
KOPE video




        19
KOPE evaluation (II)

 120%
                                                                Focus on precision and recall
...
Agenda




Introduction to Provenance
Semantic Overlays for Provenance Analysis
The Web of Data
Provenance in the Web of D...
WAKE UP VIDEO!




                 22
While the economy contracts, the digital universe expands…




Source: IDC
                                       In 2006,...
Web data




      24
The Linked Data paradigm



                                  Tim Berners Lee, 2006 (Design Issues)
         How can we
  ...
The Linked Data Cloud (May 2007)




                              26
The Linked Data Cloud (August 2007)




                                 27
The Linked Data Cloud (March 2008)




                                28
The Linked Data Cloud (September 2008)




                                    29
The Linked Data Cloud (March 2009)




                                30
The Web of Data


Apply the Linked Data principles to expose open datasets in
RDF
Define RDF links between data items for ...
Linked Data going mainstream




                          32
Agenda




Introduction to Provenance
Semantic Overlays for Provenance Analysis
The Web of Data
Provenance in the Web of D...
A real-life example


  Linking and exploiting distributed data sets without the
means that allow contrasting its provenan...
The Linked Data flow




 Linked Data applications

                                     Data trustworthiness

           ...
Provenance and Linked Data


Linked Data is largely about reusing. However, reusing data from 3rd
parties requires knowing...
Provenance challenges in the Web of Data




Provenance information needs to be

Represented
Captured and recorded
Stored ...
A Provenance architecture for the Web of Data



   Authoritative
agencies required
to certify and keep
 data provenance
 ...
Semantics in support of provenance in the Web of Data


Semantic Web                                    Provenance
   stac...
Towards a model of Web Data provenance

                                               Adapted from Olaf Hartig’s Provenan...
Provenance-related vocabularies


DC – Dublin Core Metadata Terms
FOAF – Friend of a Friend
SIOC – Semantically-Interlinke...
Action points




Provenance             Awareness of         Tools for data
vocabularies           data providers        ...
An example of provenance visualization




                                    43
Questions




       44
José Manuel Gómez-Pérez
           Thanks for                                      R&D Director
              your        ...
Upcoming SlideShare
Loading in...5
×

Provenance: From e-Science to the Web Of Data

1,713

Published on

Provenance is broadly defined as the origin or source from which something comes and the history of subsequent owners. In the context of data, process and computation-intensive disciplines, provenance focuses on the description and understanding of where and how data is produced, the actors involved in its production, and the processes applied to it. Provenance has been a hot topic in the last years in scientific disciplines, with a strong emphasis in eScience, where technology and means for representing provenance have been proposed, ranging between different degrees of expressivity. Since the amount of data involved has increased in the different domains, provenance models have eventually evolved into semantic overlays, which describe provenance at different levels of granularity, facilitating user understanding. Nowadays, the need of provenance analysis has expanded beyond scientific domains into the Web of Data arena. The abundance of data is encouraging organizations and governments to publish and expose their data in a way that can be made available to the public and reused for a number of purposes through the Linked Data initiative. However, while an important number of large and interlinked data sets such as the UK government and the BBC web sites are starting to be now publicly available, important challenges still need to be addressed before this vision can be achieved. Amongst them, provenance is one of the most outstanding issues in order to guarantee data quality, trustworthiness and realiability in the Web of Data. In this talk, we will provide an insight on provenance, from eScience to the Web of Data, describing old problems and new challenges, which need to be addressed in the upcoming years.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,713
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Provenance: From e-Science to the Web Of Data

  1. 1. iSOCO Provenance: From eScience to the Web of Data José Manuel Gómez Pérez Invited Lecture CETINIA 17/11/2009
  2. 2. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 2
  3. 3. Provenance is… Records of Origin or source from which something comes History of subsequent owners (change of custody) Adapted from James Cheney’s Principles of Provenance 3
  4. 4. Provenance is… Evidence of authenticity, integrity, and quality Certifies products of good process Adapted from James Cheney’s Principles of Provenance 4
  5. 5. Provenance is… Valuable Hard to collect and verify Necessary to assign credit …and blame i.e. establish Trust Adapted from James Cheney’s Principles of Provenance 5
  6. 6. Why provenance of electronic data is difficult Paper data Electronic data Creation process leaves Often, there is no bits paper trail trail Easier to detect Easy to forge, modification, copy, plagiarize, and modify forgery data Usually, one can judge There is no cover to a book by the cover judge by Addressing this requires explicitly representing the provenance of data, store it, keep it secure, and reason with it. Adapted from James Cheney’s Principles of Provenance 6
  7. 7. Provenance in eScience One of the most active fields in Provenance development Curated scientific biologic databases - Ensure database quality - Need provenance for data quality control and accountability - Currently done manually by curators Scientific workflows – grid computing - Abstract process execution complexity - Need provenance for process reproducibility, efficiency - Currently supported by ad-hoc systems 7
  8. 8. Past approaches to provenance in eScience 8
  9. 9. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 9
  10. 10. Provenance analysis of process executions ? 10
  11. 11. Semantic overlays for provenance analysis Objective: To support domain experts in Problem Solving Methods understanding process executions (PSMs) (McDermott 1988) How • Provide reusable guidelines to formulate process knowledge • Support reasoning • Describe the main rationale Semantic behind a process What Overlays Whom PROVENANCE SMEs 11
  12. 12. PSM perspectives Task-method Interaction decomposition Black-box perspective Knowledge transformation within the PSM Hierarchically defines how tasks PSM establishes and controls the decompose into simpler sequence of actions required to (sub)tasks perform a task Describes tasks at several levels Defines knowledge required at of detail each task step Provides alternative ways to achieve a task Knowledge flow Task Method Role 12
  13. 13. Towards knowledge provenance PSMs as semantic overlays on top of existing process documentation Task: What is going to be achieved by executing a process PSM: HOW Provenance, from a knowledge perspective - How recorded provenance relates to the execution of a process - Simpler process analysis proposing decompositions into simpler subprocesses - Visualize provenance at different levels of detail Supporting domain experts in two main ways - Validation of process executions Source: myGrid - Identification of reasoning patterns in process executions 13
  14. 14. The twig join function Based on XML pattern matching algorithms on Directed Acyclic Graphs (Bruno et al., 2002) twig_join detects the occurrence of a pattern in a XML DAG Given - P, a process - T, a task potentially describing P - M, a PSM providing a strategy on how to achieve T - i(T), the set of input roles of T - o(T), the set of output roles of T - D, the DAG resulting from documenting the execution of P twig_join(D,i(T),o(T)) checks whether a twig exists for M that connects i(T) with o(T) in D In this case, PSM M is the pattern to be identified in the process documentation DAG D 14
  15. 15. A twig join example in provenance analysis Domain Bridges PSM entities entities (mapping) twig join! 15
  16. 16. The matching algorithm • twig_join recursively applied at Task-method decomposition each decomposition level • Each task decomposed by one or several PSMs (task-method twig_join(Ti, D) decomposition view) • Knowledge flow defines the sequence of evaluation decompose(Ti) twig_join(T11, D) Knowledge flow twig_join(T12, D) twig_join(T13, D) Backtracking possible at PSM and role levels twig_join(T14, D) Interaction 16
  17. 17. KOPE: A Knowledge-Oriented Provenance Environment PSM- Matching Ontology visualization bridges Provenance Matching query detection 17
  18. 18. KOPE Evaluation PSM Catalogue Task-Method Decomposition Brain Atlas Workflow PSM Catalogue Knowledge Flow 18
  19. 19. KOPE video 19
  20. 20. KOPE evaluation (II) 120% Focus on precision and recall 100% metrics 80% 60% Precision Identified at three different Recall 40% layered contexts 20% - Method 0% Level1 Level2 Level3 Level4 - Task Goal 1: identify the main - Decomposition-level rationale behind process executions by detecting occurrences of semantic overlays in their logs Goal 2: To exploit the structure of semantic overlays to describe process executions at different levels of detail Perfect match Partial match No match 20
  21. 21. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 21
  22. 22. WAKE UP VIDEO! 22
  23. 23. While the economy contracts, the digital universe expands… Source: IDC In 2006, the size of the digital universe was estimated in 161 exabytes 3 million times, the information in all books ever written By 2010, expected to turn 988 exabytes …and all this data is potentially exposed online 23
  24. 24. Web data 24
  25. 25. The Linked Data paradigm Tim Berners Lee, 2006 (Design Issues) How can we exploit all the available data? 1. Use URIs to identify things - Anything, not just documents 2. Use HTTP URIs for people to Data reuse and remix lookup such names Common flexible and usable APIs - Globally unique names Standard vocabularies to - Distributed ownership describe interlinked datasets 3. Provide useful information in RDF Tools upon URI resolution Realize the Semantic Web vision 4. Include RDF links to other URIs - Enable discovery of related information 25
  26. 26. The Linked Data Cloud (May 2007) 26
  27. 27. The Linked Data Cloud (August 2007) 27
  28. 28. The Linked Data Cloud (March 2008) 28
  29. 29. The Linked Data Cloud (September 2008) 29
  30. 30. The Linked Data Cloud (March 2009) 30
  31. 31. The Web of Data Apply the Linked Data principles to expose open datasets in RDF Define RDF links between data items for different datasets Over 7.5 billion triples, 5 million links (as of November 2009) 31
  32. 32. Linked Data going mainstream 32
  33. 33. Agenda Introduction to Provenance Semantic Overlays for Provenance Analysis The Web of Data Provenance in the Web of Data 33
  34. 34. A real-life example Linking and exploiting distributed data sets without the means that allow contrasting its provenance can be harmful, Two fake web sites especially in sensitive domains. A fake Wikipedia entry Fake California public safety phone numbers The hoax caused a 1000-word tome on Frankfurter Allgemeine Zeitung… and public apologies from DPA Trust on Wikipedia misled DPA In a provenance-aware world, DPA would have had means based on data provenance to automatically check that - The town did not exist - The Berlin Boys do not exist - The reporting local TV station does not exist 34
  35. 35. The Linked Data flow Linked Data applications Data trustworthiness Exploit Linked Data SPARQL EPRs Provenance Provenance Linked Data Data quality Publish Linked Data (RDF, HTTP, URIs) Web documents Data lineage Multimedia Legacy resources e.g. DBs, XML repositories 35
  36. 36. Provenance and Linked Data Linked Data is largely about reusing. However, reusing data from 3rd parties requires knowing its provenance!!! Is the data Is the quality reliable? of the data Provenance shall provide the ability to good? - Trace the sources of data - Enable the exploration of relationships between datasets, their authors and affiliations Provenance analysis shall provide an insight on how data is produced and exploited Provenance should create a notion of information quality - is a certain dataset consistent and up to date? - is the connection between two interlinked datasets meaningful? - is a given dataset relevant for a particular domain? Provenance to establish information trustworthiness Provenance to provide data views following some criteria 36
  37. 37. Provenance challenges in the Web of Data Provenance information needs to be Represented Captured and recorded Stored and secured, queried, and reasoned about Visualized and browsed 37
  38. 38. A Provenance architecture for the Web of Data Authoritative agencies required to certify and keep data provenance secure!!! 38
  39. 39. Semantics in support of provenance in the Web of Data Semantic Web Provenance stack stack This, we still need to define! Information quality inference Trust inference Reasoning with provenance Provenance querying Provenance capture Provenance access policy definition Provenance encryption 39
  40. 40. Towards a model of Web Data provenance Adapted from Olaf Hartig’s Provenance Information in the Web of data Provenance represented as a graph - Nodes: provenance elements (pieces of provenance information) - Edges: relate provenance elements to each other - Subgraphs for related data items possible Provenance models define - Types of provenance elements (roles) - Relationships between them Actor Execution Artifact 40
  41. 41. Provenance-related vocabularies DC – Dublin Core Metadata Terms FOAF – Friend of a Friend SIOC – Semantically-Interlinked Online Communities SWP – Semantic Web Publishing vocabulary WOT – Web Of Trust schema VOiD – VOcabulary of Interlinked Datasets However, general lack of provenance-related metadata on the Web of Data! 41
  42. 42. Action points Provenance Awareness of Tools for data vocabularies data providers providers Represent and reason Generation of with trust and W3C Provenance IG provenance metadata information quality Extend emerging Provenance Linked Data authoritative agencies vocabularies Linked Data standards (VOiD Provenance VOiD again) visualization 42
  43. 43. An example of provenance visualization 43
  44. 44. Questions 44
  45. 45. José Manuel Gómez-Pérez Thanks for R&D Director your T +34913349778 attention! M +34609077103 jmgomez@isoco.com iSOCO Para obtener más información sobre como puede ayudar a su empresa a optimizar sus negocios digitales y aportar una solución innovadora, contáctenos en www. .com Barcelona Madrid Valencia Tel +34 93 5677200 +34 91 3349797 +34 96 3467143 Edificio Testa A C/Pedro de Valdivia, 10 Oficina 107 C/ Alcalde Barnils 64-68 28006 Madrid C/ Prof. Beltrán Báguena 4, St. Cugat del Vallès 46009 Valencia 08174 Barcelona 45

×