Provenance: From e-Science to the Web Of Data

iSOCO

Provenance: From eScience to the Web of Data

José Manuel Gómez Pérez

Invited Lecture
CETINIA 17/11/2009

Agenda

Introduction to Provenance
Semantic Overlays for Provenance Analysis
The Web of Data
Provenance in the Web of Data

2

Provenance is…

Records of

Origin or source from
which something comes
History of subsequent
owners (change of
custody)

Adapted from James Cheney’s Principles of Provenance
3

Provenance is…

Evidence of authenticity, integrity,
and quality
Certifies products of good process

4

Provenance is…

Valuable
Hard to collect and verify
Necessary to assign credit
…and blame

i.e. establish

Trust
5

Why provenance of electronic data is difficult

Paper data Electronic data

Creation process leaves Often, there is no bits
paper trail trail
Easier to detect Easy to forge,
modification, copy, plagiarize, and modify
forgery data
Usually, one can judge There is no cover to
a book by the cover judge by
Addressing this requires
explicitly representing the
provenance of data, store it, keep it
secure, and reason with it.

6

Provenance in eScience

One of the most active fields in Provenance development

Curated scientific biologic databases
- Ensure database quality
- Need provenance for data quality control and accountability
- Currently done manually by curators
Scientific workflows – grid computing
- Abstract process execution complexity
- Need provenance for process reproducibility, efficiency
- Currently supported by ad-hoc systems

7

Past approaches to provenance in eScience

8

Agenda

The Web of Data

9

Provenance analysis of process executions

?
10

Semantic overlays for provenance analysis

Objective: To support domain experts in
Problem Solving Methods
understanding process executions (PSMs) (McDermott 1988)
How
• Provide reusable guidelines
to formulate process
knowledge
• Support reasoning
• Describe the main rationale
Semantic behind a process

What
Overlays
Whom

PROVENANCE SMEs

11

PSM perspectives

Task-method Interaction
decomposition

Black-box perspective
Knowledge transformation
within the PSM

Hierarchically defines how tasks
PSM establishes and controls the decompose into simpler
sequence of actions required to (sub)tasks
perform a task Describes tasks at several levels
Defines knowledge required at of detail
each task step
Provides alternative ways to
achieve a task
Knowledge flow

Task
Method
Role

12

Towards knowledge provenance

PSMs as semantic overlays on top
of existing process documentation
Task: What is going to be
achieved by executing a process

PSM: HOW

Provenance, from a knowledge perspective
- How recorded provenance relates to the execution of a
process
- Simpler process analysis proposing decompositions into
simpler subprocesses
- Visualize provenance at different levels of detail
Supporting domain experts in two main ways
- Validation of process executions
Source: myGrid - Identification of reasoning patterns in process executions

13

The twig join function

Based on XML pattern matching algorithms on Directed Acyclic
Graphs (Bruno et al., 2002)
twig_join detects the occurrence of a pattern in a XML DAG
Given
- P, a process
- T, a task potentially describing P
- M, a PSM providing a strategy on how to achieve T
- i(T), the set of input roles of T
- o(T), the set of output roles of T
- D, the DAG resulting from documenting the execution of P
twig_join(D,i(T),o(T)) checks whether a twig exists for M that
connects i(T) with o(T) in D

In this case, PSM M is the pattern to be identified in the process
documentation DAG D

14

A twig join example in provenance analysis

Domain Bridges PSM entities
entities (mapping)

twig join!

15

The matching algorithm

• twig_join recursively applied at
Task-method
decomposition
each decomposition level
• Each task decomposed by one
or several PSMs (task-method
twig_join(Ti, D) decomposition view)
• Knowledge flow defines the
sequence of evaluation
decompose(Ti)

twig_join(T11, D)

Knowledge flow
twig_join(T12, D)

twig_join(T13, D)
Backtracking
possible at PSM and
role levels
twig_join(T14, D)
Interaction

16

KOPE: A Knowledge-Oriented Provenance Environment

PSM- Matching
Ontology visualization
bridges

Provenance Matching
query detection

17

KOPE Evaluation
PSM Catalogue
Task-Method
Decomposition

Brain Atlas
Workflow
PSM Catalogue
Knowledge Flow

18

KOPE evaluation (II)

120%
Focus on precision and recall
100%
metrics
80%

60%
Precision Identified at three different
Recall
40% layered contexts
20% - Method
0%
Level1 Level2 Level3 Level4 - Task
Goal 1: identify the main
- Decomposition-level
rationale behind process
executions by detecting
occurrences of semantic
overlays in their logs

Goal 2: To exploit the
structure of semantic
overlays to describe
process executions at
different levels of detail

Perfect match
Partial match
No match

20

Agenda

The Web of Data

21

While the economy contracts, the digital universe expands…

Source: IDC
In 2006, the size of the digital universe
was estimated in 161 exabytes
3 million times, the information in all
books ever written
By 2010, expected to turn 988
exabytes
…and all this data is potentially
exposed online
23

The Linked Data paradigm

Tim Berners Lee, 2006 (Design Issues)
How can we
exploit all the
available data? 1. Use URIs to identify things
- Anything, not just documents
2. Use HTTP URIs for people to
Data reuse and remix lookup such names
Common flexible and usable APIs - Globally unique names
Standard vocabularies to - Distributed ownership
describe interlinked datasets 3. Provide useful information in RDF
Tools upon URI resolution
Realize the Semantic Web vision 4. Include RDF links to other URIs
- Enable discovery of related
information

25

The Linked Data Cloud (May 2007)

26

The Linked Data Cloud (August 2007)

27

The Linked Data Cloud (March 2008)

28

The Linked Data Cloud (September 2008)

29

The Linked Data Cloud (March 2009)

30

The Web of Data

Apply the Linked Data principles to expose open datasets in
RDF
Define RDF links between data items for different datasets
Over 7.5 billion triples, 5 million links (as of November 2009)

31

Linked Data going mainstream

32

Agenda

The Web of Data

33

A real-life example

Linking and exploiting distributed data sets without the
means that allow contrasting its provenance can be harmful,
Two fake web sites
especially in sensitive domains.
A fake Wikipedia entry
Fake California public safety phone
numbers

The hoax caused a 1000-word tome on
Frankfurter Allgemeine Zeitung… and
public apologies from DPA

Trust on Wikipedia misled DPA
In a provenance-aware world, DPA
would have had means based on data
provenance to automatically check that
- The town did not exist
- The Berlin Boys do not exist
- The reporting local TV station does not exist

34

The Linked Data flow

Linked Data applications

Data trustworthiness

Exploit Linked Data
SPARQL EPRs

Provenance
Provenance
Linked Data
Data quality

Publish Linked Data
(RDF, HTTP, URIs)

Web documents
Data lineage
Multimedia
Legacy resources e.g.
DBs, XML repositories

35

Provenance and Linked Data

Linked Data is largely about reusing. However, reusing data from 3rd
parties requires knowing its provenance!!! Is the data Is the quality
reliable? of the data
Provenance shall provide the ability to good?
- Trace the sources of data
- Enable the exploration of relationships between datasets, their authors and
affiliations
Provenance analysis shall provide an insight on how data is produced
and exploited
Provenance should create a notion of information quality
- is a certain dataset consistent and up to date?
- is the connection between two interlinked datasets meaningful?
- is a given dataset relevant for a particular domain?
Provenance to establish information trustworthiness
Provenance to provide data views following some criteria

36

Provenance challenges in the Web of Data

Provenance information needs to be

Represented
Captured and recorded
Stored and secured, queried, and reasoned about
Visualized and browsed

37

A Provenance architecture for the Web of Data

Authoritative
agencies required
to certify and keep
data provenance
secure!!!

38

Semantics in support of provenance in the Web of Data

Semantic Web Provenance
stack stack
This, we still
need to define!

Information quality
inference

Trust inference

Reasoning with provenance

Provenance querying

Provenance capture

Provenance access policy definition

Provenance encryption

39

Towards a model of Web Data provenance

Adapted from Olaf Hartig’s Provenance
Information in the Web of data
Provenance represented as a graph
- Nodes: provenance elements (pieces of provenance information)
- Edges: relate provenance elements to each other
- Subgraphs for related data items possible
Provenance models define
- Types of provenance elements (roles)
- Relationships between them

Actor

Execution

Artifact

40

Provenance-related vocabularies

DC – Dublin Core Metadata Terms
FOAF – Friend of a Friend
SIOC – Semantically-Interlinked Online
Communities
SWP – Semantic Web Publishing vocabulary
WOT – Web Of Trust schema
VOiD – VOcabulary of Interlinked Datasets

However, general lack of
provenance-related
metadata on the Web of
Data!

41

Action points

Provenance Awareness of Tools for data
vocabularies data providers providers
Represent and reason
Generation of
with trust and
W3C Provenance IG provenance metadata
information quality

Extend emerging
Provenance
Linked Data
authoritative agencies
vocabularies
Linked Data
standards (VOiD Provenance
VOiD again) visualization

42

An example of provenance visualization

43

José Manuel Gómez-Pérez
Thanks for R&D Director
your T +34913349778
attention! M +34609077103
jmgomez@isoco.com

iSOCO
Para obtener más información sobre como puede
ayudar a su empresa a optimizar sus negocios digitales y aportar
una solución innovadora, contáctenos en
www. .com
Barcelona Madrid Valencia
Tel +34 93 5677200 +34 91 3349797 +34 96 3467143
Edificio Testa A C/Pedro de Valdivia, 10 Oficina 107
C/ Alcalde Barnils 64-68 28006 Madrid C/ Prof. Beltrán Báguena 4,
St. Cugat del Vallès 46009 Valencia
08174 Barcelona

45

Provenance: From e-Science to the Web Of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Provenance: From e-Science to the Web Of Data

Similar to Provenance: From e-Science to the Web Of Data (20)

More from Jose Manuel Gómez-Pérez

More from Jose Manuel Gómez-Pérez (9)

Recently uploaded

Recently uploaded (20)

Provenance: From e-Science to the Web Of Data