SlideShare a Scribd company logo
Provenance and the W3C PROV model
(in the Big Data context)§
Paolo Missier
School of Computing Science
Newcastle University, UK
First Keystone Summer School,
Malta, July 2015
Some of the slides courtesy of Luc Moreau – thanks!
Topical research dissemination events
Lecture goals and outline
• What is provenance, and why does it matter?
• Definitions and case studies
• The W3C PROV standard in a nutshell
• PROV-O: the Provenance Ontology and examples of its usage
• Provenance and Big Data: what’s the connection?
• Opportunities and challenges
• Provenance tools [from Southampton]
One recent book
1- Reproducibility and dissemination in Science
Independent validation of scientific claims is a cornerstone of
experimental science
• Scientific claims are supported by experiments
• How do express my “material and methods” so that you can
independently verify my results?
• How do I document my results to promote their understanding /
Provenance is the equivalent of a logbook
• Capture all steps involved in the derivation of a
• Replay, validate the execution, compare it with
To what extent these can be formalised and automated in data-
intensive science?
2- Explaining the outcome of a complex decision process
• Which process was used
to derive a diagnosis?
• How did the process use
the input data?
• How were the steps
• Which decisions were
made by human experts
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Variant filtering
HPO match
OMIM match
OMIM to Gene
Genes in scope
genes list
disease keywords
preferred genes
Variant Scoping
in scope
in scope
Variant Classification
not found
not found
pipelineClinical diagnosis of genetic diseases
3- Understanding the results of a computation
• Why has my [very complicated algorithm] produced this particular
• Why is my predictive analytics model suggesting that it will rain
• Why is this record part of the result of my database query?
• Database provenance
• Why is this record included in the result of my keyword search?
4- Content reuse on the Social Web
Open Data, Data Journalism
• A consume-select-curate-share workflow, not only professional
• Ethos: to expose the data and methods used to produce news items
• But: Data wrangling can introduce errors
• Is the data I am using valid? What is its primary source? What are the
transformation steps?
NowNews publishes an article based on
the latest employment data published by
PolicyOrg compiles a report including
NowNews article
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners
Magna Carta (‘the Great Charter’) was
agreed between King John and his barons
on 15 June 1215.
What is provenance?
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
Provenance is a record that describes the people, institutions,
entities, and activities, involved in producing, influencing, or
delivering a piece of data or a thing in the world
Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
• A browser button by which the user can express their uncertainty about a
document being displayed “so how do I know I can trust this information?”.
• Upon activation of the button, the software then retrieves metadata about the
document, listing assumptions on which trust can be based.
Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE
International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC,
the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
Provenance in the Semantic Web Stack
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Use cases on the Social Web
Open Data, Data Journalism
NowNews publishes an article based on the latest employment
data published by GovStat
PolicyOrg compiles a report including NowNews article
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Derivation - Timeliness
• Charts, graphs and visualizations are all based on multiple data sets
• Eg Bob’s article on employment that appeared in NowNews
• Which data was a figure based upon?
Is the report based on the most
up-to-date data?
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Derivation - Trusted sources
• Is this content derived from data coming from a reliable source?
• The chart within Bob’s article is based on GovStat data
• However that information is hidden:
• the chart was produced by a complex process performed by Alice
Policy rule:
“data supplied by the government
is reliable”
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Tracing the source of errors
Derivation, attribution:
• When did this error occur?
• Who was responsible for the chart?
Nick discovers an error in the
chart included in Bob’s article
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Ensuring policy compliance
Process inspection:
• Which process steps led to publication?
• Was editorial check part of it?
Policy rule:
“posts are to be checked by an
editor prior to publication”
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Ensuring credit and acknowledgement
NowNews relies on multiple
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
delAttribution and responsibility:
• How do we ensure that all relevant
contributors are acknowledged?
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
Documenting the data
generation process:
• How do we ensure that
the figures can be
reproduced using the
new versions of the
NowNews must ensure that the
article figures reflect the most
recent data
Bob: Journalist
Alice: Data Cruncher
Tom: Editor
Nick: Web Master
version: 1.0 version: 2.0
:L-Moreau a Agent.
:original-slide a Entity;
wasAttributedTo L-Moreau.
:this-slide a Entity;
wasDerivedFrom original-slide
So, why does provenance matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To enable process analysis for debugging, improvement, evolution
• To enable reproducibility of processes (eg in science, data journalism…)
See also:
ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data
and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015
DOI: 10.1145/2692312
The W3C Working Group on Provenance
Incubator group
on provenance
Chair: Yolanda Gil,
working group
Luc Moreau,
Paul Groth
Main output:
“Provenance XG Final Report”
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
April, 2011 April, 2013
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints a number of non-prescriptive
PROV: scope and structure
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures
on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129.
PROV Core Elements (graph depiction)
An entity is a physical, digital, conceptual, or other kind of thing with some fixed
aspects; entities may be real or imaginary.
An activity is something that occurs over a period of time and acts upon or with entities; it
may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place,
for the existence of an entity, or for another agent's activity.
Generation, Usage
Generation is the completion of production of a new entity by an activity. This entity did not
exist before generation and becomes available for usage after this generation.
Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had
not begun to utilize this entity
PROV is based on a notion of instantaneous events, that mark transitions in the world
- generation, usage (and others)
Ordering constraints amongst events:
“generation of e must precede each of usages”
“a can only use / generate e after it has started and before it has ended”
Concepts and relations
Generation of “draft v1” expressed as relation:
wasGeneratedBy(“draft v1”, ...)
Usage of “draft v1” by “commenting” expressed as relation:
used(“commenting, “draft v1”,...)
PROV notation
prefix prov <>
prefix ex <>
entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])
wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)
used(ex:commenting, ex:draftV1, -)
wasGeneratedBy(ex:draftV1, ex:drafting, -)
used(ex:drafting, ex:paper1, -)
used(ex:drafting, ex:paper2, -)
Same example — PROV-O notation (RDF/N3)
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
Association, Attribution, Delegation: who did what?
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)
Same example — PROV-O notation (RDF/N3)
:Alice a prov:Agent,
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .
Association and Attribution
Q.: what is the relationship between attribution and association?
This is defined as an inference rule in the PROV-CONSTR document
wasAttributedTo(e, Ag)
wasGeneratedBy(e, a,-)
wasAssociatedWith(a, Ag,-)
Communication amongst activities
Communication is the exchange of some unspecified entity by two
activities, one activity using some entity generated by the other.
wasInformedBy(ex:commenting, ex:drafting)
:drafting a prov:Activity .
:commenting a prov:Activity ;
prov:wasInformedBy :drafting .
Communication, generation, usage
wasInformedBy(ex:commenting, ex:drafting)
wasGeneratedBy(e,ex:drafting, -)
used(ex:commenting, e, -)
Q.: what is the relationship between communication, generation, and usage?
This are inference rules 5 and 6 in the PROV-CONSTR document
Three Views of Provenance
Summary of the PROV Core model
Derivation amongst entities
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
Provenance and Big Data: what’s the connection?
opportunities and challenges
Provenance {as,of} Big Data
1. BigProv: Provenance as big data
• High volume provenance
• What kind of analytics are interesting on big provenance?
2. Provenance of analytics processes
• “Prediction provenance”
• Train a model  provenance of the model as a record of the training
process and data involved
• Use the model to make predictions  provenance of the prediction
3. Provenance of a search
• What is the provenance of a keyword search?
• Why would it be interesting? What can we learn from it?
Recent research on Provenance as Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid
Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7
May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency
Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium
on , vol., no., pp.525,534, 4-7 May 2015
doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and
Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory
Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop,
Edinburgh, 2015
• A Provenance Generator tool for experimenting with provenance at scale
• Why generate synthetic provenance?
• Synthetic PROV graphs can be a valuable complement to emerging natural
provenance collections
• … provided their structural properties reflect specific provenance patterns
• control over their repetition and variability
• varying scales
• Useful for benchmarking emerging provenance management systems
• Useful to test analytics algorithms that operate on large provenance collections
trace size
Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In
Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.
What does ProvGen do?
• Accept a seed PROV graph
• Grow the graph
• Add nodes and relationships following the seed graph
• … with constraints on how to grow
entity(e1, [type="Document",
entity(e2, [type="Document"])
entity(e3, [type="Document"])
activity(a1, [type="create"])
activity(a2, [type="edit"])
activity(a3, [type="edit"])
agent(ag, [type="Person"])
used(a2, e1)
used(a3, e2)
wasGeneratedBy(e2, a2, [fct="save"])
wasGeneratedBy(e1, a1, [fct="publish"])
wasGeneratedBy(e3, a3, [fct="save"])
wasAssociatedWith(a3, ag,
wasAssociatedWith(a2, ag,
wasAssociatedWith(a1, ag,
wasDerivedFrom(e2, e1)
wasDerivedFrom(e3, e2)
type: create
type: edita3
type: edit
type: Document
version: original
type: Document
type: Document
type: prov:plan
type: Personassoc
ProvGen constraints
an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has
the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2)
unless e1 has relationship "Used" with the Activity(a) and e2 has the
relationship "WasGeneratedBy" with the Activity(a);
an Entity must have relationship "WasGeneratedBy" exactly 1 times;
an Entity must have property("version"="original") with probability 0.05;
an Entity must have out degree at most 2;
an Activity must have relationship "Used" at most 1 times;
an Activity must have property("type"="create") with probability 0.01;
an Activity must have relationship "WasAssociatedWith" exactly 1 times;
an Activity must have relationship "Used" exactly 1 times unless it has
an Activity must have relationship "WasGeneratedBy" exactly 1 times;
an Agent must have relationship "WasAssociatedWith" with probability 0.1;
an Agent must have relationship "WasAssociatedWith" between 1, 120 times with
distribution gamma(1.3, 2.4);
Some test queries
Generated graph loaded to Neo4J GDBMS
Queries expressed using the Cypher graph query language
Transitive closure over Derivation:
Return all the derivation chains, along with their length
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r)
WHERE length(r) > 10
RETURN a,b, length(r)
ORDER BY length(r) desc limit 50
Return the top 50 length derivation chains
RETURN a as Agent, b as Activity
All agents and their associated activities
All agents who created new documents
MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b)
All agents who edited a document that was derived from an original
MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2)
Provenance of Big Data
Provenance of analytics processes:
“Prediction provenance”
• Train a model  provenance of the model as a record of the training
process and data involved
• Use the model to make predictions  provenance of the prediction
21 July 2015 11:38
21 July 2015 11:38
Relations may be given identifiers
wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)
used(use1; ex:commenting, ex:draftV1, -)
gen1 denotes a generation event
use1 denotes a usage event
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
General derivation relation:
Relation IDs make it possible to refer to relations in other relations
Rendering N-ary relations in PROV-O
RDF is for binary relations —- N-ary relations require reification
wasGeneratedBy(gen1; ex:draftComments,
used(use1; ex:commenting, ex:draftV1, -)
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
“Qualified relation” RDF pattern
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
Plans — why was something done?
Most relation types have two arguments which are { Entity, Activity, Agent}
Derivation is one exception:
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
Two other notable exceptions:
- Associations with a plan
- Delegation with an activity scope
wasAssociatedWith(id; a, ag, pl, attrs)
A plan is an entity that represents a set of actions or steps
intended by one or more agents to achieve some goal
Association with a plan
A plan plays a role in an association
Plans are typed entities
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])
agent(ex:_aJVM, [prov:type = 'JVM-6.0'])
[prov:type='prov:Plan', ex:label='Program 1'])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM,
ex:accessPath="webapp" ])
A plan is an entity having prov:type = “prov:plan”
Plan pattern as PROV-O
:_aProgramExecution a prov:Activity ;
:execTime "22.5sec;
prov:qualifiedAssociation [ a prov:Association ;
:accessPath "webapp";
prov:agent :_aJVM ;
prov:hadPlan :myCleverProgram ;
prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])
agent(ex:_aJVM, [prov:type = 'JVM-6.0'])
[prov:type='prov:Plan', ex:label='Program 1'])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM,
ex:accessPath='webapp' ])
Plan pattern as PROV-O
:_aProgramExecution a prov:Activity ;
:execTime "22.5sec;
prov:qualifiedAssociation [ a prov:Association ;
:accessPath "webapp";
prov:agent :_aJVM ;
prov:hadPlan :myCleverProgram ;
prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
Delegation within an activity scope
Real-world artifacts vs provenance entities
“What do I know about the car I see in this Cambridge street today?”
•It was bought by Joe in 2011
•Joe drove it to Boston on March 16th,
2013. The car has now got 10,000 miles
on it
•Joe drove it to Cambridge on March
18th, 2013.
“Same” car, but different provenance at
each stage of its evolution
Alternate-specialization pattern
Two alternate entities present aspects of the same thing. These aspects may be the same or
different, and the alternate entities may or may not overlap in time.
An entity that is a specialization of another shares all aspects of the latter, and additionally
presents more specific aspects of the same thing as the latter.
...But, this is still that car!
Semantic notes:
1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).
2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)
3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).
differing in their
same owner,
added location
Reserved attributes and types
A small set of reserved attributes, with some usage restrictions
Bundles, provenance of provenance
A bundle is a named set of provenance descriptions, and is itself an entity,
so allowing provenance of provenance to be expressed.
bundle pm:bundle1
wasGeneratedBy(ex:draftComments, ex:commenting,-)
used(ex:commenting, ex:draftV1, -)
entity(pm:bundle1, [ prov:type='prov:Bundle' ])
wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)
wasAttributedTo(pm:bundle1, ex:Bob)
Bundles in PROV-O
Bundle definition (an RDF named graph):
ex:bundle1 {
:draftComments a prov:Entity ;
:status “blah";
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity .
Bundle usage:
ex:bundle1 a prov:Entity, "prov:Bundle";
prov:qualifiedGeneration [ a prov:Generation ;
prov:atTime “2013-03-20T10:30:00+09:00" ];
prov:wasAttributedTo :Bob .
PROV-DM relations at a glance
Component Structure for PROV
Core vs Extended
Core Extended
Time, Events
wasStartedBy(id; a2, e, a1, t, attrs)
wasEndedBy(id; a2, e, a1, t, attrs)
Instead, the PROV data model is implicitly based on a notion of
instantaneous events, that mark transitions in the world (*)
(*) PROV-CONSTR (non-normative)
- activity start, activity end,
- entity generation , entity usage, entity invalidation
- Provenance statements are combined by different systems
- An application may not be able to align the times involved to a single
global timeline
Therefore, PROV minimizes assumptions about time
From “scruffy” provenance to “valid” provenance
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
More generally, how do we formally define what it means for a set of
provenance statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph
• Motivation for collecting provenance of data and information
• In Science
• In the Social Web
• The W3C PROV Recommendation (2013)
• PROV-DM: The PROV data model
• PROV-O: the Provenance Ontology
• Provenance as Big Data
• High volume provenance
• Storage, analytics, visualisation
• Provenance of analytics
• How can I explain my predictions?
• The ProvGen tool
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell,
et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012.
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012.
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web
Semantics: Science, Services and Agents on the World Wide Web (April 2015).
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz,
Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific
Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530.
ProvGen: generating synthetic PROV graphs with predictable structure.
Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer
ProvAbs: model, policy, and tooling for abstracting PROV graphs.
Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and
Annotations), Koln, Germany, 2014. Springer
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance
Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh,
Scotland: USENIX Association, 2015.

More Related Content

What's hot

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)Stefan Dietze
VALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataVALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataPeter Neish
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesStefan Dietze
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Stefan Dietze
The SFX Framework for Context-Sensitive Reference Linking
The SFX Framework for  Context-Sensitive Reference LinkingThe SFX Framework for  Context-Sensitive Reference Linking
The SFX Framework for Context-Sensitive Reference LinkingHerbert Van de Sompel
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueHerbert Van de Sompel
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebMathieu d'Aquin
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze

What's hot (20)

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
VALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataVALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open Data
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Data hv seminar_thadthong_v05_slshr
Data hv seminar_thadthong_v05_slshrData hv seminar_thadthong_v05_slshr
Data hv seminar_thadthong_v05_slshr
Broad Data
Broad DataBroad Data
Broad Data
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
The SFX Framework for Context-Sensitive Reference Linking
The SFX Framework for  Context-Sensitive Reference LinkingThe SFX Framework for  Context-Sensitive Reference Linking
The SFX Framework for Context-Sensitive Reference Linking
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...
NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...
NISO Webinar: Taking Your Website Wherever You Go: Delivering Great User Expe...
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston

Similar to Keystone summer school 2015 paolo-missier-provenance

Big data-and-creativity v.1
Big data-and-creativity v.1Big data-and-creativity v.1
Big data-and-creativity v.1Kim Flintoff
LinkedUp Open Education Panel session
LinkedUp Open Education Panel sessionLinkedUp Open Education Panel session
LinkedUp Open Education Panel sessionMarieke Guy
Carrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University WebsiteCarrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University WebsiteGeorgiana Cohen
Computational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaComputational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaSymeon Papadopoulos
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck
Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources Amy Weiss
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessAnita Luthra
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...tdenies
Emerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsEmerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsAdam Papendieck
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...Lora Aroyo
Learn to speak open
Learn to speak openLearn to speak open
Learn to speak openLilian Juma
Safecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- FinalSafecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- FinalSafecast
ocTEL and Open Badges #altc
ocTEL and Open Badges #altcocTEL and Open Badges #altc
ocTEL and Open Badges #altcMartin Hawksey
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...Linas Eriksonas
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...Linas Eriksonas
The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering Persontyle
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivityIoannis Stavrakantonakis

Similar to Keystone summer school 2015 paolo-missier-provenance (20)

Processing Large Complex Data
Processing Large Complex DataProcessing Large Complex Data
Processing Large Complex Data
Big data-and-creativity v.1
Big data-and-creativity v.1Big data-and-creativity v.1
Big data-and-creativity v.1
LinkedUp Open Education Panel session
LinkedUp Open Education Panel sessionLinkedUp Open Education Panel session
LinkedUp Open Education Panel session
Carrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University WebsiteCarrying the Banner: Reinventing News on Your University Website
Carrying the Banner: Reinventing News on Your University Website
Computational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaComputational Verification Challenges in Social Media
Computational Verification Challenges in Social Media
Open Goverment Data: Insights from the International Open Goverment Data Conf...
Open Goverment Data: Insights from the International Open Goverment Data Conf...Open Goverment Data: Insights from the International Open Goverment Data Conf...
Open Goverment Data: Insights from the International Open Goverment Data Conf...
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
Here Comes Everything
Here Comes EverythingHere Comes Everything
Here Comes Everything
Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources Data Driven Journalism Links and Resources
Data Driven Journalism Links and Resources
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of Homelessness
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...2013.07.22  Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
2013.07.22 Tom De Nies - METHOD 2013 - Easy Access to Provenance: an Essenti...
Emerging Trends in Crisis Informatics
Emerging Trends in Crisis InformaticsEmerging Trends in Crisis Informatics
Emerging Trends in Crisis Informatics
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Lecture 4: How do we MINE, ANALYSE & VISUALISE the Social Web? (VU Amsterdam ...
Learn to speak open
Learn to speak openLearn to speak open
Learn to speak open
Safecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- FinalSafecast Report 2017 - Part 1 Safecast Project- Final
Safecast Report 2017 - Part 1 Safecast Project- Final
ocTEL and Open Badges #altc
ocTEL and Open Badges #altcocTEL and Open Badges #altc
ocTEL and Open Badges #altc
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...Linas Eriksonas, Social networks of startup entrepreneurs: the case of the s...
Linas Eriksonas, Social networks of startup entrepreneurs:  the case of the s...
The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering The Age of Data Driven Science and Engineering
The Age of Data Driven Science and Engineering
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of Productivity

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier

More from Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin

Recently uploaded (20)

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...

Keystone summer school 2015 paolo-missier-provenance

  • 1. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance and the W3C PROV model (in the Big Data context)§ Paolo Missier School of Computing Science Newcastle University, UK Tutorial First Keystone Summer School, Malta, July 2015 Some of the slides courtesy of Luc Moreau – thanks!
  • 3. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Lecture goals and outline • What is provenance, and why does it matter? • Definitions and case studies • The W3C PROV standard in a nutshell • PROV-O: the Provenance Ontology and examples of its usage • Provenance and Big Data: what’s the connection? • Opportunities and challenges • Provenance tools [from Southampton]
  • 5. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 1- Reproducibility and dissemination in Science Independent validation of scientific claims is a cornerstone of experimental science • Scientific claims are supported by experiments • How do express my “material and methods” so that you can independently verify my results? • How do I document my results to promote their understanding / reuse Provenance is the equivalent of a logbook • Capture all steps involved in the derivation of a result • Replay, validate the execution, compare it with others To what extent these can be formalised and automated in data- intensive science?
  • 6. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 2- Explaining the outcome of a complex decision process • Which process was used to derive a diagnosis? • How did the process use the input data? • How were the steps configured? • Which decisions were made by human experts (clinicians)? MAF threshold - Non-synonymous - stop/gain - frameshift known polymorphisms Homo / Heterozygous Pathogenicity predictors Variant filtering HPO match HPO to OMIM OMIM match OMIM to Gene Gene Union Gene Intersect Genes in scope User-supplied genes list User-supplied disease keywords User-defined preferred genes Variant Scoping Candidate variants Select variants in scope variants in scope ClinVar lookupClinVar Annotated patient variants Variant Classification RED: found, pathogenic AMBER: not found GREEN: found, benign OMIM AMBER/ not found AMBER/ uncertain NGS pipelineClinical diagnosis of genetic diseases
  • 7. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 3- Understanding the results of a computation • Why has my [very complicated algorithm] produced this particular result? • Why is my predictive analytics model suggesting that it will rain tomorrow? • Why is this record part of the result of my database query? • Database provenance • Why is this record included in the result of my keyword search?
  • 8. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier 4- Content reuse on the Social Web Open Data, Data Journalism • A consume-select-curate-share workflow, not only professional • Ethos: to expose the data and methods used to produce news items • But: Data wrangling can introduce errors • Is the data I am using valid? What is its primary source? What are the transformation steps? NowNews publishes an article based on the latest employment data published by GovStat PolicyOrg compiles a report including NowNews article :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 9. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier What is provenance? Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the history or pedigree of a work of art, manuscript, rare book, etc.; • a record of the passage of an item through its various owners Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
  • 10. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier What is provenance? Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world
  • 11. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance on the Web Tim Berners-Lee’s “Oh Yeah” button: • A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”. • Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based. Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC, the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
  • 12. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance in the Semantic Web Stack :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 13. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Use cases on the Social Web Open Data, Data Journalism NowNews publishes an article based on the latest employment data published by GovStat PolicyOrg compiles a report including NowNews article Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 14. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Derivation - Timeliness Derivation: • Charts, graphs and visualizations are all based on multiple data sets • Eg Bob’s article on employment that appeared in NowNews • Which data was a figure based upon? Is the report based on the most up-to-date data? Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 15. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Derivation - Trusted sources Derivation: • Is this content derived from data coming from a reliable source? • The chart within Bob’s article is based on GovStat data • However that information is hidden: • the chart was produced by a complex process performed by Alice Policy rule: “data supplied by the government is reliable” Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 16. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Tracing the source of errors Derivation, attribution: • When did this error occur? • Who was responsible for the chart? Nick discovers an error in the chart included in Bob’s article prov:wasAttributedTonowpeople: Bob now: employment-article-v1.html :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 17. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Ensuring policy compliance Process inspection: • Which process steps led to publication? • Was editorial check part of it? Policy rule: “posts are to be checked by an editor prior to publication” :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 18. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Ensuring credit and acknowledgement NowNews relies on multiple contributors Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master employment-article-v1.html David att Bob delAttribution and responsibility: • How do we ensure that all relevant contributors are acknowledged? :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 19. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Reproducibility Documenting the data generation process: • How do we ensure that the figures can be reproduced using the new versions of the data? NowNews must ensure that the article figures reflect the most recent data Bob: Journalist Alice: Data Cruncher Tom: Editor Nick: Web Master data-crunching data-source-A use data-source-B use Alice assoc version: 1.0 version: 2.0 employment-article-v1.html gen :L-Moreau a Agent. :original-slide a Entity; wasAttributedTo L-Moreau. :this-slide a Entity; wasDerivedFrom original-slide
  • 20. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier So, why does provenance matter? • To establish quality, relevance, trust • To track information attribution through complex transformations • To enable process analysis for debugging, improvement, evolution • To enable reproducibility of processes (eg in science, data journalism…) See also: ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015 DOI: 10.1145/2692312
  • 21. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier The W3C Working Group on Provenance W3C Incubator group on provenance Chair: Yolanda Gil, ISI, USC W3C working group approved Chairs: Luc Moreau, Paul Groth 2009-2010 Main output: “Provenance XG Final Report” - provides an overview of the various existing approaches, vocabularies - proposes the creation of a dedicated W3C Working Group April, 2011 April, 2013 Proposed Recommendations finalised prov-dm: Data Model prov-o: OWL ontology, RDF encoding prov-n: prov notation prov-constraints a number of non-prescriptive Notes
  • 22. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier PROV: scope and structure 23 source: Recommendation track See also: Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007.
  • 23. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier PROV Core Elements (graph depiction) 2 4 An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
  • 24. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Generation, Usage 2 5 Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation. Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity PROV is based on a notion of instantaneous events, that mark transitions in the world - generation, usage (and others) Ordering constraints amongst events: “generation of e must precede each of usages” “a can only use / generate e after it has started and before it has ended”
  • 25. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Concepts and relations 2 6 Generation of “draft v1” expressed as relation: wasGeneratedBy(“draft v1”, ...) Usage of “draft v1” by “commenting” expressed as relation: used(“commenting, “draft v1”,...)
  • 26. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier PROV notation 2 7 document prefix prov <> prefix ex <> entity(ex:draftComments) entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"]) entity(ex:paper1) entity(ex:paper2) activity(ex:commenting) activity(ex:drafting) wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00) used(ex:commenting, ex:draftV1, -) wasGeneratedBy(ex:draftV1, ex:drafting, -) used(ex:drafting, ex:paper1, -) used(ex:drafting, ex:paper2, -) endDocument
  • 27. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Same example — PROV-O notation (RDF/N3) 2 8 :draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting . :drafting a prov:Activity ; prov:used :paper1, :paper2 . :paper1 a prov:Entity, "reference"^^xsd:string . :paper2 a prov:Entity, "reference"^^xsd:string .
  • 28. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Association, Attribution, Delegation: who did what? 2 9 An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. Attribution is the ascribing of an entity to an agent. entity(ex:draftComments, [ ex:distr='internal' ]) activity(ex:commenting) agent(ex:Bob, [prov:type = "mainEditor"] ) agent(ex:Alice, [prov:type = "srEditor"]) wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"]) actedOnBehalfOf(Bob, Alice) wasAttributedTo(ex:draftComments, ex:Bob)
  • 29. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Same example — PROV-O notation (RDF/N3) 3 0 :Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper". :Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice . :draftComments prov:wasAttributedTo :Bob . :drafting a prov:Activity ; prov:wasAssociatedWith :Bob .
  • 30. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Association and Attribution 3 1 Q.: what is the relationship between attribution and association? This is defined as an inference rule in the PROV-CONSTR document entity(e) agent(Ag) activity(a) wasAttributedTo(e, Ag) wasGeneratedBy(e, a,-) wasAssociatedWith(a, Ag,-)
  • 31. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Communication amongst activities 3 2 Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other. activity(ex:commenting) activity(ex:drafting) wasInformedBy(ex:commenting, ex:drafting) :drafting a prov:Activity . :commenting a prov:Activity ; prov:wasInformedBy :drafting .
  • 32. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Communication, generation, usage 3 3 activity(ex:commenting) activity(ex:drafting) entity(e) wasInformedBy(ex:commenting, ex:drafting) wasGeneratedBy(e,ex:drafting, -) used(ex:commenting, e, -) Q.: what is the relationship between communication, generation, and usage? This are inference rules 5 and 6 in the PROV-CONSTR document
  • 35. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Derivation amongst entities 3 6 A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity. entity(ex:draftV1) entity(ex:draftComments) wasDerivedFrom(ex:draftComments, ex:draftV1) Q.: what is the relationship between derivation, generation, and usage? :draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 . :draftV1 a prov:Entity .
  • 36. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance and Big Data: what’s the connection? opportunities and challenges
  • 37. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance {as,of} Big Data 1. BigProv: Provenance as big data • High volume provenance • What kind of analytics are interesting on big provenance? 2. Provenance of analytics processes • “Prediction provenance” • Train a model  provenance of the model as a record of the training process and data involved • Use the model to make predictions  provenance of the prediction 3. Provenance of a search • What is the provenance of a keyword search? • Why would it be interesting? What can we learn from it?
  • 38. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Recent research on Provenance as Big Data Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7 May 2015 doi: 10.1109/CCGrid.2015.85 Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.525,534, 4-7 May 2015 doi: 10.1109/CCGrid.2015.86 Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013 Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop, Edinburgh, 2015
  • 39. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier ProvGen • A Provenance Generator tool for experimenting with provenance at scale • Why generate synthetic provenance? • Synthetic PROV graphs can be a valuable complement to emerging natural provenance collections • … provided their structural properties reflect specific provenance patterns • control over their repetition and variability • varying scales • Useful for benchmarking emerging provenance management systems • Useful to test analytics algorithms that operate on large provenance collections trace size numberoftraces science datasets git2PROV mediaWiki History retweet history Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.
  • 40. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier What does ProvGen do? • Accept a seed PROV graph • Grow the graph • Add nodes and relationships following the seed graph structure • … with constraints on how to grow document entity(e1, [type="Document", version="original"]) entity(e2, [type="Document"]) entity(e3, [type="Document"]) activity(a1, [type="create"]) activity(a2, [type="edit"]) activity(a3, [type="edit"]) agent(ag, [type="Person"]) used(a2, e1) used(a3, e2) wasGeneratedBy(e2, a2, [fct="save"]) wasGeneratedBy(e1, a1, [fct="publish"]) wasGeneratedBy(e3, a3, [fct="save"]) wasAssociatedWith(a3, ag, [role="contributor"]) wasAssociatedWith(a2, ag, [role="contributor"]) wasAssociatedWith(a1, ag, [role="creator"]) wasDerivedFrom(e2, e1) wasDerivedFrom(e3, e2) endDocument a1 type: create a2 e1 use type: edita3 e2 use type: edit gen type: Document version: original der type: Document e3 gen der type: Document plan type: prov:plan ag type: Personassoc assoc assoc
  • 41. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier ProvGen constraints an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has property("version"="original"); the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2) unless e1 has relationship "Used" with the Activity(a) and e2 has the relationship "WasGeneratedBy" with the Activity(a); an Entity must have relationship "WasGeneratedBy" exactly 1 times; an Entity must have property("version"="original") with probability 0.05; an Entity must have out degree at most 2; an Activity must have relationship "Used" at most 1 times; an Activity must have property("type"="create") with probability 0.01; an Activity must have relationship "WasAssociatedWith" exactly 1 times; an Activity must have relationship "Used" exactly 1 times unless it has property("type"="create"); an Activity must have relationship "WasGeneratedBy" exactly 1 times; an Agent must have relationship "WasAssociatedWith" with probability 0.1; an Agent must have relationship "WasAssociatedWith" between 1, 120 times with distribution gamma(1.3, 2.4);
  • 42. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Some test queries Generated graph loaded to Neo4J GDBMS Queries expressed using the Cypher graph query language Transitive closure over Derivation: Return all the derivation chains, along with their length MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r) MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) WHERE length(r) > 10 RETURN a,b, length(r) ORDER BY length(r) desc limit 50 Return the top 50 length derivation chains MATCH (a)-[:`WASASSOCIATEDWITH`]->(b) RETURN a as Agent, b as Activity All agents and their associated activities All agents who created new documents MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b) RETURN a,b LIMIT 25 All agents who edited a document that was derived from an original MATCH (doc1{version:'original'}) <- [:WASDERIVEDFROM] - (doc2) -[:`WASGENERATEDBY`] -> act -[:WASASSOCIATEDWITH] -> agent RETURN agent LIMIT 25
  • 43. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Provenance of Big Data Provenance of analytics processes: “Prediction provenance” • Train a model  provenance of the model as a record of the training process and data involved • Use the model to make predictions  provenance of the prediction 21 July 2015 11:38 21 July 2015 11:38
  • 44. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Relations may be given identifiers 4 5 entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -) used(use1; ex:commenting, ex:draftV1, -) gen1 denotes a generation event use1 denotes a usage event wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) General derivation relation: Relation IDs make it possible to refer to relations in other relations
  • 45. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Rendering N-ary relations in PROV-O 4 6 RDF is for binary relations —- N-ary relations require reification entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(gen1; ex:draftComments, ex:commenting, 2013-03-18T10:00:01) used(use1; ex:commenting, ex:draftV1, -) :draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 . :gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00". :commenting a prov:Activity ; prov:qualifiedUsage :use1 . :use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
  • 46. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier “Qualified relation” RDF pattern 4 7 :draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 . :gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00". :commenting a prov:Activity ; prov:qualifiedUsage :use1 . :use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
  • 47. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plans — why was something done? 4 8 Most relation types have two arguments which are { Entity, Activity, Agent} Derivation is one exception: wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) Two other notable exceptions: - Associations with a plan - Delegation with an activity scope wasAssociatedWith(id; a, ag, pl, attrs) A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goal
  • 48. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Association with a plan 4 9 A plan plays a role in an association
  • 49. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plans are typed entities 5 0 activity(ex:_aProgramExecution, [ex:execTime="22.5sec"]) agent(ex:_aJVM, [prov:type = 'JVM-6.0']) entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1']) wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath="webapp" ]) A plan is an entity having prov:type = “prov:plan”
  • 50. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plan pattern as PROV-O 5 1 :_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] . :_aJVM a prov:Agent, “Java-6.0". :myCleverProgram a prov:Entity, prov:Plan. activity(ex:_aProgramExecution, [ex:execTime="22.5sec"]) agent(ex:_aJVM, [prov:type = 'JVM-6.0']) entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label='Program 1']) wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role='defaultRuntime', ex:accessPath='webapp' ])
  • 51. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Plan pattern as PROV-O 5 2 :_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] . :_aJVM a prov:Agent, “Java-6.0". :myCleverProgram a prov:Entity, prov:Plan.
  • 53. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Real-world artifacts vs provenance entities 5 4 ref: “What do I know about the car I see in this Cambridge street today?” •It was bought by Joe in 2011 •Joe drove it to Boston on March 16th, 2013. The car has now got 10,000 miles on it •Joe drove it to Cambridge on March 18th, 2013. “Same” car, but different provenance at each stage of its evolution
  • 54. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Alternate-specialization pattern 5 5 Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time. An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter. ...But, this is still that car! Semantic notes: 1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2). 2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1) 3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3). differing in their location same owner, added location
  • 55. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Reserved attributes and types 5 6 A small set of reserved attributes, with some usage restrictions
  • 56. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Bundles, provenance of provenance 5 7 A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed. bundle pm:bundle1 entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(ex:draftComments, ex:commenting,-) used(ex:commenting, ex:draftV1, -) endBundle ... entity(pm:bundle1, [ prov:type='prov:Bundle' ]) agent(ex:Bob) wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00) wasAttributedTo(pm:bundle1, ex:Bob)
  • 57. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Bundles in PROV-O 5 8 Bundle definition (an RDF named graph): ex:bundle1 { :draftComments a prov:Entity ; :status “blah"; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity . } Bundle usage: ex:bundle1 a prov:Entity, "prov:Bundle"; prov:qualifiedGeneration [ a prov:Generation ; prov:atTime “2013-03-20T10:30:00+09:00" ]; prov:wasAttributedTo :Bob .
  • 61. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Time, Events 6 2 wasStartedBy(id; a2, e, a1, t, attrs) wasEndedBy(id; a2, e, a1, t, attrs) Instead, the PROV data model is implicitly based on a notion of instantaneous events, that mark transitions in the world (*) (*) PROV-CONSTR (non-normative) Events: - activity start, activity end, - entity generation , entity usage, entity invalidation - Provenance statements are combined by different systems - An application may not be able to align the times involved to a single global timeline Therefore, PROV minimizes assumptions about time
  • 62. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier From “scruffy” provenance to “valid” provenance 6 3 - Are all possible temporal partial ordering of events equally acceptable? - How can we specify the set of all valid orderings? More generally, how do we formally define what it means for a set of provenance statements to be valid? PROV defines a set of temporal constraints that ensure consistency of a provenance graph
  • 63. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Summary • Motivation for collecting provenance of data and information • In Science • In the Social Web • The W3C PROV Recommendation (2013) • PROV-DM: The PROV data model • PROV-O: the Provenance Ontology • (PROV-CONSTRAINTS) • Provenance as Big Data • High volume provenance • Storage, analytics, visualisation • Provenance of analytics • How can I explain my predictions? • The ProvGen tool
  • 64. FirstKeystoneSummerSchool– MaltaJuly2015–P.Missier Selected bibliography Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012. Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012. Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web Semantics: Science, Services and Agents on the World Wide Web (April 2015). doi:10.1016/j.websem.2015.04.001. Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530. ProvGen: generating synthetic PROV graphs with predictable structure. Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer ProvAbs: model, policy, and tooling for abstracting PROV graphs. Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh, Scotland: USENIX Association, 2015. program/presentation/de-oliveira.

Editor's Notes

  1. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  2. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  3. For many of its articles, NowNews relies on the integration of multiple data sources. In order to ensure correct credit is given, NowNews wants to provide a central acknowledgments list that recognizes all the people and data sources that contribute to all the various articles and information that it publishes.
  4. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  5. W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
  6. remark on PROV-AQ: nothing to do with querying, but a query model can be associated to each of the encodings W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations. Working Group Note A Working Group Note is published by a chartered Working Group to indicate that work has ended on a particular topic. A Working Group may publish a Working Group Note with or without its prior publication as a Working Draft.
  7. Alice, a senior editor, produces draft V1 of a document, after reading papers paper1 and paper2. v1 is for internal distribution only Later, Bob who is the main editor and works for Alice, commented on the draft, producing a new document, draft comments
  8. duality between elements (generation) and relations (wasGeneratedBy)
  9. baseline-noAgents.provn
  10. baseline-noAgents-unqual.n3
  11. baseline-noAgents.provn agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  12. baseline-noAgents-unqual.n3 agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  13. agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  14. mention that derivation is missing -- this requires more insight into relation IDs
  15. Most relations admit optional arguments (e.g. time) First-class arguments may be optional, too. For instance, generation with implicit activity Often only some combinations of arguments are legal
  16. A single (real world) artifact may correspond to several entities in a provenance model that includes descriptions of such artifact. The choice of mapping is determined by which characteristics of the artifact are relevant for (a specific) provenance description of it Whenever one of these attributes changes, a new entity is created ex.: the doc before and after editing. Some characteristics that are relevant for provenance have changed.
  17. These entities are however related These relationships can be expressed in PROV
  18. ... and I could have bundles that refer to other bundles...