SlideShare a Scribd company logo
1 of 31
Research Objects: Preserving
Scientific Workflows and
Provenance
Khalid Belhajjame
Université Paris Dauphine 1
Storyline
• Why Research Objects?
• Overview of the Research Objects
• Portfolio of Research Object Management
Tools
• Provenance Distillation Through Workflow
Summarization
2
Electronic papers are not enough
3
Electronic
paper
Electronic papers are not enough
4
Research
Object
Datasets
Results
Scientists
Hypothesis
Experiments
Annotations
Provenance
Electronic
paper
Benefits Of Research Objects
• A research object aggregates all elements
that are necessary to understand research
investigations.
• Methods (experiments) are viewed as first
class citizens
• Promote reuse
• Enable the verification of reproducibility of
the results
5
Reproducibility
a principle of the
scientific method
normal people
and
scientist
6
http://xkcd.com/242/
47 of 53
“landmark”
publications
could not be
replicated
Inadequate cell lines
and animal models
Nature, 483, 2012
Credit to Carole Goble JCDL 2012 Keynote
7
Fig. 2. Overview of the Research Object Model with the different ontologies that encode it: Core RO for describing the basic
Research Object structure, roevo for tracking Research Object evolution and wfprov and wfdesc for describing workflows and
Overview of the Research Object
Model
8
Research Object as an ORE
Aggregation
9
Fig. 3. Research Object as an ORE aggregation.
to annotation; and ao: Body, which comprises a
description of the target.
Research Objects use annotations as a means for
decorating a resource (or a set of resources) with
metadata information. The body is specified in the
@base <ht t p : / / exampl e. com/ r o/ 389/ > .
@pr ef i x dct : <ht t p: / / pur l . or g/ dc / t er ms / >
@pr ef i x or e:
<ht t p : / / www. openar chi ves . or g/ or e/ t er ms
Scientific Workflows
• Data driven analysis pipelines
• Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
• Tools for automating frequently
performed data intensive activities
• Provenance for the resulting datasets
– The method followed
– The resources used
– The datasets used
10
PROV Primer, Gil et al
WF Execution Trace
Retrospective Provenance:
Actual data used, actual
invocations, timestamps and data
derivation trace
WF Description
Prospective Provenance:
Intended method for analysis
11
Specifying Workflows using WfDESC
12
Fig. 5. T he wfdesc ontology and its relation to PROV-O.
21
Specifying Workflow Provenance using
WfPROV
13
Fig. 7. T he wfprov ontology and its relationship to PROV-O.
Portfolio of Research Object Tools
14
DEMO
15
Workflows can get complex!
• Overwhelming for users who are not
the developers
• Abstractions required for reporting
• Lineage queries result in very long
trails
16
Overall Approach
17
Workflow
Designer
Taverna
Workbench
Motif
WF Summary
WF Description
Summarizer
Summarization
Rules
PART-1: Scientific Workflow
Motifs
• Domain Independent
categorization
– Data-Oriented Nature
– Resource/Implementation-
Oriented Nature
• Captured In a lightweight
OWL Ontology
18
http://purl.org/net/wf-motifs
Motif annotations over operations
motifs(color_pathway_by_objects) = {m1:DataRetrieval}
motifs(Get_Image_From_URL_2) = {m2:DataMoving} 19
DataRetrieval
DataMovingl
PART-2: Workflow reduction primitives
• Collapse (Up/Down)
• Compose
• Eliminate
20
Eliminate
21
Two sample strategies
• By-Elimination
– Minimal annotation effort
– Single rule
• By Collapse
– More specific annotation
– Multiple rules
22
By-Collapse23
By-Elimination24
Analysis Data Set
• 30 Workflows from the Taverna system
• Entire dataset & queries accessible from
http://www.myexperiment.org/packs/467.html
• Manual Annotation using Motif Vocabulary
25
Mechanistic Effect of Summarization
26
User Summaries vs. Summary Graphs
27
Highlights
• Research Object model and associated
management tools
• Annotations of Workflow Using Motifs
• Methods for Summarizing Workflow and distilling
their provenance traces
• Algorithms for Repairing Workflows
• Validation of the workflow summarization
• Querying of Workflow Execution Provenance
using summaries.
28
Ongoing Work
References
• P Alper, K Belhajjame, C Goble, P Karagoz. Small Is Beautiful: Summarizing Scientific
Workflows Using Semantic Annotations. IEEE International Congress on Big Data, 2013
• Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting
scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013.
• Belhajjame K, Corcho O, Garijo D, et al. Workflow-Centric Research Objects: A First Class
Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future
of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May
2012
• Belhajjame K, Zhao J, Garijo D, et al. The Research Object Suite of Ontologies: Sharing and
Exchanging Research Data and Methods on the Open Web, submitted to the Journal of Web
Semanics.
• Daniel Garijo, Pinar Alper, Khalid Belhajjame, Óscar Corcho, Yolanda Gil, Carole A. Goble:
Common motifs in scientific workflows: An empirical analysis. eScience 2012.
• Zhao J, Gómez-Pérez JM, Belhajjame K, Klyne G, et al. Why workflows break - Understanding
and combating decay in Taverna workflows. IEEE eScience 2012.
29
Acknowledgement
Acknowledgement
EU Wf4Ever project (270129)
funded under EU FP7 (ICT- 2009.4.1).
(http://www.wf4ever-project.org)

More Related Content

Similar to Credible workshop

Metadata for Research Objects
Metadata for Research ObjectsMetadata for Research Objects
Metadata for Research Objectsseanb
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods dgarijo
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Idafen Santana Pérez
 
Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)dgarijo
 
Research Objects in Scientific Publications
Research Objects in Scientific PublicationsResearch Objects in Scientific Publications
Research Objects in Scientific Publicationsdgarijo
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyRichard Zijdeman
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Nolan Nichols
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityOscar Corcho
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
 
SC13 BoF: RDA and HPC
SC13 BoF: RDA and HPCSC13 BoF: RDA and HPC
SC13 BoF: RDA and HPCJohn Cobb
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015dgarijo
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
RO Advisory Kickoff Slides
RO Advisory Kickoff SlidesRO Advisory Kickoff Slides
RO Advisory Kickoff Slidesseanb
 

Similar to Credible workshop (20)

Metadata for Research Objects
Metadata for Research ObjectsMetadata for Research Objects
Metadata for Research Objects
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
 
Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)
 
Research Objects in Scientific Publications
Research Objects in Scientific PublicationsResearch Objects in Scientific Publications
Research Objects in Scientific Publications
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Part 1 Research workshop
Part 1 Research workshopPart 1 Research workshop
Part 1 Research workshop
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
SC13 BoF: RDA and HPC
SC13 BoF: RDA and HPCSC13 BoF: RDA and HPC
SC13 BoF: RDA and HPC
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
RO Advisory Kickoff Slides
RO Advisory Kickoff SlidesRO Advisory Kickoff Slides
RO Advisory Kickoff Slides
 

More from Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

More from Khalid Belhajjame (19)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Recently uploaded

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Recently uploaded (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Credible workshop

  • 1. Research Objects: Preserving Scientific Workflows and Provenance Khalid Belhajjame Université Paris Dauphine 1
  • 2. Storyline • Why Research Objects? • Overview of the Research Objects • Portfolio of Research Object Management Tools • Provenance Distillation Through Workflow Summarization 2
  • 3. Electronic papers are not enough 3 Electronic paper
  • 4. Electronic papers are not enough 4 Research Object Datasets Results Scientists Hypothesis Experiments Annotations Provenance Electronic paper
  • 5. Benefits Of Research Objects • A research object aggregates all elements that are necessary to understand research investigations. • Methods (experiments) are viewed as first class citizens • Promote reuse • Enable the verification of reproducibility of the results 5
  • 6. Reproducibility a principle of the scientific method normal people and scientist 6 http://xkcd.com/242/
  • 7. 47 of 53 “landmark” publications could not be replicated Inadequate cell lines and animal models Nature, 483, 2012 Credit to Carole Goble JCDL 2012 Keynote 7
  • 8. Fig. 2. Overview of the Research Object Model with the different ontologies that encode it: Core RO for describing the basic Research Object structure, roevo for tracking Research Object evolution and wfprov and wfdesc for describing workflows and Overview of the Research Object Model 8
  • 9. Research Object as an ORE Aggregation 9 Fig. 3. Research Object as an ORE aggregation. to annotation; and ao: Body, which comprises a description of the target. Research Objects use annotations as a means for decorating a resource (or a set of resources) with metadata information. The body is specified in the @base <ht t p : / / exampl e. com/ r o/ 389/ > . @pr ef i x dct : <ht t p: / / pur l . or g/ dc / t er ms / > @pr ef i x or e: <ht t p : / / www. openar chi ves . or g/ or e/ t er ms
  • 10. Scientific Workflows • Data driven analysis pipelines • Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving • Tools for automating frequently performed data intensive activities • Provenance for the resulting datasets – The method followed – The resources used – The datasets used 10
  • 11. PROV Primer, Gil et al WF Execution Trace Retrospective Provenance: Actual data used, actual invocations, timestamps and data derivation trace WF Description Prospective Provenance: Intended method for analysis 11
  • 12. Specifying Workflows using WfDESC 12 Fig. 5. T he wfdesc ontology and its relation to PROV-O. 21
  • 13. Specifying Workflow Provenance using WfPROV 13 Fig. 7. T he wfprov ontology and its relationship to PROV-O.
  • 14. Portfolio of Research Object Tools 14
  • 16. Workflows can get complex! • Overwhelming for users who are not the developers • Abstractions required for reporting • Lineage queries result in very long trails 16
  • 18. PART-1: Scientific Workflow Motifs • Domain Independent categorization – Data-Oriented Nature – Resource/Implementation- Oriented Nature • Captured In a lightweight OWL Ontology 18 http://purl.org/net/wf-motifs
  • 19. Motif annotations over operations motifs(color_pathway_by_objects) = {m1:DataRetrieval} motifs(Get_Image_From_URL_2) = {m2:DataMoving} 19 DataRetrieval DataMovingl
  • 20. PART-2: Workflow reduction primitives • Collapse (Up/Down) • Compose • Eliminate 20
  • 22. Two sample strategies • By-Elimination – Minimal annotation effort – Single rule • By Collapse – More specific annotation – Multiple rules 22
  • 25. Analysis Data Set • 30 Workflows from the Taverna system • Entire dataset & queries accessible from http://www.myexperiment.org/packs/467.html • Manual Annotation using Motif Vocabulary 25
  • 26. Mechanistic Effect of Summarization 26
  • 27. User Summaries vs. Summary Graphs 27
  • 28. Highlights • Research Object model and associated management tools • Annotations of Workflow Using Motifs • Methods for Summarizing Workflow and distilling their provenance traces • Algorithms for Repairing Workflows • Validation of the workflow summarization • Querying of Workflow Execution Provenance using summaries. 28 Ongoing Work
  • 29. References • P Alper, K Belhajjame, C Goble, P Karagoz. Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations. IEEE International Congress on Big Data, 2013 • Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013. • Belhajjame K, Corcho O, Garijo D, et al. Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May 2012 • Belhajjame K, Zhao J, Garijo D, et al. The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web, submitted to the Journal of Web Semanics. • Daniel Garijo, Pinar Alper, Khalid Belhajjame, Óscar Corcho, Yolanda Gil, Carole A. Goble: Common motifs in scientific workflows: An empirical analysis. eScience 2012. • Zhao J, Gómez-Pérez JM, Belhajjame K, Klyne G, et al. Why workflows break - Understanding and combating decay in Taverna workflows. IEEE eScience 2012. 29
  • 31. Acknowledgement EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org)

Editor's Notes

  1. Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
  2. Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
  3. In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links.(Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”).As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s deriva- tion history starting from the original sources, namely its lineage.
  4. By explicitly outlining the analysis process, work- flow descriptions make-up a prospective part of provenance. The instrumented execution of the workflow generates a veracious and elaborate data lineage trail outlining the derivation relations among resultant, intermediary and initial data artifacts, which makes up the retrospective part of provenance.
  5. http://alpha.myexperiment.org/packs/405.html0.41 Exemple of a research objects as a pack in myExpeirment of a Gene Wide Association Studies: the workflow is brokedn2.33 The workflow was repared by the user, and uploaded as part of the new packOne the user upload the workfow, this will be transfomedintoa format that is compatible with wfdesc. The idea here is that regardless of the workflow language, the workflow will be transformed into a simple workflow language, which in this case is wfdesc.7.24 You can also downloads worflow runs, and in this page you will have a summary of their characetrization.9.18 You can browse the workflow in the RO portal, which is the back end library that store information about the Research objects. And here you can find infromation about the evolution of the research object, in particular, which pack was used to create which pack.
  6. Workflows are typically complex artifacts containing sev- eral processing steps and dataflow links. Complexity is characterized by the environment in which workflows op- erate.Obfuscation degrades reporta- bility of the scientific intent embedded in the workflow. Complex workflows combined with the faithful collection of execution provenance for each step results in even more complex provenance traces containing long data trails, that overwhelm users who explore or query them [3], [1].
  7. Scientific Workflow MotifsAt the heart of our approach lies the exploitation of information on the data processing nature of activities in workflows. We propose that this information is provided through semantic annotations, which we call motifs. We use a light-weight ontology, which we developed in previous work [8] based on an analysis of 200+ workflows (Ac- cessible: http://purl.org/net/wf-motifs). The motif ontology characterizes operations wrt their Data-Oriented functional nature and the Resource-Oriented implementation nature. We refer the reader to [8] for the details, here we briefly introduce some of the motifs in light of our running example. • Data-Oriented nature. Certain activities in workflows, such as DataRetrieval, Analysis or Visualizations, perform the scientific heavy lifting in a workflow. The gene annotation pipeline in Figure 1) collects data from various resources through several retrieval steps ( see “get enzymes by gene” , “get genes by enzyme” opera- tions). The data retrieval steps are pipelined to each other through use of adapter steps, which are categorized with the DataP reparation motif. Augmentation is a sub-category of preparation operations. Augmentation decorates data with resource/protocol specific padding or formatting. (The sub- workflow “Workflow89” in our example is an augmenter that builds up a well formed query request out of given parameters.) Extraction activities perform the inverse of augmentation, they extract data from raw results returned from analyses or retrievals (e.g. SOAP XML messages). Splitting and Merging are also a sub-category of data preparation, they consume and produce the same data but change its cardinality (see the “Flatten List” steps in our example). Another general category is DataOrganization activities, which perform querying and organization func- tionsoverdatasuchas,F iltering,J oiningandGrouping. The “Remove String Duplicates” method in our example is an instance of the filtering motif. Another frequent motif is DataMoving. In order to make data accessible to external analysis tools data may need to be moved in and out of the workflow execution environment such activities fall into this category (e.g. upload to web, write to file etc). The “Get Image From URL 2” operation in our genomics workflow is an instance of this motif.• Resource-Oriented nature, which outlines the implemen- tation aspects and reflects the characteristic of resources that underpin the operation. Classifications in this category are: Human − Interactions vs entirely Computational steps, Statefull vs Stateless Invocations and using Internal (e.g. local scripts, sub-workflows) vs External (e.g. web service, or external command line tools) computational artifacts. In our example, all data retrieval operations are realized by external, stateless resources (web services), whereas all data preparation operations are realized with
  8. A data-flow decorated with motif annota- tions is 3-tuple: W = ⟨N,E,motifs⟩, where motifs is a function that maps each graph node n ∈ N, be it an operation or port, to its motifs. Operation nodes could have motif annotations from operation, resource perspectives, and port nodes can have annotations from the data perspective (omitted in this paper due to space limitations).As an example, we illustrate below two motif annotations of two operations in our running example. It is worth mentioning that a given node can be associated with multiple motifs.motifs(color pathway by objects) = {⟨m1 : Augmenter⟩} motifs(Get Image From URL 2) = {⟨m2 : DataRetrieval⟩}
  9. 2) Elimination: We define the function eliminate op : ⟨W,op⟩ -&gt; ⟨W′,M⟩ as one, which takes an annotated workflow definition, and eliminates a designated operation node from it. Elimination results in a set of indirect data links connecting the predecessors of the deleted operation to its successors. The source and target ports of these new indirect links are the cartesian product of the to-be-deleted operation’s inlinks’ sources and its outlinks’ targets. As with other primitives elimination also returns a set of mappings between the ports of the resulting workflow and the input workflow. Motif Behavior: The designated operation is eliminatedtogether with its motifs and these are not propagated to any other node in the workflow graph.Constraint Checks: There are no particular constraint checks for the elimination primitive.Dataflow Characteristic: Within the conventional defini- tion of a workflow, ports are connected with direct datalinks, which correspond to data transfers i.e. the data artifact!-# appearing at the source port is transferred to the target port as-is. In the indirect link case, however, there is no direct data transfer as the source and target artifacts are dissimilar,%&amp;# and some “eliminated” data processing occurs inbetween.
  10. Summary-By-Elimination: Requires minimal amount of annotation effort and operates with a single rule. We have only classified the scientifically significant operations to denote their motifs, and all remaining operations in the workflow are designated to be DataPreparation steps. The following reduction rule is then used to summarize the workflow graph W. Summary-By-Collapse: Requires annotation of work- flow with motif classes having more specificity and a more fine-grained set of rules. Consequently annotations on data preparation steps are more specific designating what kind of preparation activity is occurring (e.g. Merging, Splitting, Augmentation etc). We use these annotations to remove those data preparation steps from the work- flow by the collapse primitive and collapsing in the di- rection so as to retain the ports that carry more reporting friendly artifacts. We collapse Augmentation, Splitting andRef erenceGenerationoperationsdownstreamastheir outputs are less favored than their inputs, whereas we col- lapse Extraction, Merging, ReferenceResolution and all DataOrganization steps downstream as their outputs are preferred over inputs in a summary. Two rules in this approach are given below:
  11. In our evaluation we annotated the workflow corpusmanually using the motif ontology. The effort that goes into the design of workflows can be significant (up to several days for a workflow). When compared to design, the cost of annotation is minor (as it amounts to single attribute setup per operation). Annotation can be (semi)automated through application of mining techniques to workflows and execution traces. It is also possible to pool annotations in service registries or module libraries and automatically propagate annotations to operations that rare re-used from the registry. Instead of the motif ontology it is also possible to use other ontologies [10] to characterize operations in workflows.
  12. Motifs are used in conjunction with reduction primitives to build summarization rules. The objective of rules is to determine how to reduce the workflow when certain motifs or combinations of motifs are encountered. We propose graph transformation rules of the form L 􏰀 P, where L specifies a pattern to be matched in the host graph specified in terms of a workflow-specific graph model and motif decorations and P specifies a workflow reduction primitive to be enacted upon the node(s) identified by the left-hand- side L. Primitive P captures both the replacement graph pattern and how the replacement is to be embedded into the host graph. Our approach is a more controlled primitive based re-writing mechanism, rather than a fully declarative one (as in traditional graph re-writing). This allows us to: 1) Preserve data causality relations among operations together with the constraints that such relations have (e.g. acyclicity for Scientific Datalows) 2) We can allow the user to specify high-level reduction policies by creating simple &lt;motif, primitive&gt; pairs. It would be a big undertaking to expect the users, who are not computer scientists or software engineers, to specify re-writes in a fully declarative manner (i.e. a matching pattern, a replacement pattern and embedding info)