Scientific data management from the lab to the web

J
Jose Manuel Gómez-PérezR&D Director at Expert System
www.wf4ever-project.org




Scientific Data Management -
  From the Lab to the Web
      José Manuel Gómez Pérez, iSOCO

         Semantic Data Management
             Dagstuhl Seminar
             22-27 April 2012
The data deluge
                                                          Some facts

                             »    In 2010 the size of the digital
                                  universe exceeded 1 Zettabyte
                                  (=1 trillion Gb)
                             »    1.8 Zb in 2011
                             »    35 Zb expected in 2020

                             »    90% unstructured data
                             »    70% user-generated
                             »    75% resulting from data copying,
                                  merging, and transforming

                             »    Metadata is the fastest growing
                                  data category
                             »    Much of such data is dynamic,
                                  real-time, volatile

Source: IDC ‘s The 2011 Digital Universe Study
       – Extracting Value from Chaos

                                                                     2
Dealing with dynamicity
                                         Two main challenges


» Challenge 1: Identifying and
  structuring the relevant portions of
  the data for the task at hand
   › First-class data citizens
» Challenge 2: Managing the lifecycle
  of data entities
   › Preservation
   › Evolution and versioning
   › Decay                         Both technical and
                                 social aspects involved

                                                               3
The Research Lifecycle
                                                Workflows in the Scientific Method


Background
 Hypothesis                           Results           Scientific
                   Experiment         Results
Assumptions                            (data)         Interpretation       Publication
                                       (Data)
 Input data
   Method


   Example: Genome-Wide Association Studies




                                                                                         4
Workflow-based Science
              What is a Scientific Workflow?


»    A mechanism for coordinating the
     execution of services and linking together
     resources.

»    The combination of data and processes
     into a configurable, structured set of steps
     that implement semi-automated
     computational solutions in scientific
     problem-solving


    Scientific workflows are at the core of
    scientific data management
        › Enable automation
        › Encourage best practices




                                                    5
Challenge 1

 Identifying and structuring
the relevant portions of the
  data for the task at hand

    First-class data citizens
Questions for Scientific Data and Workflows                        Issues
Who are you ?                                               Identity and Description
Where and when were you born ?                                     Authenticity
Who were your parents (creators) ?                                 Uniqueness
For which purpose were you conceived and have been used ?      Reuse, Repurpose

What do you have inside ?                                         Inspection
                                                                  Visualization
                                                                  Annotations
How is your content linked ?                                Graphical Representation
May I access all your parts ?                                    Access Rights
Which parts can I replace ?                                       Adaptability
What have they done to you ?                                      Provenance
Who and When ?                                                     Versioning
Why did they do that ?


Why have you been recommended to me ?                         Information Quality
Can I believe what you are saying or trust your results ?

Do you still produce the same results ?                         Reproducibility
Are you still working ?                                          Completeness
How could I repair you ?                                           Stability

How could I thank you ?                                              Credit
How could I talk about you ?                                                           7
Challenge 1: Identifying and structuring the relevant data
                                         Research Objects as Technical Objects

Carriers of Research Context                       Third Party     Alien
» Referentiable                        Distributed  Tenancy        Store
» Aggregation, Dispersed
    › Heterogeneous
    › Local and External
» Annotated metadata
    › Provenance
    › Structured: Manifests,
      Recipes, Permissions,
      Discourse
» Lifecycle
    › Publishing, Evolution
    › Versioning
» Mixed Stewardship
    › Graceful Degradation
» Sharing
    » Security & Privacy
                                       Technical Objects              Social Objects
» Stereotypical User Profiles
» Services
                               OAI-ORE                                                   8
Research Objects as Social Objects




                    Package,
                    Explore, Inspect,
                    Review,
                    Exchange,
                    Share, Reuse,
                    Publish, Credit




9      9
                                    9
http://purl.org/wf4ever/ro#
                                                   Research Object model core (simplified)

    RO specification: http://wf4ever.github.com/ro


                                       ore:aggregates
                                                          ro:ResearchObject
               ro:Resource
                                                                                       ore:isDescribedBy



                                                                                           ro:Manifest
    wfdesc:Workflow

                     ro:annotatesAggregatedResource        ro:AggregatedAnnotation



›    ro (aggregation and annotation)           Note: This figure shows a simplified view of the RO core.
›    wfdesc (workflow description)
›    Minim* (minimum info model)
›    wfprov (workflow provenance)
›    roprov (RO provenance)
›    roevo (evolution model)                                                                                   10
                                                                           *Minim   based on M. Gamble’s MIM
Challenge 2

Managing the lifecycle of
     data entities

   Evolution and Decay
Challenge 2: Managing the lifecycle of data entities
                 RO Evolution & Versioning




                                                 12
Challenge 2: Managing the lifecycle of data entities
                                                                       RO Decay



Workflow Decay
•   Component level
•   flux/decay/unavailability
•   Data level
•   Infrastructure level

Experiment Decay
•   Methodological changes
•   New technologies
•   New resources/components
•   New data




                                                                                 13
Preservation, Conservation, Recreating


Preserving
Archived Record
Fixed Snapshots
Review
Rerun & Replay

Conserving
Active Instrument
Live
Rerun & Reuse
Repair & Restore

Recreating
Archived Record
Active Instrument
Live
Rebuild Recycle Repurpose

                                                                     14
Challenge 2: Managing the lifecycle of data entities
    Possible types of decay (an example)




                                                 15
Decay Analysis
                    A Taxonomy of RO decay



1. Service tool is missing
2. Service file descriptor disappeared
3. Service up but not contactable
4. Service up but functionality changed
5. Local software dependencies
6. Data unavailability
7. Changes in data formats
8. Chained dependency
9. Credentials deprecated
10. Input data superseded by other data
11. RO metadata outdated (upon versioning)
12. Old fashioned RO
13. External references lose credit
14. Execution framework no longer available

                                              16
A taxonomy of workflow decay
      Sample decay type




                         17
Decay Analysis
                                    1.0 Certificate – Evaluation of Stability and Completeness

                                               1.0 Certificate of quality

                           Stability                                        Completeness



      Is the RO free from any form of decay                   Is the minimal aggregation of
      preventing workflow execution?                          resources encapsulated by the RO
                                                              consistent?


      »    Focus on reproducibility                           »   RO checklists
      »    Assisted detection of RO decay                     »   Produced by scientists
      »    Active monitoring on decay forms                   »   Automatically checked against
      »    RO and workflow provenance                             minimal model (minim)
                                                              »   RO evolution

      »    Notification
      »    Explanation


                                                                                                      18
1.0 Certificate notion originally proposed by Yde de Jong
Recap
                                      Lessons learnt


Scalability   » Data with a Purpose

              » Encapsulate & Conquer
                 › Goal-driven (purpose)
                 › Aggregation
                 › Community-managed

              » Nothing is immutable,
Provenance      especially data.
                 › Foster evolution
                 › Monitor decay

                                                  19
Thanks for your Attention!
                                               Questions




 Any Questions?

http://www.wf4ever-project.org/




                                                         20
1 of 20

Recommended

Provenance Management to Enable Data Sharing by
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingUniversity of Arizona
680 views21 slides
Collaboration and Sharing by
Collaboration and SharingCollaboration and Sharing
Collaboration and SharingJisc
827 views31 slides
Innovations in Scholarly Communication and the Rise of Web 2.0 Scholarship by
Innovations in Scholarly Communication and the Rise of Web 2.0 ScholarshipInnovations in Scholarly Communication and the Rise of Web 2.0 Scholarship
Innovations in Scholarly Communication and the Rise of Web 2.0 ScholarshipThomas King
1.1K views27 slides
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv... by
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...Dimitrios Koureas
2.1K views42 slides
Wf4Ever: Advanced Workflow Preservation Technologies for Enhanced Science i by
Wf4Ever: Advanced Workflow Preservation Technologies for Enhanced Science iWf4Ever: Advanced Workflow Preservation Technologies for Enhanced Science i
Wf4Ever: Advanced Workflow Preservation Technologies for Enhanced Science iJose Enrique Ruiz
360 views21 slides

More Related Content

What's hot

Awareness Support for Knowledge Workers in Research Networks - Very brief PhD... by
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Wolfgang Reinhardt
619 views29 slides
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action by
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionInstitute for Knowledge Mobilization
793 views16 slides
Knowledge mobilization by
Knowledge mobilization Knowledge mobilization
Knowledge mobilization Integrated Knowledge Services
768 views16 slides
Qiagram by
QiagramQiagram
Qiagramshc66columbia
240 views18 slides
The changing scholarly content and communication landscape by
The changing scholarly content and communication landscapeThe changing scholarly content and communication landscape
The changing scholarly content and communication landscapeLaura Czerniewicz
3K views33 slides

Similar to Scientific data management from the lab to the web

Research Objects in Wf4Ever by
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4EverJose Enrique Ruiz
356 views15 slides
OAI7 Research Objects by
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objectsseanb
959 views29 slides
Data Management for Librarians: An Introduction by
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionGarethKnight
2K views30 slides
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final by
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip finalDeborah McGuinness
658 views28 slides
2012 03-28 Wf4ever, preserving workflows as digital research objects by
2012 03-28 Wf4ever, preserving workflows as digital research objects2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objectsStian Soiland-Reyes
873 views25 slides
Research Shared: researchobject.org by
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.orgNorman Morrison
2K views25 slides

Similar to Scientific data management from the lab to the web(20)

OAI7 Research Objects by seanb
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
seanb959 views
Data Management for Librarians: An Introduction by GarethKnight
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An Introduction
GarethKnight2K views
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final by Deborah McGuinness
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
Deborah McGuinness658 views
2012 03-28 Wf4ever, preserving workflows as digital research objects by Stian Soiland-Reyes
2012 03-28 Wf4ever, preserving workflows as digital research objects2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects
Research Shared: researchobject.org by Norman Morrison
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
Norman Morrison2K views
Research Objects: more than the sum of the parts by Carole Goble
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
Carole Goble1.5K views
myExperiment and the Rise of Social Machines by David De Roure
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social Machines
David De Roure1.7K views
Marco Roos: Newton's ideas and methods are preserved forever: how about yours? by GigaScience, BGI Hong Kong
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
If we build it will they come? BOSC2012 Keynote Goble by Carole Goble
If we build it will they come? BOSC2012 Keynote GobleIf we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote Goble
Carole Goble2.1K views
Acs denver dirks potenzone 30 aug2011 by Rudy Potenzone
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
Rudy Potenzone534 views
Research Objects for FAIRer Science by Carole Goble
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Carole Goble2.2K views
Making your data work for you: Scratchpads, publishing & the biodiversity dat... by Vince Smith
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Vince Smith2.6K views
Scratchpads training course introduction by Dimitrios Koureas
Scratchpads training course introductionScratchpads training course introduction
Scratchpads training course introduction
Dimitrios Koureas532 views
Argumentative discussions-on-the-web-2013-02-bretagne by jodischneider
Argumentative discussions-on-the-web-2013-02-bretagneArgumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagne
jodischneider3.5K views
Towards Computational Research Objects by David De Roure
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
David De Roure1.7K views
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012 by Lee Dirks
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
Lee Dirks960 views

More from Jose Manuel Gómez-Pérez

Science religion-dsmeetupv1.0 by
Science religion-dsmeetupv1.0Science religion-dsmeetupv1.0
Science religion-dsmeetupv1.0Jose Manuel Gómez-Pérez
50 views15 slides
Trust and linked data jmgomez-v1.1 by
Trust and linked data jmgomez-v1.1Trust and linked data jmgomez-v1.1
Trust and linked data jmgomez-v1.1Jose Manuel Gómez-Pérez
349 views13 slides
Halo Pcs Kcap2007 V2 by
Halo Pcs Kcap2007 V2Halo Pcs Kcap2007 V2
Halo Pcs Kcap2007 V2Jose Manuel Gómez-Pérez
367 views20 slides
Acquisition And Understanding Of Process Knowledgev1 1 by
Acquisition And Understanding Of Process Knowledgev1 1Acquisition And Understanding Of Process Knowledgev1 1
Acquisition And Understanding Of Process Knowledgev1 1Jose Manuel Gómez-Pérez
319 views55 slides
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma... by
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...Jose Manuel Gómez-Pérez
315 views40 slides
Next Challenges in Corporate Knowledge Management by
Next Challenges in Corporate Knowledge ManagementNext Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge ManagementJose Manuel Gómez-Pérez
502 views10 slides

Recently uploaded

Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
263 views86 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
17 views6 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
368 views92 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
55 views46 slides
Serverless computing with Google Cloud (2023-24) by
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)wesley chun
11 views33 slides
handbook for web 3 adoption.pdf by
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
22 views16 slides

Recently uploaded(20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software263 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri16 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta26 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi127 views

Scientific data management from the lab to the web

  • 1. www.wf4ever-project.org Scientific Data Management - From the Lab to the Web José Manuel Gómez Pérez, iSOCO Semantic Data Management Dagstuhl Seminar 22-27 April 2012
  • 2. The data deluge Some facts » In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb) » 1.8 Zb in 2011 » 35 Zb expected in 2020 » 90% unstructured data » 70% user-generated » 75% resulting from data copying, merging, and transforming » Metadata is the fastest growing data category » Much of such data is dynamic, real-time, volatile Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos 2
  • 3. Dealing with dynamicity Two main challenges » Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand › First-class data citizens » Challenge 2: Managing the lifecycle of data entities › Preservation › Evolution and versioning › Decay Both technical and social aspects involved 3
  • 4. The Research Lifecycle Workflows in the Scientific Method Background Hypothesis Results Scientific Experiment Results Assumptions (data) Interpretation Publication (Data) Input data Method Example: Genome-Wide Association Studies 4
  • 5. Workflow-based Science What is a Scientific Workflow? » A mechanism for coordinating the execution of services and linking together resources. » The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving Scientific workflows are at the core of scientific data management › Enable automation › Encourage best practices 5
  • 6. Challenge 1 Identifying and structuring the relevant portions of the data for the task at hand First-class data citizens
  • 7. Questions for Scientific Data and Workflows Issues Who are you ? Identity and Description Where and when were you born ? Authenticity Who were your parents (creators) ? Uniqueness For which purpose were you conceived and have been used ? Reuse, Repurpose What do you have inside ? Inspection Visualization Annotations How is your content linked ? Graphical Representation May I access all your parts ? Access Rights Which parts can I replace ? Adaptability What have they done to you ? Provenance Who and When ? Versioning Why did they do that ? Why have you been recommended to me ? Information Quality Can I believe what you are saying or trust your results ? Do you still produce the same results ? Reproducibility Are you still working ? Completeness How could I repair you ? Stability How could I thank you ? Credit How could I talk about you ? 7
  • 8. Challenge 1: Identifying and structuring the relevant data Research Objects as Technical Objects Carriers of Research Context Third Party Alien » Referentiable Distributed Tenancy Store » Aggregation, Dispersed › Heterogeneous › Local and External » Annotated metadata › Provenance › Structured: Manifests, Recipes, Permissions, Discourse » Lifecycle › Publishing, Evolution › Versioning » Mixed Stewardship › Graceful Degradation » Sharing » Security & Privacy Technical Objects Social Objects » Stereotypical User Profiles » Services OAI-ORE 8
  • 9. Research Objects as Social Objects Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit 9 9 9
  • 10. http://purl.org/wf4ever/ro# Research Object model core (simplified) RO specification: http://wf4ever.github.com/ro ore:aggregates ro:ResearchObject ro:Resource ore:isDescribedBy ro:Manifest wfdesc:Workflow ro:annotatesAggregatedResource ro:AggregatedAnnotation › ro (aggregation and annotation) Note: This figure shows a simplified view of the RO core. › wfdesc (workflow description) › Minim* (minimum info model) › wfprov (workflow provenance) › roprov (RO provenance) › roevo (evolution model) 10 *Minim based on M. Gamble’s MIM
  • 11. Challenge 2 Managing the lifecycle of data entities Evolution and Decay
  • 12. Challenge 2: Managing the lifecycle of data entities RO Evolution & Versioning 12
  • 13. Challenge 2: Managing the lifecycle of data entities RO Decay Workflow Decay • Component level • flux/decay/unavailability • Data level • Infrastructure level Experiment Decay • Methodological changes • New technologies • New resources/components • New data 13
  • 14. Preservation, Conservation, Recreating Preserving Archived Record Fixed Snapshots Review Rerun & Replay Conserving Active Instrument Live Rerun & Reuse Repair & Restore Recreating Archived Record Active Instrument Live Rebuild Recycle Repurpose 14
  • 15. Challenge 2: Managing the lifecycle of data entities Possible types of decay (an example) 15
  • 16. Decay Analysis A Taxonomy of RO decay 1. Service tool is missing 2. Service file descriptor disappeared 3. Service up but not contactable 4. Service up but functionality changed 5. Local software dependencies 6. Data unavailability 7. Changes in data formats 8. Chained dependency 9. Credentials deprecated 10. Input data superseded by other data 11. RO metadata outdated (upon versioning) 12. Old fashioned RO 13. External references lose credit 14. Execution framework no longer available 16
  • 17. A taxonomy of workflow decay Sample decay type 17
  • 18. Decay Analysis 1.0 Certificate – Evaluation of Stability and Completeness 1.0 Certificate of quality Stability Completeness Is the RO free from any form of decay Is the minimal aggregation of preventing workflow execution? resources encapsulated by the RO consistent? » Focus on reproducibility » RO checklists » Assisted detection of RO decay » Produced by scientists » Active monitoring on decay forms » Automatically checked against » RO and workflow provenance minimal model (minim) » RO evolution » Notification » Explanation 18 1.0 Certificate notion originally proposed by Yde de Jong
  • 19. Recap Lessons learnt Scalability » Data with a Purpose » Encapsulate & Conquer › Goal-driven (purpose) › Aggregation › Community-managed » Nothing is immutable, Provenance especially data. › Foster evolution › Monitor decay 19
  • 20. Thanks for your Attention! Questions Any Questions? http://www.wf4ever-project.org/ 20

Editor's Notes

  1. In this scenario student Dennis has made a conceptual workflow that takes the result of a gene expression experiment (activity values of all genes under two conditions: with/without a chemical compound). The wet laboratory experiment was done by others then Dennis. He makes a note of the origin (including a paper reference). The initial hypothesis is that the chemical compound disturbs gene expression. It is yet unknown which genes and what biological processes are affected. The conceptual workflow first performs one of the standard data preprocessing steps for the type of data Dennis has (Affymetrix gene expression array), then it uses a statistical test to filter those genes that are significantly differentially expressed between the two conditions, and finally it performs an enrichment test to find those pathways that are most prominent among the filtered genes. The latter requires an annotation process, where each gene is coupled to the pathways it was once implied in in other experiments (there is a database for that: KEGG).Dennis is new to workflows, so he wishes to start with an existing workflow. For each component he will search myExperiment for keywords. He then wishes to understand the workflows: look into them, perform test runs with test data and his own data, and see other peoples logs. When he finds workflows he does not understand, Dennis is inclined to create his own workflow with his own scripts. He will receive scripts from colleagues and perform tests that his colleagues are familiar with. As such, he can learn what his workflow is doing. This will help him interpret his results.Ultimately, the workflow may suggest for instance that the set of differentially expressed genes has the Wnt pathway as most common denominator. This pathway is well known for embryogenesis and cancer, information he finds on the internet. He makes a note of that. It will lead to the hypothesis that the chemical compound, may have effects on embryogenesis and/or cancer. This is now his interpretation of his experiment that he wishes to link to his experiment and the processed data. Dennis notes that in a next cycle he will want to perform another workflow that specifically tests this hypothesis, rather that perform an enrichment test. He will then look for a workflow that performs a 'global test', and replace this part in his workflow with the global test workflow. In his log he indicates this fact. In this case he will link the result of this test (most likely a new hypothesis) to the previous experiment and in particular to the initial hypothesis. At some point, he wishes to be able to retrieve this past information and the interrelationships among his hypotheses.Assuming his finding and new hypothesis are valuable and new, he will publish his results. The publication has cleaned information, sufficient for evaluating his hypothesis and rerunning the one workflow and the one dataset that lead to this result.Dennis Working Research Object will containA reference to the source of the data and the people to acknowledge for it.The initial hypothesisThe conceptual workflow or a summary of the experiment planReferences to workflows that were tested, with comments on their application for Dennis caseA reference to the workflow(s) that Dennis eventually uses, including acknowledgement information (including a note on how these people want to be acknowledged)Dennis his workflow, possibly with a backlog of previous versions that Dennis wishes to keep for reference (with notes and comments)Dennis his workflow run, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')The final hypothesis, with comments.A reference to the results of the workflowA Design log that records Dennis considerations while making the workflowA Run log that records Dennis considerations while running and interpreting the workflowHis Publication Research Object will containThe workflowA caption for his workflow (filtered from his design and run log, all information necessary to run the experiment by a reviewer)A workflow run (results, and a caption filtered from run log)His initial hypothesisHis final hypothesisThe data sourceAcknowledgementsIn time, Dennis' workflow can be found on the basis of his Published and Working RO's metadata. This will create a rich and wide range of search capabilities for Dennis' successors.The Working RO is kept at Dennis local group, and is the most valuable resource for reusing the work. The Published RO is available for download and reuse. It is anticipated that interested parties will contact Dennis or his group for 'reuse in collaboration' (i.e. for the group's expertise).
  2. Emphasise the use of Linked Data. Note: the figures here are not intended to be readable. They’re simply emphasising the existence of the models. Example user requirements being addressed by RO:UR1.3 aggregate existing resources to conveniently access related resources from a single placeUR1.6 describe the relationships between aggregated resources so that other researchers can see how the resources fit togetherUR1.16 annotate experimental results using semantic models so that I can find/show links to other, relevant research objects