SlideShare a Scribd company logo
www.wf4ever-project.org




Scientific Data Management -
  From the Lab to the Web
      José Manuel Gómez Pérez, iSOCO

         Semantic Data Management
             Dagstuhl Seminar
             22-27 April 2012
The data deluge
                                                          Some facts

                             »    In 2010 the size of the digital
                                  universe exceeded 1 Zettabyte
                                  (=1 trillion Gb)
                             »    1.8 Zb in 2011
                             »    35 Zb expected in 2020

                             »    90% unstructured data
                             »    70% user-generated
                             »    75% resulting from data copying,
                                  merging, and transforming

                             »    Metadata is the fastest growing
                                  data category
                             »    Much of such data is dynamic,
                                  real-time, volatile

Source: IDC ‘s The 2011 Digital Universe Study
       – Extracting Value from Chaos

                                                                     2
Dealing with dynamicity
                                         Two main challenges


» Challenge 1: Identifying and
  structuring the relevant portions of
  the data for the task at hand
   › First-class data citizens
» Challenge 2: Managing the lifecycle
  of data entities
   › Preservation
   › Evolution and versioning
   › Decay                         Both technical and
                                 social aspects involved

                                                               3
The Research Lifecycle
                                                Workflows in the Scientific Method


Background
 Hypothesis                           Results           Scientific
                   Experiment         Results
Assumptions                            (data)         Interpretation       Publication
                                       (Data)
 Input data
   Method


   Example: Genome-Wide Association Studies




                                                                                         4
Workflow-based Science
              What is a Scientific Workflow?


»    A mechanism for coordinating the
     execution of services and linking together
     resources.

»    The combination of data and processes
     into a configurable, structured set of steps
     that implement semi-automated
     computational solutions in scientific
     problem-solving


    Scientific workflows are at the core of
    scientific data management
        › Enable automation
        › Encourage best practices




                                                    5
Challenge 1

 Identifying and structuring
the relevant portions of the
  data for the task at hand

    First-class data citizens
Questions for Scientific Data and Workflows                        Issues
Who are you ?                                               Identity and Description
Where and when were you born ?                                     Authenticity
Who were your parents (creators) ?                                 Uniqueness
For which purpose were you conceived and have been used ?      Reuse, Repurpose

What do you have inside ?                                         Inspection
                                                                  Visualization
                                                                  Annotations
How is your content linked ?                                Graphical Representation
May I access all your parts ?                                    Access Rights
Which parts can I replace ?                                       Adaptability
What have they done to you ?                                      Provenance
Who and When ?                                                     Versioning
Why did they do that ?


Why have you been recommended to me ?                         Information Quality
Can I believe what you are saying or trust your results ?

Do you still produce the same results ?                         Reproducibility
Are you still working ?                                          Completeness
How could I repair you ?                                           Stability

How could I thank you ?                                              Credit
How could I talk about you ?                                                           7
Challenge 1: Identifying and structuring the relevant data
                                         Research Objects as Technical Objects

Carriers of Research Context                       Third Party     Alien
» Referentiable                        Distributed  Tenancy        Store
» Aggregation, Dispersed
    › Heterogeneous
    › Local and External
» Annotated metadata
    › Provenance
    › Structured: Manifests,
      Recipes, Permissions,
      Discourse
» Lifecycle
    › Publishing, Evolution
    › Versioning
» Mixed Stewardship
    › Graceful Degradation
» Sharing
    » Security & Privacy
                                       Technical Objects              Social Objects
» Stereotypical User Profiles
» Services
                               OAI-ORE                                                   8
Research Objects as Social Objects




                    Package,
                    Explore, Inspect,
                    Review,
                    Exchange,
                    Share, Reuse,
                    Publish, Credit




9      9
                                    9
http://purl.org/wf4ever/ro#
                                                   Research Object model core (simplified)

    RO specification: http://wf4ever.github.com/ro


                                       ore:aggregates
                                                          ro:ResearchObject
               ro:Resource
                                                                                       ore:isDescribedBy



                                                                                           ro:Manifest
    wfdesc:Workflow

                     ro:annotatesAggregatedResource        ro:AggregatedAnnotation



›    ro (aggregation and annotation)           Note: This figure shows a simplified view of the RO core.
›    wfdesc (workflow description)
›    Minim* (minimum info model)
›    wfprov (workflow provenance)
›    roprov (RO provenance)
›    roevo (evolution model)                                                                                   10
                                                                           *Minim   based on M. Gamble’s MIM
Challenge 2

Managing the lifecycle of
     data entities

   Evolution and Decay
Challenge 2: Managing the lifecycle of data entities
                 RO Evolution & Versioning




                                                 12
Challenge 2: Managing the lifecycle of data entities
                                                                       RO Decay



Workflow Decay
•   Component level
•   flux/decay/unavailability
•   Data level
•   Infrastructure level

Experiment Decay
•   Methodological changes
•   New technologies
•   New resources/components
•   New data




                                                                                 13
Preservation, Conservation, Recreating


Preserving
Archived Record
Fixed Snapshots
Review
Rerun & Replay

Conserving
Active Instrument
Live
Rerun & Reuse
Repair & Restore

Recreating
Archived Record
Active Instrument
Live
Rebuild Recycle Repurpose

                                                                     14
Challenge 2: Managing the lifecycle of data entities
    Possible types of decay (an example)




                                                 15
Decay Analysis
                    A Taxonomy of RO decay



1. Service tool is missing
2. Service file descriptor disappeared
3. Service up but not contactable
4. Service up but functionality changed
5. Local software dependencies
6. Data unavailability
7. Changes in data formats
8. Chained dependency
9. Credentials deprecated
10. Input data superseded by other data
11. RO metadata outdated (upon versioning)
12. Old fashioned RO
13. External references lose credit
14. Execution framework no longer available

                                              16
A taxonomy of workflow decay
      Sample decay type




                         17
Decay Analysis
                                    1.0 Certificate – Evaluation of Stability and Completeness

                                               1.0 Certificate of quality

                           Stability                                        Completeness



      Is the RO free from any form of decay                   Is the minimal aggregation of
      preventing workflow execution?                          resources encapsulated by the RO
                                                              consistent?


      »    Focus on reproducibility                           »   RO checklists
      »    Assisted detection of RO decay                     »   Produced by scientists
      »    Active monitoring on decay forms                   »   Automatically checked against
      »    RO and workflow provenance                             minimal model (minim)
                                                              »   RO evolution

      »    Notification
      »    Explanation


                                                                                                      18
1.0 Certificate notion originally proposed by Yde de Jong
Recap
                                      Lessons learnt


Scalability   » Data with a Purpose

              » Encapsulate & Conquer
                 › Goal-driven (purpose)
                 › Aggregation
                 › Community-managed

              » Nothing is immutable,
Provenance      especially data.
                 › Foster evolution
                 › Monitor decay

                                                  19
Thanks for your Attention!
                                               Questions




 Any Questions?

http://www.wf4ever-project.org/




                                                         20

More Related Content

What's hot

Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Wolfgang Reinhardt
 
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionInstitute for Knowledge Mobilization
 
The changing scholarly content and communication landscape
The changing scholarly content and communication landscapeThe changing scholarly content and communication landscape
The changing scholarly content and communication landscapeLaura Czerniewicz
 
Programming Education based on Jigsaw
Programming Education based on JigsawProgramming Education based on Jigsaw
Programming Education based on Jigsawyunjae jang
 
Digital Scholar
Digital Scholar Digital Scholar
Digital Scholar tanbob
 

What's hot (8)

Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
 
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
 
Knowledge mobilization
Knowledge mobilization Knowledge mobilization
Knowledge mobilization
 
Qiagram
QiagramQiagram
Qiagram
 
The changing scholarly content and communication landscape
The changing scholarly content and communication landscapeThe changing scholarly content and communication landscape
The changing scholarly content and communication landscape
 
2012 Taiwan UX Summit 工作坊A 簡報
2012 Taiwan UX Summit 工作坊A 簡報2012 Taiwan UX Summit 工作坊A 簡報
2012 Taiwan UX Summit 工作坊A 簡報
 
Programming Education based on Jigsaw
Programming Education based on JigsawProgramming Education based on Jigsaw
Programming Education based on Jigsaw
 
Digital Scholar
Digital Scholar Digital Scholar
Digital Scholar
 

Similar to Scientific data management from the lab to the web

OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objectsseanb
 
Data Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionGarethKnight
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip finalDeborah McGuinness
 
2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objectsStian Soiland-Reyes
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.orgNorman Morrison
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesDavid De Roure
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?GigaScience, BGI Hong Kong
 
If we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote GobleIf we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote GobleCarole Goble
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Vince Smith
 
Scratchpads training course introduction
Scratchpads training course introductionScratchpads training course introduction
Scratchpads training course introductionDimitrios Koureas
 
Argumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagneArgumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagnejodischneider
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012Lee Dirks
 

Similar to Scientific data management from the lab to the web (20)

Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
Data Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An Introduction
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
 
2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
 
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social Machines
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
If we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote GobleIf we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote Goble
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
 
Scratchpads training course introduction
Scratchpads training course introductionScratchpads training course introduction
Scratchpads training course introduction
 
Argumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagneArgumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagne
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
 

More from Jose Manuel Gómez-Pérez

More from Jose Manuel Gómez-Pérez (9)

Science religion-dsmeetupv1.0
Science religion-dsmeetupv1.0Science religion-dsmeetupv1.0
Science religion-dsmeetupv1.0
 
Trust and linked data jmgomez-v1.1
Trust and linked data jmgomez-v1.1Trust and linked data jmgomez-v1.1
Trust and linked data jmgomez-v1.1
 
Halo Pcs Kcap2007 V2
Halo Pcs Kcap2007 V2Halo Pcs Kcap2007 V2
Halo Pcs Kcap2007 V2
 
Acquisition And Understanding Of Process Knowledgev1 1
Acquisition And Understanding Of Process Knowledgev1 1Acquisition And Understanding Of Process Knowledgev1 1
Acquisition And Understanding Of Process Knowledgev1 1
 
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
 
Next Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge ManagementNext Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge Management
 
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of DataProvenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
 
Tecnologías Semánticas en Salud
Tecnologías Semánticas en SaludTecnologías Semánticas en Salud
Tecnologías Semánticas en Salud
 
Provenance and Trust
Provenance and TrustProvenance and Trust
Provenance and Trust
 

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform EngineeringJemma Hussein Allen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Scientific data management from the lab to the web

  • 1. www.wf4ever-project.org Scientific Data Management - From the Lab to the Web José Manuel Gómez Pérez, iSOCO Semantic Data Management Dagstuhl Seminar 22-27 April 2012
  • 2. The data deluge Some facts » In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb) » 1.8 Zb in 2011 » 35 Zb expected in 2020 » 90% unstructured data » 70% user-generated » 75% resulting from data copying, merging, and transforming » Metadata is the fastest growing data category » Much of such data is dynamic, real-time, volatile Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos 2
  • 3. Dealing with dynamicity Two main challenges » Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand › First-class data citizens » Challenge 2: Managing the lifecycle of data entities › Preservation › Evolution and versioning › Decay Both technical and social aspects involved 3
  • 4. The Research Lifecycle Workflows in the Scientific Method Background Hypothesis Results Scientific Experiment Results Assumptions (data) Interpretation Publication (Data) Input data Method Example: Genome-Wide Association Studies 4
  • 5. Workflow-based Science What is a Scientific Workflow? » A mechanism for coordinating the execution of services and linking together resources. » The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving Scientific workflows are at the core of scientific data management › Enable automation › Encourage best practices 5
  • 6. Challenge 1 Identifying and structuring the relevant portions of the data for the task at hand First-class data citizens
  • 7. Questions for Scientific Data and Workflows Issues Who are you ? Identity and Description Where and when were you born ? Authenticity Who were your parents (creators) ? Uniqueness For which purpose were you conceived and have been used ? Reuse, Repurpose What do you have inside ? Inspection Visualization Annotations How is your content linked ? Graphical Representation May I access all your parts ? Access Rights Which parts can I replace ? Adaptability What have they done to you ? Provenance Who and When ? Versioning Why did they do that ? Why have you been recommended to me ? Information Quality Can I believe what you are saying or trust your results ? Do you still produce the same results ? Reproducibility Are you still working ? Completeness How could I repair you ? Stability How could I thank you ? Credit How could I talk about you ? 7
  • 8. Challenge 1: Identifying and structuring the relevant data Research Objects as Technical Objects Carriers of Research Context Third Party Alien » Referentiable Distributed Tenancy Store » Aggregation, Dispersed › Heterogeneous › Local and External » Annotated metadata › Provenance › Structured: Manifests, Recipes, Permissions, Discourse » Lifecycle › Publishing, Evolution › Versioning » Mixed Stewardship › Graceful Degradation » Sharing » Security & Privacy Technical Objects Social Objects » Stereotypical User Profiles » Services OAI-ORE 8
  • 9. Research Objects as Social Objects Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit 9 9 9
  • 10. http://purl.org/wf4ever/ro# Research Object model core (simplified) RO specification: http://wf4ever.github.com/ro ore:aggregates ro:ResearchObject ro:Resource ore:isDescribedBy ro:Manifest wfdesc:Workflow ro:annotatesAggregatedResource ro:AggregatedAnnotation › ro (aggregation and annotation) Note: This figure shows a simplified view of the RO core. › wfdesc (workflow description) › Minim* (minimum info model) › wfprov (workflow provenance) › roprov (RO provenance) › roevo (evolution model) 10 *Minim based on M. Gamble’s MIM
  • 11. Challenge 2 Managing the lifecycle of data entities Evolution and Decay
  • 12. Challenge 2: Managing the lifecycle of data entities RO Evolution & Versioning 12
  • 13. Challenge 2: Managing the lifecycle of data entities RO Decay Workflow Decay • Component level • flux/decay/unavailability • Data level • Infrastructure level Experiment Decay • Methodological changes • New technologies • New resources/components • New data 13
  • 14. Preservation, Conservation, Recreating Preserving Archived Record Fixed Snapshots Review Rerun & Replay Conserving Active Instrument Live Rerun & Reuse Repair & Restore Recreating Archived Record Active Instrument Live Rebuild Recycle Repurpose 14
  • 15. Challenge 2: Managing the lifecycle of data entities Possible types of decay (an example) 15
  • 16. Decay Analysis A Taxonomy of RO decay 1. Service tool is missing 2. Service file descriptor disappeared 3. Service up but not contactable 4. Service up but functionality changed 5. Local software dependencies 6. Data unavailability 7. Changes in data formats 8. Chained dependency 9. Credentials deprecated 10. Input data superseded by other data 11. RO metadata outdated (upon versioning) 12. Old fashioned RO 13. External references lose credit 14. Execution framework no longer available 16
  • 17. A taxonomy of workflow decay Sample decay type 17
  • 18. Decay Analysis 1.0 Certificate – Evaluation of Stability and Completeness 1.0 Certificate of quality Stability Completeness Is the RO free from any form of decay Is the minimal aggregation of preventing workflow execution? resources encapsulated by the RO consistent? » Focus on reproducibility » RO checklists » Assisted detection of RO decay » Produced by scientists » Active monitoring on decay forms » Automatically checked against » RO and workflow provenance minimal model (minim) » RO evolution » Notification » Explanation 18 1.0 Certificate notion originally proposed by Yde de Jong
  • 19. Recap Lessons learnt Scalability » Data with a Purpose » Encapsulate & Conquer › Goal-driven (purpose) › Aggregation › Community-managed » Nothing is immutable, Provenance especially data. › Foster evolution › Monitor decay 19
  • 20. Thanks for your Attention! Questions Any Questions? http://www.wf4ever-project.org/ 20

Editor's Notes

  1. In this scenario student Dennis has made a conceptual workflow that takes the result of a gene expression experiment (activity values of all genes under two conditions: with/without a chemical compound). The wet laboratory experiment was done by others then Dennis. He makes a note of the origin (including a paper reference). The initial hypothesis is that the chemical compound disturbs gene expression. It is yet unknown which genes and what biological processes are affected. The conceptual workflow first performs one of the standard data preprocessing steps for the type of data Dennis has (Affymetrix gene expression array), then it uses a statistical test to filter those genes that are significantly differentially expressed between the two conditions, and finally it performs an enrichment test to find those pathways that are most prominent among the filtered genes. The latter requires an annotation process, where each gene is coupled to the pathways it was once implied in in other experiments (there is a database for that: KEGG).Dennis is new to workflows, so he wishes to start with an existing workflow. For each component he will search myExperiment for keywords. He then wishes to understand the workflows: look into them, perform test runs with test data and his own data, and see other peoples logs. When he finds workflows he does not understand, Dennis is inclined to create his own workflow with his own scripts. He will receive scripts from colleagues and perform tests that his colleagues are familiar with. As such, he can learn what his workflow is doing. This will help him interpret his results.Ultimately, the workflow may suggest for instance that the set of differentially expressed genes has the Wnt pathway as most common denominator. This pathway is well known for embryogenesis and cancer, information he finds on the internet. He makes a note of that. It will lead to the hypothesis that the chemical compound, may have effects on embryogenesis and/or cancer. This is now his interpretation of his experiment that he wishes to link to his experiment and the processed data. Dennis notes that in a next cycle he will want to perform another workflow that specifically tests this hypothesis, rather that perform an enrichment test. He will then look for a workflow that performs a 'global test', and replace this part in his workflow with the global test workflow. In his log he indicates this fact. In this case he will link the result of this test (most likely a new hypothesis) to the previous experiment and in particular to the initial hypothesis. At some point, he wishes to be able to retrieve this past information and the interrelationships among his hypotheses.Assuming his finding and new hypothesis are valuable and new, he will publish his results. The publication has cleaned information, sufficient for evaluating his hypothesis and rerunning the one workflow and the one dataset that lead to this result.Dennis Working Research Object will containA reference to the source of the data and the people to acknowledge for it.The initial hypothesisThe conceptual workflow or a summary of the experiment planReferences to workflows that were tested, with comments on their application for Dennis caseA reference to the workflow(s) that Dennis eventually uses, including acknowledgement information (including a note on how these people want to be acknowledged)Dennis his workflow, possibly with a backlog of previous versions that Dennis wishes to keep for reference (with notes and comments)Dennis his workflow run, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')The final hypothesis, with comments.A reference to the results of the workflowA Design log that records Dennis considerations while making the workflowA Run log that records Dennis considerations while running and interpreting the workflowHis Publication Research Object will containThe workflowA caption for his workflow (filtered from his design and run log, all information necessary to run the experiment by a reviewer)A workflow run (results, and a caption filtered from run log)His initial hypothesisHis final hypothesisThe data sourceAcknowledgementsIn time, Dennis' workflow can be found on the basis of his Published and Working RO's metadata. This will create a rich and wide range of search capabilities for Dennis' successors.The Working RO is kept at Dennis local group, and is the most valuable resource for reusing the work. The Published RO is available for download and reuse. It is anticipated that interested parties will contact Dennis or his group for 'reuse in collaboration' (i.e. for the group's expertise).
  2. Emphasise the use of Linked Data. Note: the figures here are not intended to be readable. They’re simply emphasising the existence of the models. Example user requirements being addressed by RO:UR1.3 aggregate existing resources to conveniently access related resources from a single placeUR1.6 describe the relationships between aggregated resources so that other researchers can see how the resources fit togetherUR1.16 annotate experimental results using semantic models so that I can find/show links to other, relevant research objects