The role of annotation in reproducibility (Empirical 2014)

  • 407 views
Uploaded on

Invited presentation at ESWC2014 Empirical workshop

Invited presentation at ESWC2014 Empirical workshop

More in: Science , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
407
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • However when the protocols are published some of them present problems such as insufficient granularity and the instructions can be imprecise or ambiguous due to the natural language. In order to avoid arbitrary interpretations, we are designing an ontological structure that facilitate the formal representation of experimental protocols.
  • However when the protocols are published some of them present problems such as insufficient granularity and the instructions can be imprecise or ambiguous due to the natural language. In order to avoid arbitrary interpretations, we are designing an ontological structure that facilitate the formal representation of experimental protocols.
  • This is the What vs How vs Why.
  • Why test more than 1 subgraph algorithm? Because we want to compare the fragments obtained from the algorithms.
    There is always a trade of between the size of the subgraph and its frequency.

Transcript

  • 1. The role of annotation in reproducibility ESWC2014 Empirical workshop 26/05/2014 Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team Oscar Corcho ocorcho@fi.upm.es @ocorcho https://www.slideshare.com/ocorcho
  • 2. Setting the context of this presentation Our main assumption “We are not so good at describing our experiments, and this has a negative impact in reproducibility (and understandability, and conservation, and reconstruction)” • Let’s see if this happens in different areas of scientific research • In vitro experiments in Plant Biology • In silico experiments in several domains • The challenge • Let’s use annotation as a means to increase reproducibility • Note: see the last slide on terminology
  • 3. Ingredients for reproducibility
  • 4. Ingredients for reproducibility
  • 5. The role of laboratory protocols in Life Sciences Laboratory Protocols http://mibbi.sourceforge.net/about.shtml Laboratory protocols support the scientific results
  • 6. Laboratory Protocols • Written in natural language • Generally, presented in a “recipe” style • Description of a sequence of operations that include inputs and outputs • Step-by-step descriptions of procedures • A protocol is a type of workflow • They must be described in sufficient and unambiguous detail. • To enable another agent (human or machine) to replicate the original experiment. • Specific journals: Biotechniques, CSH protocols, Current protocols, GMR, Jove, Protocol exchange, Plant methods, Plos One, Springer protocols
  • 7. Detailed instructions on journal’s guides for authors
  • 8. And other useful elements, including ontologies It maintains checklists that promote how to report an experiment. It models the design of an investigation. Including protocols, instrumentation, materials and data generated. Aims to formalize knowledge about the organization, execution and analysis of scientific experiments. EXPO EXACT It provides a model for the description of experiment actions. Minimal information models, check lists, and even ontologies
  • 9. However… • Ambiguity is the norm • Let’s make an analysis on protocols written for the plant biology community • Incubate the centrifuge tubes in a water bath. •Incubate the samples for 5 min with gentle shaking. • Rinse DNA briefly in 1-2 ml of wash. •Incubate at -20C overnight. Protocol
  • 10. Analysis of Laboratory Protocols Repository Number of Protocols Biotechniques 8 CSH protocols 11 Current protocols 25 GMR 4 Jove 21 Protocol exchange 12 Plant methods 10 Plos One 3 Springer protocols 5 Total 99
  • 11. Minimal Information to Report a Laboratory Protocol Our model Ocurrence in other models TITLE 100% AUTHOR 100% INTRODUCTION Purpose 89% Provenance of the protocol 89% Applications of the protocol 89% Comparison with other protocols 89% Limitations 89% MATERIALS Sample 100% · strain or line genotype · Developmental stage · Organism part (tissue) Laboratory consumables/supplies · Laboratory consumable name 22% · Manufacturer name 11% · Laboratory consumable ID (catalog number) 11% Buffer recipes · Buffer name 67% · Chemical compound name 67% · Initial concentration of chemical compound 67% · Final concentration or amount of chemical compound 56% · Storage conditions 56% · Cautions 56% · Hints 67% Our model Ocurrence in other models Reagent · Reagent name 100% · Reagent vendor or manufacturer 100% · Reagent ID (catalog number) 100% Kit · Kit name 100% · Kit vendor or manufacturer 100% · Kit ID (catalog number) 56% Primer · Primer name 67% · Primer sequence 89% · Primer vendor or manufacturer 33% Equipment · Equipment name 67% · Equipment vendor or Manufacturer 67% · Equipment ID (catalog number) 67% Software · Software name 67% · Software version 67% METHODS/PROCEDURE Protocol 100% · Cautions 56% · Critical steps 56% · Pause point 33% · Hints 22% · Troubleshooting 44%
  • 12. How to Formalize the Protocols? • Incubate the centrifuge tubes at 65°C in a water bath for 10 min. • Rinse DNA briefly in 1-2 ml of wash. •Incubate at -20C overnight. Protocol indicate different length of time 2 seconds?, 5-10 seconds?... Object: centrifuge tubes, water bath Unit of measure: 65C, 10 min. Action: incubate.
  • 13. SMARTProtocols ontology • http://vocab.linkeddata.es/SMARTProtocols/
  • 14. Currently working on protocol annotation plant material instrument name manufacturer Buffer recipe Reagent name Laboratory consumable name Source: Biotechniques Meta-information about content Content Plant material Arabidopsis thaliana (rosette leaves, flowers, siliques),… and Larix decidua (young needles) Instrument name Leitz DMRB microscope manufacturer Leica Micro-systems Buffer recipe 50 mM EDTA, 1.4% SDS Reagent name 96% ethanol ~ absolute ethanol Laboratory consumable name 2-mL tube, zeolite beads
  • 15. 15 From the wet lab to our computers Lab book Digital Log Laboratory Protocol (recipe) Workflow Experiment
  • 16. Ingredients for reproducibility
  • 17. Scientific Workflows 17 “Template defining the set of tasks needed to carry out a computational experiment” [1] •Inputs •Steps •Intermediate results •Outputs •Data driven, usually represented as Directed Acyclic Graphs (DAGs) [1] Ewa Deelman, Dennis Gannon, Matthew Shields, Ian Taylor, Workflows and e-science: an overview of workflow system features and capabilities, Future Generation Computer Systems 25 (5) (2009) 528–540.
  • 18. 18 Plenty of workflow tools and platforms: Taverna, Wings, LONI Pipeline
  • 19. What do I want from these workflows and repositories? 19 • As a designer: Discovery •Workflows with similar functionality fragments/methods •Design based in previous templates. • As user/reuser/reviewer: Understandability, Exploration •Search workflows by functionality •Commonalities between execution runs •Component categorization •Reproducibility Workflow 1
  • 20. Working on different aspects of workflow preservation •Workflow representation •Plan/template representation •Provenance trace representation •Link between templates and traces •Creation of abstractions/motifs in scientific workflows •Abstraction catalog •Find how different workflows are related •Understandability and reuse of scientific workflows •Relation between the workflows involved in the same experiment (Research Objects) 20 CH1: Can we export an abstract template of the method being represented? CH2: How do we interoperate with other workflow results? CH3: How do we access the workflow results? CH4: How do we link an abstract method with several implementations? CH5: How can we detect what are the typical operations in scientific workflows? CH6: How can we detect them automatically? CH7: Which workflow parts are related to other workflows? CH8: How do workflows depend on the other parts of the experiments?
  • 21. 21 Overview • Empirical analysis on 260 workflow templates from Taverna, Wings, Galaxy and Vistrails • Catalog of recurring patterns: scientific workflow motifs. • Data Oriented Motifs • Workflow Oriented Motifs •Understandability and reuse http://sensefinancial.com/wp-content/uploads/2012/02/contribution.jpg Common motifs in scientific workflows: An empirical analysis. Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013
  • 22. 22 Approach •Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence •Identify workflow abstractions that would facilitate understandability and therefore effective re-use
  • 23. 23 Motif Catalog Data-Oriented Motifs (What?) Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Moving Data Visualisation Workflow-Oriented Motifs (How?) Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading Ontology Purl: http://purl.org/net/wf-motifs
  • 24. Macro abstraction detection Problem statement: Given a repository of workflow templates (either abstract or specific) or workflow execution traces, what are the workflow fragments I can deduce from it? Useful for: •Systems like Taverna and Wings: (Many templates, little annotation to relate them) •Finding relationships between workflows and sub-workflows. •Most used fragments, most executed, etc. •Systems like GenePattern, LONI Pipeline and Galaxy: (Many runs, nearly no templates published) •Proposing new templates with the popular fragments. 24
  • 25. 25 Common workflow fragment detection [Holder et al 1994]: Substructure Discovery in the SUBDUE System L. B. Holder, D. J. Cook, and S. Djoko. AAAI Workshop on Knowledge Discovery, pages 169-180, 1994. •Given a collection of workflows, which are the most common fragments? •Common sub-graphs among the collection •Sub-graph isomorphism (NP-complete) •We use subgraph mining algorithms •Graph Grammar learning •The rules of the grammar are the workflow fragments •Graph based hierarchical clustering •Each cluster corresponds to a workflow fragment •Iterative algorithm with two measures for compressing the graph: •Minimum Description Length (MDL) •Size
  • 26. 26 Exporting the fragment results: Wf-FD model http://purl.org/net/wf-fd
  • 27. 27 Exporting the fragment results: Wf-FD model
  • 28. Ingredients for reproducibility
  • 29. Preserving the infrastructure http://vocab.linkeddata.es/wicus/
  • 30. Working on different aspects of workflow preservation •Workflow representation •Plan/template representation •Provenance trace representation •Link between templates and traces •Creation of abstractions/motifs in scientific workflows •Abstraction catalog •Find how different workflows are related •Understandability and reuse of scientific workflows •Relation between the workflows involved in the same experiment (Research Objects) 30 CH1: Can we export an abstract template of the method being represented? CH2: How do we interoperate with other workflow results? CH3: How do we access the workflow results? CH4: How do we link an abstract method with several implementations? CH5: How can we detect what are the typical operations in scientific workflows? CH6: How can we detect them automatically? CH7: Which workflow parts are related to other workflows? CH8: How do workflows depend on the other parts of the experiments?
  • 31. 31 What is a Research Object? •Aggregation of resources that bundles together the contents of a research work: •Data •Experiments •Examples •Bibliography •Annotations •Provenance •ROs •Etc. http://www.researchobject.org/ Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse. Belhajjame, K.; Corcho, O.; Garijo, D.; Zhao, J.; Missier, P.; Newman, D.; Palma, R.; Bechhofer, S.; Garcıa, E.; Manuel, .G. J.; Klyne, G.; Page, K.; Roos, M.; Ruiz, J. E.; Soiland-Reyes, S.; Verdes-Montenegro, L.; De Roure, D.; and Goble, C. In Proceedings of the Second International Conference on the Future of Scholarly Communication and Scientific Publishing Sepublica2012, page 1-12, Hersonissos, 2012
  • 32. ROHub and rohub.linkeddata.es http://www.rohub.org/rodl/ http://rohub.linkeddata.es/
  • 33. Workflow (and RO) Preservation Checklists
  • 34. Acknowledgements 34 :collaboratesWith :collaboratesWith :collaboratesWith :collaboratesWith :supervises :supervises :yolandGil :khalidBelhajjame :varunRatnakar :caroleGoble :pinarAlper :danielGarijo :collaboratesWith :collaboratesWith :idafenSantana :olgaGiraldo Laboratory Protocols Wf Infrastructure :supervises :oscarCorcho OEG
  • 35. The role of annotation in reproducibility ESWC2014 Empirical workshop 26/05/2014 Contributors: my PhD students Olga Giraldo, Daniel Garijo, and Idafen Santana, and the Wf4Ever team Oscar Corcho ocorcho@fi.upm.es @ocorcho https://www.slideshare.com/ocorcho
  • 36. A final note on terminology Source: Idafen Santana; Inspired by [Goble, 2012]