Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Provenance and Scientific Workflow Management 
Data Provenance 
Neuroscience Data 
Scientific Workflow Management 
(a...
Data Provenance and Scientific Workflow Management 
Agenda 
1 Data Provenance 
2 Neuroscience Data 
CARMEN Project 
NEMO P...
Data Provenance and Scientific Workflow Management 
Data Provenance 
Data Provenance 
Frequently asked questions for Scien...
Data Provenance and Scientific Workflow Management 
Data Provenance 
Data Provenance 
What is Provenance? 
Provenance refe...
Data Provenance and Scientific Workflow Management 
Data Provenance 
Works devoted to Data Provenance 
Provenance Working ...
Data Provenance and Scientific Workflow Management 
Data Provenance 
Open Provenance Model (OPM) 
The Open Provenance Mode...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
Projects recording provenance of neuroscience 
data...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
CARMEN Project 
The CARMEN consortium 
“A core part...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
CARMEN Project 
MINI module for Electrophysiology 
...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
NEMO Project 
Neural ElectroMagnetic Ontologies (NE...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
NEMO Project 
Ontology (informal definition) 
In bo...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
NEMO Project 
MINEMO – an extension of the MINI mod...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
NEMO Project 
Subset of “mandatory” MINEMO terms 
1...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
NEMO Project 
More about NEMO... 
Data in the NEMO ...
Data Provenance and Scientific Workflow Management 
Neuroscience Data 
NEMO Project 
A “detail” to worry about... 
The MIN...
Data Provenance and Scientific Workflow Management 
Scientific Workflow Management Systems (SWMS) 
Scientific Workflows 
A...
Data Provenance and Scientific Workflow Management 
Scientific Workflow Management Systems (SWMS) 
Scientific Workflow Man...
Data Provenance and Scientific Workflow Management 
Scientific Workflow Management Systems (SWMS) 
Most successful SWMSs 
...
Data Provenance and Scientific Workflow Management 
Scientific Workflow Management Systems (SWMS) 
Online workflow reposit...
Data Provenance and Scientific Workflow Management 
Scientific Workflow Management Systems (SWMS) 
Taverna 
Taverna 
Featu...
Data Provenance and Scientific Workflow Management 
Questionnaires 
Automatic Generation of Online Questionnaires 
There a...
Upcoming SlideShare
Loading in …5
×

Data Provenance and Scientific Workflow Management

843 views

Published on

Introductory class on techniques and tools to manage scientific data, focusing on sources of information and data analysis. Lecturer: Prof. Kelly Rosa Braghetto, a NeuroMat associate investigator and a professor at the University of São Paulo's Department of Computer Science.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Data Provenance and Scientific Workflow Management

  1. 1. Data Provenance and Scientific Workflow Management Data Provenance Neuroscience Data Scientific Workflow Management (and Questionnaires) Kelly Rosa Braghetto kellyrb@ime.usp.br Departamento de Ciência da Computação Instituto de Matemática e Estatística Universidade de São Paulo 05 de Junho de 2013 1 / 21
  2. 2. Data Provenance and Scientific Workflow Management Agenda 1 Data Provenance 2 Neuroscience Data CARMEN Project NEMO Project 3 Scientific Workflow Management Systems (SWMS) Taverna 4 Questionnaires 2 / 21
  3. 3. Data Provenance and Scientific Workflow Management Data Provenance Data Provenance Frequently asked questions for Scientists Where was a document found? How was this data set produced? Were all facts included in this decision? Were all the latest figures included in this diagram? Can this scientific experiment be reproduced? Source: http://openprovenance.org/ 3 / 21
  4. 4. Data Provenance and Scientific Workflow Management Data Provenance Data Provenance What is Provenance? Provenance refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact. Why does Provenance matter? The provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. People make trust judgments based on provenance that may or may not be explicitly offered to them. Problem: lack of a standard model. Source: http://www.w3.org/2011/prov/wiki/Main_Page 4 / 21
  5. 5. Data Provenance and Scientific Workflow Management Data Provenance Works devoted to Data Provenance Provenance Working Group, maintained by W3C “Mission: to support the widespread publication and use of provenance information of Web documents, data, and resources.” http://www.w3.org/2011/prov/wiki/Main_Page Wf4Ever project “Wf4Ever addresses some of the challenges associated to the preservation of scientific experiments in data-intensive science.” http://www.wf4ever-project.org/ Open Provenance Model (OPM) http://openprovenance.org/ 5 / 21
  6. 6. Data Provenance and Scientific Workflow Management Data Provenance Open Provenance Model (OPM) The Open Provenance Model is a model of provenance that is designed to meet the following requirements: 1 To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. 2 To allow developers to build and share tools that operate on such a provenance model. 3 To define provenance in a precise, technology-agnostic manner. 4 To support a digital representation of provenance for any ’thing’, whether produced by computer systems or not. 5 To allow multiple levels of description to coexist. 6 To define a core set of rules that identify the valid inferences that can be made on provenance representation. 6 / 21
  7. 7. Data Provenance and Scientific Workflow Management Neuroscience Data Projects recording provenance of neuroscience data Code Analysis, Repository & Modelling for e-Neuroscience (CARMEN) http://www.carmen.org.uk/ “CARMEN is an e-Science Pilot Project funded by the Engineering and Physical Sciences Research Council (UK). It will deliver a virtual laboratory for neurophysiology, enabling sharing and collaborative exploitation of data, analysis code and expertise. Neural activity recordings (signals and image series) are the primary data types.” Neural ElectroMagnetic Ontologies (NEMO) http://nemo.nic.uoregon.edu/wiki/NEMO [More details in the next slides...] 7 / 21
  8. 8. Data Provenance and Scientific Workflow Management Neuroscience Data CARMEN Project The CARMEN consortium “A core part of our work is the development of minimum reporting guidelines for annotation of data and other computational resources for the purpose of sharing” Result: a MINI module for Electrophysiology MINI (Minimum Information about a Neuroscience investigation) – is a family of reporting guideline documents A module represents the minimum information that should be reported about a dataset to: facilitate computational access and analysis to allow a reader to interpret and critically evaluate the process performed and the conclusions reached to support their experimental corroboration 8 / 21
  9. 9. Data Provenance and Scientific Workflow Management Neuroscience Data CARMEN Project MINI module for Electrophysiology The reporting recommendadions cover both extracellular and intracellular electrophysiology Covered data: date stamps and responsible persons the subject under study the subject task or stimulus if appropriate the recording protocol and the resulting description of time series data The entire module is described in: http://www.carmen.org.uk/standards/mini.pdf The module is registered in the MIBBI portal (http://www.biosharing.org/standards/mibbi and http://mibbi.sourceforge.net/legacy.shtml). MIBBI – Minimum Information for Biological and Biomedical Investigations – is a pioneering project that aims to coordinate guidelines for reporting of metadata across domains 9 / 21
  10. 10. Data Provenance and Scientific Workflow Management Neuroscience Data NEMO Project Neural ElectroMagnetic Ontologies (NEMO) An NIH funded project Aims to create EEG and MEG ontologies and ontology based tools. These resources will be used to support representation, classification, and meta-analysis of brain electromagnetic data. Based on three pillars: DATA, ONTOLOGY, and DATABASE Data – raw EEG, averaged EEG (ERPs), and ERP data analysis results Ontologies – include concepts related to ERP data (including spatial and temporal features of ERP patterns), data provenance, and the cognitive and linguistic paradigms that were used to collect the data Database – the NEMO database portal is a large repository that stores NEMO consortium data, data analysis results, and data provenance Site: http://nemo.nic.uoregon.edu 10 / 21
  11. 11. Data Provenance and Scientific Workflow Management Neuroscience Data NEMO Project Ontology (informal definition) In both computer science and information science, an ontology represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. Ontologies are used as a form of knowledge representation about the world or some part of it. Ontologies generally describe: Individuals: the basic or “ground level” objects Classes: sets, collections, or types of objects Attributes: properties, features, characteristics, or parameters that objects can have and share Relations: ways that objects can be related to one another Events: the changing of attributes or relations Source: http://neurolex.org 11 / 21
  12. 12. Data Provenance and Scientific Workflow Management Neuroscience Data NEMO Project MINEMO – an extension of the MINI module for Electrophysiology MINEMO = Minimal Information for Neural Electromagnetic Ontologies “A standards-compliant method for analysis and integration of event-related potentials (ERP) data”; in other words: a checklist for the description of ERP studies The checklist comprises no more than 60 fields; 20 of these fields are considered “mandatory” MINEMO promotes the use of controlled vocabularies (or lexicons) for data annotation. Aim: to conduct cross-lab meta-analysis Each MINEMO checklist item is linked to a term defined in the NEMO ontology 12 / 21
  13. 13. Data Provenance and Scientific Workflow Management Neuroscience Data NEMO Project Subset of “mandatory” MINEMO terms 1 Research lab (General features) 2 Experiment (General features) 3 Publication 4 Study subjects (Group characteristics) 5 Experiment condition 6 Stimulus representation 7 Behavioral data collection 8 EEG data collection 9 EEG/ERP data preprocessing 10 EEG/ERP data file The entire set of terms is defined in the article: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3235514/ They are also in the MIBBI portal: 13 / 21
  14. 14. Data Provenance and Scientific Workflow Management Neuroscience Data NEMO Project More about NEMO... Data in the NEMO Portal are aligned with the MINEMO checklist and ontology https://portal.nemo.nic.uoregon.edu NIF (the Neuroscience Information Framework project – http://www.neuinfo.org/) uses the NEMO ontology. NIF aggregates online sources of neuroscience data, including database, web sites, and publications, and provides a search interface across these disparate sources The NEMO ontology can be seen in: http://bioportal.bioontology.org/ontologies/40522 14 / 21
  15. 15. Data Provenance and Scientific Workflow Management Neuroscience Data NEMO Project A “detail” to worry about... The MINI module for Electrophysiology and MINEMO do not cover the description of image data To see later: MIfMRI – Minimum Information about an fMRI Study http://www.fmrimethods.org/ 15 / 21
  16. 16. Data Provenance and Scientific Workflow Management Scientific Workflow Management Systems (SWMS) Scientific Workflows A data analysis (or processing) generally can be described as a workflow, e.g., a set of computational tasks that “transform” data In Bioinformatics, a workflow is frequently called pipeline In a workflow, the output data of a task is generally used as input data for other(s) tasks(s). So, the flow of data defines an execution order for the workflows tasks Frequently, a same task can be appear in more than one workflow 16 / 21
  17. 17. Data Provenance and Scientific Workflow Management Scientific Workflow Management Systems (SWMS) Scientific Workflow Management System (SWMS) A computational tool that controls the execution of workflows It provides mechanisms for a scientist to describe his/her workflow using “intuitive” modeling languages It can optimize the execution considering the characteristics of the available computational resources It helps to generate provenance data of an analysis process. In addition, it improves the reproducibility of analyses 17 / 21
  18. 18. Data Provenance and Scientific Workflow Management Scientific Workflow Management Systems (SWMS) Most successful SWMSs Taverna – http://www.taverna.org.uk VisTrails – http://www.vistrails.org Kepler – https://kepler-project.org Galaxy – http://galaxyproject.org 18 / 21
  19. 19. Data Provenance and Scientific Workflow Management Scientific Workflow Management Systems (SWMS) Online workflow repositories – collaborative science MyExperiments project (http://www.myexperiment.org/): Users upload their workflow models Models are categorized according their research domain Users can search and download models uploaded by other users Site stores models from different SWMSs (Taverna, Kepler, etc.) 19 / 21
  20. 20. Data Provenance and Scientific Workflow Management Scientific Workflow Management Systems (SWMS) Taverna Taverna Features: Graphical user interface for the description of the workflows Easy installation and use Recording of the “execution history” and intermediate results (= provenance data of the entire analysis) Provenance export capability to OPM 20 / 21
  21. 21. Data Provenance and Scientific Workflow Management Questionnaires Automatic Generation of Online Questionnaires There are computational tools that automatically generate electronic questionnaires. One of the most used is the LimeSurvey (https://www.limesurvey.org/). Functionalities of the LimeSurvey: Generates online questionnaires Has a big set of question types Keeps questionnaire data in a real database Manages users Creates a print version of questionnaires Makes basic statistical analysis ... 21 / 21

×