SlideShare a Scribd company logo
1 of 35
Download to read offline
Scientific Workflows:
what do we have, what do we miss?
Paolo Romano
IRCCS AOU San Martino – IST,
Genova, Italy
(paolo.dm.romano@gmail.com, skype: p.romano)
Talk outline
 Aims of data integration in Life Sciences
 A methodology for the automation of data retrieval
and analysis processes
 Workflow Management Systems
 Issues related to:
 automatic composition,
 execution performances,
 workflow reuse
22 June 2013 2Scientific Workflows: what do we miss?
Biomedical databases
22 June 2013 3Scientific Workflows: what do we miss?
Accessible on-line by means of
human-centered interfaces
Don’t share interface, data
contents and structure, encoding
Don’t interoperate
Oblige researchers to
“cut & paste” data
May have huge size
Some figures
European Nucleotide Archive:
195,241,608 sequences, 292,078,866,691 bases
UniProtKB:
12,347,303 sequences, 3,974,018,240 AAs
PRIDE: 111,219,191 spectra
IntAct: 229,082 interactions
ArrayExpress:
~16,000 experiments, ~450,000 hybridizations
22 June 2013 4
DB size
Next-Generation Sequencing: 16Gb / experiment!
Scientific Workflows: what do we miss?
Some figures
European Nucleotide Archive:
195,241,608 sequences, 292,078,866,691 bases
UniProtKB:
12,347,303 sequences, 3,974,018,240 AAs
PRIDE: 111,219,191 spectra
IntAct: 229,082 interactions
ArrayExpress:
~16,000 experiments, ~450,000 hybridizations
22 June 2013 5
DB size
Next-Generation Sequencing: 16Gb / experiment!
Scientific Workflows: what do we miss?
An international collaboration aimed at building a
detailed map of human genome variability.
Pilot phase: identification of 95% of variations
present in at least 1% of population for three
ethnic groups (Oct 28, 2010).
Data: ~4.9 Tbases (~3 Gbases/individual)
Found: 15M mutations, 1M deletions/insertions,
20K major variants
The 1000 Genomes Consortium. A map of human genome variation from population
scale sequencing. Published online in Nature on 28 October 2010.
DOI:10.1038/nature09534 http://www.1000genomes.org/
22 June 2013 6
1000 Genomes Project
Scientific Workflows: what do we miss?
An international collaboration aimed at building a
detailed map of human genome variability.
Pilot phase: identification of 95% of variations
present in at least 1% of population for three
ethnic groups (Oct 28, 2010).
Data: ~4.9 Tbases (~3 Gbases/individual)
Found: 15M mutations, 1M deletions/insertions,
20K major variants
The 1000 Genomes Consortium. A map of human genome variation from population
scale sequencing. Published online in Nature on 28 October 2010.
DOI:10.1038/nature09534 http://www.1000genomes.org/
22 June 2013 7
1000 Genomes Project
Impossible without
bioinformatics
Unmanageable without
automation of processes
Scientific Workflows: what do we miss?
22 June 2013 8
Data integration: aims
 Data integration and automation of retrieval
and analysis processes are needed for:
o Achieving a precise and comprehensive vision of
available information
o Carrying out queries and analysis involving many
databases and software tools automatically
o Carrying out analysis of huge data quantities
efficiently
o Implementing an effective data mining
Scientific Workflows: what do we miss?
“A computerized facilitation or automation of a business
process, in whole or part" (Workflow Management Coalition)
Aim:
 Implementing data analysis processes in standardized
enviroments
Main advantages:
 efficiency: being automatic procedures, make researchers
free from repetitive tasks and e support “good practices”,
 reproducibiliy: analysis may be replicated over time, easily
and effectively,
 reuse: both intermediate results and workflows may be
reused,
 traceability: the workflow is enacted in a environment
that allows tracing back results.
What is a Workflow
22 June 2013 9Scientific Workflows: what do we miss?
An experiment
Prediction of the structure of a protein by homology
22 June 2013 10Scientific Workflows: what do we miss?
Researchers carrying out the analysis need to know:
 Which tools and dbs are needed, where they
reside, and how to use them
 In which order they must be used
 How to transfer data between them
 How to reconcile semantics of data used by
services
Manual
22 June 2013 11Scientific Workflows: what do we miss?
In an automated procedure
software must:
 Know which tool/db is able to carry out a given
task (e.g. aligning sequence, retrieving protein
structure data)
 Find real implementations (e.g. BLAST, provided
by NCBI)
 Link services in a workflow enabling to achieve
the desired task
 Transfer data appropriately between services
Automatic
22 June 2013 12Scientific Workflows: what do we miss?
Workflow for CABRI Network Services
22 June 2013 13Scientific Workflows: what do we miss?
o Define XML languages with controlled vocabularies
o Archive data in XML formats
o Make use of Web Services for data exchange
between services
o Associate data and analysis to proper items of an
ontology of bioinformatics data, data types, and
tasks
o Encode processes as workflows
Methodology: components
22 June 2013 14Scientific Workflows: what do we miss?
Both industrial and academic WfMS are available and
their use for Life Sciences is now widespread.
 Biopipe, an add-on for bioperl
 GPipe, an extension of Pise
 Taverna (EBI), a component of myGrid platform
 Pegasys (University of British Columbia)
 EGene (Universidade de São Paulo)
 Wildfire (Bioinformatics Institute, Singapore)
 Pipeline Pilot (SciTegic)
 BioWBI, Bioinformatic Workflow Builder Interface (IBM)
Workflow Management Systems
22 June 2013 15Scientific Workflows: what do we miss?
Software Type Standard License URL
Taverna Workbench Stand-alone XScufl Open source http://taverna.sourceforge.net/
Biopipe Libreria software Pipeline XML Open source http://www.gmod.org/biopipe/
ProGenGrid Stand-alone NA NA http://datadog.unile.it/progen
DiscoveryNet Stand-alone DPML Commercial http://www.discovery-on-the.net/
Kepler Stand-alone MoML Open source http://kepler-project.org/
GPipe Interfaccia Web,
servizi locali
GPipe XML Open source http://if-
web1.imb.uq.edu.au/Pise/5.a/gpipe.html
EGene Stand-alone NA Open source http://www.lbm.fmvz.usp.br/egene/
BioWMS Interfaccia Web,
servizi remoti
XPDL Public use http://litbio.unicam.it:8080/biowms/
BioWEP Portale XScufl
XPDL
Open source http://bioinformatics.istge.it/biowep/
BioWBI Interfaccia Web,
servizi locali
Proprietary Commerciale http://www.alphaworks.ibm.com/tech/biowbi
Pegasys Stand-alone Pegasys DAG Open source http://bioinformatics.ubc.ca/pegasys/
Wildfire Stand-alone GEL Open source http://wildfire.bii.a-star.edu.sg/wildfire/
Triana Stand-alone Triana Workflow
Language
Open source http://www.trianacode.org/
Pipeline Pilot Stand-alone Proprietary Commercial http://www.scitegic.com/
FreeFluo Libreria software WSFL e XScufl Open source http://freefluo.sourceforge.net/
Biomake Libreria software NA Open source http://skam.sourceforge.net/
Workflow Management Systems
Various software types and different standards
22 June 2013 16Scientific Workflows: what do we miss?
Taverna Workbench is the best known and most
adopted in life sciences
 Developed in the context of the myGrid platform
 Univ. Manchester and EBI main developers
 Open source at SourceForge.net
It allows to:
 Build and execute workflows for complex analysis
 … by getting access to remote and local services
 … displaying results in various formats
 … describing data through an ad-hoc ontology
Requirements: java plus Windows / Mac / Linux
Open source: http://taverna.sourceforge.net/ Current version: 2.4
Taverna Workbench
22 June 2013 17Scientific Workflows: what do we miss?
WfMS are increasingly used for data integration
and analysis in biomedical research.
Here, we highlight some of current issues.
Issues:
 Automatic composition of workflows
 Performances
 Reproducibility and reuse
WfMS: some current issues
22 June 2013 18Scientific Workflows: what do we miss?
Researchers only care for scientific
results!
 Building workflows may be a burden
 Various skills are requested, and GUI do not
solve
 Workflow composition should be much
simpler, and become semi-automatic
Automatic composition
22 June 2013 19Scientific Workflows: what do we miss?
Automatic composition
22 June 2013 20
Automatic
composition
Automatic
selection of
best services
Automated
service
identification
and composition
Adapters for
different data
formats
Automatic
conversion of
formats Ontology of
methods, tools
and data types
Integration
with
repositories
Controlled
Language
Interface
Scientific Workflows: what do we miss?
Automatic composition
22 June 2013 21
Automatic
composition
Automatic
selection of
best services
Automated
service
identification
and composition
Adapters for
different data
formats
Automatic
conversion of
formats Ontology of
methods, tools
and data types
Integration
with
repositories
Controlled
Language
Interface
Scientific Workflows: what do we miss?
A trade-off is required between rich
semantic annotations and design
complexity.
Semantic-based solutions available
for controlled set of services.
Beyond Taverna
MyGrid team developed tools identification of
services and supporting reuse of workflows
BioCatalogue
Annotated catalogue of Web Services for Life
Science
MyExperiment
Repository of workflows for Life Science, enabled
by social networking features
22 June 2013 22Scientific Workflows: what do we miss?
Allows to define all:
 Data analysis tasks for bioinformatics
 Data types
 Possible relations betweeb tasks and data types (I/O)
 Transformations between equivalent data (format)
 Transformations between related data (through elaboration,
e.g.: triplet  AA, gene symbol  sequence)
Fondamental in order to:
 Validate data flow and elaborations
 Support automatic workflow composition
EDAM (EMBRACE Data and Methods) Ontology
EDAM Ontology
22 June 2013 23Scientific Workflows: what do we miss?
EDAM (EMBRACE Data and Methods)
Topic: context of the analysis: domain of a study or an experiment
Operation: task carried out
Data: a data type used in
bioinformatics
Format: a format used for
encoding some data
http://edamontology.sourceforge.net/
EDAM Ontology
22 June 2013 24Scientific Workflows: what do we miss?
Topic
Topic
"A general bioinformatics subject or category, such as a field of
study, data, processing, analysis or technology.“
"Biological data resources“ "Nucleic acid analysis“
"Protein analysis“ "Sequence analysis“
"Structure analysis“ "Phylogenetics“
"Proteomics“ "Data handling“
"Chemoinformatics“ "Transcriptomics“
"Literature and reference“ "Ontologies, nomenclature and
"Immunoinformatics“ classification“
"Genetics“ "Systems biology"
"Ecoinformatics“ "Genomics"
22 June 2013 25Scientific Workflows: what do we miss?
Operation
Operation
"A function or process performed by a tool; what is done, but
not (typically) how or in what context."
"Alignment“ "Analysis and processing“
"Annotation“ "Classification“
"Comparison“ "Editing“
"Mapping and assembly“ "Modelling and simulation“
"Optimisation and refinement“ "Plotting and rendering“
"Prediction, detection and recognition“
"Search and retrieval“ "Validation and standardisation"
22 June 2013 26Scientific Workflows: what do we miss?
Data
Data
"A type of data in common use in bioinformatics."
Include: Core data, Identifier, Parameter, report
"Alignment“ "Article“ "Biological model“
"Classification“ "Codon usage table“ "Data index“
"Data reference“ "Experimental measurement“
"Gene expression profile“"Image“ "Map“ "Matrix“
"Microarray data“ "Molecular interaction“ "Molecular property“
"Ontology“ "Ontology concept“ "Pathway or
"Phylogenetic raw data“ "Phylogenetic tree“ network“
"Reaction data“ "Schema“ "Secondary structure“
"Sequence“ "Sequence motif“ "Sequence profile“
"Structural (3D) profile“ "Structure“ "Workflow"
22 June 2013 27Scientific Workflows: what do we miss?
Format e Identifier
Format
"A specific layout for encoding a specific type of data in a
computer file or memory."
"Binary“ "Format (typed)“ "HTML“ "RDF“
"Text“ "XML“
Identifier
"A label that identifies (typically uniquely) something such as
data, a resource or a biological entity."
"Accession“ "Identifier (hybrid)“ "Identifier (typed)“
"Identifier with metadata“ "Name"
22 June 2013 28Scientific Workflows: what do we miss?
Researchers want best possible results
in the shortest possible time!
No matter which database, site, computer
are used
Distributed nature of data sources (network
issues, e.g. timeout and unavailabilty of sites)
Large data volumes (reduced data transfer)
Complex data analysis (implying HPC/cloud)
Perfomances
22 June 2013 29Scientific Workflows: what do we miss?
Optimization of performances
22 June 2013 30
Optimization
Runtime error
detection
Task-level
failure
recovery
Evaluation of
alternative
services Task
dependency
analysis & flow
parallelization
Parallelization
on cluster
or HPC
architecture
Scientific Workflows: what do we miss?
Optimization of performances
22 June 2013 31
Optimization
Runtime error
detection
Task-level
failure
recovery
Evaluation of
alternative
services Task
dependency
analysis & flow
parallelization
Parallelization
on cluster
or HPC
architecture
Scientific Workflows: what do we miss?
Alternative services
SRS by Web Services (SWS) provides
access to public SRS implementations by
selecting the most up-to-date, working site
for any given database
Reproducibility of analysis in life
sciences is fundamental!
 Dependency on current contents of databases
 Dependency on the current status and
variability of tools
NB! Perfect reproducibility in-silico is impossible!
Reuse of intermediate results and procedures
Reproducibility and reuse
22 June 2013 32Scientific Workflows: what do we miss?
Reproducibility & reuse
22 June 2013 33
Reproducibility
and reuse of
results
State of
databases and
tools
Prospective
provenance
data
Retrospective
provenance
data Reuse of
intermediate
results
Caching
Scientific Workflows: what do we miss?
Reproducibility & reuse
22 June 2013 34
Reproducibility
and reuse of
results
State of
databases and
tools
Prospective
provenance
data
Retrospective
provenance
data Reuse of
intermediate
results
Caching
Scientific Workflows: what do we miss?
Prospective provenance
Workflow structural model, dependencies
from services, databases, or software
libraries, systems dependencies
Retrospective provenance
Observations from run time events: data
produced and consumed and services
accessed
In collaboration with
Paolo MISSIER
School of Computing Sciences, Newcastle University, UK
paolo.missier@ncl.ac.uk
Thanks!
22 June 2013 35Scientific Workflows: what do we miss?

More Related Content

What's hot

FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout Carole Goble
 
Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher? Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher? Carole Goble
 
Data management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK StoryData management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK StoryCarole Goble
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsCarole Goble
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data ManagementCarole Goble
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
 
FAIR History and the Future
FAIR History and the FutureFAIR History and the Future
FAIR History and the FutureCarole Goble
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
 
EOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryEOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryCarole Goble
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Jamie Bisset
 
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...Carole Goble
 
How are we Faring with FAIR? (and what FAIR is not)
How are we Faring with FAIR? (and what FAIR is not)How are we Faring with FAIR? (and what FAIR is not)
How are we Faring with FAIR? (and what FAIR is not)Carole Goble
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOMCarole Goble
 
The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?Carole Goble
 
Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...librarianrafia
 

What's hot (20)

FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
 
Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher? Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher?
 
Data management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK StoryData management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK Story
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of Scientists
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
 
FAIR History and the Future
FAIR History and the FutureFAIR History and the Future
FAIR History and the Future
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
EOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryEOSC-Life Workflow Collaboratory
EOSC-Life Workflow Collaboratory
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction)
 
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
 
How are we Faring with FAIR? (and what FAIR is not)
How are we Faring with FAIR? (and what FAIR is not)How are we Faring with FAIR? (and what FAIR is not)
How are we Faring with FAIR? (and what FAIR is not)
 
DCC Keynote 2007
DCC Keynote 2007DCC Keynote 2007
DCC Keynote 2007
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOM
 
The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?
 
Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...
 
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
 

Similar to Scientific Workflows: what do we have, what do we miss?

Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Thomas Burguiere
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsSrinath Perera
 
Biocatalogue Talk Slides
Biocatalogue Talk SlidesBiocatalogue Talk Slides
Biocatalogue Talk SlidesBioCatalogue
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesASIS&T
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...
2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...
2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...GIS in the Rockies
 
Taverna workflow management system (2010 11-30 Bath Workflow Tools)
Taverna workflow management system (2010 11-30 Bath Workflow Tools)Taverna workflow management system (2010 11-30 Bath Workflow Tools)
Taverna workflow management system (2010 11-30 Bath Workflow Tools)Stian Soiland-Reyes
 
Taverna workflow management system (2010 11-30 Bath Workflow Tools) PPTX
Taverna workflow management system (2010 11-30 Bath Workflow Tools) PPTXTaverna workflow management system (2010 11-30 Bath Workflow Tools) PPTX
Taverna workflow management system (2010 11-30 Bath Workflow Tools) PPTXStian Soiland-Reyes
 
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...Jose Manuel Gómez-Pérez
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Oladokun Sulaiman
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Sanjay Padhi, Ph.D
 
Kuchinsky_Cytoscape_BOSC2009
Kuchinsky_Cytoscape_BOSC2009Kuchinsky_Cytoscape_BOSC2009
Kuchinsky_Cytoscape_BOSC2009bosc
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 

Similar to Scientific Workflows: what do we have, what do we miss? (20)

UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
Biocatalogue Talk Slides
Biocatalogue Talk SlidesBiocatalogue Talk Slides
Biocatalogue Talk Slides
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...
2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...
2015 FOSS4G Track: Open Specifications for the Storage, Transport and Process...
 
Taverna workflow management system (2010 11-30 Bath Workflow Tools)
Taverna workflow management system (2010 11-30 Bath Workflow Tools)Taverna workflow management system (2010 11-30 Bath Workflow Tools)
Taverna workflow management system (2010 11-30 Bath Workflow Tools)
 
Taverna workflow management system (2010 11-30 Bath Workflow Tools) PPTX
Taverna workflow management system (2010 11-30 Bath Workflow Tools) PPTXTaverna workflow management system (2010 11-30 Bath Workflow Tools) PPTX
Taverna workflow management system (2010 11-30 Bath Workflow Tools) PPTX
 
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
 
Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Grid computing
Grid computingGrid computing
Grid computing
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
Kuchinsky_Cytoscape_BOSC2009
Kuchinsky_Cytoscape_BOSC2009Kuchinsky_Cytoscape_BOSC2009
Kuchinsky_Cytoscape_BOSC2009
 
UCIAD - quick overview
UCIAD - quick overviewUCIAD - quick overview
UCIAD - quick overview
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Scientific Workflows: what do we have, what do we miss?

  • 1. Scientific Workflows: what do we have, what do we miss? Paolo Romano IRCCS AOU San Martino – IST, Genova, Italy (paolo.dm.romano@gmail.com, skype: p.romano)
  • 2. Talk outline  Aims of data integration in Life Sciences  A methodology for the automation of data retrieval and analysis processes  Workflow Management Systems  Issues related to:  automatic composition,  execution performances,  workflow reuse 22 June 2013 2Scientific Workflows: what do we miss?
  • 3. Biomedical databases 22 June 2013 3Scientific Workflows: what do we miss? Accessible on-line by means of human-centered interfaces Don’t share interface, data contents and structure, encoding Don’t interoperate Oblige researchers to “cut & paste” data May have huge size
  • 4. Some figures European Nucleotide Archive: 195,241,608 sequences, 292,078,866,691 bases UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs PRIDE: 111,219,191 spectra IntAct: 229,082 interactions ArrayExpress: ~16,000 experiments, ~450,000 hybridizations 22 June 2013 4 DB size Next-Generation Sequencing: 16Gb / experiment! Scientific Workflows: what do we miss?
  • 5. Some figures European Nucleotide Archive: 195,241,608 sequences, 292,078,866,691 bases UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs PRIDE: 111,219,191 spectra IntAct: 229,082 interactions ArrayExpress: ~16,000 experiments, ~450,000 hybridizations 22 June 2013 5 DB size Next-Generation Sequencing: 16Gb / experiment! Scientific Workflows: what do we miss?
  • 6. An international collaboration aimed at building a detailed map of human genome variability. Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010). Data: ~4.9 Tbases (~3 Gbases/individual) Found: 15M mutations, 1M deletions/insertions, 20K major variants The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010. DOI:10.1038/nature09534 http://www.1000genomes.org/ 22 June 2013 6 1000 Genomes Project Scientific Workflows: what do we miss?
  • 7. An international collaboration aimed at building a detailed map of human genome variability. Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010). Data: ~4.9 Tbases (~3 Gbases/individual) Found: 15M mutations, 1M deletions/insertions, 20K major variants The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010. DOI:10.1038/nature09534 http://www.1000genomes.org/ 22 June 2013 7 1000 Genomes Project Impossible without bioinformatics Unmanageable without automation of processes Scientific Workflows: what do we miss?
  • 8. 22 June 2013 8 Data integration: aims  Data integration and automation of retrieval and analysis processes are needed for: o Achieving a precise and comprehensive vision of available information o Carrying out queries and analysis involving many databases and software tools automatically o Carrying out analysis of huge data quantities efficiently o Implementing an effective data mining Scientific Workflows: what do we miss?
  • 9. “A computerized facilitation or automation of a business process, in whole or part" (Workflow Management Coalition) Aim:  Implementing data analysis processes in standardized enviroments Main advantages:  efficiency: being automatic procedures, make researchers free from repetitive tasks and e support “good practices”,  reproducibiliy: analysis may be replicated over time, easily and effectively,  reuse: both intermediate results and workflows may be reused,  traceability: the workflow is enacted in a environment that allows tracing back results. What is a Workflow 22 June 2013 9Scientific Workflows: what do we miss?
  • 10. An experiment Prediction of the structure of a protein by homology 22 June 2013 10Scientific Workflows: what do we miss?
  • 11. Researchers carrying out the analysis need to know:  Which tools and dbs are needed, where they reside, and how to use them  In which order they must be used  How to transfer data between them  How to reconcile semantics of data used by services Manual 22 June 2013 11Scientific Workflows: what do we miss?
  • 12. In an automated procedure software must:  Know which tool/db is able to carry out a given task (e.g. aligning sequence, retrieving protein structure data)  Find real implementations (e.g. BLAST, provided by NCBI)  Link services in a workflow enabling to achieve the desired task  Transfer data appropriately between services Automatic 22 June 2013 12Scientific Workflows: what do we miss?
  • 13. Workflow for CABRI Network Services 22 June 2013 13Scientific Workflows: what do we miss?
  • 14. o Define XML languages with controlled vocabularies o Archive data in XML formats o Make use of Web Services for data exchange between services o Associate data and analysis to proper items of an ontology of bioinformatics data, data types, and tasks o Encode processes as workflows Methodology: components 22 June 2013 14Scientific Workflows: what do we miss?
  • 15. Both industrial and academic WfMS are available and their use for Life Sciences is now widespread.  Biopipe, an add-on for bioperl  GPipe, an extension of Pise  Taverna (EBI), a component of myGrid platform  Pegasys (University of British Columbia)  EGene (Universidade de São Paulo)  Wildfire (Bioinformatics Institute, Singapore)  Pipeline Pilot (SciTegic)  BioWBI, Bioinformatic Workflow Builder Interface (IBM) Workflow Management Systems 22 June 2013 15Scientific Workflows: what do we miss?
  • 16. Software Type Standard License URL Taverna Workbench Stand-alone XScufl Open source http://taverna.sourceforge.net/ Biopipe Libreria software Pipeline XML Open source http://www.gmod.org/biopipe/ ProGenGrid Stand-alone NA NA http://datadog.unile.it/progen DiscoveryNet Stand-alone DPML Commercial http://www.discovery-on-the.net/ Kepler Stand-alone MoML Open source http://kepler-project.org/ GPipe Interfaccia Web, servizi locali GPipe XML Open source http://if- web1.imb.uq.edu.au/Pise/5.a/gpipe.html EGene Stand-alone NA Open source http://www.lbm.fmvz.usp.br/egene/ BioWMS Interfaccia Web, servizi remoti XPDL Public use http://litbio.unicam.it:8080/biowms/ BioWEP Portale XScufl XPDL Open source http://bioinformatics.istge.it/biowep/ BioWBI Interfaccia Web, servizi locali Proprietary Commerciale http://www.alphaworks.ibm.com/tech/biowbi Pegasys Stand-alone Pegasys DAG Open source http://bioinformatics.ubc.ca/pegasys/ Wildfire Stand-alone GEL Open source http://wildfire.bii.a-star.edu.sg/wildfire/ Triana Stand-alone Triana Workflow Language Open source http://www.trianacode.org/ Pipeline Pilot Stand-alone Proprietary Commercial http://www.scitegic.com/ FreeFluo Libreria software WSFL e XScufl Open source http://freefluo.sourceforge.net/ Biomake Libreria software NA Open source http://skam.sourceforge.net/ Workflow Management Systems Various software types and different standards 22 June 2013 16Scientific Workflows: what do we miss?
  • 17. Taverna Workbench is the best known and most adopted in life sciences  Developed in the context of the myGrid platform  Univ. Manchester and EBI main developers  Open source at SourceForge.net It allows to:  Build and execute workflows for complex analysis  … by getting access to remote and local services  … displaying results in various formats  … describing data through an ad-hoc ontology Requirements: java plus Windows / Mac / Linux Open source: http://taverna.sourceforge.net/ Current version: 2.4 Taverna Workbench 22 June 2013 17Scientific Workflows: what do we miss?
  • 18. WfMS are increasingly used for data integration and analysis in biomedical research. Here, we highlight some of current issues. Issues:  Automatic composition of workflows  Performances  Reproducibility and reuse WfMS: some current issues 22 June 2013 18Scientific Workflows: what do we miss?
  • 19. Researchers only care for scientific results!  Building workflows may be a burden  Various skills are requested, and GUI do not solve  Workflow composition should be much simpler, and become semi-automatic Automatic composition 22 June 2013 19Scientific Workflows: what do we miss?
  • 20. Automatic composition 22 June 2013 20 Automatic composition Automatic selection of best services Automated service identification and composition Adapters for different data formats Automatic conversion of formats Ontology of methods, tools and data types Integration with repositories Controlled Language Interface Scientific Workflows: what do we miss?
  • 21. Automatic composition 22 June 2013 21 Automatic composition Automatic selection of best services Automated service identification and composition Adapters for different data formats Automatic conversion of formats Ontology of methods, tools and data types Integration with repositories Controlled Language Interface Scientific Workflows: what do we miss? A trade-off is required between rich semantic annotations and design complexity. Semantic-based solutions available for controlled set of services.
  • 22. Beyond Taverna MyGrid team developed tools identification of services and supporting reuse of workflows BioCatalogue Annotated catalogue of Web Services for Life Science MyExperiment Repository of workflows for Life Science, enabled by social networking features 22 June 2013 22Scientific Workflows: what do we miss?
  • 23. Allows to define all:  Data analysis tasks for bioinformatics  Data types  Possible relations betweeb tasks and data types (I/O)  Transformations between equivalent data (format)  Transformations between related data (through elaboration, e.g.: triplet  AA, gene symbol  sequence) Fondamental in order to:  Validate data flow and elaborations  Support automatic workflow composition EDAM (EMBRACE Data and Methods) Ontology EDAM Ontology 22 June 2013 23Scientific Workflows: what do we miss?
  • 24. EDAM (EMBRACE Data and Methods) Topic: context of the analysis: domain of a study or an experiment Operation: task carried out Data: a data type used in bioinformatics Format: a format used for encoding some data http://edamontology.sourceforge.net/ EDAM Ontology 22 June 2013 24Scientific Workflows: what do we miss?
  • 25. Topic Topic "A general bioinformatics subject or category, such as a field of study, data, processing, analysis or technology.“ "Biological data resources“ "Nucleic acid analysis“ "Protein analysis“ "Sequence analysis“ "Structure analysis“ "Phylogenetics“ "Proteomics“ "Data handling“ "Chemoinformatics“ "Transcriptomics“ "Literature and reference“ "Ontologies, nomenclature and "Immunoinformatics“ classification“ "Genetics“ "Systems biology" "Ecoinformatics“ "Genomics" 22 June 2013 25Scientific Workflows: what do we miss?
  • 26. Operation Operation "A function or process performed by a tool; what is done, but not (typically) how or in what context." "Alignment“ "Analysis and processing“ "Annotation“ "Classification“ "Comparison“ "Editing“ "Mapping and assembly“ "Modelling and simulation“ "Optimisation and refinement“ "Plotting and rendering“ "Prediction, detection and recognition“ "Search and retrieval“ "Validation and standardisation" 22 June 2013 26Scientific Workflows: what do we miss?
  • 27. Data Data "A type of data in common use in bioinformatics." Include: Core data, Identifier, Parameter, report "Alignment“ "Article“ "Biological model“ "Classification“ "Codon usage table“ "Data index“ "Data reference“ "Experimental measurement“ "Gene expression profile“"Image“ "Map“ "Matrix“ "Microarray data“ "Molecular interaction“ "Molecular property“ "Ontology“ "Ontology concept“ "Pathway or "Phylogenetic raw data“ "Phylogenetic tree“ network“ "Reaction data“ "Schema“ "Secondary structure“ "Sequence“ "Sequence motif“ "Sequence profile“ "Structural (3D) profile“ "Structure“ "Workflow" 22 June 2013 27Scientific Workflows: what do we miss?
  • 28. Format e Identifier Format "A specific layout for encoding a specific type of data in a computer file or memory." "Binary“ "Format (typed)“ "HTML“ "RDF“ "Text“ "XML“ Identifier "A label that identifies (typically uniquely) something such as data, a resource or a biological entity." "Accession“ "Identifier (hybrid)“ "Identifier (typed)“ "Identifier with metadata“ "Name" 22 June 2013 28Scientific Workflows: what do we miss?
  • 29. Researchers want best possible results in the shortest possible time! No matter which database, site, computer are used Distributed nature of data sources (network issues, e.g. timeout and unavailabilty of sites) Large data volumes (reduced data transfer) Complex data analysis (implying HPC/cloud) Perfomances 22 June 2013 29Scientific Workflows: what do we miss?
  • 30. Optimization of performances 22 June 2013 30 Optimization Runtime error detection Task-level failure recovery Evaluation of alternative services Task dependency analysis & flow parallelization Parallelization on cluster or HPC architecture Scientific Workflows: what do we miss?
  • 31. Optimization of performances 22 June 2013 31 Optimization Runtime error detection Task-level failure recovery Evaluation of alternative services Task dependency analysis & flow parallelization Parallelization on cluster or HPC architecture Scientific Workflows: what do we miss? Alternative services SRS by Web Services (SWS) provides access to public SRS implementations by selecting the most up-to-date, working site for any given database
  • 32. Reproducibility of analysis in life sciences is fundamental!  Dependency on current contents of databases  Dependency on the current status and variability of tools NB! Perfect reproducibility in-silico is impossible! Reuse of intermediate results and procedures Reproducibility and reuse 22 June 2013 32Scientific Workflows: what do we miss?
  • 33. Reproducibility & reuse 22 June 2013 33 Reproducibility and reuse of results State of databases and tools Prospective provenance data Retrospective provenance data Reuse of intermediate results Caching Scientific Workflows: what do we miss?
  • 34. Reproducibility & reuse 22 June 2013 34 Reproducibility and reuse of results State of databases and tools Prospective provenance data Retrospective provenance data Reuse of intermediate results Caching Scientific Workflows: what do we miss? Prospective provenance Workflow structural model, dependencies from services, databases, or software libraries, systems dependencies Retrospective provenance Observations from run time events: data produced and consumed and services accessed
  • 35. In collaboration with Paolo MISSIER School of Computing Sciences, Newcastle University, UK paolo.missier@ncl.ac.uk Thanks! 22 June 2013 35Scientific Workflows: what do we miss?