Scientific Workflows: what do we have, what do we miss?

Scientific Workflows:
what do we have, what do we miss?
Paolo Romano
IRCCS AOU San Martino – IST,
Genova, Italy
(paolo.dm.romano@gmail.com, skype: p.romano)

Talk outline
 Aims of data integration in Life Sciences
 A methodology for the automation of data retrieval
and analysis processes
 Workflow Management Systems
 Issues related to:
 automatic composition,
 execution performances,
 workflow reuse
22 June 2013 2Scientific Workflows: what do we miss?

Biomedical databases
Accessible on-line by means of
human-centered interfaces
Don’t share interface, data
contents and structure, encoding
Don’t interoperate
Oblige researchers to
“cut & paste” data
May have huge size

Some figures
European Nucleotide Archive:
195,241,608 sequences, 292,078,866,691 bases
UniProtKB:
12,347,303 sequences, 3,974,018,240 AAs
PRIDE: 111,219,191 spectra
IntAct: 229,082 interactions
ArrayExpress:
~16,000 experiments, ~450,000 hybridizations
22 June 2013 4
DB size
Next-Generation Sequencing: 16Gb / experiment!
Scientific Workflows: what do we miss?

Some figures
European Nucleotide Archive:
195,241,608 sequences, 292,078,866,691 bases
UniProtKB:
12,347,303 sequences, 3,974,018,240 AAs
PRIDE: 111,219,191 spectra
IntAct: 229,082 interactions
ArrayExpress:
~16,000 experiments, ~450,000 hybridizations
22 June 2013 5
DB size
Next-Generation Sequencing: 16Gb / experiment!

An international collaboration aimed at building a
detailed map of human genome variability.
Pilot phase: identification of 95% of variations
present in at least 1% of population for three
ethnic groups (Oct 28, 2010).
Data: ~4.9 Tbases (~3 Gbases/individual)
Found: 15M mutations, 1M deletions/insertions,
20K major variants
The 1000 Genomes Consortium. A map of human genome variation from population
scale sequencing. Published online in Nature on 28 October 2010.
DOI:10.1038/nature09534 http://www.1000genomes.org/
22 June 2013 6
1000 Genomes Project

An international collaboration aimed at building a
detailed map of human genome variability.
Pilot phase: identification of 95% of variations
present in at least 1% of population for three
ethnic groups (Oct 28, 2010).
Data: ~4.9 Tbases (~3 Gbases/individual)
Found: 15M mutations, 1M deletions/insertions,
20K major variants
The 1000 Genomes Consortium. A map of human genome variation from population
scale sequencing. Published online in Nature on 28 October 2010.
DOI:10.1038/nature09534 http://www.1000genomes.org/
22 June 2013 7
1000 Genomes Project
Impossible without
bioinformatics
Unmanageable without
automation of processes

22 June 2013 8
Data integration: aims
 Data integration and automation of retrieval
and analysis processes are needed for:
o Achieving a precise and comprehensive vision of
available information
o Carrying out queries and analysis involving many
databases and software tools automatically
o Carrying out analysis of huge data quantities
efficiently
o Implementing an effective data mining

“A computerized facilitation or automation of a business
process, in whole or part" (Workflow Management Coalition)
Aim:
 Implementing data analysis processes in standardized
enviroments
Main advantages:
 efficiency: being automatic procedures, make researchers
free from repetitive tasks and e support “good practices”,
 reproducibiliy: analysis may be replicated over time, easily
and effectively,
 reuse: both intermediate results and workflows may be
reused,
 traceability: the workflow is enacted in a environment
that allows tracing back results.
What is a Workflow

An experiment
Prediction of the structure of a protein by homology

Researchers carrying out the analysis need to know:
 Which tools and dbs are needed, where they
reside, and how to use them
 In which order they must be used
 How to transfer data between them
 How to reconcile semantics of data used by
services
Manual

In an automated procedure
software must:
 Know which tool/db is able to carry out a given
task (e.g. aligning sequence, retrieving protein
structure data)
 Find real implementations (e.g. BLAST, provided
by NCBI)
 Link services in a workflow enabling to achieve
the desired task
 Transfer data appropriately between services
Automatic

Workflow for CABRI Network Services

o Define XML languages with controlled vocabularies
o Archive data in XML formats
o Make use of Web Services for data exchange
between services
o Associate data and analysis to proper items of an
ontology of bioinformatics data, data types, and
tasks
o Encode processes as workflows
Methodology: components

Both industrial and academic WfMS are available and
their use for Life Sciences is now widespread.
 Biopipe, an add-on for bioperl
 GPipe, an extension of Pise
 Taverna (EBI), a component of myGrid platform
 Pegasys (University of British Columbia)
 EGene (Universidade de São Paulo)
 Wildfire (Bioinformatics Institute, Singapore)
 Pipeline Pilot (SciTegic)
 BioWBI, Bioinformatic Workflow Builder Interface (IBM)
Workflow Management Systems

Software Type Standard License URL
Taverna Workbench Stand-alone XScufl Open source http://taverna.sourceforge.net/
Biopipe Libreria software Pipeline XML Open source http://www.gmod.org/biopipe/
ProGenGrid Stand-alone NA NA http://datadog.unile.it/progen
DiscoveryNet Stand-alone DPML Commercial http://www.discovery-on-the.net/
Kepler Stand-alone MoML Open source http://kepler-project.org/
GPipe Interfaccia Web,
servizi locali
GPipe XML Open source http://if-
web1.imb.uq.edu.au/Pise/5.a/gpipe.html
EGene Stand-alone NA Open source http://www.lbm.fmvz.usp.br/egene/
BioWMS Interfaccia Web,
servizi remoti
XPDL Public use http://litbio.unicam.it:8080/biowms/
BioWEP Portale XScufl
XPDL
Open source http://bioinformatics.istge.it/biowep/
BioWBI Interfaccia Web,
servizi locali
Proprietary Commerciale http://www.alphaworks.ibm.com/tech/biowbi
Pegasys Stand-alone Pegasys DAG Open source http://bioinformatics.ubc.ca/pegasys/
Wildfire Stand-alone GEL Open source http://wildfire.bii.a-star.edu.sg/wildfire/
Triana Stand-alone Triana Workflow
Language
Open source http://www.trianacode.org/
Pipeline Pilot Stand-alone Proprietary Commercial http://www.scitegic.com/
FreeFluo Libreria software WSFL e XScufl Open source http://freefluo.sourceforge.net/
Biomake Libreria software NA Open source http://skam.sourceforge.net/
Workflow Management Systems
Various software types and different standards

Taverna Workbench is the best known and most
adopted in life sciences
 Developed in the context of the myGrid platform
 Univ. Manchester and EBI main developers
 Open source at SourceForge.net
It allows to:
 Build and execute workflows for complex analysis
 … by getting access to remote and local services
 … displaying results in various formats
 … describing data through an ad-hoc ontology
Requirements: java plus Windows / Mac / Linux
Open source: http://taverna.sourceforge.net/ Current version: 2.4
Taverna Workbench

WfMS are increasingly used for data integration
and analysis in biomedical research.
Here, we highlight some of current issues.
Issues:
 Automatic composition of workflows
 Performances
 Reproducibility and reuse
WfMS: some current issues

Researchers only care for scientific
results!
 Building workflows may be a burden
 Various skills are requested, and GUI do not
solve
 Workflow composition should be much
simpler, and become semi-automatic
Automatic composition

22 June 2013 20
Automatic
composition
Automatic
selection of
best services
Automated
service
identification
and composition
Adapters for
different data
formats
Automatic
conversion of
formats Ontology of
methods, tools
and data types
Integration
with
repositories
Controlled
Language
Interface

22 June 2013 21
Automatic
composition
Automatic
selection of
best services
Automated
service
identification
and composition
Adapters for
different data
formats
Automatic
conversion of
formats Ontology of
methods, tools
and data types
Integration
with
repositories
Controlled
Language
Interface
A trade-off is required between rich
semantic annotations and design
complexity.
Semantic-based solutions available
for controlled set of services.

Beyond Taverna
MyGrid team developed tools identification of
services and supporting reuse of workflows
BioCatalogue
Annotated catalogue of Web Services for Life
Science
MyExperiment
Repository of workflows for Life Science, enabled
by social networking features

Allows to define all:
 Data analysis tasks for bioinformatics
 Data types
 Possible relations betweeb tasks and data types (I/O)
 Transformations between equivalent data (format)
 Transformations between related data (through elaboration,
e.g.: triplet  AA, gene symbol  sequence)
Fondamental in order to:
 Validate data flow and elaborations
 Support automatic workflow composition
EDAM (EMBRACE Data and Methods) Ontology
EDAM Ontology

EDAM (EMBRACE Data and Methods)
Topic: context of the analysis: domain of a study or an experiment
Operation: task carried out
Data: a data type used in
bioinformatics
Format: a format used for
encoding some data
http://edamontology.sourceforge.net/
EDAM Ontology

Topic
Topic
"A general bioinformatics subject or category, such as a field of
study, data, processing, analysis or technology.“
"Biological data resources“ "Nucleic acid analysis“
"Protein analysis“ "Sequence analysis“
"Structure analysis“ "Phylogenetics“
"Proteomics“ "Data handling“
"Chemoinformatics“ "Transcriptomics“
"Literature and reference“ "Ontologies, nomenclature and
"Immunoinformatics“ classification“
"Genetics“ "Systems biology"
"Ecoinformatics“ "Genomics"

Operation
Operation
"A function or process performed by a tool; what is done, but
not (typically) how or in what context."
"Alignment“ "Analysis and processing“
"Annotation“ "Classification“
"Comparison“ "Editing“
"Mapping and assembly“ "Modelling and simulation“
"Optimisation and refinement“ "Plotting and rendering“
"Prediction, detection and recognition“
"Search and retrieval“ "Validation and standardisation"

Data
Data
"A type of data in common use in bioinformatics."
Include: Core data, Identifier, Parameter, report
"Alignment“ "Article“ "Biological model“
"Classification“ "Codon usage table“ "Data index“
"Data reference“ "Experimental measurement“
"Gene expression profile“"Image“ "Map“ "Matrix“
"Microarray data“ "Molecular interaction“ "Molecular property“
"Ontology“ "Ontology concept“ "Pathway or
"Phylogenetic raw data“ "Phylogenetic tree“ network“
"Reaction data“ "Schema“ "Secondary structure“
"Sequence“ "Sequence motif“ "Sequence profile“
"Structural (3D) profile“ "Structure“ "Workflow"

Format e Identifier
Format
"A specific layout for encoding a specific type of data in a
computer file or memory."
"Binary“ "Format (typed)“ "HTML“ "RDF“
"Text“ "XML“
Identifier
"A label that identifies (typically uniquely) something such as
data, a resource or a biological entity."
"Accession“ "Identifier (hybrid)“ "Identifier (typed)“
"Identifier with metadata“ "Name"

Researchers want best possible results
in the shortest possible time!
No matter which database, site, computer
are used
Distributed nature of data sources (network
issues, e.g. timeout and unavailabilty of sites)
Large data volumes (reduced data transfer)
Complex data analysis (implying HPC/cloud)
Perfomances

Optimization of performances
22 June 2013 30
Optimization
Runtime error
detection
Task-level
failure
recovery
Evaluation of
alternative
services Task
dependency
analysis & flow
parallelization
Parallelization
on cluster
or HPC
architecture

Optimization of performances
22 June 2013 31
Optimization
Runtime error
detection
Task-level
failure
recovery
Evaluation of
alternative
services Task
dependency
analysis & flow
parallelization
Parallelization
on cluster
or HPC
architecture
Alternative services
SRS by Web Services (SWS) provides
access to public SRS implementations by
selecting the most up-to-date, working site
for any given database

Reproducibility of analysis in life
sciences is fundamental!
 Dependency on current contents of databases
 Dependency on the current status and
variability of tools
NB! Perfect reproducibility in-silico is impossible!
Reuse of intermediate results and procedures
Reproducibility and reuse

Reproducibility & reuse
22 June 2013 33
Reproducibility
and reuse of
results
State of
databases and
tools
Prospective
provenance
data
Retrospective
provenance
data Reuse of
intermediate
results
Caching

Reproducibility & reuse
22 June 2013 34
Reproducibility
and reuse of
results
State of
databases and
tools
Prospective
provenance
data
Retrospective
provenance
data Reuse of
intermediate
results
Caching
Prospective provenance
Workflow structural model, dependencies
from services, databases, or software
libraries, systems dependencies
Retrospective provenance
Observations from run time events: data
produced and consumed and services
accessed

In collaboration with
Paolo MISSIER
School of Computing Sciences, Newcastle University, UK
paolo.missier@ncl.ac.uk
Thanks!

Scientific Workflows: what do we have, what do we miss?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scientific Workflows: what do we have, what do we miss?

Similar to Scientific Workflows: what do we have, what do we miss? (20)

Recently uploaded

Recently uploaded (20)

Scientific Workflows: what do we have, what do we miss?