Scaling API-first – The story of a global engineering organization
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
1. Scientific Workflow Management System
Taverna,
Biocatalogue,
and
myExperiment:
a
three-‐legged
founda;on
for
effec;ve
collabora;on
in
E-‐science
A collaborative talk by Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
with additional material kindly shared by:
Prof. Dave DeRoure and David Newman, University of Southampton
Prof. Carole Goble and the e-Labs design group, University of Manchester
1
GeoClouds workshop, Indianapolis, IN, Sept. 17, 2009 - P. Missier
Sunday, 13 March 2011
2. What is the myGrid Project?
UK
e-‐Science
pilot
project
since
2001.
Centred
at
Manchester,
Southampton
and
the
EMBL-‐EBI
Part
of
Open
Middleware
Infrastructure
InsEtute
UK
hFp://
www.omii.ac.uk.
Mixture
of
developers,
bioinformaEcians
and
researchers
An
alliance
of
contribuEng
projects
and
partners
Open
source
development
and
content
LGPL
or
BSD
Infrastructure
We
don’t
own
any
resources
(apart
from
catalogues)
Or
a
Grid.
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
3. Taverna
Graphical
Workbench
For
Professionals
Plug-‐in
architecture
Nested
Workflows
Drag
and
Drop
Wiring
together
Rapidly
incorporate
new
service
without
coding.
Not
restricted
to
predetermined
services
Access
to
local
and
remote
resources
and
analysis
tools
3500+
service
operaEons
available
when
start
up
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
4. What do Scientists use Taverna for?
Systems
biology
model
building Netherlands
BioinformaEcs
Centre
Genome
Canada
BioinformaEcs
Plaaorm
Proteomics
BioMOBY
Sequence
analysis US
FLOSS
social
science
program
Protein
structure
predicEon RENCI
Gene/protein
annotaEon
SysMO
ConsorEum
Microarray
data
analysis French
SIGENAE
farm
animals
project
QTL
studies ThaiGrid
CARMEN
Neuroscience
project
QSAR
studies SPINE
consorEum
Medical
image
analysis EU
Enfin,
EMBRACE,
BioSapian,
Casimir
Public
Health
care
epidemiology EU
SysMO
ConsorEum
Heart
model
simulaEons NERC
Centre
for
Ecology
and
Hydrology
High
throughput
screening Bergen
Centre
for
ComputaEonal
Biology
Max-‐Planck
insEtute
for
Plant
Breeding
Research
Phenotypical
studies
Genoa
Cancer
Research
Centre
Phylogeny AstroGrid
StaEsEcal
analysis
30
USA
academic
and
research
Text
mining ins;tu;ons
Astronomy,
Music,
Meteorology
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
5. Who else is in this space?
Trident Triana
Kepler
Ptolemy II
Taverna
BioExtract
BPEL
5
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
6. www.myexperiment.org
Socially share,
discover and reuse
workflows and
other methods.
Cooperative bazaar.
l Sunday
10th
May:
1748
registered
users,
143
groups,
669
workflows,
197
files,
52
packs
56
different
countries.
Top
4:
UK,
US,
The
Netherlands,
Germany
Sunday, 13 March 2011
9. Why data provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
Linköping, Sweden -- January 2010
Sunday, 13 March 2011
10. Goals, expected contributions
• Established technology provider - open-source
– traditionally active in the bioinf space
– but also involved in the e-Lico EU project (data mining
portal)
– large community base, established production
environment
• Main goal:
– to offer our workflow and workflow repository technology,
put it to the test on the challenges of data preservation
pipelines
• Challenges:
– expect new requirements on our current technology
• robust, high-volume data pipelines
• workflow provenance -- process evolution
10
• data provenance
Sunday, 13 March 2011