Scientific workflow-overview-2012-01-rev-2

Scientific Workflows:
Experience, Advances, and
Where We Go From Here
Terence Critchlow
January 2012
PNNL-SA-85033

This talk will provide answers to 4 questions:
Why did I get involved with scientific workflows?
How do scientific workflows help scientists?
What problems did you find when you first started working
with scientific workflows?
Can scientific workflows be effectively integrated into the
broader scientific process?

I became involved with scientific workflows
through the SciDAC SDM Center
The Scientific Discovery through
Advanced Computing (SciDAC)
program was funded by DOE
starting in 2001 with the goal of
advancing scientific computing by having CS and
domain science teams work together to address
science questions using new HPC platforms
Application initiatives were funded in areas such as
combustion, fusion, astrophysics, and groundwater
CS and math centers were funded in areas critical
to the development of new, scalable capabilities
including solvers, AMR, visualization, performance,
and data management
Focus was on science not CS research

The Scientific Data Management (SDM)
Center was the focal point for DOE data
management activities
Large, multi-institutional collaborations
Led by Arie Shoshani (LBL)
5 Labs and 5 Universities
Funded for 10 years
Project concluded in 2011
The center had 3 research thrusts:
Storage and efficient access (Rob Ross – ANL)
Data Mining and Analysis (Nagiza Samatova - NCSU)
Software Process Automation (Terence Critchlow – PNNL)
The goal of the SPA team was to develop and deploy
technology that would allow scientists to spend more time
on science by reducing the data management overhead
Workflows had filled that niche in business but, in
2001, there was little usage in science applications

As lead for the SPA team, I had both
management and research responsibilities
Team of 10-15 spread across NCSU, Univ. of
Utah, UC Davis, SDSC, ORNL, and PNNL
Identify relevant technology
Work with science teams to design and
deploy solutions
Identify areas requiring additional research
Perform research to improve the existing
capabilities for our target customers

Workflow technology was selected because
time consuming, repetitive tasks dominate
day-to-day computational science activity
By automating mundane tasks, we
allow scientists to focus on science
not data management
Needed a general purpose
workflow engine that we could
apply to an HPC-centric
environment
Act as the orchestrator, coordinating
the workflow execution
Allow processing of larger data sets
Support scientific reproducibility
Reduce waste of resources by allowing
timely corrective action to be taken

The SDM Center was one of the founding
organizations of the Kepler Consortium
In 2001 there were no widely used scientific workflow
engines
Kepler is an open source workflow environment
Based on the Ptolemy II system developed at UC Berkeley
Started with several projects coming together based on
a need for a flexible
workflow environment
Kepler-project.org
Kepler has become
one of the best
known and widely
used scientific workflow
engines

This talk focuses on work that I was directly
involved in
There was a lot of work performed by the SDM Center
team that I managed but I don’t focus on
Provenance tracking
Dashboard
Templates
Patterns
Deployed workflows
ITER
CPES
Combustion
https://sdm.lbl.gov/sdmcenter/
My research focused on raising the level of
abstraction within scientific workflows

Our first deployed workflow was managing a
bioinformatics analysis pipeline (2002)
In collaboration with
Matt Coleman (LLNL)

The TSI workflow was the first of our
“standard” simulation workflows (2005)
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
In collaboration with
Doug Swesty (Stony Brook)

The workflow can be broken into several
general steps
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer completed
correctly
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Job Submission

Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer completed
correctly
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Job Monitoring
general steps

Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer completed
correctly
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Moving files
general steps

Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer completed
correctly
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Data Analysis
general steps

This translates into a complicated Kepler
workflow
Extensive use of nested
workflows to
compartmentalize steps
160 instances of 18
distinct actors
Over a dozen
parameters to control
workflow execution

We ended up building several similar
simulation science workflows
Fusion science
Combustion
Subsurface science
These all have the same
general steps
But there are significant
differences in the details

Unfortunately, workflows are not typically
portable across machines
User authentication
mechanisms depend on
machine-specific policies
Job launch and
monitoring features
depend on scheduler
File transfer
mechanisms depend on
available infrastructure

We developed generic actors as the first step
in raising the level of abstraction for
workflow design
Generic actors embody
general functionality into
actors that work across
platforms / workflows
Improve workflow
portability
Simplify creation of new
workflows
Form the basis for sharing
subworkflows
Reduce the number of
actor choices

We identified several capabilities required
across simulation workflows
User authentication
Job submission
Submit job scheduling
request to batch
scheduler
Job monitoring
Track status of job from
submitted, to running, to
completed
File transfer
Move files, potentially
between machines at
different sites
Developed and deployed
actors capable of
performing the desired
functionality using
available infrastructure
Generalized to manage
multiple implementations
Parameters and
contextual information
determine which options
to utilize

Use of generic actors improved workflow
effectiveness
Same workflow could be
used on all of the DOE
leadership class
machines
Significantly less
maintenance required
Fewer workflows needed
per science team
Each workflow is simpler
Still requires parameters
to manage details of
execution

Workflow context can be used to reduce
number of explicit parameters
Workflows run in a context
that provide certain
preferences
Systems
User accounts
Configuration files
Information requested /
computed / bound at run time
instead of design time
Initial results are promising
but more work is required to
determine how effective run-
time binding is for workflows

Scientific workflows still have major
adoption challenges to overcome
The correlation between
the scientific process
and the executable
workflow is loose at best
Executable workflows
are extremely complex
and usually require a
dedicated workflow
designer to create
The translation from idea to napkin drawing to
executable workflow is challenging and lossy

The scientific process is collaborative, fluid,
and time sensitive
Important decisions are
made in meetings and
conversations.
Records are distributed
and not easily
associated with specific
tasks
Decisions can be
revisited and changed
Science is inherently
iterative
document the results of
these decisions
Lack broader context
Electronic lab notebooks
provide some contextual
information
Lack details and external
information / links
Provenance provides
some associations

Need a way to allow scientists to collect and
share information about their experiments
A single location capable
of collecting all relevant
information about an
experiment
Information needs to be
related in a meaningful
way
Temporal information
must be preserved
Working with
collaborators at UTEP,
we developed a
prototype of what this
could look like

Our prototype is built on annotating abstract
workflows
Design principles
Workflow construction
needs to be a byproduct
of information collection
Information should not
need to be entered more
than once
Annotations should relate
to specific steps in the
process

The research hierarchy contains the steps in
the abstract workflow
Steps are conceptual
At the top level, these
outline the major steps in
the experiment being
performed
Get data
Create conceptual model
Generate model input
Run simulation
Each step can have sub-
steps within it to refine the
concept further
(nested structure)

Free-form text is associated with each step
in the hierarchy
Allows scientists to easily
describe step’s purpose
Top level describes entire
experiment
Decisions are captured
under research specs tab

The process view shows the steps as a
workflow
Ports are used to identify
inputs and outputs
Lines between steps
indicate information flow
between steps
Steps should, eventually,
connect

Steps are connected by linking input and
output parameters (ports)
Inputs and outputs are
linked
Comment field holds
assumptions and
constraints from the
“other side” of the line
Free-form text makes it
easy to input
information, but
impossible to perform
automatic verification

Zooming in on a specific sub-step provides
additional information about that step
A new tab provides
(sub-)step-specific
information
The process view is
updated to reflect the
sub-steps contained
within this step
Note that the inputs
and outputs to the
workflow come from
the higher-level
workflow

Eventually, some steps correspond to
executable (Kepler) workflows
Prototype expands
Kepler infrastructure
are (still) typically created
by a dedicated workflow
designer
This places the
executable workflow in
the broader context of
the experiment it is
supporting
Provenance can be
linked into overall
experiment

Annotations are stored in RDF to support
export / import
Semantically Interlinked
Online Communities
(SIOC) format chosen
Supports other tools
using these annotations
Report generation
Search / query
Experiment level
provenance information

This prototype represents a starting point for
answering many interesting questions
How do you effectively
link other sources of
information into steps in
an abstract workflow?
How do you select only
the relevant information?
How do you manage
provenance and
attribution in a distributed
environment?
What is the best way to
organize this information
for people filling a variety
of roles?
PIs need a different view
than workflow designers
or bench scientists
How do you effectively
share (subsets of) this
information?
How do you implement
access controls
effectively?

This prototype represents a starting point for
answering many interesting questions
Are workflows the right
abstraction for
representing the
scientific process?
Representing evolution
over time is challenging in
workflows
Does everything have to
correspond to a step?
Is there a way to
generate parts of an
executable workflow
given an abstract
definition?
Can we match steps to
specific actors?
Could you develop a
generic set of wizards or
templates?

Conclusions
The SDM Center has been at the
forefront of scientific workflow R&D
Workflows have been successfully
deployed across a wide variety of
scientific domains
Significant advances have been
made in making workflow engines
more reliable and useful
There remains significant work
required to
Fit workflows within the context of the
overall scientific process
Allow scientists to design and
implement their own workflows

This work involved many, many people
My team
George Chin
Chandrika Sivaramakrishnan
Xiaowen Xin (LLNL)
Anand Kulkarni
Anne Ngu (TX State)
Paulo Pinheiro da Silva
(UTEP)
Aida Gandara (UTEP)
Other SPA team members
Ilkay Altintas
Bertram Ludaescher
Mladen Vouk
Claudio Silva
Scott Klasky
Norbert Podhorszki
Dan Crawl
Ayla Khan
Arie Shoshani
Plus other students and researchers who were
involved for shorter times

Scientific workflow-overview-2012-01-rev-2

Recommended

Recommended

More Related Content

Similar to Scientific workflow-overview-2012-01-rev-2

Similar to Scientific workflow-overview-2012-01-rev-2 (20)

Recently uploaded

Recently uploaded (20)

Scientific workflow-overview-2012-01-rev-2