Scientific Workflows:
Experience, Advances, and
Where We Go From Here
Terence Critchlow
January 2012
PNNL-SA-85033
This talk will provide answers to 4 questions:
Why did I get involved with scientific workflows?
How do scientific workflo...
I became involved with scientific workflows
through the SciDAC SDM Center
The Scientific Discovery through
Advanced Comput...
The Scientific Data Management (SDM)
Center was the focal point for DOE data
management activities
Large, multi-institutio...
As lead for the SPA team, I had both
management and research responsibilities
Team of 10-15 spread across NCSU, Univ. of
U...
Workflow technology was selected because
time consuming, repetitive tasks dominate
day-to-day computational science activi...
The SDM Center was one of the founding
organizations of the Kepler Consortium
In 2001 there were no widely used scientific...
This talk focuses on work that I was directly
involved in
There was a lot of work performed by the SDM Center
team that I ...
Our first deployed workflow was managing a
bioinformatics analysis pipeline (2002)
In collaboration with
Matt Coleman (LLN...
The TSI workflow was the first of our
“standard” simulation workflows (2005)
Submit batch
request at NERSC
Identify new
co...
The workflow can be broken into several
general steps
Submit batch
request at NERSC
Identify new
complete files
Check job
...
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
corre...
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
corre...
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
corre...
This translates into a complicated Kepler
workflow
Extensive use of nested
workflows to
compartmentalize steps
160 instanc...
We ended up building several similar
simulation science workflows
Fusion science
Combustion
Subsurface science
These all h...
Unfortunately, workflows are not typically
portable across machines
User authentication
mechanisms depend on
machine-speci...
We developed generic actors as the first step
in raising the level of abstraction for
workflow design
Generic actors embod...
We identified several capabilities required
across simulation workflows
User authentication
Job submission
Submit job sche...
Use of generic actors improved workflow
effectiveness
Same workflow could be
used on all of the DOE
leadership class
machi...
Workflow context can be used to reduce
number of explicit parameters
Workflows run in a context
that provide certain
prefe...
Scientific workflows still have major
adoption challenges to overcome
The correlation between
the scientific process
and t...
The scientific process is collaborative, fluid,
and time sensitive
Important decisions are
made in meetings and
conversati...
Need a way to allow scientists to collect and
share information about their experiments
A single location capable
of colle...
Our prototype is built on annotating abstract
workflows
Design principles
Workflow construction
needs to be a byproduct
of...
The research hierarchy contains the steps in
the abstract workflow
Steps are conceptual
At the top level, these
outline th...
Free-form text is associated with each step
in the hierarchy
Allows scientists to easily
describe step’s purpose
Top level...
The process view shows the steps as a
workflow
Ports are used to identify
inputs and outputs
Lines between steps
indicate ...
Steps are connected by linking input and
output parameters (ports)
Inputs and outputs are
linked
Comment field holds
assum...
Zooming in on a specific sub-step provides
additional information about that step
A new tab provides
(sub-)step-specific
i...
Eventually, some steps correspond to
executable (Kepler) workflows
Prototype expands
Kepler infrastructure
Executable work...
Annotations are stored in RDF to support
export / import
Semantically Interlinked
Online Communities
(SIOC) format chosen
...
This prototype represents a starting point for
answering many interesting questions
How do you effectively
link other sour...
This prototype represents a starting point for
answering many interesting questions
Are workflows the right
abstraction fo...
Conclusions
The SDM Center has been at the
forefront of scientific workflow R&D
Workflows have been successfully
deployed ...
This work involved many, many people
My team
George Chin
Chandrika Sivaramakrishnan
Xiaowen Xin (LLNL)
Anand Kulkarni
Anne...
Upcoming SlideShare
Loading in …5
×

Scientific workflow-overview-2012-01-rev-2

386 views

Published on

Summary of the work I did as part of the SciDAC SDM center, which wrapped up in 2012.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scientific workflow-overview-2012-01-rev-2

  1. 1. Scientific Workflows: Experience, Advances, and Where We Go From Here Terence Critchlow January 2012 PNNL-SA-85033
  2. 2. This talk will provide answers to 4 questions: Why did I get involved with scientific workflows? How do scientific workflows help scientists? What problems did you find when you first started working with scientific workflows? Can scientific workflows be effectively integrated into the broader scientific process?
  3. 3. I became involved with scientific workflows through the SciDAC SDM Center The Scientific Discovery through Advanced Computing (SciDAC) program was funded by DOE starting in 2001 with the goal of advancing scientific computing by having CS and domain science teams work together to address science questions using new HPC platforms Application initiatives were funded in areas such as combustion, fusion, astrophysics, and groundwater CS and math centers were funded in areas critical to the development of new, scalable capabilities including solvers, AMR, visualization, performance, and data management Focus was on science not CS research
  4. 4. The Scientific Data Management (SDM) Center was the focal point for DOE data management activities Large, multi-institutional collaborations Led by Arie Shoshani (LBL) 5 Labs and 5 Universities Funded for 10 years Project concluded in 2011 The center had 3 research thrusts: Storage and efficient access (Rob Ross – ANL) Data Mining and Analysis (Nagiza Samatova - NCSU) Software Process Automation (Terence Critchlow – PNNL) The goal of the SPA team was to develop and deploy technology that would allow scientists to spend more time on science by reducing the data management overhead Workflows had filled that niche in business but, in 2001, there was little usage in science applications
  5. 5. As lead for the SPA team, I had both management and research responsibilities Team of 10-15 spread across NCSU, Univ. of Utah, UC Davis, SDSC, ORNL, and PNNL Identify relevant technology Work with science teams to design and deploy solutions Identify areas requiring additional research Perform research to improve the existing capabilities for our target customers
  6. 6. Workflow technology was selected because time consuming, repetitive tasks dominate day-to-day computational science activity By automating mundane tasks, we allow scientists to focus on science not data management Needed a general purpose workflow engine that we could apply to an HPC-centric environment Act as the orchestrator, coordinating the workflow execution Allow processing of larger data sets Support scientific reproducibility Reduce waste of resources by allowing timely corrective action to be taken
  7. 7. The SDM Center was one of the founding organizations of the Kepler Consortium In 2001 there were no widely used scientific workflow engines Kepler is an open source workflow environment Based on the Ptolemy II system developed at UC Berkeley Started with several projects coming together based on a need for a flexible workflow environment Kepler-project.org Kepler has become one of the best known and widely used scientific workflow engines
  8. 8. This talk focuses on work that I was directly involved in There was a lot of work performed by the SDM Center team that I managed but I don’t focus on Provenance tracking Dashboard Templates Patterns Deployed workflows ITER CPES Combustion https://sdm.lbl.gov/sdmcenter/ My research focused on raising the level of abstraction within scientific workflows
  9. 9. Our first deployed workflow was managing a bioinformatics analysis pipeline (2002) In collaboration with Matt Coleman (LLNL)
  10. 10. The TSI workflow was the first of our “standard” simulation workflows (2005) Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning In collaboration with Doug Swesty (Stony Brook)
  11. 11. The workflow can be broken into several general steps Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Job Submission
  12. 12. Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Job Monitoring The workflow can be broken into several general steps
  13. 13. Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Moving files The workflow can be broken into several general steps
  14. 14. Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Data Analysis The workflow can be broken into several general steps
  15. 15. This translates into a complicated Kepler workflow Extensive use of nested workflows to compartmentalize steps 160 instances of 18 distinct actors Over a dozen parameters to control workflow execution
  16. 16. We ended up building several similar simulation science workflows Fusion science Combustion Subsurface science These all have the same general steps But there are significant differences in the details
  17. 17. Unfortunately, workflows are not typically portable across machines User authentication mechanisms depend on machine-specific policies Job launch and monitoring features depend on scheduler File transfer mechanisms depend on available infrastructure
  18. 18. We developed generic actors as the first step in raising the level of abstraction for workflow design Generic actors embody general functionality into actors that work across platforms / workflows Improve workflow portability Simplify creation of new workflows Form the basis for sharing subworkflows Reduce the number of actor choices
  19. 19. We identified several capabilities required across simulation workflows User authentication Job submission Submit job scheduling request to batch scheduler Job monitoring Track status of job from submitted, to running, to completed File transfer Move files, potentially between machines at different sites Developed and deployed actors capable of performing the desired functionality using available infrastructure Generalized to manage multiple implementations Parameters and contextual information determine which options to utilize
  20. 20. Use of generic actors improved workflow effectiveness Same workflow could be used on all of the DOE leadership class machines Significantly less maintenance required Fewer workflows needed per science team Each workflow is simpler Still requires parameters to manage details of execution
  21. 21. Workflow context can be used to reduce number of explicit parameters Workflows run in a context that provide certain preferences Systems User accounts Configuration files Information requested / computed / bound at run time instead of design time Initial results are promising but more work is required to determine how effective run- time binding is for workflows
  22. 22. Scientific workflows still have major adoption challenges to overcome The correlation between the scientific process and the executable workflow is loose at best Executable workflows are extremely complex and usually require a dedicated workflow designer to create The translation from idea to napkin drawing to executable workflow is challenging and lossy
  23. 23. The scientific process is collaborative, fluid, and time sensitive Important decisions are made in meetings and conversations. Records are distributed and not easily associated with specific tasks Decisions can be revisited and changed Science is inherently iterative Executable workflows document the results of these decisions Lack broader context Electronic lab notebooks provide some contextual information Lack details and external information / links Provenance provides some associations
  24. 24. Need a way to allow scientists to collect and share information about their experiments A single location capable of collecting all relevant information about an experiment Information needs to be related in a meaningful way Temporal information must be preserved Working with collaborators at UTEP, we developed a prototype of what this could look like
  25. 25. Our prototype is built on annotating abstract workflows Design principles Workflow construction needs to be a byproduct of information collection Information should not need to be entered more than once Annotations should relate to specific steps in the process
  26. 26. The research hierarchy contains the steps in the abstract workflow Steps are conceptual At the top level, these outline the major steps in the experiment being performed Get data Create conceptual model Generate model input Run simulation Each step can have sub- steps within it to refine the concept further (nested structure)
  27. 27. Free-form text is associated with each step in the hierarchy Allows scientists to easily describe step’s purpose Top level describes entire experiment Decisions are captured under research specs tab
  28. 28. The process view shows the steps as a workflow Ports are used to identify inputs and outputs Lines between steps indicate information flow between steps Steps should, eventually, connect
  29. 29. Steps are connected by linking input and output parameters (ports) Inputs and outputs are linked Comment field holds assumptions and constraints from the “other side” of the line Free-form text makes it easy to input information, but impossible to perform automatic verification
  30. 30. Zooming in on a specific sub-step provides additional information about that step A new tab provides (sub-)step-specific information The process view is updated to reflect the sub-steps contained within this step Note that the inputs and outputs to the workflow come from the higher-level workflow
  31. 31. Eventually, some steps correspond to executable (Kepler) workflows Prototype expands Kepler infrastructure Executable workflows are (still) typically created by a dedicated workflow designer This places the executable workflow in the broader context of the experiment it is supporting Provenance can be linked into overall experiment
  32. 32. Annotations are stored in RDF to support export / import Semantically Interlinked Online Communities (SIOC) format chosen Supports other tools using these annotations Report generation Search / query Experiment level provenance information
  33. 33. This prototype represents a starting point for answering many interesting questions How do you effectively link other sources of information into steps in an abstract workflow? How do you select only the relevant information? How do you manage provenance and attribution in a distributed environment? What is the best way to organize this information for people filling a variety of roles? PIs need a different view than workflow designers or bench scientists How do you effectively share (subsets of) this information? How do you implement access controls effectively?
  34. 34. This prototype represents a starting point for answering many interesting questions Are workflows the right abstraction for representing the scientific process? Representing evolution over time is challenging in workflows Does everything have to correspond to a step? Is there a way to generate parts of an executable workflow given an abstract definition? Can we match steps to specific actors? Could you develop a generic set of wizards or templates?
  35. 35. Conclusions The SDM Center has been at the forefront of scientific workflow R&D Workflows have been successfully deployed across a wide variety of scientific domains Significant advances have been made in making workflow engines more reliable and useful There remains significant work required to Fit workflows within the context of the overall scientific process Allow scientists to design and implement their own workflows
  36. 36. This work involved many, many people My team George Chin Chandrika Sivaramakrishnan Xiaowen Xin (LLNL) Anand Kulkarni Anne Ngu (TX State) Paulo Pinheiro da Silva (UTEP) Aida Gandara (UTEP) Other SPA team members Ilkay Altintas Bertram Ludaescher Mladen Vouk Claudio Silva Scott Klasky Norbert Podhorszki Dan Crawl Ayla Khan Arie Shoshani Plus other students and researchers who were involved for shorter times

×