SlideShare a Scribd company logo
Scientific Workflows:
Experience, Advances, and
Where We Go From Here
Terence Critchlow
January 2012
PNNL-SA-85033
This talk will provide answers to 4 questions:
Why did I get involved with scientific workflows?
How do scientific workflows help scientists?
What problems did you find when you first started working
with scientific workflows?
Can scientific workflows be effectively integrated into the
broader scientific process?
I became involved with scientific workflows
through the SciDAC SDM Center
The Scientific Discovery through
Advanced Computing (SciDAC)
program was funded by DOE
starting in 2001 with the goal of
advancing scientific computing by having CS and
domain science teams work together to address
science questions using new HPC platforms
Application initiatives were funded in areas such as
combustion, fusion, astrophysics, and groundwater
CS and math centers were funded in areas critical
to the development of new, scalable capabilities
including solvers, AMR, visualization, performance,
and data management
Focus was on science not CS research
The Scientific Data Management (SDM)
Center was the focal point for DOE data
management activities
Large, multi-institutional collaborations
Led by Arie Shoshani (LBL)
5 Labs and 5 Universities
Funded for 10 years
Project concluded in 2011
The center had 3 research thrusts:
Storage and efficient access (Rob Ross – ANL)
Data Mining and Analysis (Nagiza Samatova - NCSU)
Software Process Automation (Terence Critchlow – PNNL)
The goal of the SPA team was to develop and deploy
technology that would allow scientists to spend more time
on science by reducing the data management overhead
Workflows had filled that niche in business but, in
2001, there was little usage in science applications
As lead for the SPA team, I had both
management and research responsibilities
Team of 10-15 spread across NCSU, Univ. of
Utah, UC Davis, SDSC, ORNL, and PNNL
Identify relevant technology
Work with science teams to design and
deploy solutions
Identify areas requiring additional research
Perform research to improve the existing
capabilities for our target customers
Workflow technology was selected because
time consuming, repetitive tasks dominate
day-to-day computational science activity
By automating mundane tasks, we
allow scientists to focus on science
not data management
Needed a general purpose
workflow engine that we could
apply to an HPC-centric
environment
Act as the orchestrator, coordinating
the workflow execution
Allow processing of larger data sets
Support scientific reproducibility
Reduce waste of resources by allowing
timely corrective action to be taken
The SDM Center was one of the founding
organizations of the Kepler Consortium
In 2001 there were no widely used scientific workflow
engines
Kepler is an open source workflow environment
Based on the Ptolemy II system developed at UC Berkeley
Started with several projects coming together based on
a need for a flexible
workflow environment
Kepler-project.org
Kepler has become
one of the best
known and widely
used scientific workflow
engines
This talk focuses on work that I was directly
involved in
There was a lot of work performed by the SDM Center
team that I managed but I don’t focus on
Provenance tracking
Dashboard
Templates
Patterns
Deployed workflows
ITER
CPES
Combustion
https://sdm.lbl.gov/sdmcenter/
My research focused on raising the level of
abstraction within scientific workflows
Our first deployed workflow was managing a
bioinformatics analysis pipeline (2002)
In collaboration with
Matt Coleman (LLNL)
The TSI workflow was the first of our
“standard” simulation workflows (2005)
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
In collaboration with
Doug Swesty (Stony Brook)
The workflow can be broken into several
general steps
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Job Submission
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Job Monitoring
The workflow can be broken into several
general steps
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Moving files
The workflow can be broken into several
general steps
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Data Analysis
The workflow can be broken into several
general steps
This translates into a complicated Kepler
workflow
Extensive use of nested
workflows to
compartmentalize steps
160 instances of 18
distinct actors
Over a dozen
parameters to control
workflow execution
We ended up building several similar
simulation science workflows
Fusion science
Combustion
Subsurface science
These all have the same
general steps
But there are significant
differences in the details
Unfortunately, workflows are not typically
portable across machines
User authentication
mechanisms depend on
machine-specific policies
Job launch and
monitoring features
depend on scheduler
File transfer
mechanisms depend on
available infrastructure
We developed generic actors as the first step
in raising the level of abstraction for
workflow design
Generic actors embody
general functionality into
actors that work across
platforms / workflows
Improve workflow
portability
Simplify creation of new
workflows
Form the basis for sharing
subworkflows
Reduce the number of
actor choices
We identified several capabilities required
across simulation workflows
User authentication
Job submission
Submit job scheduling
request to batch
scheduler
Job monitoring
Track status of job from
submitted, to running, to
completed
File transfer
Move files, potentially
between machines at
different sites
Developed and deployed
actors capable of
performing the desired
functionality using
available infrastructure
Generalized to manage
multiple implementations
Parameters and
contextual information
determine which options
to utilize
Use of generic actors improved workflow
effectiveness
Same workflow could be
used on all of the DOE
leadership class
machines
Significantly less
maintenance required
Fewer workflows needed
per science team
Each workflow is simpler
Still requires parameters
to manage details of
execution
Workflow context can be used to reduce
number of explicit parameters
Workflows run in a context
that provide certain
preferences
Systems
User accounts
Configuration files
Information requested /
computed / bound at run time
instead of design time
Initial results are promising
but more work is required to
determine how effective run-
time binding is for workflows
Scientific workflows still have major
adoption challenges to overcome
The correlation between
the scientific process
and the executable
workflow is loose at best
Executable workflows
are extremely complex
and usually require a
dedicated workflow
designer to create
The translation from idea to napkin drawing to
executable workflow is challenging and lossy
The scientific process is collaborative, fluid,
and time sensitive
Important decisions are
made in meetings and
conversations.
Records are distributed
and not easily
associated with specific
tasks
Decisions can be
revisited and changed
Science is inherently
iterative
Executable workflows
document the results of
these decisions
Lack broader context
Electronic lab notebooks
provide some contextual
information
Lack details and external
information / links
Provenance provides
some associations
Need a way to allow scientists to collect and
share information about their experiments
A single location capable
of collecting all relevant
information about an
experiment
Information needs to be
related in a meaningful
way
Temporal information
must be preserved
Working with
collaborators at UTEP,
we developed a
prototype of what this
could look like
Our prototype is built on annotating abstract
workflows
Design principles
Workflow construction
needs to be a byproduct
of information collection
Information should not
need to be entered more
than once
Annotations should relate
to specific steps in the
process
The research hierarchy contains the steps in
the abstract workflow
Steps are conceptual
At the top level, these
outline the major steps in
the experiment being
performed
Get data
Create conceptual model
Generate model input
Run simulation
Each step can have sub-
steps within it to refine the
concept further
(nested structure)
Free-form text is associated with each step
in the hierarchy
Allows scientists to easily
describe step’s purpose
Top level describes entire
experiment
Decisions are captured
under research specs tab
The process view shows the steps as a
workflow
Ports are used to identify
inputs and outputs
Lines between steps
indicate information flow
between steps
Steps should, eventually,
connect
Steps are connected by linking input and
output parameters (ports)
Inputs and outputs are
linked
Comment field holds
assumptions and
constraints from the
“other side” of the line
Free-form text makes it
easy to input
information, but
impossible to perform
automatic verification
Zooming in on a specific sub-step provides
additional information about that step
A new tab provides
(sub-)step-specific
information
The process view is
updated to reflect the
sub-steps contained
within this step
Note that the inputs
and outputs to the
workflow come from
the higher-level
workflow
Eventually, some steps correspond to
executable (Kepler) workflows
Prototype expands
Kepler infrastructure
Executable workflows
are (still) typically created
by a dedicated workflow
designer
This places the
executable workflow in
the broader context of
the experiment it is
supporting
Provenance can be
linked into overall
experiment
Annotations are stored in RDF to support
export / import
Semantically Interlinked
Online Communities
(SIOC) format chosen
Supports other tools
using these annotations
Report generation
Search / query
Experiment level
provenance information
This prototype represents a starting point for
answering many interesting questions
How do you effectively
link other sources of
information into steps in
an abstract workflow?
How do you select only
the relevant information?
How do you manage
provenance and
attribution in a distributed
environment?
What is the best way to
organize this information
for people filling a variety
of roles?
PIs need a different view
than workflow designers
or bench scientists
How do you effectively
share (subsets of) this
information?
How do you implement
access controls
effectively?
This prototype represents a starting point for
answering many interesting questions
Are workflows the right
abstraction for
representing the
scientific process?
Representing evolution
over time is challenging in
workflows
Does everything have to
correspond to a step?
Is there a way to
generate parts of an
executable workflow
given an abstract
definition?
Can we match steps to
specific actors?
Could you develop a
generic set of wizards or
templates?
Conclusions
The SDM Center has been at the
forefront of scientific workflow R&D
Workflows have been successfully
deployed across a wide variety of
scientific domains
Significant advances have been
made in making workflow engines
more reliable and useful
There remains significant work
required to
Fit workflows within the context of the
overall scientific process
Allow scientists to design and
implement their own workflows
This work involved many, many people
My team
George Chin
Chandrika Sivaramakrishnan
Xiaowen Xin (LLNL)
Anand Kulkarni
Anne Ngu (TX State)
Paulo Pinheiro da Silva
(UTEP)
Aida Gandara (UTEP)
Other SPA team members
Ilkay Altintas
Bertram Ludaescher
Mladen Vouk
Claudio Silva
Scott Klasky
Norbert Podhorszki
Dan Crawl
Ayla Khan
Arie Shoshani
Plus other students and researchers who were
involved for shorter times

More Related Content

Similar to Scientific workflow-overview-2012-01-rev-2

Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...
Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...
Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...
Research Support Team, IT Services, University of Oxford
 
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Research Support Team, IT Services, University of Oxford
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
Data Finder
 
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Research Support Team, IT Services, University of Oxford
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...
Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...
Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...
Research Support Team, IT Services, University of Oxford
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Support Team, IT Services, University of Oxford
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
Shiyong Lu
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
Ian Foster
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
ASIS&T
 
Streamlining R&D Case Files at the AFRL
Streamlining R&D Case Files at the AFRLStreamlining R&D Case Files at the AFRL
Streamlining R&D Case Files at the AFRL
cjoesten
 
Introduction to Research Data Management
Introduction to Research Data ManagementIntroduction to Research Data Management
Introduction to Research Data Management
Research Support Team, IT Services, University of Oxford
 
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...
faflrt
 
Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...
Research Support Team, IT Services, University of Oxford
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcher
bwestra
 
Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...
Research Support Team, IT Services, University of Oxford
 
Carmichael.kevin
Carmichael.kevinCarmichael.kevin
Carmichael.kevin
NASAPMC
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
LabPulse whitepaper_focalcxm
LabPulse whitepaper_focalcxmLabPulse whitepaper_focalcxm
LabPulse whitepaper_focalcxm
FOCALCXM
 

Similar to Scientific workflow-overview-2012-01-rev-2 (20)

Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...
Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...
Introduction to Research Data Management - 2016-02-03 - MPLS Division, Univer...
 
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...
Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...
Introduction to Research Data Management - 2014-02-26 - Mathematical, Physica...
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
Streamlining R&D Case Files at the AFRL
Streamlining R&D Case Files at the AFRLStreamlining R&D Case Files at the AFRL
Streamlining R&D Case Files at the AFRL
 
Introduction to Research Data Management
Introduction to Research Data ManagementIntroduction to Research Data Management
Introduction to Research Data Management
 
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...
 
Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2016-11-16 - Humanities Div...
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcher
 
Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...
Preparing Your Research Material for the Future - 2015-11-16 - Humanities Div...
 
Carmichael.kevin
Carmichael.kevinCarmichael.kevin
Carmichael.kevin
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
LabPulse whitepaper_focalcxm
LabPulse whitepaper_focalcxmLabPulse whitepaper_focalcxm
LabPulse whitepaper_focalcxm
 

Recently uploaded

Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

Scientific workflow-overview-2012-01-rev-2

  • 1. Scientific Workflows: Experience, Advances, and Where We Go From Here Terence Critchlow January 2012 PNNL-SA-85033
  • 2. This talk will provide answers to 4 questions: Why did I get involved with scientific workflows? How do scientific workflows help scientists? What problems did you find when you first started working with scientific workflows? Can scientific workflows be effectively integrated into the broader scientific process?
  • 3. I became involved with scientific workflows through the SciDAC SDM Center The Scientific Discovery through Advanced Computing (SciDAC) program was funded by DOE starting in 2001 with the goal of advancing scientific computing by having CS and domain science teams work together to address science questions using new HPC platforms Application initiatives were funded in areas such as combustion, fusion, astrophysics, and groundwater CS and math centers were funded in areas critical to the development of new, scalable capabilities including solvers, AMR, visualization, performance, and data management Focus was on science not CS research
  • 4. The Scientific Data Management (SDM) Center was the focal point for DOE data management activities Large, multi-institutional collaborations Led by Arie Shoshani (LBL) 5 Labs and 5 Universities Funded for 10 years Project concluded in 2011 The center had 3 research thrusts: Storage and efficient access (Rob Ross – ANL) Data Mining and Analysis (Nagiza Samatova - NCSU) Software Process Automation (Terence Critchlow – PNNL) The goal of the SPA team was to develop and deploy technology that would allow scientists to spend more time on science by reducing the data management overhead Workflows had filled that niche in business but, in 2001, there was little usage in science applications
  • 5. As lead for the SPA team, I had both management and research responsibilities Team of 10-15 spread across NCSU, Univ. of Utah, UC Davis, SDSC, ORNL, and PNNL Identify relevant technology Work with science teams to design and deploy solutions Identify areas requiring additional research Perform research to improve the existing capabilities for our target customers
  • 6. Workflow technology was selected because time consuming, repetitive tasks dominate day-to-day computational science activity By automating mundane tasks, we allow scientists to focus on science not data management Needed a general purpose workflow engine that we could apply to an HPC-centric environment Act as the orchestrator, coordinating the workflow execution Allow processing of larger data sets Support scientific reproducibility Reduce waste of resources by allowing timely corrective action to be taken
  • 7. The SDM Center was one of the founding organizations of the Kepler Consortium In 2001 there were no widely used scientific workflow engines Kepler is an open source workflow environment Based on the Ptolemy II system developed at UC Berkeley Started with several projects coming together based on a need for a flexible workflow environment Kepler-project.org Kepler has become one of the best known and widely used scientific workflow engines
  • 8. This talk focuses on work that I was directly involved in There was a lot of work performed by the SDM Center team that I managed but I don’t focus on Provenance tracking Dashboard Templates Patterns Deployed workflows ITER CPES Combustion https://sdm.lbl.gov/sdmcenter/ My research focused on raising the level of abstraction within scientific workflows
  • 9. Our first deployed workflow was managing a bioinformatics analysis pipeline (2002) In collaboration with Matt Coleman (LLNL)
  • 10. The TSI workflow was the first of our “standard” simulation workflows (2005) Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning In collaboration with Doug Swesty (Stony Brook)
  • 11. The workflow can be broken into several general steps Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Job Submission
  • 12. Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Job Monitoring The workflow can be broken into several general steps
  • 13. Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Moving files The workflow can be broken into several general steps
  • 14. Submit batch request at NERSC Identify new complete files Check job status Transfer files to HPSS Transfer completed correctly Transfer files to SB Transfer completed correctly Delete file Extract Get Variables Remap coordinates Create Chem vars Create neutrino vars Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 Generate thumbnails Generate movie Delay Queued Running or Done Update web page IfRunning Data Analysis The workflow can be broken into several general steps
  • 15. This translates into a complicated Kepler workflow Extensive use of nested workflows to compartmentalize steps 160 instances of 18 distinct actors Over a dozen parameters to control workflow execution
  • 16. We ended up building several similar simulation science workflows Fusion science Combustion Subsurface science These all have the same general steps But there are significant differences in the details
  • 17. Unfortunately, workflows are not typically portable across machines User authentication mechanisms depend on machine-specific policies Job launch and monitoring features depend on scheduler File transfer mechanisms depend on available infrastructure
  • 18. We developed generic actors as the first step in raising the level of abstraction for workflow design Generic actors embody general functionality into actors that work across platforms / workflows Improve workflow portability Simplify creation of new workflows Form the basis for sharing subworkflows Reduce the number of actor choices
  • 19. We identified several capabilities required across simulation workflows User authentication Job submission Submit job scheduling request to batch scheduler Job monitoring Track status of job from submitted, to running, to completed File transfer Move files, potentially between machines at different sites Developed and deployed actors capable of performing the desired functionality using available infrastructure Generalized to manage multiple implementations Parameters and contextual information determine which options to utilize
  • 20. Use of generic actors improved workflow effectiveness Same workflow could be used on all of the DOE leadership class machines Significantly less maintenance required Fewer workflows needed per science team Each workflow is simpler Still requires parameters to manage details of execution
  • 21. Workflow context can be used to reduce number of explicit parameters Workflows run in a context that provide certain preferences Systems User accounts Configuration files Information requested / computed / bound at run time instead of design time Initial results are promising but more work is required to determine how effective run- time binding is for workflows
  • 22. Scientific workflows still have major adoption challenges to overcome The correlation between the scientific process and the executable workflow is loose at best Executable workflows are extremely complex and usually require a dedicated workflow designer to create The translation from idea to napkin drawing to executable workflow is challenging and lossy
  • 23. The scientific process is collaborative, fluid, and time sensitive Important decisions are made in meetings and conversations. Records are distributed and not easily associated with specific tasks Decisions can be revisited and changed Science is inherently iterative Executable workflows document the results of these decisions Lack broader context Electronic lab notebooks provide some contextual information Lack details and external information / links Provenance provides some associations
  • 24. Need a way to allow scientists to collect and share information about their experiments A single location capable of collecting all relevant information about an experiment Information needs to be related in a meaningful way Temporal information must be preserved Working with collaborators at UTEP, we developed a prototype of what this could look like
  • 25. Our prototype is built on annotating abstract workflows Design principles Workflow construction needs to be a byproduct of information collection Information should not need to be entered more than once Annotations should relate to specific steps in the process
  • 26. The research hierarchy contains the steps in the abstract workflow Steps are conceptual At the top level, these outline the major steps in the experiment being performed Get data Create conceptual model Generate model input Run simulation Each step can have sub- steps within it to refine the concept further (nested structure)
  • 27. Free-form text is associated with each step in the hierarchy Allows scientists to easily describe step’s purpose Top level describes entire experiment Decisions are captured under research specs tab
  • 28. The process view shows the steps as a workflow Ports are used to identify inputs and outputs Lines between steps indicate information flow between steps Steps should, eventually, connect
  • 29. Steps are connected by linking input and output parameters (ports) Inputs and outputs are linked Comment field holds assumptions and constraints from the “other side” of the line Free-form text makes it easy to input information, but impossible to perform automatic verification
  • 30. Zooming in on a specific sub-step provides additional information about that step A new tab provides (sub-)step-specific information The process view is updated to reflect the sub-steps contained within this step Note that the inputs and outputs to the workflow come from the higher-level workflow
  • 31. Eventually, some steps correspond to executable (Kepler) workflows Prototype expands Kepler infrastructure Executable workflows are (still) typically created by a dedicated workflow designer This places the executable workflow in the broader context of the experiment it is supporting Provenance can be linked into overall experiment
  • 32. Annotations are stored in RDF to support export / import Semantically Interlinked Online Communities (SIOC) format chosen Supports other tools using these annotations Report generation Search / query Experiment level provenance information
  • 33. This prototype represents a starting point for answering many interesting questions How do you effectively link other sources of information into steps in an abstract workflow? How do you select only the relevant information? How do you manage provenance and attribution in a distributed environment? What is the best way to organize this information for people filling a variety of roles? PIs need a different view than workflow designers or bench scientists How do you effectively share (subsets of) this information? How do you implement access controls effectively?
  • 34. This prototype represents a starting point for answering many interesting questions Are workflows the right abstraction for representing the scientific process? Representing evolution over time is challenging in workflows Does everything have to correspond to a step? Is there a way to generate parts of an executable workflow given an abstract definition? Can we match steps to specific actors? Could you develop a generic set of wizards or templates?
  • 35. Conclusions The SDM Center has been at the forefront of scientific workflow R&D Workflows have been successfully deployed across a wide variety of scientific domains Significant advances have been made in making workflow engines more reliable and useful There remains significant work required to Fit workflows within the context of the overall scientific process Allow scientists to design and implement their own workflows
  • 36. This work involved many, many people My team George Chin Chandrika Sivaramakrishnan Xiaowen Xin (LLNL) Anand Kulkarni Anne Ngu (TX State) Paulo Pinheiro da Silva (UTEP) Aida Gandara (UTEP) Other SPA team members Ilkay Altintas Bertram Ludaescher Mladen Vouk Claudio Silva Scott Klasky Norbert Podhorszki Dan Crawl Ayla Khan Arie Shoshani Plus other students and researchers who were involved for shorter times