ORNL is managed by UT-Battelle LLC for the US Department of Energy
Enhancing research orchestration
capabilities at ORNL
Tyler J. Skluzacek
Research Scientist
Oak Ridge National Laboratory
2
2
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Task Who does it?
Initiate experiment
Operate instrument
Initiate data transfers
Initiate compute
Computes!
Validate outputs
Initiate reactionary analysis
Publish, clean up
As research becomes more autonomous, the landscape
of ‘human-machine interaction’ evolves…
Human driven
3
3
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Task Who does it?
Initiate experiment
Operate instrument
Initiate data transfers
Initiate compute
Computes!
Validate outputs
Initiate reactionary analysis
Publish, clean up
As research becomes more autonomous, the landscape
of ‘human-machine interaction’ evolves…
Machine driven
/
4
4
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Task Who does it?
Initiate experiment
Operate instrument
Initiate data transfers
Initiate compute
Computes!
Validate outputs
Initiate reactionary analysis
Publish, clean up
Our community has converged on
‘automation when possible; humans when required’.
Requirements driven
/
/
/
/
/
/
/
/
5
5
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Requirements driven experimentation demands software
that enables humans or machines to ‘drive’.
… and more soon!
Users
Human
Machine
Zambeze: distributed workflow
orchestration
The OLCF Facility API for
easy, remote, reliable
interactions with resources.
/status, /compute, /data …
OLCF
Globus Flows
6
6
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
The OLCF Facility API enables users to remotely interact
with our resources
Not a new idea, but we can build on it!
• FirecREST (CSCS):
https://firecrest.readthedocs.io/en/l
atest/overview.html#gateway
• SuperFacility (NERSC):
https://www.nersc.gov/research-an
d-development/superfacility/
• Tapis (TACC):
https://tapis.readthedocs.io/en/late
st/technical/index.html
7
7
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Coming soon: direct support for computational workflows
”The workflow representing how I perform computation”
1. Project-level auth
2. User-level auth
3. Check to see if resource online (/status)
4. Submit job (/compute)
5. Monitor job status.
6. Send data (/data… coming soon!)
8
8
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
What if we could largely ignore cross-facility
configuration/scheduling and just focus on science?
microscope
capture
train model
validate
model
create
visualization
store
visualization
science campaign
activities
Distributed workflow orchestration: act of
organizing and executing application and
data flow between separate workflow
management systems, between
potentially-separate compute and storage
resources
workflow orchestration != workflow management
9
9
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Zambeze for automated and distributed workflow orchestration
Facility A
Compute A1
storage A1 storage A2
Agent Agent
Compute B1
storage B
Agent
Compute B1
Agent
Facility B
activity
messages
control
messages
data
data
Compute A2
instrument A2
instrument B2
Agent
10
10
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Zambeze enables cross-facility analysis
AtomAI use case
• deep learning models for semantic segmentation
• assign each pixel to a category of ‘what it represents’
11
11
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Globus Flows @ OLCF
• Action providers allow flexible access to breadth of APIs
– Future: OLCF Facility API!
• Enables human-in-the-loop input
– Link to Globus web console
• Globus Auth enables secure access to most* facilities
• Extremely fast time-to-implementation
– Organization already has DTNs
– Globus Compute fast to install, uses existing virtual environments
12
12
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Globus Flows for tomographic reconstruction
In collaboration with Ryan Chard;
Credit to Will Engler for ALCF AP
…
13
13
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
In conclusion,
Globus enables OLCF to provide key research orchestration capabilities.
@
14
14
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
If you would like to learn more, please reach out!
Others to thank for these efforts:
Tyler J. Skluzacek
Research Scientist
skluzacektj@ornl.gov
Paul Bryant Ryan Chard Rafael Ferreira
da Silva
Ryan Prout
A.J. Ruckman Renan Santos
Souza
Mark Coletti
Fred Suter Gavin Wiggins

Enhancing Research Orchestration Capabilities at ORNL.pdf

  • 1.
    ORNL is managedby UT-Battelle LLC for the US Department of Energy Enhancing research orchestration capabilities at ORNL Tyler J. Skluzacek Research Scientist Oak Ridge National Laboratory
  • 2.
    2 2 Tyler J. Skluzacek OakRidge Leadership Computing Facility Task Who does it? Initiate experiment Operate instrument Initiate data transfers Initiate compute Computes! Validate outputs Initiate reactionary analysis Publish, clean up As research becomes more autonomous, the landscape of ‘human-machine interaction’ evolves… Human driven
  • 3.
    3 3 Tyler J. Skluzacek OakRidge Leadership Computing Facility Task Who does it? Initiate experiment Operate instrument Initiate data transfers Initiate compute Computes! Validate outputs Initiate reactionary analysis Publish, clean up As research becomes more autonomous, the landscape of ‘human-machine interaction’ evolves… Machine driven /
  • 4.
    4 4 Tyler J. Skluzacek OakRidge Leadership Computing Facility Task Who does it? Initiate experiment Operate instrument Initiate data transfers Initiate compute Computes! Validate outputs Initiate reactionary analysis Publish, clean up Our community has converged on ‘automation when possible; humans when required’. Requirements driven / / / / / / / /
  • 5.
    5 5 Tyler J. Skluzacek OakRidge Leadership Computing Facility Requirements driven experimentation demands software that enables humans or machines to ‘drive’. … and more soon! Users Human Machine Zambeze: distributed workflow orchestration The OLCF Facility API for easy, remote, reliable interactions with resources. /status, /compute, /data … OLCF Globus Flows
  • 6.
    6 6 Tyler J. Skluzacek OakRidge Leadership Computing Facility The OLCF Facility API enables users to remotely interact with our resources Not a new idea, but we can build on it! • FirecREST (CSCS): https://firecrest.readthedocs.io/en/l atest/overview.html#gateway • SuperFacility (NERSC): https://www.nersc.gov/research-an d-development/superfacility/ • Tapis (TACC): https://tapis.readthedocs.io/en/late st/technical/index.html
  • 7.
    7 7 Tyler J. Skluzacek OakRidge Leadership Computing Facility Coming soon: direct support for computational workflows ”The workflow representing how I perform computation” 1. Project-level auth 2. User-level auth 3. Check to see if resource online (/status) 4. Submit job (/compute) 5. Monitor job status. 6. Send data (/data… coming soon!)
  • 8.
    8 8 Tyler J. Skluzacek OakRidge Leadership Computing Facility What if we could largely ignore cross-facility configuration/scheduling and just focus on science? microscope capture train model validate model create visualization store visualization science campaign activities Distributed workflow orchestration: act of organizing and executing application and data flow between separate workflow management systems, between potentially-separate compute and storage resources workflow orchestration != workflow management
  • 9.
    9 9 Tyler J. Skluzacek OakRidge Leadership Computing Facility Zambeze for automated and distributed workflow orchestration Facility A Compute A1 storage A1 storage A2 Agent Agent Compute B1 storage B Agent Compute B1 Agent Facility B activity messages control messages data data Compute A2 instrument A2 instrument B2 Agent
  • 10.
    10 10 Tyler J. Skluzacek OakRidge Leadership Computing Facility Zambeze enables cross-facility analysis AtomAI use case • deep learning models for semantic segmentation • assign each pixel to a category of ‘what it represents’
  • 11.
    11 11 Tyler J. Skluzacek OakRidge Leadership Computing Facility Globus Flows @ OLCF • Action providers allow flexible access to breadth of APIs – Future: OLCF Facility API! • Enables human-in-the-loop input – Link to Globus web console • Globus Auth enables secure access to most* facilities • Extremely fast time-to-implementation – Organization already has DTNs – Globus Compute fast to install, uses existing virtual environments
  • 12.
    12 12 Tyler J. Skluzacek OakRidge Leadership Computing Facility Globus Flows for tomographic reconstruction In collaboration with Ryan Chard; Credit to Will Engler for ALCF AP …
  • 13.
    13 13 Tyler J. Skluzacek OakRidge Leadership Computing Facility In conclusion, Globus enables OLCF to provide key research orchestration capabilities. @
  • 14.
    14 14 Tyler J. Skluzacek OakRidge Leadership Computing Facility If you would like to learn more, please reach out! Others to thank for these efforts: Tyler J. Skluzacek Research Scientist skluzacektj@ornl.gov Paul Bryant Ryan Chard Rafael Ferreira da Silva Ryan Prout A.J. Ruckman Renan Santos Souza Mark Coletti Fred Suter Gavin Wiggins