SlideShare a Scribd company logo
The Interplay of Workflow Execution
and Resource Provisioning
Rafael Ferreira da Silva
18th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP)
MS10 Resource Management, Scheduling, Workflows: Critical Middleware for HPC and Clouds
March 7, 2018
Funded by the National Science Foundation under grants #1741040 and #1664162
https://pegasus.isi.edu
OUTLINE
Introduction Workflow Systems Workflow Execution
Workflow Optimizations Workload Characterization
Compute Pipelines
Scientific Workflows
Scientific Achievements
Basic Concepts
Execution Challenges
Heterogeneous Environments
Fine-Grained Workflows on
HPC Systems
HPC/HTC
Resource Provisioning
Fault-Tolerance
Task Clustering
Workflow Restructuring
2
Extreme Scale
In Transit Processing
In Situ Processing
https://pegasus.isi.edu 3
Compute Pipelines
Building Blocks
Compute Pipelines
Allows scientists to connect different codes together and execute their
analysis
Pipelines can be very simple (independent or parallel) jobs or complex
represented as DAGs
However, it is still up-to user to figure out
Data Management
How do you ship in the small/large amounts data required by your
pipeline and protocols to use?
How best to leverage different infrastructure setups
Grids have no shared filesystem while XSEDE and your local campus
cluster has one!
Debug and Monitor Computations.
Correlate data across lots of log files
Need to know what host a job ran on and how it was invoked
Restructure Workflows for Improved Performance
Short running tasks? Data placement
https://pegasus.isi.edu
Why Pegasus
4
> > >Automation
Enables parallel, distributed computations
Automatically executes data transfers
Reproducibility
Reusable, aids reproducibility
Records how data was produced (provenance)
Automate
Recover
Debug
Recover & Debug
Handles failures with to provide reliability
Keeps track of data and files
NSF funded project since 2001, with
close collaboration with HTCondor
team
http://pegasus.isi.edu
https://pegasus.isi.edu
Scientific Achievements
5
60,000 compute tasks
Input Data: 5,000 files (10GB total)
Output Data: 60,000 files (60GB total)
executed on LIGO Data Grid,
Open Science Grid and XSEDE
https://pegasus.isi.edu 6
Advanced LIGO
PyCBC Workflow
One of the main pipelines to measure the statistical significance of
data needed for discovery
Contains 100’s of thousands of jobs and accesses on order of terabytes
of data
Uses data from multiple detectors
For the detection, the pipeline was executed on Syracuse and Albert
Einstein Institute Hannover
A single run of the binary black hole + binary neutron star search
through the O1 data (about 3 calendar months of data with 50% duty
cycle) requires a workflow with 194,364 jobs
Generating the final O1 results with all the review required for the first
discovery took about 20 million core hours
PyCBC Papers: An improved pipeline to search for gravitational waves from compact binary coalescence. Samantha Usman, Duncan Brown et al.
The PyCBC search for gravitational waves from compact binary coalescence, Samantha Usman et al (https://arxiv.org/abs/1508.02357)
PyCBC Detection GW150914: First results from the search for binary black hole coalescence with Advanced LIGO. B. P. Abbott et al.
https://pegasus.isi.edu 7
Southern California Earthquake
Center’s CyberShake
Builders ask seismologists: What will the peak ground
motion be at my new building in the next 50 years?
Seismologists answer this question using Probabilistic
Seismic Hazard Analysis (PSHA)
286 sites, 4 models
each workflow has
420,000 tasks
CPU jobs (Mesh generation, seismogram synthesis):
1,094,000 node-hours
GPU jobs: 439,000 node-hours
AWP-ODC finite-difference code
5 billion points per volume, 23000 timesteps
200 GPUs for 1 hour
Titan:
421,000 CPU node-hours, 110,000 GPU node-hours
Blue Waters:
673,000 CPU node-hours, 329,000 GPU node-hours
https://pegasus.isi.edu
Enabled cutting-edge domain science (e.g., drug
delivery) through collaboration with scientists at the DoE
Spallation Neutron Source (SNS) facility
A Pegasus workflow was developed that confirmed that
nanodiamonds can enhance the dynamics of tRNA
It compared SNS neutron scattering data with MD
simulations by calculating the epsilon that best
matches experimental data
Ran on a Cray XE6 at NERSC using 400,000 CPU
hours, and generated 3TB of data
Water is seen as small red and
white molecules on large
nanodiamond spheres. The
colored tRNA can be seen on
the nanodiamond surface.
(Image Credit: Michael
Mattheson, OLCF, ORNL)
An automated analysis workflow for optimization of force-field parameters using
neutron scattering data. V. E. Lynch, J. M. Borreguero, D. Bhowmik, P. Ganesh, B. G.
Sumpter, T. E. Proffen, M. Goswami, Journal of Computational Physics, July 2017.
Impact on DOE Science
8
https://pegasus.isi.edu 9
Soybean Workflow
TACC Wrangler as Execution Environment
Flash Based Shared Storage
Switched to glideins (pilot jobs) - Brings in remote compute nodes
and joins them to the HTCondor pool on the submit host
Workflow runs at a finer granularity
Works well on Wrangler due to more cores and memory per node
(48 cores, 128 GB RAM)
https://pegasus.isi.edu 10
cleanup job
removes unused data
stage-in job
stage-out job
registration job
transfers the workflow input data
transfers the workflow output data
registers the workflow output data
DAGdirected-acyclic graphs
Portable Description
Users do not worry about
low level execution details
abstract
workflow
executable
workflow
transformation
executables (or programs)
platform independent
logical filename (LFN)
platform independent (abstraction)
Basic Concepts
https://pegasus.isi.edu
Overview of Workflow Execution Challenges
11
Workflow
Management
System
Workflow
Design
Workflow
Structure
Sequence
Parallelism
Modeling /
Composition
Language
Scheduling / Resource
Provisioning
Architecture
HPC /
HTC
Optimization
Clustering
/ Cleanup
Pilot
Jobs
Interopera
bility
Task
Estimation
Runtime /
Energy /
Disk
Fault-
Tolerance
Task-level
Anomaly
Detection
Workflow-
level
Exception
Handling
Data
Movement
Data
Locality
Network
Proximity
Reconfigurable
Network
SDN
Execution
Monitoring
Provenance
In situ
NVRAM /
Data
Spaces
In
transit
HDFS /
Burst
Buffers
Deelman, E., et al. The Future of Scientific Workflows, The International Journal of High Performance
Computing Applications, vol. 32 (1), 2017.
https://pegasus.isi.edu
Input data site
Data staging site
shared
filesystem
submit host
(e.g., user’s laptop)
Compute site B
Compute site A
Data staging site
Input data site
Output data site
Output data site
object storage
Heterogeneous Environments
12
HPC
Cloud
https://pegasus.isi.edu
Workflow Optimization: Task Clustering
13
Task Granularity
Low performance of lightweight (fine-grained) tasks
High queuing times
Communication overheads
Solutions
Group into coarse-grained tasks reduces the cost of data
transfers when grouped tasks share input data, and saves
queuing time
Use pilot jobs to dynamically provision a number of
resources at a time
Chen, W., et al. "Dynamic and fault-tolerant clustering for scientific workflows." IEEE Transactions on
Cloud Computing 4.1 (2016): 49-62.
Chen, W., et al. "Using imbalance metrics to optimize task clustering in scientific workflow executions."
Future Generation Computer Systems 46 (2015): 69-84.
Ferreira da Silva, R., et al. "Controlling fairness and task granularity in distributed, online, non‐clairvoyant
workflow executions." Concurrency and Computation: Practice and Experience 26.14 (2014): 2347-2366.
https://pegasus.isi.edu
HPC System
submit host
(e.g., user’s laptop)
workflow wrapped as an MPI job
Allows sub-graphs of a Pegasus workflow to be
submitted as monolithic jobs to remote
resources
Fine-Grained Workflows on HPC systems
14
Dreher, M., et al. "In Situ Workflows at Exascale: System Software to the Rescue." Proceedings of the In
Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization. ACM, 2017.
Ferreira da Silva, R., et al. "A characterization of workflow management systems for extreme-scale
applications." Future Generation Computer Systems 75 (2017): 228-238.
https://pegasus.isi.edu
Workflow Optimization: Workflow Restructuring
15
Chen, W., and Deelman, E.. "Integration of workflow partitioning and resource
provisioning." Proceedings of the 2012 12th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 2012.
Srinivasan, S., et al. "A cleanup algorithm for implementing storage constraints in
scientific workflow executions." Proceedings of the 9th Workshop on Workflows in
Support of Large-Scale Science. IEEE Press, 2014.
Partition the workflow into subworkflows and send
them for execution to the target system
Storage constraints to limit or minimize the
amount of disk space that workflows use on shared
storage systems (use of cleanup jobs)
https://pegasus.isi.edu
Resource Provisioning: Workload Characteristics (HPC)
16
487
Total Number of Users
MIRA (ALCF)
78,782
Total Number of Jobs
5,698
CPU hours (millions)
6,093
Avg. Runtime (seconds)
786,432
Total Number of Cores
49,152
Total Number of Nodes
2014
Physics
73 Users
24,429 Jobs
2,256 CPU hours (millions)
7,147 Avg. Runtime (sec)
Materials Science
77 Users
12,546 Jobs
895 CPU hours (millions)
5,820 Avg. Runtime (sec)
Chemistry
51 Users
10,286 Jobs
810 CPU hours (millions)
6,131 Avg. Runtime (sec)
Schlagkamp, S., et al. ”Consecutive Job Submission Behavior at Mira
Supercomputer." 25th ACM International Symposium on High-
Performance Parallel and Distributed Computing (HPDC), 2016.
https://pegasus.isi.edu
Resource Provisioning: Workload Characteristics (HTC)
17
COMPACT MUON SOLENOID (CMS)
1,435,280
Total Number of Jobs
AUGUST 2014 OCTOBER 2014
392
Total Number of Users
75
Execution Sites
15,484
Execution Nodes
792,603
Completed Jobs
385,447
Exit Code (!= 0)
9,444.6
Avg. Runtime (sec)
55.3
Avg. Disk Usage (MB)
408
Total Number of Users
15,034
Execution Nodes
476,391
Exit Code (!= 0)
32.9
Avg. Disk Usage (MB)
1,638,803
Total Number of Jobs
72
Execution Sites
816,678
Completed Jobs
9967.1
Avg. Runtime (sec)
Schlagkamp, S., et al. ” Understanding User Behavior: from
HPC to HTC" Procedia Computer Science (ICCS’16), 2016.
https://pegasus.isi.edu
Resource Provisioning: Resource Estimation
18
Jobs’ resource requirements
at Mira
Job submission interarrival
times per day
Job submission
interarrival times per hour
Jobruntime
MemoryUsage
DiskUsage
Ferreira da Silva, Rafael, et al. "Online task resource consumption prediction for scientific workflows."
Parallel Processing Letters 25.03 (2015): 1541003.
https://pegasus.isi.edu
Network Provisioning
19
We predicted completion times of current and future workflow
tasks and start times for data staging under different scenarios
to provision appropriate compute & network resources
We developed mechanisms to arbitrate and prioritize data
flows from competing workflows by leveraging an advanced
network provisioning technologies, a virtual Software Defined
Exchange (SDX)
Virtualized
SDX
Site 1
OpenFlow
Controller Site 3Site 2
Data Site
ExoGENI Slice with Virtual SDX + HTCondor Pools + Data Site
Dynamic
Connectivity
HTCondor
Pool
HTCondor
Pool
HTCondor
Pool
10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10
Relative data flow priorities (Workflow1-Workflow2)
0
200
400
600
800
1000
1200
1400
1600
1800
Avg.transfertimefor"individual"jobs(secs)
Transfer time vs. Flow priorities (Different bandwidths)
Workflow1 on HTCondor pool 1 (vSDX bandwidth = 250Mb/s)
Workflow2 on HTCondor pool 2 (vSDX bandwidth = 250Mb/s)
Workflow1 on HTCondor pool 1 (vSDX bandwidth = 500Mb/s)
Workflow2 on HTCondor pool 2 (vSDX bandwidth = 500Mb/s)
Effect of relative flow priorities on observed
data transfer times for data-intensive
workflow tasks for different vSDX
provisioned bandwidths
Deelman, E., et al. “PANORAMA: An Approach to Performance Modeling and Diagnosis of Extreme
Scale Workflows”, International Journal of High Performance Computing Applications, vol. 31, 2017.
Mandal, A., et al. “Toward Prioritization of Data Flows for Scientific Workflows Using Virtual Software
Defined Exchanges”, First International Workshop on Workflow Science (WoWS 2017), 2017.
https://pegasus.isi.edu
Typical Approaches
Task Retries
Task Resubmission
Task Clustering
Checkpointing
Provenance
…
Statistical and
Machine Learning
Linear Regression
Neural Networks
Classification Algorithms
Tree-based Methods
Support Vector Machines
…
Others
Exception Handling
Game Theory
…
Analytical Solutions
Failure Modeling
Markov Chains
Principal Component Analysis
Histograms
…
20
Some Approaches to Handle Faults
https://pegasus.isi.edu
Fault-Tolerance: Anomaly Detection
21
Anomaly Detection System
Developed Auto-Regression (AR) based statistical methods to detect
anomalies for single metrics
AR coefficients and degree of AR calculated from training runs AR model
During application execution, use coefficients to generate predictions; when
error exceeds a threshold, generate triggers
Experimented with anomaly detection by injecting artificial, competing loads Integrated workflow monitoring and anomaly detection system
Mandal, A. et al. “Toward an End-to-end Framework for
Modeling, Monitoring, and Anomaly Detection for
Scientific Workflows”, Workshop on Large-Scale Parallel
Processing (LSPP’16), 2016.
https://pegasus.isi.edu
Fault Tolerance: Preventing Faults using PID Controllers
22
P
I
D
ΣΣ Process
OutputSetpoint Input+
-
+
+
+
z
0.0
0.5
1.0
1.5
TimeDead time
Raise time
Percent overshoot
Settling time
Steady-state error
DATAFOOTPRINT
PID controller aims at detecting the possibility of a
fault far enough in advance so that an action can
be performed to prevent it from happening
Ferreira da Silva, R., et al. “Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific
Workflows”, 11th Workflows in Support of Large-Scale Science (WORKS’16), 2016.
https://pegasus.isi.edu
In transit processing: Burst Buffers in Workflow Systems
23
Automated BB Reservations
BB reservations operations (either persistent or scratch)
consist in the creation and release, as well as stage in and
stage out operations
Transient reservations: needs to implement stage in/out
operations at the beginning/end of each job execution
Execution Directory
Automated mapping between the workflow execution
directory and the BB reservation
No changes to the application code are necessary,
and the application job directly writes its output to
the BB reservation
Placement of the burst buffer nodes within the Cori system (NERSC)
Ferreira da Silva, R., et al. "On the use of burst buffers for accelerating data-intensive scientific workflows."
Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science. ACM, 2017.
0
2500
5000
7500
10000
1 4 8 16 32 64 128 256 313
# Nodes
MiB/s
BB no−BB
fx.sgt fy.sgt
1 4 8 16 32 64 128 1 4 8 16 32 64 128
0
500
1000
1500
2000
# Nodes
Time(seconds)
BB No−BB
https://pegasus.isi.edu
In situ Processing: WMS collocated with the simulations
24
Integration with In situ Processing Systems
DataSpaces
Distributed in-memory data staging framework
Decaf
Dataflow system for the parallel communication of
coupled tasks
ADIOS
Adaptable IO System
Challenges
Efficient Data Exchanges
Tasks and workflow communication to exchange data
User Code Integration
Expertise required for modifying programs
Task Placement
Management of data locality between tasks
Resilience
Increased probability of failures in large resources
Dynamicity
Workflow graph may need to be restructured at runtime
Portability
Specialized hardware, specific libraries
Dreher, M., et al. "In Situ Workflows at Exascale: System Software to the Rescue." Proceedings of the In
Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization. ACM, 2017.
Ferreira da Silva, R., et al. "A characterization of workflow management systems for extreme-scale
applications." Future Generation Computer Systems 75 (2017): 228-238.
The Interplay of Workflow Execution
and Resource Provisioning
Rafael Ferreira da Silva, Ph.D.
Research Assistant Professor, Computer Science Department
Computer Scientist, USC Information Sciences Institute
rafsilva@isi.edu – http://rafaelsilva.com
Thank You
Questions?
Funded by the National
Science Foundation
under grants #1741040
and #1664162
http://pegasus.isi.edu
http://analytics4md.org

More Related Content

What's hot

Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
Anubhav Jain
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
Larry Smarr
 
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Rafael Ferreira da Silva
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
Frederic Desprez
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
Robert Grossman
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
Larry Smarr
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
Ian Foster
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
Robert Grossman
 
Petrel: A Programmatically Accessible Research Data Service
Petrel: A Programmatically Accessible Research Data ServicePetrel: A Programmatically Accessible Research Data Service
Petrel: A Programmatically Accessible Research Data Service
Globus
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
Anubhav Jain
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
Anubhav Jain
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
Anubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
PacificResearchPlatform
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
Miha Ahronovitz
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
Robert Grossman
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
Robert Grossman
 

What's hot (20)

Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
 
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Petrel: A Programmatically Accessible Research Data Service
Petrel: A Programmatically Accessible Research Data ServicePetrel: A Programmatically Accessible Research Data Service
Petrel: A Programmatically Accessible Research Data Service
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
 

Similar to The Interplay of Workflow Execution and Resource Provisioning

Scientific
Scientific Scientific
Scientific
marpierc
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
LevLafayette1
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
Intel IT Center
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
Eran Chinthaka Withana
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
Larry Smarr
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
AI Super computer update
AI Super computer update AI Super computer update
AI Super computer update
Ganesan Narayanasamy
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
Ganesan Narayanasamy
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
Ian Foster
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
Yahoo Developer Network
 
A Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterA Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing Center
James McGalliard
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
Amazon Web Services
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogce
marpierc
 
Foss4G 2009 Scenz Grid
Foss4G 2009 Scenz GridFoss4G 2009 Scenz Grid
Foss4G 2009 Scenz Grid
noho
 
Research on Blue Waters
Research on Blue WatersResearch on Blue Waters
Research on Blue Waters
inside-BigData.com
 

Similar to The Interplay of Workflow Execution and Resource Provisioning (20)

Scientific
Scientific Scientific
Scientific
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
AI Super computer update
AI Super computer update AI Super computer update
AI Super computer update
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
A Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterA Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing Center
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogce
 
Foss4G 2009 Scenz Grid
Foss4G 2009 Scenz GridFoss4G 2009 Scenz Grid
Foss4G 2009 Scenz Grid
 
Research on Blue Waters
Research on Blue WatersResearch on Blue Waters
Research on Blue Waters
 

More from Rafael Ferreira da Silva

Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
Rafael Ferreira da Silva
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Rafael Ferreira da Silva
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Rafael Ferreira da Silva
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
Rafael Ferreira da Silva
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Rafael Ferreira da Silva
 
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsAccurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Rafael Ferreira da Silva
 
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Rafael Ferreira da Silva
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
Rafael Ferreira da Silva
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
Rafael Ferreira da Silva
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Rafael Ferreira da Silva
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific Workflows
Rafael Ferreira da Silva
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
Rafael Ferreira da Silva
 
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Rafael Ferreira da Silva
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computations
Rafael Ferreira da Silva
 
Task Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsTask Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and Workflows
Rafael Ferreira da Silva
 
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Rafael Ferreira da Silva
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Rafael Ferreira da Silva
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
Rafael Ferreira da Silva
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Rafael Ferreira da Silva
 
A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...
Rafael Ferreira da Silva
 

More from Rafael Ferreira da Silva (20)

Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsAccurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
 
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific Workflows
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
 
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computations
 
Task Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and WorkflowsTask Resource Consumption Prediction for Scientific Applications and Workflows
Task Resource Consumption Prediction for Scientific Applications and Workflows
 
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
 
A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...
 

Recently uploaded

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
ScyllaDB
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 

Recently uploaded (20)

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 

The Interplay of Workflow Execution and Resource Provisioning

  • 1. The Interplay of Workflow Execution and Resource Provisioning Rafael Ferreira da Silva 18th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP) MS10 Resource Management, Scheduling, Workflows: Critical Middleware for HPC and Clouds March 7, 2018 Funded by the National Science Foundation under grants #1741040 and #1664162
  • 2. https://pegasus.isi.edu OUTLINE Introduction Workflow Systems Workflow Execution Workflow Optimizations Workload Characterization Compute Pipelines Scientific Workflows Scientific Achievements Basic Concepts Execution Challenges Heterogeneous Environments Fine-Grained Workflows on HPC Systems HPC/HTC Resource Provisioning Fault-Tolerance Task Clustering Workflow Restructuring 2 Extreme Scale In Transit Processing In Situ Processing
  • 3. https://pegasus.isi.edu 3 Compute Pipelines Building Blocks Compute Pipelines Allows scientists to connect different codes together and execute their analysis Pipelines can be very simple (independent or parallel) jobs or complex represented as DAGs However, it is still up-to user to figure out Data Management How do you ship in the small/large amounts data required by your pipeline and protocols to use? How best to leverage different infrastructure setups Grids have no shared filesystem while XSEDE and your local campus cluster has one! Debug and Monitor Computations. Correlate data across lots of log files Need to know what host a job ran on and how it was invoked Restructure Workflows for Improved Performance Short running tasks? Data placement
  • 4. https://pegasus.isi.edu Why Pegasus 4 > > >Automation Enables parallel, distributed computations Automatically executes data transfers Reproducibility Reusable, aids reproducibility Records how data was produced (provenance) Automate Recover Debug Recover & Debug Handles failures with to provide reliability Keeps track of data and files NSF funded project since 2001, with close collaboration with HTCondor team http://pegasus.isi.edu
  • 5. https://pegasus.isi.edu Scientific Achievements 5 60,000 compute tasks Input Data: 5,000 files (10GB total) Output Data: 60,000 files (60GB total) executed on LIGO Data Grid, Open Science Grid and XSEDE
  • 6. https://pegasus.isi.edu 6 Advanced LIGO PyCBC Workflow One of the main pipelines to measure the statistical significance of data needed for discovery Contains 100’s of thousands of jobs and accesses on order of terabytes of data Uses data from multiple detectors For the detection, the pipeline was executed on Syracuse and Albert Einstein Institute Hannover A single run of the binary black hole + binary neutron star search through the O1 data (about 3 calendar months of data with 50% duty cycle) requires a workflow with 194,364 jobs Generating the final O1 results with all the review required for the first discovery took about 20 million core hours PyCBC Papers: An improved pipeline to search for gravitational waves from compact binary coalescence. Samantha Usman, Duncan Brown et al. The PyCBC search for gravitational waves from compact binary coalescence, Samantha Usman et al (https://arxiv.org/abs/1508.02357) PyCBC Detection GW150914: First results from the search for binary black hole coalescence with Advanced LIGO. B. P. Abbott et al.
  • 7. https://pegasus.isi.edu 7 Southern California Earthquake Center’s CyberShake Builders ask seismologists: What will the peak ground motion be at my new building in the next 50 years? Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA) 286 sites, 4 models each workflow has 420,000 tasks CPU jobs (Mesh generation, seismogram synthesis): 1,094,000 node-hours GPU jobs: 439,000 node-hours AWP-ODC finite-difference code 5 billion points per volume, 23000 timesteps 200 GPUs for 1 hour Titan: 421,000 CPU node-hours, 110,000 GPU node-hours Blue Waters: 673,000 CPU node-hours, 329,000 GPU node-hours
  • 8. https://pegasus.isi.edu Enabled cutting-edge domain science (e.g., drug delivery) through collaboration with scientists at the DoE Spallation Neutron Source (SNS) facility A Pegasus workflow was developed that confirmed that nanodiamonds can enhance the dynamics of tRNA It compared SNS neutron scattering data with MD simulations by calculating the epsilon that best matches experimental data Ran on a Cray XE6 at NERSC using 400,000 CPU hours, and generated 3TB of data Water is seen as small red and white molecules on large nanodiamond spheres. The colored tRNA can be seen on the nanodiamond surface. (Image Credit: Michael Mattheson, OLCF, ORNL) An automated analysis workflow for optimization of force-field parameters using neutron scattering data. V. E. Lynch, J. M. Borreguero, D. Bhowmik, P. Ganesh, B. G. Sumpter, T. E. Proffen, M. Goswami, Journal of Computational Physics, July 2017. Impact on DOE Science 8
  • 9. https://pegasus.isi.edu 9 Soybean Workflow TACC Wrangler as Execution Environment Flash Based Shared Storage Switched to glideins (pilot jobs) - Brings in remote compute nodes and joins them to the HTCondor pool on the submit host Workflow runs at a finer granularity Works well on Wrangler due to more cores and memory per node (48 cores, 128 GB RAM)
  • 10. https://pegasus.isi.edu 10 cleanup job removes unused data stage-in job stage-out job registration job transfers the workflow input data transfers the workflow output data registers the workflow output data DAGdirected-acyclic graphs Portable Description Users do not worry about low level execution details abstract workflow executable workflow transformation executables (or programs) platform independent logical filename (LFN) platform independent (abstraction) Basic Concepts
  • 11. https://pegasus.isi.edu Overview of Workflow Execution Challenges 11 Workflow Management System Workflow Design Workflow Structure Sequence Parallelism Modeling / Composition Language Scheduling / Resource Provisioning Architecture HPC / HTC Optimization Clustering / Cleanup Pilot Jobs Interopera bility Task Estimation Runtime / Energy / Disk Fault- Tolerance Task-level Anomaly Detection Workflow- level Exception Handling Data Movement Data Locality Network Proximity Reconfigurable Network SDN Execution Monitoring Provenance In situ NVRAM / Data Spaces In transit HDFS / Burst Buffers Deelman, E., et al. The Future of Scientific Workflows, The International Journal of High Performance Computing Applications, vol. 32 (1), 2017.
  • 12. https://pegasus.isi.edu Input data site Data staging site shared filesystem submit host (e.g., user’s laptop) Compute site B Compute site A Data staging site Input data site Output data site Output data site object storage Heterogeneous Environments 12 HPC Cloud
  • 13. https://pegasus.isi.edu Workflow Optimization: Task Clustering 13 Task Granularity Low performance of lightweight (fine-grained) tasks High queuing times Communication overheads Solutions Group into coarse-grained tasks reduces the cost of data transfers when grouped tasks share input data, and saves queuing time Use pilot jobs to dynamically provision a number of resources at a time Chen, W., et al. "Dynamic and fault-tolerant clustering for scientific workflows." IEEE Transactions on Cloud Computing 4.1 (2016): 49-62. Chen, W., et al. "Using imbalance metrics to optimize task clustering in scientific workflow executions." Future Generation Computer Systems 46 (2015): 69-84. Ferreira da Silva, R., et al. "Controlling fairness and task granularity in distributed, online, non‐clairvoyant workflow executions." Concurrency and Computation: Practice and Experience 26.14 (2014): 2347-2366.
  • 14. https://pegasus.isi.edu HPC System submit host (e.g., user’s laptop) workflow wrapped as an MPI job Allows sub-graphs of a Pegasus workflow to be submitted as monolithic jobs to remote resources Fine-Grained Workflows on HPC systems 14 Dreher, M., et al. "In Situ Workflows at Exascale: System Software to the Rescue." Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization. ACM, 2017. Ferreira da Silva, R., et al. "A characterization of workflow management systems for extreme-scale applications." Future Generation Computer Systems 75 (2017): 228-238.
  • 15. https://pegasus.isi.edu Workflow Optimization: Workflow Restructuring 15 Chen, W., and Deelman, E.. "Integration of workflow partitioning and resource provisioning." Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 2012. Srinivasan, S., et al. "A cleanup algorithm for implementing storage constraints in scientific workflow executions." Proceedings of the 9th Workshop on Workflows in Support of Large-Scale Science. IEEE Press, 2014. Partition the workflow into subworkflows and send them for execution to the target system Storage constraints to limit or minimize the amount of disk space that workflows use on shared storage systems (use of cleanup jobs)
  • 16. https://pegasus.isi.edu Resource Provisioning: Workload Characteristics (HPC) 16 487 Total Number of Users MIRA (ALCF) 78,782 Total Number of Jobs 5,698 CPU hours (millions) 6,093 Avg. Runtime (seconds) 786,432 Total Number of Cores 49,152 Total Number of Nodes 2014 Physics 73 Users 24,429 Jobs 2,256 CPU hours (millions) 7,147 Avg. Runtime (sec) Materials Science 77 Users 12,546 Jobs 895 CPU hours (millions) 5,820 Avg. Runtime (sec) Chemistry 51 Users 10,286 Jobs 810 CPU hours (millions) 6,131 Avg. Runtime (sec) Schlagkamp, S., et al. ”Consecutive Job Submission Behavior at Mira Supercomputer." 25th ACM International Symposium on High- Performance Parallel and Distributed Computing (HPDC), 2016.
  • 17. https://pegasus.isi.edu Resource Provisioning: Workload Characteristics (HTC) 17 COMPACT MUON SOLENOID (CMS) 1,435,280 Total Number of Jobs AUGUST 2014 OCTOBER 2014 392 Total Number of Users 75 Execution Sites 15,484 Execution Nodes 792,603 Completed Jobs 385,447 Exit Code (!= 0) 9,444.6 Avg. Runtime (sec) 55.3 Avg. Disk Usage (MB) 408 Total Number of Users 15,034 Execution Nodes 476,391 Exit Code (!= 0) 32.9 Avg. Disk Usage (MB) 1,638,803 Total Number of Jobs 72 Execution Sites 816,678 Completed Jobs 9967.1 Avg. Runtime (sec) Schlagkamp, S., et al. ” Understanding User Behavior: from HPC to HTC" Procedia Computer Science (ICCS’16), 2016.
  • 18. https://pegasus.isi.edu Resource Provisioning: Resource Estimation 18 Jobs’ resource requirements at Mira Job submission interarrival times per day Job submission interarrival times per hour Jobruntime MemoryUsage DiskUsage Ferreira da Silva, Rafael, et al. "Online task resource consumption prediction for scientific workflows." Parallel Processing Letters 25.03 (2015): 1541003.
  • 19. https://pegasus.isi.edu Network Provisioning 19 We predicted completion times of current and future workflow tasks and start times for data staging under different scenarios to provision appropriate compute & network resources We developed mechanisms to arbitrate and prioritize data flows from competing workflows by leveraging an advanced network provisioning technologies, a virtual Software Defined Exchange (SDX) Virtualized SDX Site 1 OpenFlow Controller Site 3Site 2 Data Site ExoGENI Slice with Virtual SDX + HTCondor Pools + Data Site Dynamic Connectivity HTCondor Pool HTCondor Pool HTCondor Pool 10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10 Relative data flow priorities (Workflow1-Workflow2) 0 200 400 600 800 1000 1200 1400 1600 1800 Avg.transfertimefor"individual"jobs(secs) Transfer time vs. Flow priorities (Different bandwidths) Workflow1 on HTCondor pool 1 (vSDX bandwidth = 250Mb/s) Workflow2 on HTCondor pool 2 (vSDX bandwidth = 250Mb/s) Workflow1 on HTCondor pool 1 (vSDX bandwidth = 500Mb/s) Workflow2 on HTCondor pool 2 (vSDX bandwidth = 500Mb/s) Effect of relative flow priorities on observed data transfer times for data-intensive workflow tasks for different vSDX provisioned bandwidths Deelman, E., et al. “PANORAMA: An Approach to Performance Modeling and Diagnosis of Extreme Scale Workflows”, International Journal of High Performance Computing Applications, vol. 31, 2017. Mandal, A., et al. “Toward Prioritization of Data Flows for Scientific Workflows Using Virtual Software Defined Exchanges”, First International Workshop on Workflow Science (WoWS 2017), 2017.
  • 20. https://pegasus.isi.edu Typical Approaches Task Retries Task Resubmission Task Clustering Checkpointing Provenance … Statistical and Machine Learning Linear Regression Neural Networks Classification Algorithms Tree-based Methods Support Vector Machines … Others Exception Handling Game Theory … Analytical Solutions Failure Modeling Markov Chains Principal Component Analysis Histograms … 20 Some Approaches to Handle Faults
  • 21. https://pegasus.isi.edu Fault-Tolerance: Anomaly Detection 21 Anomaly Detection System Developed Auto-Regression (AR) based statistical methods to detect anomalies for single metrics AR coefficients and degree of AR calculated from training runs AR model During application execution, use coefficients to generate predictions; when error exceeds a threshold, generate triggers Experimented with anomaly detection by injecting artificial, competing loads Integrated workflow monitoring and anomaly detection system Mandal, A. et al. “Toward an End-to-end Framework for Modeling, Monitoring, and Anomaly Detection for Scientific Workflows”, Workshop on Large-Scale Parallel Processing (LSPP’16), 2016.
  • 22. https://pegasus.isi.edu Fault Tolerance: Preventing Faults using PID Controllers 22 P I D ΣΣ Process OutputSetpoint Input+ - + + + z 0.0 0.5 1.0 1.5 TimeDead time Raise time Percent overshoot Settling time Steady-state error DATAFOOTPRINT PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening Ferreira da Silva, R., et al. “Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows”, 11th Workflows in Support of Large-Scale Science (WORKS’16), 2016.
  • 23. https://pegasus.isi.edu In transit processing: Burst Buffers in Workflow Systems 23 Automated BB Reservations BB reservations operations (either persistent or scratch) consist in the creation and release, as well as stage in and stage out operations Transient reservations: needs to implement stage in/out operations at the beginning/end of each job execution Execution Directory Automated mapping between the workflow execution directory and the BB reservation No changes to the application code are necessary, and the application job directly writes its output to the BB reservation Placement of the burst buffer nodes within the Cori system (NERSC) Ferreira da Silva, R., et al. "On the use of burst buffers for accelerating data-intensive scientific workflows." Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science. ACM, 2017. 0 2500 5000 7500 10000 1 4 8 16 32 64 128 256 313 # Nodes MiB/s BB no−BB fx.sgt fy.sgt 1 4 8 16 32 64 128 1 4 8 16 32 64 128 0 500 1000 1500 2000 # Nodes Time(seconds) BB No−BB
  • 24. https://pegasus.isi.edu In situ Processing: WMS collocated with the simulations 24 Integration with In situ Processing Systems DataSpaces Distributed in-memory data staging framework Decaf Dataflow system for the parallel communication of coupled tasks ADIOS Adaptable IO System Challenges Efficient Data Exchanges Tasks and workflow communication to exchange data User Code Integration Expertise required for modifying programs Task Placement Management of data locality between tasks Resilience Increased probability of failures in large resources Dynamicity Workflow graph may need to be restructured at runtime Portability Specialized hardware, specific libraries Dreher, M., et al. "In Situ Workflows at Exascale: System Software to the Rescue." Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization. ACM, 2017. Ferreira da Silva, R., et al. "A characterization of workflow management systems for extreme-scale applications." Future Generation Computer Systems 75 (2017): 228-238.
  • 25. The Interplay of Workflow Execution and Resource Provisioning Rafael Ferreira da Silva, Ph.D. Research Assistant Professor, Computer Science Department Computer Scientist, USC Information Sciences Institute rafsilva@isi.edu – http://rafaelsilva.com Thank You Questions? Funded by the National Science Foundation under grants #1741040 and #1664162 http://pegasus.isi.edu http://analytics4md.org