SlideShare a Scribd company logo
1 of 25
Download to read offline
Vas Vasiliadis
vas@uchicago.edu
February 28, 2024
Reliable, Remote Computation at all Scales
Why do we need remote computing?
Workloads
• Interactive, real-time
workloads
• Machine learning training
and inference
• Components may best be
executed in different places
Users
• Diverse backgrounds
and expertise
• Different user interfaces
(e.g., notebooks)
Resources
• Hardware specialization
• Specialization leads to
distribution
Why do researchers “hate” remote computing?
• Remote computing is notoriously complicated
– Authentication
– Network connections
– Configuring/managing jobs
– Interacting with resources (waiting in queues, scaling nodes)
– Configuring execution environment
– Getting results back again
• Researchers need to overcome the same obstacles
every time they move to a new resource
3
How can we bridge this gap?
Borrow page from data management playbook
è“Fire-and-forget” computation
èUniform access interface
èFederated access control
èFast networks making compute resource “local”
How can we bridge this gap?
Move closer to researchers’ environments
èResearchers primarily work in high level languages
èFunctions are a natural unit of computation
Function-as-a-Service (FaaS)!
• Researchers work in familiar language, e.g., Python…
• …using familiar interfaces, e.g., Jupyter
FaaS as offered by cloud providers
6
• Single provider, single
location to submit and
manage tasks
• Homogenous execution
environment
• Transparent and elastic
execution (of even very
small tasks)
• Integrated with
cloud provider data
management
Cloud
Provider 1
Cloud
Provider 2
FaaS as an interface to the advanced computing
ecosystem
7
• Multiple provider,
multiple location, single
interface
• Homogenous execution
environment
• Transparent and elastic
execution
• Integrated with
institution-wide data
management
Globus Compute helps bridge the gap
Managed, federated
Function-as-a-Service for
reliably, scaleably and
securely executing functions
on remote endpoints from
laptops to supercomputers
The Globus Compute model
• Compute service — Highly available cloud-hosted
service for managed, fire-and-forget function execution
• Compute endpoint — Abstracts access to compute
resources, from edge device to supercomputer
• Compute SDK — Python interface for interacting with the
service, suing the familiar (to many) Globus look and feel
• Security — Leverages Globus Auth; Compute endpoints
are resource servers, authN and access via OAuth tokens
9
User interaction with Globus Compute
10
A B
You request a function be
executed on endpoints A and B
1
2 Globus Compute manages
the reliable and secure
execution on these endpoints
3
Globus Compute returns results or
stores them until requested
Compute
Service
You register your code
(as a Python function)
0
A compute
resource
Another
compute
resource
Globus Compute transforms any computing
resource into a function serving endpoint
• Python pip installable agent
• Elastic resource provisioning
from local, cluster, or cloud
system (via Parsl)
• Parallel execution using local
fork or via common schedulers
– Slurm, PBS, LSF, Cobalt, K8s
11
Compute
Service
Executing functions with Globus Compute
• Users invoke functions as tasks
– Register Python function
– Pass input arguments
– Select endpoint(s)
• Service stores tasks in the cloud
• Endpoints fetch waiting tasks (when
online), run tasks, and return results
• Results stored in the cloud and on
Globus storage endpoints
• Users retrieve results asynchronously
12
Compute
Service
Common Use Cases
13
Use case 1: fire-and-forget execution
Executing a bag of tasks, e.g., running simulations with
different parameters, executing ML inferences, on one
or more remote computers directly from your
environment, e.g., Jupyter on your laptop
• Fire-and-forget execution managed by Globus Compute;
tasks/results cached until endpoint/client are online
• Portability across different systems—optionally making use of
specialized hardware
• Elastic scaling to provision resources as needed (from HPC and
cloud systems)
ML-based drug screening
15
National Virtual Biotechnology Laboratory, arXiv:2006.02431
Use case 2: automated analysis of data
Constructing and running automated analysis
pipelines with data processing steps that need to be
executed in different locations
• Automatically process data as they are acquired (event- and
workflow-based)
• Integrate with data movement and other actions—human and
machine
• Execute functions across the computing continuum e.g., near
instrument, in data center, on specialized hardware
Enabling serial crystallography at scale
• Serially image chips with
000’s of embedded crystals
• Quality control first 1,000 to
report failures
• Analyze batches of images as
they are collected
• Report statistics and images
during experiment
• Return structure to scientist
Darren Sherrell, Gyorgy Babnigg, Andrzej Joachimiak
Globus
Flows
How remote computing enables this use case
Image
processing
Local workstation
Carbon!
Check
threshold
Data publication
Transfer
Transfer
raw files
Transfer
Move results
to repo
Compute
Analyze
images
Compute
Visualize
Compute
Gather
metadata
Share
Set access
controls
Compute
Launch QA
job
Search
Ingest to
index
Supercomputer
GPU Cluster
Globus enables “smart” instruments
19
aining of Deep Neural Networks
ources
7sec
19sec
5sec
31sec.
cycle time
Seems too easy? Let’s take a look…
• Run function on my laptop …test, debug
• Scale (a little) on a cloud VM (or K8s cluster)
• Scale (a bit more) on my campus cluster
• …run complete model on DOE supercomputer :--(
20
Common pattern of computing in a flow
Transfer raw
instrument
images
Run a compute
job to process
image files
Transfer Compute
Move processed
images to sharing
system/repository
Set access
controls for
sharing data
Share
Transfer
1 2 3 4
EC2
Instance
“Instrument”
EC2
Instance
Compute
Resource
Our environment Registered
Function
Monitor
script
transfer control
access
result files
0 trigger
flow run
transfer
raw files
1
invoke image
processing function 2
set
permissions
4
transfer
result files
3
Data
Endpoint
Data
Endpoint
Data
Endpoint
ALCF
Eagle
Compute
Endpoint
Globus Compute current state
• Service is running in production
– Tried and tested by 10MMs of invocations on 100Ks endpoints
• Globus Computer Agent supports single user
– Akin to Globus Connect Personal
• Endpoints visible via Globus web app
• Function registration and invocation via SDK
23
Compute service will evolve rapidly
• Multi-user compute endpoints
– Akin to Globus Connect Server
• Native integration with transfer for stage in and stage
out of data for compute tasks
– Can be done today via Globus Flows
• Expanding compute service interfaces in the webapp
for administrators and users
Support resources
• Globus documentation: docs.globus.org
• YouTube channel: youtube.com/GlobusOnline
• Helpdesk: support@globus.org
• Mailing Lists: globus.org/mailing-lists
• Customer engagement team (office hours)
• Professional services team (advisory, custom work)

More Related Content

Similar to Reliable, Remote Computation at All Scales

Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginnerscpallares
 
StarlingX - Project Onboarding
StarlingX - Project OnboardingStarlingX - Project Onboarding
StarlingX - Project OnboardingShuquan Huang
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systemsSri Prasanna
 
DOE Magellan OpenStack user story
DOE Magellan OpenStack user storyDOE Magellan OpenStack user story
DOE Magellan OpenStack user storylaurabeckcahoon
 
Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?inside-BigData.com
 
Globus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 

Similar to Reliable, Remote Computation at All Scales (20)

Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
 
StarlingX - Project Onboarding
StarlingX - Project OnboardingStarlingX - Project Onboarding
StarlingX - Project Onboarding
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
DOE Magellan OpenStack user story
DOE Magellan OpenStack user storyDOE Magellan OpenStack user story
DOE Magellan OpenStack user story
 
Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?
 
Globus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next Frontier
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 

More from Globus

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration TopicsGlobus
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowGlobus
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaSGlobus
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusGlobus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for ResearchersGlobus
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with GlobusGlobus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System AdministratorsGlobus
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersGlobus
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersGlobus
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Globus
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeGlobus
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformGlobus
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New UsersGlobus
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsGlobus
 
Globus Automation
Globus AutomationGlobus Automation
Globus AutomationGlobus
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 
Introduction to Globus
Introduction to GlobusIntroduction to Globus
Introduction to GlobusGlobus
 

More from Globus (20)

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration Topics
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a Flow
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using Globus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for Researchers
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with Globus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System Administrators
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for Researchers
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for Developers
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and Compute
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus Platform
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and Portals
 
Globus Automation
Globus AutomationGlobus Automation
Globus Automation
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Introduction to Globus
Introduction to GlobusIntroduction to Globus
Introduction to Globus
 

Recently uploaded

Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 

Recently uploaded (20)

Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 

Reliable, Remote Computation at All Scales

  • 1. Vas Vasiliadis vas@uchicago.edu February 28, 2024 Reliable, Remote Computation at all Scales
  • 2. Why do we need remote computing? Workloads • Interactive, real-time workloads • Machine learning training and inference • Components may best be executed in different places Users • Diverse backgrounds and expertise • Different user interfaces (e.g., notebooks) Resources • Hardware specialization • Specialization leads to distribution
  • 3. Why do researchers “hate” remote computing? • Remote computing is notoriously complicated – Authentication – Network connections – Configuring/managing jobs – Interacting with resources (waiting in queues, scaling nodes) – Configuring execution environment – Getting results back again • Researchers need to overcome the same obstacles every time they move to a new resource 3
  • 4. How can we bridge this gap? Borrow page from data management playbook è“Fire-and-forget” computation èUniform access interface èFederated access control èFast networks making compute resource “local”
  • 5. How can we bridge this gap? Move closer to researchers’ environments èResearchers primarily work in high level languages èFunctions are a natural unit of computation Function-as-a-Service (FaaS)! • Researchers work in familiar language, e.g., Python… • …using familiar interfaces, e.g., Jupyter
  • 6. FaaS as offered by cloud providers 6 • Single provider, single location to submit and manage tasks • Homogenous execution environment • Transparent and elastic execution (of even very small tasks) • Integrated with cloud provider data management Cloud Provider 1 Cloud Provider 2
  • 7. FaaS as an interface to the advanced computing ecosystem 7 • Multiple provider, multiple location, single interface • Homogenous execution environment • Transparent and elastic execution • Integrated with institution-wide data management
  • 8. Globus Compute helps bridge the gap Managed, federated Function-as-a-Service for reliably, scaleably and securely executing functions on remote endpoints from laptops to supercomputers
  • 9. The Globus Compute model • Compute service — Highly available cloud-hosted service for managed, fire-and-forget function execution • Compute endpoint — Abstracts access to compute resources, from edge device to supercomputer • Compute SDK — Python interface for interacting with the service, suing the familiar (to many) Globus look and feel • Security — Leverages Globus Auth; Compute endpoints are resource servers, authN and access via OAuth tokens 9
  • 10. User interaction with Globus Compute 10 A B You request a function be executed on endpoints A and B 1 2 Globus Compute manages the reliable and secure execution on these endpoints 3 Globus Compute returns results or stores them until requested Compute Service You register your code (as a Python function) 0 A compute resource Another compute resource
  • 11. Globus Compute transforms any computing resource into a function serving endpoint • Python pip installable agent • Elastic resource provisioning from local, cluster, or cloud system (via Parsl) • Parallel execution using local fork or via common schedulers – Slurm, PBS, LSF, Cobalt, K8s 11 Compute Service
  • 12. Executing functions with Globus Compute • Users invoke functions as tasks – Register Python function – Pass input arguments – Select endpoint(s) • Service stores tasks in the cloud • Endpoints fetch waiting tasks (when online), run tasks, and return results • Results stored in the cloud and on Globus storage endpoints • Users retrieve results asynchronously 12 Compute Service
  • 14. Use case 1: fire-and-forget execution Executing a bag of tasks, e.g., running simulations with different parameters, executing ML inferences, on one or more remote computers directly from your environment, e.g., Jupyter on your laptop • Fire-and-forget execution managed by Globus Compute; tasks/results cached until endpoint/client are online • Portability across different systems—optionally making use of specialized hardware • Elastic scaling to provision resources as needed (from HPC and cloud systems)
  • 15. ML-based drug screening 15 National Virtual Biotechnology Laboratory, arXiv:2006.02431
  • 16. Use case 2: automated analysis of data Constructing and running automated analysis pipelines with data processing steps that need to be executed in different locations • Automatically process data as they are acquired (event- and workflow-based) • Integrate with data movement and other actions—human and machine • Execute functions across the computing continuum e.g., near instrument, in data center, on specialized hardware
  • 17. Enabling serial crystallography at scale • Serially image chips with 000’s of embedded crystals • Quality control first 1,000 to report failures • Analyze batches of images as they are collected • Report statistics and images during experiment • Return structure to scientist Darren Sherrell, Gyorgy Babnigg, Andrzej Joachimiak
  • 18. Globus Flows How remote computing enables this use case Image processing Local workstation Carbon! Check threshold Data publication Transfer Transfer raw files Transfer Move results to repo Compute Analyze images Compute Visualize Compute Gather metadata Share Set access controls Compute Launch QA job Search Ingest to index Supercomputer GPU Cluster
  • 19. Globus enables “smart” instruments 19 aining of Deep Neural Networks ources 7sec 19sec 5sec 31sec. cycle time
  • 20. Seems too easy? Let’s take a look… • Run function on my laptop …test, debug • Scale (a little) on a cloud VM (or K8s cluster) • Scale (a bit more) on my campus cluster • …run complete model on DOE supercomputer :--( 20
  • 21. Common pattern of computing in a flow Transfer raw instrument images Run a compute job to process image files Transfer Compute Move processed images to sharing system/repository Set access controls for sharing data Share Transfer 1 2 3 4
  • 22. EC2 Instance “Instrument” EC2 Instance Compute Resource Our environment Registered Function Monitor script transfer control access result files 0 trigger flow run transfer raw files 1 invoke image processing function 2 set permissions 4 transfer result files 3 Data Endpoint Data Endpoint Data Endpoint ALCF Eagle Compute Endpoint
  • 23. Globus Compute current state • Service is running in production – Tried and tested by 10MMs of invocations on 100Ks endpoints • Globus Computer Agent supports single user – Akin to Globus Connect Personal • Endpoints visible via Globus web app • Function registration and invocation via SDK 23
  • 24. Compute service will evolve rapidly • Multi-user compute endpoints – Akin to Globus Connect Server • Native integration with transfer for stage in and stage out of data for compute tasks – Can be done today via Globus Flows • Expanding compute service interfaces in the webapp for administrators and users
  • 25. Support resources • Globus documentation: docs.globus.org • YouTube channel: youtube.com/GlobusOnline • Helpdesk: support@globus.org • Mailing Lists: globus.org/mailing-lists • Customer engagement team (office hours) • Professional services team (advisory, custom work)