Automating Research Data with Globus Flows and Compute

Automating Research Data with Globus
Flows and Compute
Greg Nawrocki
greg@globus.org
nawrocki@uchicago.edu
nawrocki@anl.gov
Washington University in St. Louis
September 20 & 21, 2022
Case Western Reserve University
October 23 – 24, 2023

Topics
• Globus Flows overview
• Automating data management
–Run an existing Flow
–Build a Flow then run it
• Globus Compute overview
• Automating end-to-end research flows
4

Globus Platform and Automation Capabilities
Timer Service
The Globus WebApp supports
recurring and scheduled transfers.
(a.k.a. Globus cron)
Command Line Interface
The CLI provides an interface to Globus
services from the shell and is suited to
both interactive and scripting use cases.
Globus API / SDK
Our open REST APIs and Python SDK
empower you to create an integrated
ecosystem of research data services
and applications. Harness the power of
the Globus platform so you can focus
on building your application.

Automation using Globus Flows
Available to all Globus Subscribers
• Managed, secure (Globus Auth), reliable
task orchestration
• Support for heterogenous resources
• Extensible and authorable event driven
execution model
– Flow Definition (JSON)
– Input Schema (JSON)
– Deployment
• Extensible via custom actions
6

Managed automation of tasks
• Flows: A platform service for defining, applying, and
sharing distributed research automation flows
• Flows comprise Actions
• Action Providers: Called by Flows to perform tasks
• Triggers*: Start flows based on events * In development
Transfer
Transfer
raw files
Compute
Launch
analysis job
Carbon!
Correct,
classify, …
Compute
Extract
metadata
Search
Ingest to
index
Transfer
Move final
files to repo
Share
Set access
controls

Globus Flows service implementation
• Built on AWS Step Functions
– Simple state machine language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
AWS Step Functions Globus Auth
+

Automation services ecosystem
GET /provider_url/
POST /provider_url/run
GET /provider_url/action_id/status
GET /provider_url/action_id/cancel
GET /provider_url/action_id/status
Create Action
Providers
Define and
deploy flows
{ “StartAt”: ”ToProject”,
”States” : {
”ToProject” : { … },
”SetPermission” : { …},
“ProcessData” : { … } … }}
Run flows

Flow lifecycle
11
• Define using JSON/YAML

Flow lifecycle
12
• Deploy to Flows service

Flow lifecycle
13
• Set access policy for
visibility and execution

Flow lifecycle
14
• Run (debug) and monitor

Flow lifecycle: Write once, run many
15
• Run (debug) and monitor
• …and run again!

A simple, rather contrived, use case
Transfer files
to intermediate
storage
Transfer
Actions
Transfer files
to final storage
Transfer
1 2

Ex. 1: Run an existing flow using the web app
• Navigate to app.globus.org/flows
• Find the flow named “Two Stage Globus Transfer” and click ”Start”
• Consent to allow the flow access to your account
• Source
– Collection: Globus Tutorial Endpoint 1
– Path: /share/godata/
• Intermediate
– Collection and path of your choice
– You can even use the collection you created yesterday in the admin tutorial
• Destination
– Collection: Globus Tutorial Endpoint 2
– Path: /~/
• Add appropriate labels and tags
• Start Run!
• Click “View Run Details” and “Event Log” to monitor progress
19

A simple, and very common, use case
Transfer raw
instrument
images
Transfer
Set access
controls for
sharing data
Share
1 2
Actions

Let’s build it!
22
jupyter.demo.globus.org
globus-jupyter-notebooks
Automation_Using_Globus_Flows.ipynb
https://globus-automate-
client.readthedocs.io/en/latest/authoring_flows.html

• Uses Globus defined Action Providers
• https://globus-automate-client.readthedocs.io/en/latest/globus_action_providers.html
• transfer
• Uses the Globus Transfer Task API to perform a transfer of data from one Globus
Collection to another.
• set_permission
• Uses the Globus Transfer ACL API to set or manage permissions on a folder or file.
Example Flow

Initial Housekeeping
import sys
import os
import time
import json
import uuid
import pickle
import base64
import globus_sdk
from globus_automate_client import FlowsClient
# ID of this tutorial notebook as registered with Globus Auth
CLIENT_ID = 'f794186b-f330-4595-b6c6-9c9d3e903e47’
• Things we need in place for this Notebook to run and access
the Globus SDK and Globus Flows client.

Initial Housekeeping
# Feel free to replace the collection UUIDs below with those of your own
collections
# "Globus Tutorial Endpoint 1"
source_collection = "ddb59aef-6d04-11e5-ba46-22000b92c6ec”
# "Globus Tutorials on ALCF Eagle"
destination_collection = "a6f165fa-aee2-4fe5-95f3-97429c28bf82”
# "Tutorial Users" group
my_collaborators = "50b6a29c-63ac-11e4-8062-22000ab68755”

Authentication and Authorization
• All interactions between users and services on the Globus
automation platform are governed by the Globus Auth service.
• Consent must be given by the user for each interaction taking place on their
behalf.
• When executing a flow.
• When deploying a new flow on the Globus Flow service.
• This Notebook in our JupyterHub.
• Access to the Flow service is already granted to you by virtue of authenticating to the
JupyterHub running this notebook – the tokens are already in place.
• If you're running this notebook in your own environment you will need to manually log
into Globus Auth and get tokens using a native app authorization flow (see the
`Platform_Introduction_Native_App_Auth` notebook for an example of how to initiate
this flow).

The Globus Flows Service in a Jupyter Notebook
login
REST APIs
{ “tokens”:…
{“tokens”:…
REST APIs
Flow Service
Bearer a45cd…
# Get Globus Auth token data from the JupyterHub environment
tokens = pickle.loads(base64.b64decode(os.getenv('GLOBUS_DATA')))['tokens']
# Introspect tokens
print(json.dumps(tokens, indent=2))

Authentication and Authorization
# Create a variable for storing flow scope tokens. Each newly deployed flow
# scope needs to be authorized separately,
# and will have its own set of tokens. Save each of these tokens by scope.
saved_flow_scopes = {}
# Add a callback to the flows client for fetching scopes.
# It will draw scopes from `saved_flow_scopes`
def get_flow_authorizer(flow_url, flow_scope, client_id):
return globus_sdk.AccessTokenAuthorizer
(access_token=saved_flow_scopes[flow_scope]['access_token'])
# Setup the Flow client, using tokens from our Jupyterhub login to access the
Globus Flows service, and
# set the `get_flow_authorizer` callback for any new flows we authorize.
flows_authorizer = globus_sdk.AccessTokenAuthorizer
(access_token=tokens['flows.globus.org']['access_token'])
flows_client = FlowsClient.new_client
(CLIENT_ID, get_flow_authorizer, flows_authorizer)
• Once you’ve got the tokens the authentication magic happens.

Fetch User Identity
# Create an Auth client so we can look up identities
auth_authorizer = globus_sdk.AccessTokenAuthorizer
(access_token=tokens['auth.globus.org']['access_token'])
ac = globus_sdk.AuthClient(authorizer=auth_authorizer)
# Get the user's primary identity
primary_identity = ac.oauth2_userinfo()
identity_id = primary_identity['sub']
print(f"Username: {primary_identity['preferred_username']} (ID: {identity_id})")
print(f"Notifications will be sent to: {primary_identity['email']}")
• When transferring files to the destination collection we will put them in a
uniquely named directory:
• <identity_id>-shared-files
• Fetch our user id for this purpose.

• Define a Flow
• Flows are composed of State Types
• The Action Type is what we will highlight in this example
• Define a Schema
• The user inputs needed for this Flow
• Deploy the Flow
– The FlowsClient makes that easy!
Authoring a Flow

# Define flow
flow_definition = {
"Comment": "Transfer files to a guest collection and set access permissions",
"StartAt": "TransferFiles",
"States": {
• Top Level Fields
• From the Amazon States Language playbook
• Can Include
• Comment
• StartAt
• First State in the Machine
• States
• State definitions
Authoring a Flow – Define a Flow

• Supported States from the Amazon States Language playbook
• Pass
• Passes input to output – performs no work
• Choice
• Adds branching logic to a state machine.
• Wait
• Delays the machine from continuing for a specified time.
• Fail
• Terminates the machine as a failed run.
• Globus Defined States
• Action
• References the Action Providers – The heart of our example.
• ExpressionEval
• Method of evaluating an expression to create parameter values for passing to an Action.
• Combines the Action and Pass State Types providing the ability to compute results for
Parameters (Action) and the simple storage of the new values (Pass).
• Useful for determining a value to be tested in a Choice State or to compute a “final” value
seen in the output of the Flow upon completion.
State Types

The Action State Type – by way of example
"TransferFiles":
"Comment": "Transfer to a guest collection",
"Type": "Action",
"ActionUrl": "https://actions.automate.globus.org/transfer/transfer",
• Name the State – “TransferFiles”
• Comment
• Self explanatory
• Type : Action (required)
• ActionUrl (required)
• The base URL of the Action (Service Endpoint). As defined by the Action Interface.

"Parameters": {
"source_endpoint_id.$": "$.input.source.id",
"destination_endpoint_id.$": "$.input.destination.id",
"transfer_items": [
{
"source_path.$": "$.input.source.path",
"destination_path.$": "$.input.destination.path",
"recursive.$": "$.input.recursive_tx"
}
]
},
• Each Action Provider (optionally) defines its own set of properties/inputs.
• Input to the Action can either be referenced by “InputPath” or
“Parameters”.
• In this example the parameters are referenced from the input schema
(we’ll see that soon).

"ResultPath": "$.TransferFiles",
"WaitTime": 60,
"Next": "SetPermission",
},
• “ResultPath”: Is a Reference Path indicating where the output of the Action will
be placed in the state of the Flow run-time.
• “WaitTime” (optional, default value 300 – five minutes): The maximum amount
time to wait for the Action to complete (or abort) in seconds.
• “Next or End” (mutually exclusive, one required): These indicate how the Flow
should proceed after the Action state.
– “Next ”indicates the name of the following state of the flow.
– “End” with a value ”True” indicates that the Flow is complete after this state completes.

The Action State Type – another example
"SetPermission": {
"Comment": "Grant read permission on the data to a Globus user or group",
"Type": "Action",
"ActionUrl": "https://actions.automate.globus.org/transfer/set_permission",
"Parameters": {
"endpoint_id.$": "$.input.destination.id",
"path.$": "$.input.destination.path",
"operation": "CREATE",
"permissions": "r", # read-only access
"principal_type.$": "$.input.principal_type", # 'group' or 'identity'
"principal.$": "$.input.principal_identifier"
},
"ResultPath": "$.SetPermission",
"End": True
}
}
}

The Action State Type – wrap up
• The examples above are not exhaustive – for more
information on the Action State Type
• https://globus-automate-
client.readthedocs.io/en/latest/authoring_flows.html#action-state-type
• Globus Action Providers
client.readthedocs.io/en/latest/globus_action_providers.html
• Roll your own Action Providers
• https://action-provider-tools.readthedocs.io/en/latest/

• All Flows require schemas to validate user input.
• Yea! More JSON!
Authoring a Flow – Define a Schema
# Define input schema
input_schema = {
"required": [
"input"
],
"properties": {
"input": {
"type": "object",
"required": [
"source",
"destination",
"recursive_tx",
"principal_identifier",
"principal_type"
],
• User input we need for this Flow
– source
o Globus Collection containing the
source data
– destination
o Globus Guest collection that will be
the destination of the transfer action
– recursive_tx
o Boolean flag to state whether or not
to transfer files recursively
– principal_identifier
o UUID of the user identity or group to
share data with
– principal_type
o Specifies whether to share with an
individual user of group identity

"properties": {
"source": {
"type": "object",
"title": "Select source collection and path",
"description": "The source collection and path (path MUST end with a slash)",
"format": "globus-collection",
"required": [
"id",
"path"
],
"properties": {
"id": {
"type": "string",
"format": "uuid",
"default": source_collection
},
"path": {
"type": "string"
}
},
"additionalProperties": False
},
• Schema for the “source” object
– globus-collection is a custom
format
o https://globus-automate-
client.readthedocs.io/en/latest/authori
ng_flows.html#globus-web-app-
custom-formats
– Note the default to
source_collection which we
defined at the beginning of this
notebook.

”destination": {
"type": "object",
"title": "Select destination collection and path",
etc…
"recursive_tx": {
"type": "boolean",
"title": "Recursive transfer",
etc…
"principal_type": {
"type": "string",
"title": "Type of principal to share with",
etc…
"principal_identifier": {
"type": "string",
"title": "UUID of user identity or group",
etc…
• Finish Schema for remaining user input parameters

Flow Deployment
# Deploy the flow
flow_title = f"Tutorial-Transfer-Share-{str(uuid.uuid4())}" # generate a unique title
flow = flows_client.deploy_flow(
flow_definition,
title=flow_title,
input_schema=input_schema,
)
flow_id = flow['id']
flow_scope = flow['globus_auth_scope’]
print(f"Successfully deployed flow (ID: {flow_id})")
print(f"Flow scope: {flow_scope}nn")
print(f"View the flow in the Webapp: https://app.globus.org/flows/{flow_id}")
print(f"Note: You can start your flow directly from the Webapp!")
• Simple method of the FlowsClient

Flow Updating
flow = flows_client.update_flow(
flow_id,
flow_definition,
# administered_by=[f"urn:globus:auth:identity:{identity_id}"])
# runnable_by=[f"urn:globus:auth:identity:{identity_id}"])
visible_to=[f"urn:globus:auth:identity:{identity_id}"])
• If you change the Flow you will need to update it.
• Very similar to the deploy step.
• By default Flows are only visible to their creator, you can modify that here.
client.readthedocs.io/en/latest/python_sdk_reference.html

Flow Execution – From the API
• Flows may be run via the globus-automate API
– See section C of the Jupyter Notebook
• Authorize the Flow
– Native App Grant process
• Define the Flow Inputs
– Define Flow inputs with a JSON document
• Run the Flow
– Trivial thanks again to the FlowsClient

Flows - Administrivia
• Flows can be created / updated / run from the Globus CLI
• Flows is a subscription service
– Non-subscribers can have a single flow
– You should delete the flow we just created if you want to follow
along with the next example.
• If my institution has a subscription, how do I run more than
one flow?
– Short answer, contact me (greg@globus.org) or the Globus
Support Staff (support@globus.org)
– This process will improve

Now we’ll add computation to our flow
Transfer raw
instrument
images
Run a compute
job to process
image files
Transfer Compute
Move processed
images to
repository
Set access
controls for
sharing data
Share
Transfer
1 2 3 4

Globus Compute – Formerly FuncX
47

Globus Compute
• Globus Compute: managed, federated FaaS
• Compute function: Python code registered with the
Globus compute service ! simple image processing
• Compute endpoint: any system running the Globus
Compute agent ! our EC2 instance
• Currently you can only run functions you register
48

Globus Compute transforms any computing
resource into a function serving endpoint
• pip installable endpoint
– Globus Auth for registration
• Elastic resource provisioning
from local, cluster, or cloud
system (via Parsl)
• Parallel execution using local
fork or via common schedulers
– Slurm, PBS, LSF, Cobalt, K8s
49
Compute
Service

Web interface to Compute
50
List of compute endpoints available to user
Status and details of compute endpoint

Compute service will evolve rapidly
• Multi-user compute endpoints
• Native integration with transfer for stage in and stage
out of data for compute tasks
• Expanding compute service interfaces in the webapp
for administrators and users

ALCF
EC2
Instance
Computing
Resource
Our environment
Compute
Endpoint
Registered
Compute
Function
GCS
Endpoint
GCP
Endpoint
Compute
Service
Transfer
Service
Sharing Resource

Configure our computing resource
• Register a compute endpoint with Globus Compute
– Activate venv: ~/.compute/bin/activate
o Virtual environment – contains necessary packages
– Register: globus-compute-endpoint configure EP_NAME
– Start: globus-compute-endpoint start EP_NAME
– Save the registered endpoint UUID
– View endpoint in the web app: app.globus.org/compute
53

Register and execute a function
• Register a function with Globus Compute
– Activate venv: ~/.compute/bin/activate (should
already be done)
– Register: python ~/globus-flows-trigger-
examples/compute_function.py
– Save the registered function UUID
• Open interactive Python shell and run the function
54

Configure our computing resource storage
• We need a way to get the data to that computing
resource
• Setup and run Globus Connect Personal
– Setup: globusconnectpersonal
– Run: globusconnectpersonal -start &
– Save the registered collection UUID
– View collection in the web app
55

Adding computation to our flow
Transfer raw
instrument
images
Run a compute
job to process
image files
Transfer Compute
Move processed
images to
repository
Set access
controls for
sharing data
Share
Transfer
1 2 3 4

EC2
Instance
Computing
Resource
Our environment
Compute
Endpoint
Registered
Compute
Function
transfer control
ALCF
Sharing Resource
transfer
raw files
1
invoke image
processing function 2
set
permissions
4
transfer
result files
3
GCP
Endpoint
GCS
Endpoint
Compute
Service
Transfer
Service
Instrument
(same EC2 Instance)
GCP
Endpoint
Monitor
script
0
trigger
flow run

Incorporate compute into a flow (1/3)
• Review the flow definition and schema:
– transfer_compute_share_definition.json
– Actually… we’ll do that after we deploy it
• Deploy the enhanced flow
– Activate venv: ~/.trigger/bin/activate
– Deploy: deploy_flow --flowdef --schema --title
58

• Update the monitoring script
– Edit trigger_transfer_compute_share.py
• Modify…
– Flow ID
– Source collection ID and path (the “instrument”)
– Destination collection ID and path (the compute endpoint)
– Compute endpoint and function IDs
– Result share collection ID and path (the sharing resource)
59

• Run the monitoring script
./trigger_transfer_compute_share_flow.py
--watchdir /home/devN/images
--patterns .done
• Activate the trigger
– Copy *.png files to directory being monitored
– “touch” iam.done file to trigger the flow
• Monitor the running flow in the web app
60

EC2
Instance
Computing
Resource
Enjoy our success!
Compute
Endpoint
Registered
Compute
Function
transfer control
access
result files
ALCF
Sharing Resource
transfer
raw files
1
invoke image
processing function 2
set
permissions
4
transfer
result files
3
GCP
Endpoint
GCS
Endpoint
Compute
Service
Transfer
Service
Instrument
(same EC2 Instance)
GCP
Endpoint
Monitor
script
0
trigger
flow run

Extending the ecosystem: Action providers
62
• Action Provider is a
service endpoint
– Run
– Status
– Cancel
– Release
– Resume
• Action Provider Toolkit
action-provider-tools.readthedocs.io
compute
ACLs
delete
identifier
transfer
notify ingest
mkdir
search
ls
Xtract describe
web form
Custom developed
docs.globus.org/api/flows/hosted-action-providers

Flows Resources
• Globus Documentation: docs.globus.org
• Flows Specific Doc: https://docs.globus.org/api/flows/
–Globus Flows Overview
o Authoring Flows
o Running a Flow automatically
o Python SDK Reference
–Globus Operated Action Providers
–Globus Action Provider API Specification
–Globus Flows API Specification

Automating Research Data with Globus Flows and Compute

Recommended

Recommended

More Related Content

Similar to Automating Research Data with Globus Flows and Compute

Similar to Automating Research Data with Globus Flows and Compute (20)

More from Globus

More from Globus (20)

Recently uploaded

Recently uploaded (20)

Automating Research Data with Globus Flows and Compute