We describe how to build, deploy and trigger Globus Flows for automating data management at scale. Examples include running flows, building flows, and integrating compute into flows.
SpotFlow: Tracking Method Calls and States at Runtime
Automating Research Data with Globus Flows and Compute
1. Automating Research Data with Globus
Flows and Compute
Greg Nawrocki
greg@globus.org
nawrocki@uchicago.edu
nawrocki@anl.gov
Washington University in St. Louis
September 20 & 21, 2022
Case Western Reserve University
October 23 – 24, 2023
2. Topics
• Globus Flows overview
• Automating data management
–Run an existing Flow
–Build a Flow then run it
• Globus Compute overview
• Automating end-to-end research flows
4
3. Globus Platform and Automation Capabilities
Timer Service
The Globus WebApp supports
recurring and scheduled transfers.
(a.k.a. Globus cron)
Command Line Interface
The CLI provides an interface to Globus
services from the shell and is suited to
both interactive and scripting use cases.
Globus API / SDK
Our open REST APIs and Python SDK
empower you to create an integrated
ecosystem of research data services
and applications. Harness the power of
the Globus platform so you can focus
on building your application.
4. Automation using Globus Flows
Available to all Globus Subscribers
• Managed, secure (Globus Auth), reliable
task orchestration
• Support for heterogenous resources
• Extensible and authorable event driven
execution model
– Flow Definition (JSON)
– Input Schema (JSON)
– Deployment
• Extensible via custom actions
6
5. Managed automation of tasks
• Flows: A platform service for defining, applying, and
sharing distributed research automation flows
• Flows comprise Actions
• Action Providers: Called by Flows to perform tasks
• Triggers*: Start flows based on events * In development
Transfer
Transfer
raw files
Compute
Launch
analysis job
Carbon!
Correct,
classify, …
Compute
Extract
metadata
Search
Ingest to
index
Transfer
Move final
files to repo
Share
Set access
controls
6. Globus Flows service implementation
• Built on AWS Step Functions
– Simple state machine language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
AWS Step Functions Globus Auth
+
7. Automation services ecosystem
GET /provider_url/
POST /provider_url/run
GET /provider_url/action_id/status
GET /provider_url/action_id/cancel
GET /provider_url/action_id/status
Create Action
Providers
Define and
deploy flows
{ “StartAt”: ”ToProject”,
”States” : {
”ToProject” : { … },
”SetPermission” : { …},
“ProcessData” : { … } … }}
Run flows
10. Flow lifecycle
13
• Define using JSON/YAML
• Deploy to Flows service
• Set access policy for
visibility and execution
11. Flow lifecycle
14
• Define using JSON/YAML
• Deploy to Flows service
• Set access policy for
visibility and execution
• Run (debug) and monitor
12. Flow lifecycle: Write once, run many
15
• Define using JSON/YAML
• Deploy to Flows service
• Set access policy for
visibility and execution
• Run (debug) and monitor
• …and run again!
15. A simple, rather contrived, use case
Transfer files
to intermediate
storage
Transfer
Actions
Transfer files
to final storage
Transfer
1 2
16. Ex. 1: Run an existing flow using the web app
• Navigate to app.globus.org/flows
• Find the flow named “Two Stage Globus Transfer” and click ”Start”
• Consent to allow the flow access to your account
• Source
– Collection: Globus Tutorial Endpoint 1
– Path: /share/godata/
• Intermediate
– Collection and path of your choice
– You can even use the collection you created yesterday in the admin tutorial
• Destination
– Collection: Globus Tutorial Endpoint 2
– Path: /~/
• Add appropriate labels and tags
• Start Run!
• Click “View Run Details” and “Event Log” to monitor progress
19
20. • Uses Globus defined Action Providers
• https://globus-automate-client.readthedocs.io/en/latest/globus_action_providers.html
• transfer
• Uses the Globus Transfer Task API to perform a transfer of data from one Globus
Collection to another.
• set_permission
• Uses the Globus Transfer ACL API to set or manage permissions on a folder or file.
Example Flow
21. Initial Housekeeping
import sys
import os
import time
import json
import uuid
import pickle
import base64
import globus_sdk
from globus_automate_client import FlowsClient
# ID of this tutorial notebook as registered with Globus Auth
CLIENT_ID = 'f794186b-f330-4595-b6c6-9c9d3e903e47’
• Things we need in place for this Notebook to run and access
the Globus SDK and Globus Flows client.
22. Initial Housekeeping
# Feel free to replace the collection UUIDs below with those of your own
collections
# "Globus Tutorial Endpoint 1"
source_collection = "ddb59aef-6d04-11e5-ba46-22000b92c6ec”
# "Globus Tutorials on ALCF Eagle"
destination_collection = "a6f165fa-aee2-4fe5-95f3-97429c28bf82”
# "Tutorial Users" group
my_collaborators = "50b6a29c-63ac-11e4-8062-22000ab68755”
23. Authentication and Authorization
• All interactions between users and services on the Globus
automation platform are governed by the Globus Auth service.
• Consent must be given by the user for each interaction taking place on their
behalf.
• When executing a flow.
• When deploying a new flow on the Globus Flow service.
• This Notebook in our JupyterHub.
• Access to the Flow service is already granted to you by virtue of authenticating to the
JupyterHub running this notebook – the tokens are already in place.
• If you're running this notebook in your own environment you will need to manually log
into Globus Auth and get tokens using a native app authorization flow (see the
`Platform_Introduction_Native_App_Auth` notebook for an example of how to initiate
this flow).
24. The Globus Flows Service in a Jupyter Notebook
login
REST APIs
{ “tokens”:…
{“tokens”:…
REST APIs
Flow Service
Bearer a45cd…
# Get Globus Auth token data from the JupyterHub environment
tokens = pickle.loads(base64.b64decode(os.getenv('GLOBUS_DATA')))['tokens']
# Introspect tokens
print(json.dumps(tokens, indent=2))
25. Authentication and Authorization
# Create a variable for storing flow scope tokens. Each newly deployed flow
# scope needs to be authorized separately,
# and will have its own set of tokens. Save each of these tokens by scope.
saved_flow_scopes = {}
# Add a callback to the flows client for fetching scopes.
# It will draw scopes from `saved_flow_scopes`
def get_flow_authorizer(flow_url, flow_scope, client_id):
return globus_sdk.AccessTokenAuthorizer
(access_token=saved_flow_scopes[flow_scope]['access_token'])
# Setup the Flow client, using tokens from our Jupyterhub login to access the
Globus Flows service, and
# set the `get_flow_authorizer` callback for any new flows we authorize.
flows_authorizer = globus_sdk.AccessTokenAuthorizer
(access_token=tokens['flows.globus.org']['access_token'])
flows_client = FlowsClient.new_client
(CLIENT_ID, get_flow_authorizer, flows_authorizer)
• Once you’ve got the tokens the authentication magic happens.
26. Fetch User Identity
# Create an Auth client so we can look up identities
auth_authorizer = globus_sdk.AccessTokenAuthorizer
(access_token=tokens['auth.globus.org']['access_token'])
ac = globus_sdk.AuthClient(authorizer=auth_authorizer)
# Get the user's primary identity
primary_identity = ac.oauth2_userinfo()
identity_id = primary_identity['sub']
print(f"Username: {primary_identity['preferred_username']} (ID: {identity_id})")
print(f"Notifications will be sent to: {primary_identity['email']}")
• When transferring files to the destination collection we will put them in a
uniquely named directory:
• <identity_id>-shared-files
• Fetch our user id for this purpose.
27. • Define a Flow
• Flows are composed of State Types
• The Action Type is what we will highlight in this example
• Define a Schema
• The user inputs needed for this Flow
• Deploy the Flow
– The FlowsClient makes that easy!
Authoring a Flow
28. # Define flow
flow_definition = {
"Comment": "Transfer files to a guest collection and set access permissions",
"StartAt": "TransferFiles",
"States": {
• Top Level Fields
• From the Amazon States Language playbook
• Can Include
• Comment
• StartAt
• First State in the Machine
• States
• State definitions
Authoring a Flow – Define a Flow
29. • Supported States from the Amazon States Language playbook
• Pass
• Passes input to output – performs no work
• Choice
• Adds branching logic to a state machine.
• Wait
• Delays the machine from continuing for a specified time.
• Fail
• Terminates the machine as a failed run.
• Globus Defined States
• Action
• References the Action Providers – The heart of our example.
• ExpressionEval
• Method of evaluating an expression to create parameter values for passing to an Action.
• Combines the Action and Pass State Types providing the ability to compute results for
Parameters (Action) and the simple storage of the new values (Pass).
• Useful for determining a value to be tested in a Choice State or to compute a “final” value
seen in the output of the Flow upon completion.
State Types
30. The Action State Type – by way of example
"TransferFiles":
"Comment": "Transfer to a guest collection",
"Type": "Action",
"ActionUrl": "https://actions.automate.globus.org/transfer/transfer",
• Name the State – “TransferFiles”
• Comment
• Self explanatory
• Type : Action (required)
• ActionUrl (required)
• The base URL of the Action (Service Endpoint). As defined by the Action Interface.
31. The Action State Type – by way of example
"Parameters": {
"source_endpoint_id.$": "$.input.source.id",
"destination_endpoint_id.$": "$.input.destination.id",
"transfer_items": [
{
"source_path.$": "$.input.source.path",
"destination_path.$": "$.input.destination.path",
"recursive.$": "$.input.recursive_tx"
}
]
},
• Each Action Provider (optionally) defines its own set of properties/inputs.
• Input to the Action can either be referenced by “InputPath” or
“Parameters”.
• In this example the parameters are referenced from the input schema
(we’ll see that soon).
32. The Action State Type – by way of example
"ResultPath": "$.TransferFiles",
"WaitTime": 60,
"Next": "SetPermission",
},
• “ResultPath”: Is a Reference Path indicating where the output of the Action will
be placed in the state of the Flow run-time.
• “WaitTime” (optional, default value 300 – five minutes): The maximum amount
time to wait for the Action to complete (or abort) in seconds.
• “Next or End” (mutually exclusive, one required): These indicate how the Flow
should proceed after the Action state.
– “Next ”indicates the name of the following state of the flow.
– “End” with a value ”True” indicates that the Flow is complete after this state completes.
33. The Action State Type – another example
"SetPermission": {
"Comment": "Grant read permission on the data to a Globus user or group",
"Type": "Action",
"ActionUrl": "https://actions.automate.globus.org/transfer/set_permission",
"Parameters": {
"endpoint_id.$": "$.input.destination.id",
"path.$": "$.input.destination.path",
"operation": "CREATE",
"permissions": "r", # read-only access
"principal_type.$": "$.input.principal_type", # 'group' or 'identity'
"principal.$": "$.input.principal_identifier"
},
"ResultPath": "$.SetPermission",
"End": True
}
}
}
34. The Action State Type – wrap up
• The examples above are not exhaustive – for more
information on the Action State Type
• https://globus-automate-
client.readthedocs.io/en/latest/authoring_flows.html#action-state-type
• Globus Action Providers
• https://globus-automate-
client.readthedocs.io/en/latest/globus_action_providers.html
• Roll your own Action Providers
• https://action-provider-tools.readthedocs.io/en/latest/
35. • All Flows require schemas to validate user input.
• Yea! More JSON!
Authoring a Flow – Define a Schema
# Define input schema
input_schema = {
"required": [
"input"
],
"properties": {
"input": {
"type": "object",
"required": [
"source",
"destination",
"recursive_tx",
"principal_identifier",
"principal_type"
],
• User input we need for this Flow
– source
o Globus Collection containing the
source data
– destination
o Globus Guest collection that will be
the destination of the transfer action
– recursive_tx
o Boolean flag to state whether or not
to transfer files recursively
– principal_identifier
o UUID of the user identity or group to
share data with
– principal_type
o Specifies whether to share with an
individual user of group identity
36. Authoring a Flow – Define a Schema
"properties": {
"source": {
"type": "object",
"title": "Select source collection and path",
"description": "The source collection and path (path MUST end with a slash)",
"format": "globus-collection",
"required": [
"id",
"path"
],
"properties": {
"id": {
"type": "string",
"format": "uuid",
"default": source_collection
},
"path": {
"type": "string"
}
},
"additionalProperties": False
},
• Schema for the “source” object
– globus-collection is a custom
format
o https://globus-automate-
client.readthedocs.io/en/latest/authori
ng_flows.html#globus-web-app-
custom-formats
– Note the default to
source_collection which we
defined at the beginning of this
notebook.
37. Authoring a Flow – Define a Schema
”destination": {
"type": "object",
"title": "Select destination collection and path",
etc…
"recursive_tx": {
"type": "boolean",
"title": "Recursive transfer",
etc…
"principal_type": {
"type": "string",
"title": "Type of principal to share with",
etc…
"principal_identifier": {
"type": "string",
"title": "UUID of user identity or group",
etc…
• Finish Schema for remaining user input parameters
38. Flow Deployment
# Deploy the flow
flow_title = f"Tutorial-Transfer-Share-{str(uuid.uuid4())}" # generate a unique title
flow = flows_client.deploy_flow(
flow_definition,
title=flow_title,
input_schema=input_schema,
)
flow_id = flow['id']
flow_scope = flow['globus_auth_scope’]
print(f"Successfully deployed flow (ID: {flow_id})")
print(f"Flow scope: {flow_scope}nn")
print(f"View the flow in the Webapp: https://app.globus.org/flows/{flow_id}")
print(f"Note: You can start your flow directly from the Webapp!")
• Simple method of the FlowsClient
40. Flow Updating
flow = flows_client.update_flow(
flow_id,
flow_definition,
# administered_by=[f"urn:globus:auth:identity:{identity_id}"])
# runnable_by=[f"urn:globus:auth:identity:{identity_id}"])
visible_to=[f"urn:globus:auth:identity:{identity_id}"])
• If you change the Flow you will need to update it.
• Very similar to the deploy step.
• By default Flows are only visible to their creator, you can modify that here.
• https://globus-automate-
client.readthedocs.io/en/latest/python_sdk_reference.html
41. Flow Execution – From the API
• Flows may be run via the globus-automate API
– See section C of the Jupyter Notebook
• Authorize the Flow
– Native App Grant process
• Define the Flow Inputs
– Define Flow inputs with a JSON document
• Run the Flow
– Trivial thanks again to the FlowsClient
42. Flows - Administrivia
• Flows can be created / updated / run from the Globus CLI
• Flows is a subscription service
– Non-subscribers can have a single flow
– You should delete the flow we just created if you want to follow
along with the next example.
• If my institution has a subscription, how do I run more than
one flow?
– Short answer, contact me (greg@globus.org) or the Globus
Support Staff (support@globus.org)
– This process will improve
43. Now we’ll add computation to our flow
Transfer raw
instrument
images
Run a compute
job to process
image files
Transfer Compute
Move processed
images to
repository
Set access
controls for
sharing data
Share
Transfer
1 2 3 4
45. Globus Compute
• Globus Compute: managed, federated FaaS
• Compute function: Python code registered with the
Globus compute service ! simple image processing
• Compute endpoint: any system running the Globus
Compute agent ! our EC2 instance
• Currently you can only run functions you register
48
46. Globus Compute transforms any computing
resource into a function serving endpoint
• pip installable endpoint
– Globus Auth for registration
• Elastic resource provisioning
from local, cluster, or cloud
system (via Parsl)
• Parallel execution using local
fork or via common schedulers
– Slurm, PBS, LSF, Cobalt, K8s
49
Compute
Service
47. Web interface to Compute
50
List of compute endpoints available to user
Status and details of compute endpoint
48. Compute service will evolve rapidly
• Multi-user compute endpoints
• Native integration with transfer for stage in and stage
out of data for compute tasks
• Expanding compute service interfaces in the webapp
for administrators and users
50. Configure our computing resource
• Register a compute endpoint with Globus Compute
– Activate venv: ~/.compute/bin/activate
o Virtual environment – contains necessary packages
– Register: globus-compute-endpoint configure EP_NAME
– Start: globus-compute-endpoint start EP_NAME
– Save the registered endpoint UUID
– View endpoint in the web app: app.globus.org/compute
53
51. Register and execute a function
• Register a function with Globus Compute
– Activate venv: ~/.compute/bin/activate (should
already be done)
– Register: python ~/globus-flows-trigger-
examples/compute_function.py
– Save the registered function UUID
• Open interactive Python shell and run the function
54
52. Configure our computing resource storage
• We need a way to get the data to that computing
resource
• Setup and run Globus Connect Personal
– Setup: globusconnectpersonal
– Run: globusconnectpersonal -start &
– Save the registered collection UUID
– View collection in the web app
55
53. Adding computation to our flow
Transfer raw
instrument
images
Run a compute
job to process
image files
Transfer Compute
Move processed
images to
repository
Set access
controls for
sharing data
Share
Transfer
1 2 3 4
55. Incorporate compute into a flow (1/3)
• Review the flow definition and schema:
– transfer_compute_share_definition.json
– Actually… we’ll do that after we deploy it
• Deploy the enhanced flow
– Activate venv: ~/.trigger/bin/activate
– Deploy: deploy_flow --flowdef --schema --title
58
56. Incorporate compute into a flow (2/3)
• Update the monitoring script
– Edit trigger_transfer_compute_share.py
• Modify…
– Flow ID
– Source collection ID and path (the “instrument”)
– Destination collection ID and path (the compute endpoint)
– Compute endpoint and function IDs
– Result share collection ID and path (the sharing resource)
59
57. Incorporate compute into a flow (3/3)
• Run the monitoring script
./trigger_transfer_compute_share_flow.py
--watchdir /home/devN/images
--patterns .done
• Activate the trigger
– Copy *.png files to directory being monitored
– “touch” iam.done file to trigger the flow
• Monitor the running flow in the web app
60
59. Extending the ecosystem: Action providers
62
• Action Provider is a
service endpoint
– Run
– Status
– Cancel
– Release
– Resume
• Action Provider Toolkit
action-provider-tools.readthedocs.io
compute
ACLs
delete
identifier
transfer
notify ingest
mkdir
search
ls
Xtract describe
web form
Custom developed
docs.globus.org/api/flows/hosted-action-providers
60. Flows Resources
• Globus Documentation: docs.globus.org
• Flows Specific Doc: https://docs.globus.org/api/flows/
–Globus Flows Overview
o Authoring Flows
o Running a Flow automatically
o Python SDK Reference
–Globus Operated Action Providers
–Globus Action Provider API Specification
–Globus Flows API Specification