Instrument data orchestration with
Globus Search and Flows
Vas Vasiliadis
vas@uchicago.edu
October 13, 2021
Why we’re all here this week…
2
Distribution Store
Data Portal
Advanced Computing Facility
Instrument Facility
Instrument data orchestration:
A common design pattern
Image Analysis
3
Search/Discovery
5
Science!
6
Imaging
1 Acquisition
2
Description/Identification
4
v
Three Degrees of Automation
Timer Service
Scheduled and recurring transfers
(a.k.a. Globus cron)
Command Line Interface
Ad hoc scripting and integration
Platform Services
Comprehensive data—and
compute—orchestration (with
human in the loop)
Search
Flows
Transfer
& Sharing
Globus Command Line
Interface (CLI)
…you’re all experts on this already!
Globus Timer Service
The Globus Timer service
• Scheduled/recurring file transfers
• Well suited to backup/sync tasks
• Service with a command line interface
– Simple installation: pypi.org/project/globus-timer-cli
– One-time authentication with a user identity
• Example: NIH – hpc.nih.gov/storage/globus_cron.html
7
Use case: Data replication
• For backup: initiated by user or system back up
• Automated transfer of data from science instrument
8
Recurring transfers
with sync option
Copy /ingest
Daily @ 3:30am
Using the Globus Timer service
9
$ globus–timer session {login, logout, whoami}
$ globus–timer job transfer 
--name example–job 
--label "Timer Transfer Job" 
--interval 28800 
--start '2020–01–01T12:34:56' 
--source–endpoint ddb59aef–6d04–11e5–ba46–22000b92c6ec 
--dest–endpoint ddb59af0–6d04–11e5–ba46–22000b92c6ec 
--item ~/file1.txt ~/new_file1.txt false 
--item ~/file2.txt ~/new_file2.txt false
Globus Timer service options
• ––items–file {file_name}
• ––stop–after–runs
• ––stop–after–date
• Transfer behavior (equivalent to options in web app)
––sync–level (how timer behaves if files exist)
––verify–checksum
––encrypt–data
––preserve–timestamp
10
Timer options in the webapp
Coming soon….
Platform Services
Relevant Globus platform capabilities
• Data transfer and sharing
• Data description and discovery
• Data (and compute) orchestration
• Authentication and Authorization
13
Auth Search Transfer Groups Flows
Globus Auth: Foundational IAM service
Brokers authentication and authorization among…
– End-users
– Identity providers: enterprise, external (federated identities)
– Services: resource servers with REST APIs
– Apps: web, mobile, desktop, command line clients
– Services acting as clients to other services
• OAuth 2.0 Authorization Framework (a.k.a. OAuth2)
• OpenID Connect Core 1.0 (a.k.a. OIDC)
Auth
14
Several authentication models supported
• Application acting as user with consent
– Authorization code grant
• Application authenticating as itself
– Client credentials grant
• Application able to manage tokens for offline or long
running tasks
– Refresh tokens
Authorization Code Grant
16
Client
(Web Portal,
Application)
Globus service
(Resource Server)
Globus Auth
(Authorization Server)
5. Authenticate using client id and
secret, send authorization code
Browser (User)
1. Access
portal
2. Redirect
user
3. User authenticates
and consents
4. Authorization
code
6. Access token(s)
7. Authenticate with access token(s),
giving client authority to invoke the
requested service
Identity
Provider
Client credential grant
17
1. Authenticate with app
client id and secret
2. Access Tokens
Application,
Science Gateway,
Data Portal
(Client)
3. Authenticate as app
with access tokens to invoke
service (on behalf of authorized
user, within a given scope)
Globus Transfer
(Resource Server)
Globus Auth
(Authorization Server)
Step 0: Application registration
• Set desired scopes
• Set callback URL
• Get client ID and secret
• Consents implement
least privileges principle
18
Auth
developers.globus.org
Data transfer and sharing
…you already know how to do this ;-)
• Move data to collection à Submit Transfer task
• Make data accessible à Set guest collection access rule
• Grant user/app access à Add/confirm Group membership
19
Groups
service
Transfer
service
GET /groups/my_groups
POST /endpoint/{endpoint_id}/access
POST /transfer
Groups
Transfer
Using guest collections in your apps
• Create a guest collection; requires authentication
– Cannot be completely automated – must ”log in”
– Create once and automate rest of the steps
• Grant the application Access Manager role
– Allows the application to manage permissions on the collection
– Set for application identity: appclientid@clients.auth.globus.org
• Grant roles for management of endpoint and tasks
Transfer
Globus Search Service
Data description and discovery
• Metadata store with fine-
grained visibility controls
• Schema agnostic
à dynamic schemas
• Simple search using URL
query parameters
• Complex search using
search request document
22
docs.globus.org/api/search
Search
Index
Search
Cancer Registry Records for Research (CR3)
• Create network of federated cancer registries
– Deploy similar infrastructure at other cancer registries
– Enable queries across multiple registries
• Federation via Globus: network scale ßà local control
– Data owners input/export data sets, apply QC, set access policies
– Registry data remain at the institution where they were generated
– Identities are provided/authenticated by the institution, not Globus
– System scale depends on data owners providing storage resources
CR3
Discovery
Portal
Cohort
aggregate
counts
Login with
UPMC/Pitt
credentials
Globus
Search (GS)
Globus
Auth (GA)
UPMC/Pitt
Identity
Providers
Authentication
Auth
initiated to
GA
Cohort
search
initiated to
GS
Researcher
Cohort
aggregate
counts
returned
CR3 Architecture
Globus
Transfer (GT)
Registry Staff
Data transfer from registrar to
researcher mediated by GT
Manage
authorization
Elasticsearch
Request
Service
Cancer Registry De-identified
Data Index (minimal criteria
data: e.g., staging)
CR3 requirements
• Search Index
– Only de-identified data in search index
– No record-level for researchers
• Portal
– Fine-grained access control
– Researchers must use a specific identity
– Access must be logged
– Render graphs based on search results
– Faceted search in real time
CR3 Portal (simulated data)
Federated logon using Globus Auth
with Pitt/UPMC as identity providers
Dynamically updating
charts as facets change
Variable facets based on
source registry index
Google-like text search with
facets for filtering
Developed using a framework based
on the Globus Modern Research
Data Portal* design pattern
(docs.globus.org/mrdp)
* PeerJ Articles:cs-144 https://peerj.com/articles/cs-144/
Distinct access policies
may be applied to
Data and Metadata
Data ingest with Globus Search
28
Search
Index
POST /index/{index_id}/ingest'
Search
{
"ingest_type": "GMetaList",
"ingest_data": {
"gmeta": [
{
"id": "filetype",
"subject”: "https://search.api.globus.org/abc.txt",
"visible_to": ["public"],
"content": {
"metadata-schema/file#type": "file”
}
},
...
]
}
Data ingest with Globus Search
29
Search
Index
POST /index/{index_id}/ingest'
Search
{
"ingest_type": "GMetaList",
"ingest_data": {
"gmeta": [
{
"id": ”weight",
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": ["urn:globus:auth:identity:46bd0f56-
e24f-11e5-a510-131bef46955c"],
"content": {
"metadata-schema/file#size": ”37.6",
"metadata-schema/file#size_human": ”<50lb”
}
},
...
]
}
Visibility limited to Globus Auth identity
- Single user
- Globus Group
- Registered client application
Data discovery with Globus Search
30
{
"@datatype": "GSearchResult",
"@version": "2017-09-01",
"count": 1,
"gmeta": [
{
"@datatype": "GMetaResult",
"@version": "2019-08-27",
"entries": [
{ ... }
],
"subject": "https://..."
}
],
"offset": 0,
"total": 1
}
GET /index/{index_id}/search?q=type%3Ahdf5
Search
Index
Simple query
Search
Data discovery with Globus Search
31
POST /index/{index_id}/search
Search
Index
Complex query
{
"filters": [
{
"type": "range",
"field_name": ”pubdate",
"values": [
{
"from": "*",
"to": "2020-12-31"
}
]
}
],
"facets": [
{
"name": "Publication Date",
"field_name": "pubdate",
...
}
]
}
Search
Working with Globus
Search
32
jupyter.demo.globus.org
Metadata, Search and Discovery
Globus Flows Service
Data (and compute) automation
• Flows: A platform service for defining, applying, and
sharing distributed research automation flows
• Flows comprise Actions
• Action Providers: Called by Flows to perform tasks
• Triggers*: Start flows based on events
* In development
Automation with Globus Flows
• Built on AWS Step Functions
– Simple JSON-based state machine
language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
Extending the ecosystem: Action providers
36
• Action Provider is a
service endpoint
– Run
– Status
– Cancel
– Release
– Resume
• Action Provider Toolkit
action-provider-
tools.readthedocs.io/en/latest
Search
Transfer
Notification
ACLs Identifier
Delete
Ingest
User
Form
Describe Xtract
funcX Web
Form
Custom built
Globus Provided
Working with Globus
Flows
37
jupyter.demo.globus.org
Automation Using Globus Flows
Coming soon: Globus Trigger service
• Trigger–Action platform
• Predefined triggers and
actions to create rules
• Globus processes triggers
and reliably executes actions
globus.org
docs.globus.org
outreach@globus.org
support@globus.org

Instrument Data Orchestration with Globus Search and Flows

  • 1.
    Instrument data orchestrationwith Globus Search and Flows Vas Vasiliadis vas@uchicago.edu October 13, 2021
  • 2.
    Why we’re allhere this week… 2
  • 3.
    Distribution Store Data Portal AdvancedComputing Facility Instrument Facility Instrument data orchestration: A common design pattern Image Analysis 3 Search/Discovery 5 Science! 6 Imaging 1 Acquisition 2 Description/Identification 4 v
  • 4.
    Three Degrees ofAutomation Timer Service Scheduled and recurring transfers (a.k.a. Globus cron) Command Line Interface Ad hoc scripting and integration Platform Services Comprehensive data—and compute—orchestration (with human in the loop) Search Flows Transfer & Sharing
  • 5.
    Globus Command Line Interface(CLI) …you’re all experts on this already!
  • 6.
  • 7.
    The Globus Timerservice • Scheduled/recurring file transfers • Well suited to backup/sync tasks • Service with a command line interface – Simple installation: pypi.org/project/globus-timer-cli – One-time authentication with a user identity • Example: NIH – hpc.nih.gov/storage/globus_cron.html 7
  • 8.
    Use case: Datareplication • For backup: initiated by user or system back up • Automated transfer of data from science instrument 8 Recurring transfers with sync option Copy /ingest Daily @ 3:30am
  • 9.
    Using the GlobusTimer service 9 $ globus–timer session {login, logout, whoami} $ globus–timer job transfer --name example–job --label "Timer Transfer Job" --interval 28800 --start '2020–01–01T12:34:56' --source–endpoint ddb59aef–6d04–11e5–ba46–22000b92c6ec --dest–endpoint ddb59af0–6d04–11e5–ba46–22000b92c6ec --item ~/file1.txt ~/new_file1.txt false --item ~/file2.txt ~/new_file2.txt false
  • 10.
    Globus Timer serviceoptions • ––items–file {file_name} • ––stop–after–runs • ––stop–after–date • Transfer behavior (equivalent to options in web app) ––sync–level (how timer behaves if files exist) ––verify–checksum ––encrypt–data ––preserve–timestamp 10
  • 11.
    Timer options inthe webapp Coming soon….
  • 12.
  • 13.
    Relevant Globus platformcapabilities • Data transfer and sharing • Data description and discovery • Data (and compute) orchestration • Authentication and Authorization 13 Auth Search Transfer Groups Flows
  • 14.
    Globus Auth: FoundationalIAM service Brokers authentication and authorization among… – End-users – Identity providers: enterprise, external (federated identities) – Services: resource servers with REST APIs – Apps: web, mobile, desktop, command line clients – Services acting as clients to other services • OAuth 2.0 Authorization Framework (a.k.a. OAuth2) • OpenID Connect Core 1.0 (a.k.a. OIDC) Auth 14
  • 15.
    Several authentication modelssupported • Application acting as user with consent – Authorization code grant • Application authenticating as itself – Client credentials grant • Application able to manage tokens for offline or long running tasks – Refresh tokens
  • 16.
    Authorization Code Grant 16 Client (WebPortal, Application) Globus service (Resource Server) Globus Auth (Authorization Server) 5. Authenticate using client id and secret, send authorization code Browser (User) 1. Access portal 2. Redirect user 3. User authenticates and consents 4. Authorization code 6. Access token(s) 7. Authenticate with access token(s), giving client authority to invoke the requested service Identity Provider
  • 17.
    Client credential grant 17 1.Authenticate with app client id and secret 2. Access Tokens Application, Science Gateway, Data Portal (Client) 3. Authenticate as app with access tokens to invoke service (on behalf of authorized user, within a given scope) Globus Transfer (Resource Server) Globus Auth (Authorization Server)
  • 18.
    Step 0: Applicationregistration • Set desired scopes • Set callback URL • Get client ID and secret • Consents implement least privileges principle 18 Auth developers.globus.org
  • 19.
    Data transfer andsharing …you already know how to do this ;-) • Move data to collection à Submit Transfer task • Make data accessible à Set guest collection access rule • Grant user/app access à Add/confirm Group membership 19 Groups service Transfer service GET /groups/my_groups POST /endpoint/{endpoint_id}/access POST /transfer Groups Transfer
  • 20.
    Using guest collectionsin your apps • Create a guest collection; requires authentication – Cannot be completely automated – must ”log in” – Create once and automate rest of the steps • Grant the application Access Manager role – Allows the application to manage permissions on the collection – Set for application identity: appclientid@clients.auth.globus.org • Grant roles for management of endpoint and tasks Transfer
  • 21.
  • 22.
    Data description anddiscovery • Metadata store with fine- grained visibility controls • Schema agnostic à dynamic schemas • Simple search using URL query parameters • Complex search using search request document 22 docs.globus.org/api/search Search Index Search
  • 23.
    Cancer Registry Recordsfor Research (CR3) • Create network of federated cancer registries – Deploy similar infrastructure at other cancer registries – Enable queries across multiple registries • Federation via Globus: network scale ßà local control – Data owners input/export data sets, apply QC, set access policies – Registry data remain at the institution where they were generated – Identities are provided/authenticated by the institution, not Globus – System scale depends on data owners providing storage resources
  • 24.
    CR3 Discovery Portal Cohort aggregate counts Login with UPMC/Pitt credentials Globus Search (GS) Globus Auth(GA) UPMC/Pitt Identity Providers Authentication Auth initiated to GA Cohort search initiated to GS Researcher Cohort aggregate counts returned CR3 Architecture Globus Transfer (GT) Registry Staff Data transfer from registrar to researcher mediated by GT Manage authorization Elasticsearch Request Service Cancer Registry De-identified Data Index (minimal criteria data: e.g., staging)
  • 25.
    CR3 requirements • SearchIndex – Only de-identified data in search index – No record-level for researchers • Portal – Fine-grained access control – Researchers must use a specific identity – Access must be logged – Render graphs based on search results – Faceted search in real time
  • 26.
    CR3 Portal (simulateddata) Federated logon using Globus Auth with Pitt/UPMC as identity providers Dynamically updating charts as facets change Variable facets based on source registry index Google-like text search with facets for filtering Developed using a framework based on the Globus Modern Research Data Portal* design pattern (docs.globus.org/mrdp) * PeerJ Articles:cs-144 https://peerj.com/articles/cs-144/
  • 27.
    Distinct access policies maybe applied to Data and Metadata
  • 28.
    Data ingest withGlobus Search 28 Search Index POST /index/{index_id}/ingest' Search { "ingest_type": "GMetaList", "ingest_data": { "gmeta": [ { "id": "filetype", "subject”: "https://search.api.globus.org/abc.txt", "visible_to": ["public"], "content": { "metadata-schema/file#type": "file” } }, ... ] }
  • 29.
    Data ingest withGlobus Search 29 Search Index POST /index/{index_id}/ingest' Search { "ingest_type": "GMetaList", "ingest_data": { "gmeta": [ { "id": ”weight", "subject": "https://search.api.globus.org/abc.txt", "visible_to": ["urn:globus:auth:identity:46bd0f56- e24f-11e5-a510-131bef46955c"], "content": { "metadata-schema/file#size": ”37.6", "metadata-schema/file#size_human": ”<50lb” } }, ... ] } Visibility limited to Globus Auth identity - Single user - Globus Group - Registered client application
  • 30.
    Data discovery withGlobus Search 30 { "@datatype": "GSearchResult", "@version": "2017-09-01", "count": 1, "gmeta": [ { "@datatype": "GMetaResult", "@version": "2019-08-27", "entries": [ { ... } ], "subject": "https://..." } ], "offset": 0, "total": 1 } GET /index/{index_id}/search?q=type%3Ahdf5 Search Index Simple query Search
  • 31.
    Data discovery withGlobus Search 31 POST /index/{index_id}/search Search Index Complex query { "filters": [ { "type": "range", "field_name": ”pubdate", "values": [ { "from": "*", "to": "2020-12-31" } ] } ], "facets": [ { "name": "Publication Date", "field_name": "pubdate", ... } ] } Search
  • 32.
  • 33.
  • 34.
    Data (and compute)automation • Flows: A platform service for defining, applying, and sharing distributed research automation flows • Flows comprise Actions • Action Providers: Called by Flows to perform tasks • Triggers*: Start flows based on events * In development
  • 35.
    Automation with GlobusFlows • Built on AWS Step Functions – Simple JSON-based state machine language – Conditions, loops, fault tolerance, etc. – Propagates state through the flow • Standardized API for integrating custom event and action services – Actions: synchronous or asynchronous – Custom Web forms prompt for user input • Actions secured with Globus Auth
  • 36.
    Extending the ecosystem:Action providers 36 • Action Provider is a service endpoint – Run – Status – Cancel – Release – Resume • Action Provider Toolkit action-provider- tools.readthedocs.io/en/latest Search Transfer Notification ACLs Identifier Delete Ingest User Form Describe Xtract funcX Web Form Custom built Globus Provided
  • 37.
  • 38.
    Coming soon: GlobusTrigger service • Trigger–Action platform • Predefined triggers and actions to create rules • Globus processes triggers and reliably executes actions
  • 39.