Simplified Research Data Management
with the Globus Platform
CNI Membership Meeting, Fall 2018 - Project Update
Vas Vasiliadis
vas@uchicago.edu
Topics
• What is Globus?
• Globus from a researcher’s perspective
• Common use cases: research data automation
• Data publication with Globus
• Sustainability – it’s in our DNA
Research data management today (circa 2008)
How do we...
...move, share, describe,
discover, reproduce?
Index?
Facilitate data stewardship
Globus: A Brief History of Time
• Oct. 1998 – Globus Toolkit v1.0.0
• Nov. 2010 – Globus Online initial release
• Nov. 2013 – Sustainability model launched
• Dec. 2016 – 50,000 registered users, 200PB+ moved
• Jan. 2018 – Globus Toolkit support EOL
• Jan. 2019 - 100th subscriber signed, >50% sustainable
• ??? – Globus becomes fully self-sustaining
globus online
Globus…
bridges data and people
within and beyond
organizational boundaries
6
Research Computing HPC
Desktop Workstations
Mass Storage Instruments
Personal Resources
Public/Private Cloud
National Resources
Unified access to data across storage tiers
Public / private cloud stores
External
campus
storage
Project
repositories,
replication stores
Public repositories
Sharing with collaborators, community
Globus: Core functions
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• Use a Web browser
• Access any storage
• Use an existing identity
Demonstration
Globus for high assurance data management
• Restricted data handling: PHI, PII, CUI
• Security controls: NIST 800-53, 800-171 Low
• Business Associate Agreement (BAA) w/UChicago
– University of Chicago has a BAA with Amazon
High Assurance features
• Additional authentication assurance
– Per storage gateway policy on frequency of authentication with
specific identity for access to data
– Ensure that user authenticates with the specific identity that gives
them access within session
• Application instance isolation
– Authentication context is per application, per session
• Encryption of user data in transit and Globus data at rest
• Detailed audit log (on data transfer nodes)
Accessing Globus from
your own storage
…client software
that makes a
storage system
accessible via
Globus
Globus Connect Personal
• Installers do not require admin access
• Zero configuration; auto updating
• No firewall changes required; handles NATs
Globus Connect Server
• Installed and managed by sysadmin
• Default access for all local accounts
31
docs.globus.org/globus-connect-server-installation-guide/
Local system users
Local Storage System
(HPC cluster, NAS, …)
Globus Connect Server
MyProxy
CA
GridFTP
Server
OAuth
Server
Data
Transfer
Node
• POSIX + connectors
• Native packaging
Linux: DEB, RPM
IBM Spectrum Scale
Current Planned
Storage Connectors - globus.org/connectors
ActiveScale
Use(r)-appropriate interfaces
GET /endpoint/go%23ep1
PUT /endpoint/vas#my_endpt
200 OK
X-Transfer-API-Version: 0.10
Content-Type: application/json
...
Globus service
Web
CLI
Rest
API
Globus Command Line Interface
• Full-featured (web++)
• Uses Python SDK
• Open source
github.com/globus/globus-cli
docs.globus.org/cli
Globus is PaaS…
…for building science
gateways, portals, and
other web applications in
support of research and
education
Globus Auth
(identity and access management)
…
GlobusAPIs
(Transfer,Search,Identifiers,…)
GlobusConnect
Data Publication
File Sharing
File Transfer, Sync
The Globus Platform
Data Automation
Globus Auth
• Foundational Identity and Access
Management (IAM) service
• Protects REST API communications
• Enables login for diverse app ecosystem,
no new identity required
• Employs least privileges security model
Auth
User
Authentication
Secure service
interactions
Application identity
and interactions
Globus helps
automate data flows
Automated data movement
39
Scheduled
backup
My Drive/backups
|__/20170930
|__/20170929
|__/20170928
|__….
Recurring transfers
with sync option
Campus Lab
Streamlined data distribution
My Drive/projectX
|__/source
|__/pipe0001
|__/pipe0002
|__….
Secure sharing with
research community
Discover and access
via data portalHPC resource,
Campus storage,
…
Data distribution example: NCAR RDA
Reliable instrument data egress
My Drive/FASTQ
|__/cohort_0_0
|__/cohort_0_1
|__/cohort_0_2
|__….
Stage data for
downstream analysis
NGS and high-res Imaging
(APS, ALS, CryoEM, fMRI,…)
Instrument data egress example
• Kasthuri Lab at Uchicago: brain aging and disease
• Construct connectome (map neuron connections)
JLSEUChicago
ALCFAPS
Publication7
Kasthuri Lab neuroanatomy reconstruction pipeline
Imaging1
Lab Server 1
Acquisition2
Lab Server 2
Pre-processing3 Preview/Center4
Reconstruction6Visualization8
User validation5
Science!9
Data Management Plan enablement
My Drive/datasets
|__/afdb4523
|__/235fabcc
|__/cd23a421
|__….
Dataset
assembly,
description,
curation
http://hdl.handle.net/11466/OMN5BFB
Access via
persistent
identifier
Diverse
storage
systems
Globus Data Publication V1
• Cloud-based web app
• BYO storage
• User-managed collections
• Select pre-defined schema
• Handle, DOI persistent
identifiers
• >2000 users, >600 datasets
publish.globus.org
Many variations of data publication…
Citable Data
• Standard metadata
• Persistent identifiers
• Durable storage
• Many domains
• Custom metadata
• Locally managed storage
Institutional Data
• Agreed schema
• Larger datasets
• Fine grained metadata
Community Data
…Including active data management
Active Research Data
• Less standard and evolving schema
• Data organized independent of storage
• Support active collaboration
• Location agnostic identifiers
Publication v2 Platform
• Decompose Globus turnkey solution into microservices
• Enable flexible re-composition and adaptation of services
• Support extension and enhancement of publication flows
Automate
SearchIdentifyDescribeTransferAuth
Create
folder
Transfer
data
Get
metadata
Mint
persistent
identifier
Catalog
Get
credentials
Set ACL
Globus Search service
• Hosted, scalable service for research data discovery
• Schema agnostic
• Fine grained access control
• Plain text search
• Faceted search
• Rich query language
50
Globus Identifiers service (limited beta release)
• Issue persistent identifiers…
• …within your namespace, with access control
• Identifiers have…
– …link to data
– …landing page
– …visibility
– …checksum
– …extensible metadata
– …versioning
51
Globus Automate (coming soon)
• Composition and execution service for automating
research data management
• Higher level flow description language and authoring
tools
• Pluggable API to integrate any actions
– e.g. automated validation, metadata extraction
• Flexible invocation of actions: user or event driven
Globus
platform applications
Jupyter + Globus for interactive data science at scale
petrel.alcf.anl.gov
materialsdatafacility.org
2PB, 80Gbps store
3.2M materials data
Cooley: 290 TFLOPS
Query1 Share4
Transfer2
Learn3
Genotype imputation: Wellcome Sanger
National Resource Access
Identity Management
Globus PaaS developer resources
Python SDK
Sample
Application
docs.globus.org/api github.com/globus
Jupyter Notebook
…on sustainability
8,300
active shared
endpoints
70+
petabyte movers
500 PB
moved
20,400
active personal
endpoints
80 billion
files processed
1,800
active server
endpoints
94
subscribers
1 PB
largest single
transfer to date
99.9%
availability
559
identity providers
1,923
most shared
endpoints
at a single
institution 120,000
users
Globus by the numbers
Thank you to our sponsors...
U . S . D E P A R T M E N T O F
ENERGY
…and THANK YOU, subscribers!
Globus sustainability model
• Standard Subscription
– Sharing, data publication
– HTTPS access
– Console, usage reporting
– Priority support
– App integration support
• High Assurance subscription
– App instance isolation
– Additional authentication assurance
– Audit logging
– NIST 800-53, NIST 800-171 (+ BAA)
• Branded Web Site
• Premium Storage Connectors
• Alternate Identity Provider (InCommon is standard)
Support resources
• Globus documentation: docs.globus.org
• Community email list: developer-discuss@globus.org
• Helpdesk and issue escalation: support@globus.org
• Customer engagement team
• Globus professional services team
– Assist with portal/gateway/app architecture and design
– Develop custom applications that leverage the Globus platform
– Advise on customized deployment and integration scenarios

Simplified Research Data Management with the Globus Platform

  • 1.
    Simplified Research DataManagement with the Globus Platform CNI Membership Meeting, Fall 2018 - Project Update Vas Vasiliadis vas@uchicago.edu
  • 2.
    Topics • What isGlobus? • Globus from a researcher’s perspective • Common use cases: research data automation • Data publication with Globus • Sustainability – it’s in our DNA
  • 3.
    Research data managementtoday (circa 2008) How do we... ...move, share, describe, discover, reproduce? Index? Facilitate data stewardship
  • 4.
    Globus: A BriefHistory of Time • Oct. 1998 – Globus Toolkit v1.0.0 • Nov. 2010 – Globus Online initial release • Nov. 2013 – Sustainability model launched • Dec. 2016 – 50,000 registered users, 200PB+ moved • Jan. 2018 – Globus Toolkit support EOL • Jan. 2019 - 100th subscriber signed, >50% sustainable • ??? – Globus becomes fully self-sustaining globus online
  • 5.
    Globus… bridges data andpeople within and beyond organizational boundaries
  • 6.
    6 Research Computing HPC DesktopWorkstations Mass Storage Instruments Personal Resources Public/Private Cloud National Resources Unified access to data across storage tiers
  • 7.
    Public / privatecloud stores External campus storage Project repositories, replication stores Public repositories Sharing with collaborators, community
  • 8.
    Globus: Core functions Researcherinitiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • Use a Web browser • Access any storage • Use an existing identity
  • 9.
  • 23.
    Globus for highassurance data management • Restricted data handling: PHI, PII, CUI • Security controls: NIST 800-53, 800-171 Low • Business Associate Agreement (BAA) w/UChicago – University of Chicago has a BAA with Amazon
  • 24.
    High Assurance features •Additional authentication assurance – Per storage gateway policy on frequency of authentication with specific identity for access to data – Ensure that user authenticates with the specific identity that gives them access within session • Application instance isolation – Authentication context is per application, per session • Encryption of user data in transit and Globus data at rest • Detailed audit log (on data transfer nodes)
  • 28.
  • 29.
    …client software that makesa storage system accessible via Globus
  • 30.
    Globus Connect Personal •Installers do not require admin access • Zero configuration; auto updating • No firewall changes required; handles NATs
  • 31.
    Globus Connect Server •Installed and managed by sysadmin • Default access for all local accounts 31 docs.globus.org/globus-connect-server-installation-guide/ Local system users Local Storage System (HPC cluster, NAS, …) Globus Connect Server MyProxy CA GridFTP Server OAuth Server Data Transfer Node • POSIX + connectors • Native packaging Linux: DEB, RPM
  • 32.
    IBM Spectrum Scale CurrentPlanned Storage Connectors - globus.org/connectors ActiveScale
  • 33.
    Use(r)-appropriate interfaces GET /endpoint/go%23ep1 PUT/endpoint/vas#my_endpt 200 OK X-Transfer-API-Version: 0.10 Content-Type: application/json ... Globus service Web CLI Rest API
  • 34.
    Globus Command LineInterface • Full-featured (web++) • Uses Python SDK • Open source github.com/globus/globus-cli docs.globus.org/cli
  • 35.
    Globus is PaaS… …forbuilding science gateways, portals, and other web applications in support of research and education
  • 36.
    Globus Auth (identity andaccess management) … GlobusAPIs (Transfer,Search,Identifiers,…) GlobusConnect Data Publication File Sharing File Transfer, Sync The Globus Platform Data Automation
  • 37.
    Globus Auth • FoundationalIdentity and Access Management (IAM) service • Protects REST API communications • Enables login for diverse app ecosystem, no new identity required • Employs least privileges security model Auth User Authentication Secure service interactions Application identity and interactions
  • 38.
  • 39.
    Automated data movement 39 Scheduled backup MyDrive/backups |__/20170930 |__/20170929 |__/20170928 |__…. Recurring transfers with sync option Campus Lab
  • 40.
    Streamlined data distribution MyDrive/projectX |__/source |__/pipe0001 |__/pipe0002 |__…. Secure sharing with research community Discover and access via data portalHPC resource, Campus storage, …
  • 41.
  • 42.
    Reliable instrument dataegress My Drive/FASTQ |__/cohort_0_0 |__/cohort_0_1 |__/cohort_0_2 |__…. Stage data for downstream analysis NGS and high-res Imaging (APS, ALS, CryoEM, fMRI,…)
  • 43.
    Instrument data egressexample • Kasthuri Lab at Uchicago: brain aging and disease • Construct connectome (map neuron connections)
  • 44.
    JLSEUChicago ALCFAPS Publication7 Kasthuri Lab neuroanatomyreconstruction pipeline Imaging1 Lab Server 1 Acquisition2 Lab Server 2 Pre-processing3 Preview/Center4 Reconstruction6Visualization8 User validation5 Science!9
  • 45.
    Data Management Planenablement My Drive/datasets |__/afdb4523 |__/235fabcc |__/cd23a421 |__…. Dataset assembly, description, curation http://hdl.handle.net/11466/OMN5BFB Access via persistent identifier Diverse storage systems
  • 46.
    Globus Data PublicationV1 • Cloud-based web app • BYO storage • User-managed collections • Select pre-defined schema • Handle, DOI persistent identifiers • >2000 users, >600 datasets publish.globus.org
  • 47.
    Many variations ofdata publication… Citable Data • Standard metadata • Persistent identifiers • Durable storage • Many domains • Custom metadata • Locally managed storage Institutional Data • Agreed schema • Larger datasets • Fine grained metadata Community Data
  • 48.
    …Including active datamanagement Active Research Data • Less standard and evolving schema • Data organized independent of storage • Support active collaboration • Location agnostic identifiers
  • 49.
    Publication v2 Platform •Decompose Globus turnkey solution into microservices • Enable flexible re-composition and adaptation of services • Support extension and enhancement of publication flows Automate SearchIdentifyDescribeTransferAuth Create folder Transfer data Get metadata Mint persistent identifier Catalog Get credentials Set ACL
  • 50.
    Globus Search service •Hosted, scalable service for research data discovery • Schema agnostic • Fine grained access control • Plain text search • Faceted search • Rich query language 50
  • 51.
    Globus Identifiers service(limited beta release) • Issue persistent identifiers… • …within your namespace, with access control • Identifiers have… – …link to data – …landing page – …visibility – …checksum – …extensible metadata – …versioning 51
  • 52.
    Globus Automate (comingsoon) • Composition and execution service for automating research data management • Higher level flow description language and authoring tools • Pluggable API to integrate any actions – e.g. automated validation, metadata extraction • Flexible invocation of actions: user or event driven
  • 53.
  • 54.
    Jupyter + Globusfor interactive data science at scale petrel.alcf.anl.gov materialsdatafacility.org 2PB, 80Gbps store 3.2M materials data Cooley: 290 TFLOPS Query1 Share4 Transfer2 Learn3
  • 55.
  • 56.
  • 57.
  • 58.
    Globus PaaS developerresources Python SDK Sample Application docs.globus.org/api github.com/globus Jupyter Notebook
  • 59.
  • 60.
    8,300 active shared endpoints 70+ petabyte movers 500PB moved 20,400 active personal endpoints 80 billion files processed 1,800 active server endpoints 94 subscribers 1 PB largest single transfer to date 99.9% availability 559 identity providers 1,923 most shared endpoints at a single institution 120,000 users Globus by the numbers
  • 61.
    Thank you toour sponsors... U . S . D E P A R T M E N T O F ENERGY
  • 62.
    …and THANK YOU,subscribers!
  • 63.
    Globus sustainability model •Standard Subscription – Sharing, data publication – HTTPS access – Console, usage reporting – Priority support – App integration support • High Assurance subscription – App instance isolation – Additional authentication assurance – Audit logging – NIST 800-53, NIST 800-171 (+ BAA) • Branded Web Site • Premium Storage Connectors • Alternate Identity Provider (InCommon is standard)
  • 64.
    Support resources • Globusdocumentation: docs.globus.org • Community email list: developer-discuss@globus.org • Helpdesk and issue escalation: support@globus.org • Customer engagement team • Globus professional services team – Assist with portal/gateway/app architecture and design – Develop custom applications that leverage the Globus platform – Advise on customized deployment and integration scenarios