Advertisement
Advertisement

More Related Content

Similar to Automating Research Data Management at Scale with Globus(20)

Advertisement
Advertisement

Automating Research Data Management at Scale with Globus

  1. Automating Research Data Management at Scale Vas Vasiliadis Greg Nawrocki SGCI Webinar September 9, 2020
  2. Globus is … a non-profit service developed and operated by
  3. Our mission is to… increase the efficiency and effectiveness of researchers engaged in data-driven science and scholarship through sustainable software
  4. 4
  5. Fast, reliable file transfer …from any to any system User-initiated, or automated transfer request 1 Instrument, Lab server Compute Facility Globus transfers files reliably, securely 2 Globally accessible multi-tenant service • Fire-and-forget transfers • Optimized speed • Assured reliability • Unified view of storage • Browser, REST API, CLI v Optional notifications 3
  6. Secure data sharing …from any storage Collaborator logs into Globus and accesses shared files; no local account required; download via Globus2 On-prem or public cloud storage Select files to share, select user or group, and set access permissions 1Globally accessible multi-tenant service Globus controls access to shared files on existing storage Laptop, server, compute facility • Fine-grained access control “overlay” on storage system • Share with any identity, email, group • No need to stage data just for sharing v
  7. Globus Connectors ActiveScale Object Storage
  8. Coming soon… Globus Developed Community Developed
  9. Use(r)-appropriate interfaces GET /endpoint/go%23ep1 PUT /endpoint/demodoc#my_endpt 200 OK X-Transfer-API-Version: 0.10 Content-Type: application/json … Globus service Web CLI Platform (RESTful APIs)
  10. Globus Command Line Interface (CLI) • Native application: docs.globus.org/cli • Open source, uses Python SDK • globus login – get access and refresh tokens – Tokens stored locally in ~/.globus.cfg • Service (transfer/auth) invocation uses tokens • globus logout – delete tokens docs.globus.org/cli/examples
  11. Globus CLI Examples 11
  12. Globus CLI Examples 12
  13. Automated instrument data egress Cryo EM Lightsheet Sequencer ALS/APS …. Local system download Remote analysis, visualization • Reliable, near-real time data access • Automatically set policy based permissions • Self-service access control, management • Federated login for frictionless data access Local policy store --/cohort045 --/cohort096 --/cohort127 v
  14. v Repository data distribution Bulk data transfer 2 Search, request data of interest 1 • Gateway/data portal/app enables faceted search • Enforces fine-grained authorization • HTTPS download for “small” data • Asynchronous transfer for larger data sets 2 Browser based download Globally accessible multi-tenant service 2
  15. --/run123/output (r) Output data staged with access control 2 Data staging for compute --/run123/input (rw) Compute service v • User securely uploads data for analysis • Results available with fine- grained permissions • Automated setup/tear down of folders, permissions 3 User accesses, downloads results 1 Input data upload
  16. Automation Examples • Syncing a directory – bash script; calls the Globus CLI – Python module; run as script or import as module • Staging data for distribution – bash and Python variants • Removing directories after files are transferred – Python script 16 github.com/globus/automation-examples
  17. …but scripting only gets us so far… 17
  18. Globus platform services • Identity and Access Management (IAM): Auth, Groups • Data Services: Connect, Transfer, Manifest* • Search • Identifiers (collaboration with DataCite) • Flows* 18
  19. Globus automation platform 19
  20. Globus Automate A platform service for defining, applying, and sharing distributed research automation flows • Triggers start flows based on subscribed events • Flows call Action Providers to perform tasks
  21. Globus automation architecture • Built on AWS Step Functions – JSON-based state machine language – Conditions, loops, fault tolerance, etc. – Propagates state through the flow • Standardized API for integrating custom event and action services – Actions: synchronous or asynchronous – Custom Web forms prompt for user input • Actions secured with Globus Auth
  22. Automation Action Providers Delete ACLs Search DLHub User Form Notification Expression Evaluation Describe Web FormIdentifier Transfer Ingest Xtract funcX Globus action providers Custom action providers
  23. funcX action provider funcX: FaaS platform for HPC • funcX endpoints deployed at resources • Service routes requests to endpoints • Parsl acquires resources • Singularity containers run functions • Globus Auth secures communication funcX
  24. Xtract action provider Xtract: Derive rich metadata on-demand Xtract
  25. Automation Examples 25
  26. UChicago Kasthuri Lab: Brain aging and disease • Construct connectomes—mapping of neuron connections • Use APS synchrotron to rapidly image brains – Beam time available once every few months – ~20GB/minute for large (cm) unsectioned brains • Generate segmented datasets/visualizations for the community • Perform semi-standard reconstruction on all data across HPC resources
  27. Argonne JLSEUChicago Argonne Leadership Computing FacilityAPS Publication7 Neurocartography Imaging1 Lab Server 1 Acquisition2 Lab Server 2 Pre-processing3 Preview/Center4 Reconstruction6Visualization8 User validation5 Science!9 Neuroanatomy reconstruction pipeline
  28. Automating neurocartography Web form User input Search Ingest Share Set policy Identifier Mint DOI funcX Auth Get credentials Automate Run job Describe Get metadata Transfer Transfer data funcX Run job Transfer Transfer data
  29. XPCS: X-ray Photon Correlation Spectroscopy Data Portal Storage Argonne Leadership Computing Facility Advanced Photon Source Publication5 Imaging1 Lab Server 1 Acquisition2 Plot results4 XPCS-Eigen3 Science!6
  30. Automating XPCS Search Ingest funcX Auth Get credentials Automate Plot Results Transfer Transfer HDF5 Transfer Transfer IMM funcX Run Corr ACLs Set policy Transfer Return Results
  31. Materials Data Facility > 35 TB of data > 320 authors> 400 datasets • Accept data from many locations with flexible interfaces • Index dataset contents in science-aware ways • Dispatch data to the community • Using Automate to simplify building composable flows of services
  32. MDF Data Publication Automation Ingest Bulk Ingest Auth Get Credentials Automate Transfer Transfer Dataset XTract Extract Metadata Share Set permissions Transfer Move metadata Transfer Transfer Dataset Transfers Transfer Dataset Identifier Mint DOI Web form Metadata Notify Notify Curator Web form Curation Notify Notify user
  33. Petrel data services • Data service providing simple, self-managed project-based data and sharing capabilities (https://petreldata.net) • Flexible user-managed search index and discovery portal
  34. PaaS: develop custom action providers • Directly use the platform to build and run extensible flows • Develop action providers – Fit for purpose – Developed and deployed by the project – Plugged into their flows • Action Provider Development toolkit 36
  35. SaaS: instrument data management • Templated solution • Configurable… – Set transfer triggers – Select destination(s) – Define metadata • Extensible… – Add/remove actions – Change action providers • No development required Cryo EM Lightsheet Sequencer …. Indexing for search Image reconstruction, analysis, visualization Automated egress from device --/cohort045 --/cohort096 --/cohort127 Transfer funcXXtract
  36. SaaS: Data Management Plans • “Turnkey” DMP enablement • Select dataset (collection)… • …add metadata for indexing • …generate persistent ID (DOI, ARK, etc.) 38 Transfer Identifier Ingest “Point & Click” to findable and accessible data
  37. Thank you, funders... U . S . D E P A R T M E N T O F ENERGY
  38. Thank you, subscribers!
Advertisement